Lightweight unmanned aerial vehicle object detection algorithm based on improved YOLOv8

Abstract

With the rapid development of unmanned aerial vehicle (UAV) technology and computer vision, real-time object detection in UAV aerial images has become a current research hotspot. However, the detection tasks in UAV aerial images face challenges such as disparate object scales, numerous small objects, and mutual occlusion. To address these issues, this paper proposes the ASM-YOLO model, which enhances the original model by replacing the Neck part of YOLOv8 with an efficient bidirectional cross-scale connections and adaptive feature fusion (ABiFPN) . Additionally, a Structural Feature Enhancement Module (SFE) is introduced to inject features extracted by the backbone network into the Neck part, enhancing inter-network information exchange. Furthermore, the MPDIoU bounding box loss function is employed to replace the original CIoU bounding box loss function. A series of experiments was conducted on the VisDrone-DET dataset, and comparisons were made with the baseline network YOLOv8s. The experimental results demonstrate that the proposed model in this study achieved reductions of 26.1% and 24.7% in terms of parameter count and model size, respectively. Additionally, during testing on the evaluation set, the proposed model exhibited improvements of 7.4% and 4.6% in the AP50 and mAP metrics, respectively, compared to the YOLOv8s baseline model, thereby validating the practicality and effectiveness of the proposed model. Subsequently, the generalizability of the algorithm was validated on the DOTA and DIOR datasets, which share similarities with aerial images captured by drones. The experimental results indicate significant enhancements on both datasets.

Keywords

computer vision drone aerial images multi-scale object detection real-time object detection feature fusion

1. Introduction

The integration of UAV aerial image processing with deep learning technology is currently a focal point of scrutiny within the academic community. UAVs, known for their maneuverability and flexibility, find extensive applications in monitoring, particularly in intricate natural environments and terrains. UAV-based monitoring, when compared to conventional methods, presents advantages such as wide coverage, high efficiency, and cost-effectiveness, rendering UAVs an optimal choice in the monitoring domain. Nonetheless, generic object detection models often encounter challenges in accurately identifying objects in low-altitude UAV aerial images. Aerial photographs captured by UAVs differ significantly from ground-based photography, manifesting characteristics such as expansive scenes, diminutive objects (defined as those with pixels less than 32 $\times$ 32 in the MS COCO [1] dataset), multiscale variations, complex backgrounds, and overlapping occlusions. Generic object detection models face difficulties in precisely detecting specific objects under these conditions. Moreover, UAV object detection tasks necessitate the implementation of the inference process on embedded devices [2], demanding both high accuracy and real-time performance. Deploying complex object detection models on edge devices proves challenging, while lightweight detectors struggle to enhance accuracy. These challenges hinder the advancement of deep learning methods in the domain of multi-object detection for UAVs. Hence, the development of a model that strikes a balance between detection accuracy and lightweight characteristics to overcome the bottlenecks of deep learning in UAV aerial image applications holds substantial practical significance.

Although YOLOv8 [3] stands out as an excellent real-time object detection model, its performance is suboptimal when directly applied to real-time object detection tasks for drones, particularly in handling challenges such as small targets and occlusions in drone scenarios. Solely pursuing high accuracy proves inadequate for meeting real-time requirements [4], and prioritizing real-time performance compromises accuracy [5]. This trade-off makes it challenging to achieve the desired results in both aspects. Therefore, this paper proposes a series of improvements to YOLOv8, aiming to address the simultaneous requirements of real-time performance and accuracy in drone object detection tasks. To tackle the issue of varying scales in UAV object detection, we utilize an efficient bidirectional cross-scale connections and adaptive feature fusion (ABiFPN) with a P2 layer, replacing the PAN-FPN [6] structure in the original YOLOv8 Neck. This adjustment aims to mitigate the common issue of overlooking small objects in low-altitude aerial images. Additionally, to improve information exchange between the Backbone and Neck networks, we introduce the Structural Feature Enhancement Module (SFE). The SFE module injects features from the Backbone into the Neck network, facilitating additional fusion processing. The SFE module optimally utilizes features from the Backbone, and the Coordinate Attention (CA) mechanism [7] within enables the model to focus on the spatial and positional information of distant dependencies during information exchange. This improvement substantially enhances the accuracy of object detection in low-altitude aerial images. Finally, to ensure the model captures multiscale variations and information on small objects during training, we replace the original YOLOv8 CIoU [8] bounding box loss function with the MPDIoU [9] bounding box loss function.

Ablation experiments were conducted to assess the viability and efficacy of the proposed network optimization methods. The experiments utilized the internationally recognized VisDrone-DET [10] dataset and involved a comparison with the baseline model YOLOv8s. The results of the experiments revealed a notable reduction of 26.1% and 24.7% in parameter count and model size, respectively, for the proposed model. Moreover, examination of the dataset showcased a 7.4% enhancement in Average Precision at 50% Intersection over Union (AP50) and a 4.6% increase in mean Average Precision (mAP) compared to the YOLOv8s baseline model, thus affirming the practicality and effectiveness of the proposed model. Additionally, comparative analyses involving state-of-the-art detection models and mainstream models with comparable parameters provided further evidence of the superior performance of the methods proposed in this paper. Simultaneously, we also validate the algorithm’s generalizability on the DOTA [11] and DIOR [12] datasets, which share similarities with aerial images captured by drones. The experimental results indicate a significant improvement in performance on both of these datasets for ASM-YOLO.

The subsequent sections of this paper are structured as follows: Section 2 provides a review of relevant prior research. In Section 3, an enhanced model for detecting aerial images is introduced, offering a detailed exposition of the model’s structure and functionality. Section 4 delineates the evaluation metrics, experimental settings, and parameter configurations. Subsequently, ablation experiments, comparative experiments, and interpretability experiments were conducted on the VisDrone-DET international open-source dataset. Additionally, generalization validation was performed on the DOTA and DIOR datasets. Section 5 succinctly encapsulates the findings of the entire paper and outlines future research directions.

2. Related works

In recent years, the rapid development of universal object detection technology has become the foundation for research in numerous application domains. This progress not only highlights its significance in providing crucial technical support for the practical implementation of unmanned aerial vehicle (UAV) target detection tasks but also reflects its integral role in the broader domain. Section 2.1 of this paper will offer an in-depth exploration of the evolution of universal object detection. Moving forward, Section 2.2 will shift focus to the practical application scenarios of UAV target detection. The discussions in these two sections aim to contribute to a comprehensive understanding of the intricate interplay between universal object detection technology and the challenges posed by UAV target detection tasks.

2.1 Universal object detection model

Universal object detection techniques fall into two primary categories: one-stage and two-stage. One-stage techniques yield results in an end-to-end fashion, transforming the object detection challenge into a global regression problem. These techniques concurrently assign positions and categories to multiple candidate boxes, achieving a distinct separation between objects and backgrounds. Classic one-stage object detection methodologies encompass SSD [13], the YOLO (You Only Look Once) series [14, 15, 16], and RetinaNet [17]. In contrast, two-stage techniques employ heuristic methods or region proposal generation in the first stage to acquire multiple candidate boxes, subsequently subjecting them to filtering, classification, and regression in the second stage. Traditional two-stage object detection approaches include R-CNN [18], Faster R-CNN [19], and Mask R-CNN [20]. While two-stage techniques demonstrate superior detection accuracy on expansive datasets such as MS COCO [1] and PASCAL VOC [21], they often struggle to meet the real-time demands of edge computing architectures. Conversely, one-stage object detection models strike a balance between real-time performance and detection accuracy. Notably, the YOLO series stands out as a highly regarded single-stage object detection model.

The YOLO series of networks typically consists of three core components: Backbone, Neck, and Head. The Backbone extracts essential features from the image, with shallow layers capturing edge and texture features, while deep layers capture object and semantic information. The Neck connects the Backbone and the Head, aggregating and refining features from the Backbone, commonly performing feature fusion to handle information at different scales. As the final component of the YOLO model, the Head is responsible for predictions. It consists of one or more task-specific sub-networks for classification, localization, and, more recently, instance segmentation and pose estimation. The Head utilizes features provided by the Neck to generate predictions for each candidate object. Finally, post-processing steps such as NMS [22] filter out overlapping predictions, retaining only the detections with the highest confidence.

In 2023, Ultralytics introduced YOLOv8 in the realm of general object detection, incorporating insights from the YOLOv7 ELAN [23] philosophy into the design of its backbone network and Neck components. YOLOv8 tactfully replaced the YOLOv5 [24] C3 structure with the more gradient-rich C2f structure and adjusted channel numbers for diverse-scale models. The Head section of YOLOv8 underwent substantial modifications, featuring a decoupled head structure that segregates classification and detection heads. It also transitioned from the conventional Anchor-Based to Anchor-Free approach, further refining detection accuracy. Noteworthy innovations in loss calculation include the implementation of the TaskAlignedAssigner positive Sample assignment strategy and Distribution Focal Loss (DFL) [25], intending to fortify the model’s robustness and accuracy. Furthermore, YOLOv8 elevated object detection performance by introducing the CIoU [8] bounding box loss function. In summary, YOLOv8 has excelled in object detection tasks through these inventive improvements, solidifying its status as a widely applied and popular model in diverse domains. It offers scaled versions like YOLOv8n (nano), YOLOv8s (small), YOLOv8m (medium), YOLOv8l (large), and YOLOv8x (extra large), catering to various visual tasks encompassing object detection, segmentation, pose estimation, tracking, and classification.

2.2 Object detection in aerial images from unmanned aerial vehicle perspective

Despite notable advancements in general object detection techniques, the task of detecting objects in low-altitude UAV aerial images still presents numerous challenges. These challenges encompass various factors, including the presence of small objects, multi-scale variations, complex backgrounds, occlusions, and lighting conditions. In order to address these issues, literature [26] proposed a sliding window cropping method specifically designed for detecting high-resolution images obtained from remote sensing satellites. The proposed method involves dividing the original captured image into multiple smaller images of a fixed size prior to inputting them into the network. This sequential approach effectively mitigates the problem of information loss that occurs when directly downsizing the original high-resolution image upon network entry. Presently, this method is widely employed to tackle the challenge of detecting small objects within high-resolution images. Furthermore, a recent study outlined in literature [4] introduced the utilization of image upscaling to enhance detection performance in tasks involving small objects. While these two image preprocessing methods offer significant improvements in the accuracy of small object detection within low-altitude UAV aerial images, their deployment in edge computing scenarios undoubtedly entails substantial computational overhead.

In the literature [27], a structure similar to an encoder-decoder architecture is utilized to modify the dimensions of feature maps by employing various types of blocks and convolutional layers. Moreover, object detection is performed using five detection layers. The K-Means algorithm is employed to determine the optimal values for anchor boxes. Common data augmentation techniques, including the selective blacking out of specific regions, are employed. These techniques have shown a certain level of effectiveness in detecting objects in aerial images captured by UAVs. The literature [28] presents a UAV image object detection method named Multi-Proxy Detection Network with Unified Foreground Packing (UFPMP-Det). In order to tackle the challenge of detecting densely distributed small objects in UAV-captured scenes, this method introduces the Unified Foreground Packing (UFP). Firstly, background suppression is achieved through cluster merging, followed by packing the resulting sub-regions into mosaics for individual inferences, leading to a significant reduction in overall computational time. Moreover, to mitigate the confusion arising from object similarity and intra-class variations, the method introduces a Multi-Proxy Detection Network (MP-Det). By employing multiple proxies, the MP-Det learns to model the distribution of objects and enforces proxy diversification by minimizing the optimal transport loss guided by the Bag-of-Instance-Words (BoIW). The literature [29] introduces MS-YOLOv7, an UAV aerial image object detection model that improves upon YOLOv7 [23]. This model specifically focuses on detecting common large objects and small objects with higher aspect ratios in UAV aerial images. It achieves this by employing multiple detection heads and the CBAM [30] convolutional attention module to extract features at various scales. To address the challenge of detecting densely packed objects, the model incorporates Swin Transformer [31] units and introduces a novel pyramid pooling module named SPPFS. Moreover, the model combines SoftNMS [32] and the Mish [33] activation function to enhance the network’s capability to recognize overlapping and occluded objects. The literature [34] addresses the issue of false positives and false negatives in the detection of small objects in UAV aerial images. This is achieved by introducing the Bi-PAN-FPN approach to enhance the Neck component of YOLOv8. In order to mitigate information loss during long-distance feature transmission and significantly reduce the number of model parameters, the Ghostblock v2 [35] structure is introduced in the backbone section, replacing a portion of the original C2f in V8. Additionally, to improve the overall performance of the detection task, the WiseIoU [36] bounding box loss function is employed, along with a dynamic non-monotonic focusing mechanism and an "outlier" evaluation of anchor box quality. This combination enables the detector to consider anchor boxes of different qualities. The literature [37] presents UAV-YOLOv8, a object detection model specifically designed for UAV aerial scenes, which builds upon the YOLOv8 architecture. Firstly, the WiseIoU [36] is employed as the bounding box regression loss, accompanied by a judicious gradient allocation strategy that prioritizes common quality Samples. This strategy effectively enhances the model’s localization capability. Secondly, the introduction of BiFormer [38], an attention mechanism, optimizes the backbone network and enhances the model’s focus on crucial information. Lastly, the design of a feature processing module called Focal FasterNet block (FFNB) integrates shallow and deep features by proposing two new detection scales. The integration of these features through the proposed multi-scale feature fusion network significantly boosts the model’s detection performance and reduces the rate of missed detections for small objects. The literature [39] introduces a remote sensing image object detection method based on Consistency- and Dependence-Guided Knowledge Distillation (CDKD). This approach effectively utilizes the Spatial- and Channeloriented Structure Discriminative modules (SCSDM) to extract discriminative spatial positions and channels focused by the teacher model. By eliminating noise and the influence of complex backgrounds, the method enhances the feature representation of the student model. Guided by SCSDM, the paper successfully establishes the consistency and dependence of features between the teacher and student models. This work holds significant reference value for unmanned aerial vehicle image processing and model lightweighting.

Although [27, 28, 29, 34, 37, 39] has made certain contributions to the development in this field, it still has several limitations. Firstly, these methods primarily focus on achieving high detection accuracy, which they accomplish effectively. However, their speed falls short of meeting the deployment requirements on edge devices, thereby impeding their practical application. Secondly, despite the commendable performance of these methods in terms of detection accuracy and speed, the size of the model has increased during the improvement process, thereby exacerbating the burden of deployment. Consequently, practical applications demand additional computational resources and storage space, consequently limiting the feasibility of these methods. Therefore, when further enhancing low-altitude UAV aerial image detection methods, it is crucial to consider reducing the model’s size while simultaneously improving detection accuracy and speed.

Building upon the aforementioned considerations, this paper aims to propose a precise and lightweight method for detecting low-altitude UAV aerial images. Our focal point will be to tackle the aforementioned challenges by adopting novel algorithms and techniques, which enable efficient deployment on edge devices. To enhance detection accuracy, we will introduce novel feature representation methods or refine existing feature extractors. Furthermore, we will investigate approaches to optimize model architecture and parameter settings, thereby improving detection speed and efficiency.

3. Methods

3.1 Structure of the ASM-YOLO

To tackle the object detection task in low-altitude aerial images captured by UAVs, this paper proposes an improved version of YOLOv8 called ASM-YOLO. The overall architecture, as depicted in Figure 1, consists of several components. Firstly, the (a) Backbone structure, responsible for feature extraction, employs the CSPDarknet53 backbone network. Due to its excellent feature extraction capabilities and lower resource consumption, we adhere to its design. The useful feature layers extracted from this structure, namely B2, B3, B4, and B5, are also utilized in the subsequent Neck structure, denoted as P2 to P5 and H2 to H5 in the Neck and Head structures, respectively. Secondly, the (b) SFE structure combines the CA attention mechanism to inject the features extracted by the Backbone into the Neck structure. This process focuses on spatial and positional information. Next, the (c) Neck structure replaces the original PAN-FPN structure with a four-layer ABiFPN. It receives features from layers B2 to B5 of the Backbone and performs adaptive fusion. Lastly, the (d) Head structure follows the decoupled heads approach used in YOLOv8. It consists of separate branches for classification and bounding box regression. The classification task employs Binary Cross-Entropy (BCE) loss, while the regression task adopts MPDIoU loss and DFL. The detailed parts of this network are shown in Figure 2(a) to (i).

Figure 1.

The main workflow structure of ASM-YOLO.

Figure 2.

The detailed process structure of ASM-YOLO.

3.2 Improved neck structure

In one-stage object detectors, the Neck component typically incorporates a network resembling the Feature Pyramid Network (FPN) [40]. This network integrates high-level features with low-level features to enhance the representation capacity of the latter. Additionally, it assigns objects of different scales to different detection layers, thereby achieving a divide-and-conquer strategy. Over the years, several FPN variants have been introduced, including NAS-FPN [41], BiFPN [42], DetectoRS [43], and AFPN [44]. To address the problem of drastic scale differences and the difficulty in detecting small objects in UAV aerial images, this paper replaces the Neck component of YOLOv8 with a BiFPN structure firstly. Additionally, an adaptive feature fusion mechanism is employed to fuse information from different scales within this structure. Furthermore, a shallow feature layer, P2, is introduced to improve the detection performance for small objects. The improved structure is referred to as ABiFPN, and its architecture is illustrated in Figure 3. The data flow of the ABiFPN structure is shown in Eqs (1) to (6).

\begin{aligned} P_{2}^{o u t} & = Adaptive (B_{2}^{o u t}, UpSample (P_{3}^{o u t})) \end{aligned}

(1)

\begin{aligned} P_{3}^{o u t} & = Adaptive (B_{3}^{o u t}, UpSample (P_{4}^{o u t})) \end{aligned}

(2)

\begin{aligned} P_{4}^{o u t} & = Adaptive (B_{4}^{o u t}, UpSample (P_{5}^{o u t})) \end{aligned}

(3)

\begin{aligned} P_{3 +}^{o u t} & = Adaptive (DownSample (P_{2}^{o u t}), P_{3}^{o u t}, B_{3}^{o u t}) \end{aligned}

(4)

\begin{aligned} P_{4 +}^{o u t} & = Adaptive (DownSample (P_{3 +}^{o u t}), P_{4}^{o u t}, B_{4}^{o u t}) \end{aligned}

(5)

\begin{aligned} P_{5}^{o u t} & = Adaptive (DownSample (P_{4 +}^{o u t}), B_{5}^{o u t}) \end{aligned}

(6)

Figure 3.

Structure of ABiFPN.

3.2.1 Structure of the ABiFPN

Although FPN has achieved effective cross-scale feature fusion, its unidirectional top-down connection structure still limits the expressive power of features. In order to overcome this limitation, BiFPN introduces bottom-up connections as well as bidirectional feature aggregation. Moreover, to address the imbalanced contributions of features at different scales, BiFPN employs a weighted feature fusion strategy that enables the network to learn the significance of each feature, thereby replacing simple addition operations. Additionally, BiFPN considers a single top-down and bottom-up connection as a layer and iteratively stacks this structure to form multiple layers of BiFPN, thereby enhancing feature representation. In terms of network connectivity, BiFPN optimizes the structure by removing single-input nodes, adding connections among nodes at the same scale, and employing other methods to improve the network’s representation capacity without increasing parameters or computational complexity. The ABiFPN utilized in this study adheres to the design principles of BiFPN.

The P2 detection layer exhibits significant advantages due to its relatively higher resolution, enabling it to capture fine-grained feature information in the image and enhance the precision of object detection. By performing convolution, pooling, and activation operations at the lower levels of the pyramid network, the P2 layer effectively extracts local features and provides a strong foundation for object detection. Considering that the original BiFPN network’s performance in handling small-sized objects remains unsatisfactory, this paper introduces the P2 detection layer to enhance the detection performance of small objects. The subsequent object detection head utilizes the P2 feature layer for predicting object categories and positions, achieving accurate object detection and localization. Additionally, the P2 feature layer collaborates with other pyramid layers, leveraging multi-scale information and enabling the entire object detection system to adapt to objects of different scales, thereby improving the overall detection performance. The calculation of the parameter count for the modified ABiFPN can be determined using Eqs (7) to (11).

\begin{aligned} P_{Neck} = \sum_{p_{2}}^{p_{5}} i \times (p_{UpSanple} + p_{Fusion} + p_{C 2 f} + p_{DownSanple}) \end{aligned}

(7)

Among them, $P_{Neck}$ represents the parameter amount of the entire structure; $p_{U p S a n p l e}$ represents the parameter amount in the down-Sampling process; $p_{F u s i o n}$ represents the parameter amount in the feature fusion process; $p_{C 2 f}$ represents the parameter amount generated by the C2f module; $p_{D o w n S a n p l e}$ represents the parameter amount in the down-Sampling process. The amount of parameters generated. $p_{2}$ and $p_{5}$ respectively represent detection layer features with the same feature size.

\begin{aligned} p_{UpSample} = 0 \end{aligned}

(8)

Since the upsampling process is to perform adjacent differences on features, it only generates calculation amount and does not generate parameter amount.

\begin{aligned} p_{Fusion} = p_{Conv 1 \times 1} + p_{Concat} + p_{Conv 1 \times 1} + p_{Softmax} + p_{Split} = 2 \times C_{i n} \end{aligned}

(9)

Among them, $C_{i n}$ represents the number of channels of the input features for the adaptive feature fusion, and $b s$ represents the size of BatchSize.

\begin{aligned} p_{C 2 f} = \sum_{2}^{5} \frac{InputSize}{2^{i}} \times C_{Neck} \times P_{C 2 f} \end{aligned}

(10)

Among them, InputSize is the image size input by the network; $C_{N e c k}$ is the number of channels in the Neck part of the network. In ASM-YOLO, the number of channels is fixed at 128. In YOLOv8s, the number of channels from P3 to P5 layers are 128, 256 and 512 respectively; $P_{C 2 f}$ is the number of parameters required for the single-channel C2f structure.

\begin{aligned} p_{DownSample} = C_{i n} \times bs \times P_{DownSample} \end{aligned}

(11)

Among them, $P_{D o w n S a m p l e}$ represents the number of parameters required for single-channel downsampling convolution.

The equation above reveals that the excessive number of channels is one of the primary issues associated with the Neck component. Furthermore, previous literature [45] has highlighted the presence of redundancy among different channels and the necessity to increase the channel count for Depthwise Convolution in order to compensate for accuracy degradation and heightened memory access. To address these concerns, this study aims to diminish computational redundancy and alleviate memory pressure. Drawing inspiration from the design principles of Partial Convolution, only a subset of channels is retained for feature extraction, while the channel count of the original P3-P5 layers is standardized to 128. This channel reduction strategy not only reduces the computational burden but also capitalizes on the effectiveness of Partial Convolution by extracting scene features using representative channels, with Pointwise Convolution being utilized to integrate information from all channels. By adopting this approach, optimization is achieved in terms of both computation and memory access, while ensuring effective feature representation. Ultimately, the modification of the structural parameters yields a 26% decrease in the parameter count compared to the original PAN-FPN.

3.2.2 Adaptive feature fusion

To tackle the challenge of complex backgrounds in UAV object detection, this study proposes an adaptive feature fusion mechanism within the Neck component of the Feature Pyramid Network (FPN). Feature fusion serves multiple purposes within the FPN network: firstly, it facilitates the generation of complementary multi-scale features; secondly, it enhances the information flow between different network layers; thirdly, it mitigates semantic inconsistencies; fourthly, it improves the detection performance of small objects; and finally, it reduces the overall parameter count. However, the conventional simple stacking method fails to effectively discern the importance of individual feature maps, leading to a decrease in the detection performance of small objects in complex scenes. In order to achieve more intelligent and efficient feature fusion, this research develops an adaptive feature fusion module based on an attention mechanism. This module is capable of discerning the distinct contributions of feature maps at each position to the final object detection task, assigning varying attention weights accordingly. The structure is depicted in Figure 4, it employs 1 $\times$ 1 convolutions to reduce the dimensionality of each feature map, followed by stacking the dimension-reduced feature maps along the channel axis, and subsequently increasing the dimensionality of the stacked feature maps via 1 $\times$ 1 convolutions. The module then computes the correlation between feature maps at each position, generating attention weights that are later split into the same number of channels as the input. Finally, the obtained weights are utilized to weigh the corresponding feature maps, resulting in a fused new feature representation. This adaptive fusion approach enables the network to autonomously learn the significance of different feature maps and positions, facilitating more intelligent and effective combinations of multi-scale features, thereby enhancing the detection performance of the model. In contrast to simple stacking fusion, this attention-based adaptive feature fusion mechanism enables learnable fusion and optimization of feature extraction in the backbone network, rendering the combination of multi-scale features more efficient and flexible.

Figure 4.

Adaptive feature fusion.

3.3 Structural feature enhancement module

The complex and diverse backgrounds present in images captured by low-altitude aerial drones necessitate the introduction of the SFE module in this study to accurately capture the desired information. Figure 2 illustrates the composition of the SFE module, which comprises the Conv structure and the CA module. The primary objective of this module is to inject informative feature maps into the neck component of the network, thereby enhancing the representation capability and abstraction level of the features.

According to the operational flow illustrated in Figure 5, the mechanism initiates by conducting average pooling in both the height and width dimensions on the input features, resulting in a downsampled representation with reduced spatial resolution. Following that, stacked concatenation, convolution, batch normalization, and nonlinear mapping operations are employed to augment the expressive capabilities of the features. Subsequently, the features are partitioned into two parallel stages, where convolution and nonlinear mapping are independently performed in the height and width dimensions. Lastly, the mapping results are merged with the input features through residual connections. This adaptive computation iteratively occurs multiple times across the height, width, and channel dimensions to accentuate the positional and channel-specific information of the features.

The design of this mechanism considers the structural characteristics of the input data and enhances the expressive power of the features through operations such as stacked concatenation, convolution, batch normalization, and nonlinear mapping. Additionally, residual [46] connections are utilized to preserve the information of the input features, thereby preventing information loss. Consequently, when addressing scenarios with long-range dependencies, the incorporation of CA attention facilitates precise filtering of spatial and channel information pertaining to the relevant features. In the context of detecting objects in complex backgrounds of low-altitude aerial images captured by UAVs, this mechanism effectively and accurately locates the objects, ultimately accomplishing the objective of object detection.

Figure 5.

Coordinate attention.

3.4 Enhancement of the MPDIoU boundary box loss function

In the context of object detection in low-altitude aerial images acquired by UAVs, the presence of small objects is prominent, necessitating the use of a well-designed loss function to enhance the model’s detection performance. YOLOv8 incorporates the CIoU [8] and DFL [25] loss functions for addressing boundary box loss and binary cross-entropy loss, respectively. Nonetheless, the CIoU approach exhibits several limitations. Firstly, it fails to consider the balance between challenging and easy samples. Secondly, the inclusion of aspect ratio as a penalty term within the loss function introduces inaccuracies when the ground truth box and predicted box share the same aspect ratio but differ in width and height values. Consequently, the penalty term fails to accurately capture the true disparity between the two boxes. Lastly, the computation of CIoU entails the use of inverse trigonometric functions, resulting in increased computational complexity for the model. The computation procedure of CIoU is illustrated by Eq. (12).

\begin{aligned} L_{CloU} = 1 - I o U + \frac{ρ^{2} (b, b^{g t})}{(c_{w})^{2} + (c_{h})^{2}} + \frac{4}{π^{2}} (\tan^{- 1} \frac{w^{g t}}{h^{g t}} - \tan^{- 1} \frac{w}{h}) \end{aligned}

(12)

In Eq. (12), the symbol IoU denotes the intersection over union ratio between the predicted box and the ground truth box. The term $ρ (b, b^{g t})$ represents the Euclidean distance between the center points of the predicted box and the ground truth box. The variables $w$ and $h$ correspond to the width and height of the predicted box, respectively. Similarly, $h^{g t}$ and $w^{g t}$ denote the height and width of the ground truth box. Lastly, $c_{h}$ and $c_{w}$ refer to the height and width of the minimum enclosing rectangle enclosing both the predicted box and the ground truth box.

Figure 6.

MPDIoU.

Although the CIoU loss function incorporates the aspect ratio of bounding boxes as a penalty term, which can partially expedite the regression convergence of predicted boxes, it poses a constraint when the width and height of predicted boxes and ground truth boxes demonstrate a linear proportionality. In such cases, the width and height of the predicted boxes cannot simultaneously increase or decrease, impeding the efficacy of the CIoU loss function. To enhance the detection of small objects in low-altitude aerial images captured by UAVs, this study introduces the MPDIoU bounding box loss function as a substitute for the CIoU loss function. The MPDIoU loss function precisely characterizes the position and shape of objects by minimizing the distance between the top-left and bottom-right points of the predicted and ground truth bounding boxes. The specific computation process is delineated by Eqs (13) to (16). The advantages of the MPDIoU loss function lie in its streamlined computation process based on the minimum point distance measurement, rendering it applicable to both overlapping and non-overlapping bounding box regression. By incorporating the MPDIoU loss function, this study surmounts the limitations of the CIoU loss function during the convergence of width and height ratios, thereby amplifying the accuracy and performance of small object detection.

\begin{aligned} L_{M P D I o U} & = 1 - M P D I o U \end{aligned}

(13)

\begin{aligned} M P D I o U & = \frac{A \cap B}{A \cup B} - \frac{d_{1}^{2}}{w^{2} + h^{2}} - \frac{d_{2}^{2}}{w^{2} + h^{2}} \end{aligned}

(14)

\begin{aligned} d_{1}^{2} & = (x_{1}^{B} - x_{1}^{A})^{2} + (y_{1}^{B} - y_{1}^{A})^{2} \end{aligned}

(15)

\begin{aligned} d_{2}^{2} & = (x_{2}^{B} - x_{2}^{A})^{2} + (y_{2}^{B} - y_{2}^{A})^{2} \end{aligned}

(16)

Among them, some parameters in Eqs (13) to (16) are shown in Figure 6. $A$ and $B$ are the bounding boxes of any two objects, $w$ and $h$ are the width and height of the input image; $(x_{1}^{A}, y_{1}^{A})$ and $(x_{2}^{A}, y_{2}^{A})$ is the upper left corner of shape $A$ Points; $(x_{1}^{B}, y_{1}^{B})$ and $(x_{2}^{B}, y_{2}^{B})$ in the lower right corner are the points in the upper left corner and lower right corner of shape $B$ ; $d_{1}$ and $d_{2}$ are the distances from point $(x_{1}^{A}, y_{1}^{A})$ to point $(x_{1}^{B}, y_{1}^{B})$ and $(x_{2}^{A}, y_{2}^{A})$ to point $(x_{2}^{B}, y_{2}^{B})$ respectively.

4. Experiments and results

4.1 Evaluation indicators

The model’s detection performance in this study is evaluated using precision (P), recall (R), average precision at an IoU threshold of 0.5 (AP50), and mean average precision within the IoU threshold range of [0.5, 0.95] (mAP). The assessment of model resource consumption includes parameters (Par), floating point operations per second (FLOPs), and model size (MS). The speed of the model’s detection is measured by the inference time required for a single image (Latency). The calculations for P, R, and mAP are described in Eqs (17) to (19).

\begin{aligned} P & = \frac{T P}{T P + F P} \end{aligned}

(17)

\begin{aligned} R & = \frac{T P}{T P + F N} \end{aligned}

(18)

\begin{aligned} mAP & = \frac{{AP}_{50} + {AP}_{55} + \dots + {AP}_{90} + {AP}_{95}}{10} \end{aligned}

(19)

where TP (True Positives) represents the number of correctly identified positive Samples, FP (False Positives) represents the number of incorrectly identified negative Samples, and FN (False Negatives) represents the number of incorrectly identified positive Samples.

4.2 Dataset and experimental configuration

The experimental data utilized in this study originates from the publicly released VisDrone2021 [10] dataset by the AISKYEYE data mining team at Tianjin University. This dataset exclusively comprises aerial footage captured by UAVs. It encompasses 263 video clips, totaling 179,264 frames, along with 10,209 static images. The data is obtained through cameras affixed to diverse types of UAVs, exhibiting variations across multiple dimensions, including location (spanning 14 distinct cities in China), environment (encompassing both urban and rural areas), objects (encompassing pedestrians, vehicles, and bicycles), and density (encompassing sparse and crowded scenes). The dataset encompasses a wide range of weather and lighting conditions, effectively representing diverse real-life scenarios. The video clips possess a maximum resolution of 3840 $\times$ 2160, while the static images possess a maximum resolution of 2000 $\times$ 1500. Categorized into 10 classes, the images encompass pedestrians, people, bicycles, cars, vans, trucks, tricycles, awning-tricycles, buses, and motorcycles. For the purposes of this study, 6,471 images are allocated to the training set, 548 images to the validation set, and 1,610 images to the test set.

The experiment employed the Ubuntu 20.04 operating system, Python 3.8 programming language, and PyTorch 2.0 deep learning framework. The training process was accelerated using an NVIDIA 3090 (24G) GPU. The code used in this study was based on Ultralytics YOLOv8.0.114 and subsequently enhanced. To maintain impartiality, none of the conducted experiments in this research relied on pre-trained weights. Table 1 presents the primary parameter configurations employed during the training phase.

Table 1
Training parameter table.

Hyparameters Value

Epochs 300

Patience 20

Batch Size 4

Image Size 640 $\times$ 640

Optimizer SGD

NMS IoU 0.7

Initial Learning Rate 0.01

Final Learning Rate 1e-4

Momentum 0.937

Weight-Decay 5e-4

Hyparameters	Value
Epochs	300
Patience	20
Batch Size	4
Image Size	640 $\times$ 640
Optimizer	SGD
NMS IoU	0.7
Initial Learning Rate	0.01
Final Learning Rate	1e-4
Momentum	0.937
Weight-Decay	5e-4

The trained model is compared against the YOLOv8s baseline model by conducting training on the VisDrone-DET-train dataset and validation on the VisDrone-DET-val dataset. The training outcomes are depicted in Figure 7, illustrating the performance improvements achieved by the proposed approach.

4.3 Ablation experiment

The effectiveness of the proposed enhancements was demonstrated through ablation experiments conducted on the baseline model using the VisDrone-DET dataset. The experimental findings are presented in Table 2. In this table, ABiFPN signifies the incorporation of the Adaptive Weighted Bi-directional Feature Pyramid Network with a P2 detection layer serving as an improved Neck component within the network. SFE represents the Structural Feature Enhancement module, which facilitates the interaction between the Backbone and Neck networks. MPDIoU denotes the substitution of the original CIoU bounding box loss function of the baseline network with the MPDIoU loss function. The ✓ symbol indicates the adoption of the respective enhancement strategy.

Figure 7.

Training results, (a) The training results of both YOLOv8s and the proposed model are evaluated based on precision. (b) The training results of both YOLOv8s and the proposed model are assessed using recall. (c) The training results of both YOLOv8s and the proposed model are evaluated using AP50. (d) The training results of both YOLOv8s and the proposed model are measured using mAP.

Table 2

Ablation experiment. The best result is presented in bold.

ABiFPN	SFE	MPDIoU	ASM-YOLO	Par (M)	AP50(%)	mAP(%)
–	–	–	–	11.13	32.6	18.7
✓				8.21	39.4 $↑$ 6.8	22.8 $↑$ 4.1
	✓			11.15	33.6 $↑$ 1.0	19.4 $↑$ 0.7
		✓		11.13	33.0 $↑$ 0.4	18.9 $↑$ 0.2
			✓	8.23	40.0 $↑$ 7.4	23.3 $↑$ 4.6

The experimental findings presented in Table 2 demonstrate that each enhancement strategy significantly improves the detection performance of the baseline model on the VisDrone-DET dataset. Substituting the PAN-FPN component with an ABiFPN that incorporates a detection layer P2 leads to a notable enhancement in the model’s detection effectiveness, resulting in respective increases of 6.8% and 4.1% in the AP50 and mAP metrics. This improvement can be attributed to the abundance of small objects in the scenes captured by drones, as well as the substantial variations in image scales. The integration of the SFE module into the baseline network results in a 1.0% increase in AP50 and a 0.7% increase in mAP. This improvement stems from the enhanced utilization of backbone feature extraction facilitated by the SFE structure. Employing the MPDIoU as the new bounding box regression loss function leads to a 0.4% increase in AP50 and a 0.2% increase in mAP. However, the inclusion of the ABiFPN structure introduces complexity to the model and prolongs the inference time.

4.4 Comparative experiment

To evaluate the effectiveness of the proposed model, comparative experiments were conducted between the enhanced model presented in this paper and several widely recognized models in the field of real-time object detection. The selected models include YOLOv5 (version 7.0), released by Ultralytics in November 2022; YOLOv6 (version 3.0), released by Meituan in February 2023; YOLOv7, released by the original YOLOv4 team in July 2022; Gold-YOLO, released by Huawei in September 2023; and YOLOv8, also released by Ultralytics in January 2023. The experimental results on the VisDrone-DET dataset are presented in Table 3, where ASM-YOLO demonstrates superior performance in terms of Recall, AP50, and mAP metrics. In comparison to the baseline network of YOLOv8, ASM-YOLO achieves reductions of 26.1% and 24.7% in parameter count and model size, respectively. Meanwhile, in terms of detection precision, the AP50 and mAP metrics witness improvements of 7.4% and 4.6%, respectively. The trade-off is a slight decrease in detection speed.

Table 3
Comparative experiment table. The best result is presented in bold.

Models Par (M) F (G) MS R(%) AP50(%) mAP(%) Latency (ms)

YOLOv5s-7.0 [24] 9.11 23.8 21.4 33.5 32.7 18.2 2.59

YOLOv6s-3.0 [47] 18.51 45.18 38.7 32.3 30.2 17.6 2.73

YOLOv7-tiny [23] 6.03 13.10 11.7 34.1 29.5 15.1 2.80

Gold-YOLOs [48] 21.51 46.04 44.7 31.5 28.9 16.6 3.19

YOLOv8n 3.01 8.1 5.9 29.9 27.5 15.5 1.79

YOLOv8s 11.13 28.5 21.4 34.7 32.6 18.7 3.10

YOLOv8m 25.84 78.7 49.6 36.9 35.5 20.7 9.00

ASM-YOLO(ours) 8.23 56.4 16.1 39.9 40.0 23.3 6.02

Models	Par (M)	F (G)	MS	R(%)	AP50(%)	mAP(%)	Latency (ms)
YOLOv5s-7.0 [24]	9.11	23.8	21.4	33.5	32.7	18.2	2.59
YOLOv6s-3.0 [47]	18.51	45.18	38.7	32.3	30.2	17.6	2.73
YOLOv7-tiny [23]	6.03	13.10	11.7	34.1	29.5	15.1	2.80
Gold-YOLOs [48]	21.51	46.04	44.7	31.5	28.9	16.6	3.19
YOLOv8n	3.01	8.1	5.9	29.9	27.5	15.5	1.79
YOLOv8s	11.13	28.5	21.4	34.7	32.6	18.7	3.10
YOLOv8m	25.84	78.7	49.6	36.9	35.5	20.7	9.00
ASM-YOLO(ours)	8.23	56.4	16.1	39.9	40.0	23.3	6.02

Figure 8 visually demonstrates the outstanding comprehensive performance of ASM-YOLO compared to the prevailing object detection models on the VisDrone-DET dataset, accentuating its feasibility.

4.5 Experimental results analysis

In this section, we discuss and analyze the experimental results obtained from evaluating the ASM-YOLO model on the VisDrone-DET-val and VisDrone-DET-test-dev datasets. Table 4 presents the performance of ASM-YOLO across 10 categories, as well as the overall recall, AP50, and mAP metrics in the VisDrone-DET dataset. Notably, the detection performance for the car, van, and bus categories shows great promise, achieving AP50 scores of 79.2%, 47.6%, and 62.8%, respectively, on the test set.

Table 4
Detection Performance of ASM-YOLO on Various Categories in the VisDrone-DET-val and VisDrone-DET-test-dev Datasets.

Datasets Ind.¹ All Ped.² People Bicycle Car Van Truck Tricycle Awn.³ Bus Motor

val⁴ R 46.2 52.1 40.4 21.2 84.9 53.6 42.3 24.7 22.6 57.2 53.5

AP50 48.7 56.6 45.8 21.4 86.4 54.5 42.7 35.5 22.6 64.9 56.8

mAP 30.1 28.3 19.6 10.4 63.3 39.5 29.0 21.0 14.2 48.2 27.6

test⁵ R 39.9 34.2 19.7 14.2 78.9 49.7 51.5 30.9 23.0 57.4 39.6

AP50 40.0 37.3 24.0 15.8 79.2 47.6 47.2 24.6 24.5 62.8 36.5

mAP 23.3 15.9 9.03 6.67 50.8 31.8 30.8 13.8 14.2 44.4 15.7

Datasets	Ind.¹	All	Ped.²	People	Bicycle	Car	Van	Truck	Tricycle	Awn.³	Bus	Motor
val⁴	R	46.2	52.1	40.4	21.2	84.9	53.6	42.3	24.7	22.6	57.2	53.5
	AP50	48.7	56.6	45.8	21.4	86.4	54.5	42.7	35.5	22.6	64.9	56.8
	mAP	30.1	28.3	19.6	10.4	63.3	39.5	29.0	21.0	14.2	48.2	27.6
test⁵	R	39.9	34.2	19.7	14.2	78.9	49.7	51.5	30.9	23.0	57.4	39.6
	AP50	40.0	37.3	24.0	15.8	79.2	47.6	47.2	24.6	24.5	62.8	36.5
	mAP	23.3	15.9	9.03	6.67	50.8	31.8	30.8	13.8	14.2	44.4	15.7

¹Ind. is indicators; ²Ped. is pedestrian; ³Awn. is awning-tricycle; ⁴val is VisDrone-DET-val; ⁵test is VisDrone-DET-test-dev.

Figure 8.

Comparative Experiment, The x-axis denotes latency, and the y-axis represents accuracy. The size of the triangular markers signifies the parameter size of each model. In (a), the experiments are compared based on the evaluation using the AP50 metric, while in (b), the experiments are evaluated using the mAP metric.

The detection performance of the proposed model in practical application scenarios was validated by selecting several complex images from the VisDrone-DET dataset, as depicted in Figure 9. Figure 9(a) showcases three drone-captured images, arranged from left to right: a nighttime scene, a distant small target scene, and an image with a complex background. Figure 9(b) presents the detection results of YOLOv8s. In this set of images, it is evident that the lower half of the first image exhibits issues of both missed detections and false positives, while the upper half of the second image and the left side of the third image show missed detections of vehicles and pedestrians, respectively. Figure 9 (c) displays the detection results achieved using ASM-YOLO, which, compared to Figure 9(b), demonstrates a certain degree of improvement in addressing the observed missed detections and false positives issues.

Figure 9.

Detection Effect: (a) Original Image; (b) YOLOv8s Detection Result; (c) ASM-YOLO Detection Result. Red boxes indicate false positive detections, light blue boxes indicate missed detections.

4.6 Interpretability experiment

To ascertain the efficacy of the ASM-YOLO model, this study employs the Grad-CAM technique [49] for explicative analysis. Grad-CAM utilizes gradient information to pinpoint regions in the image associated with the target concept, thereby producing visually interpretable outcomes. Its underlying mechanism, grounded in gradient information, is adaptable to various deep learning models without necessitating retraining. In practical applications, Grad-CAM furnishes perspicuous visual explications, facilitating a nuanced comprehension of the model’s decision-making process. Through the application of Grad-CAM, this research attains a more profound understanding of the deep network’s focal points on disparate regions within the image. This gradient-centric localization methodology bears substantial significance in unraveling the decision-making intricacies of deep networks, validating the dependability and applicability of the network.

Figure 10.

Grad-CAM Visualization: (a) Original Image; (b) Grad-CAM Visualization of YOLOv8s; (c) Grad-CAM Visualization of ASM-YOLO.

Figure 10 illustrates the experimental results. The first row showcases three selected images captured by low-altitude drones. The second row presents the Grad-CAM visualization results of the baseline YOLOv8s model, while the third row exhibits the Grad-CAM visualization results of the proposed model in this study. From Figure 10, it is evident that the baseline YOLOv8s model has limitations in detecting small objects at longer distances, despite performing well in close-range object detection. The proposed model in this study successfully mitigates this drawback of the baseline model. Moreover, this study demonstrates the superior generalization ability of the proposed model compared to the baseline model in the task of detecting low-altitude drone-captured images.

4.7 Validation across multiple datasets

To validate the generality of our improved algorithm, we additionally employed the DOTA dataset and DIOR dataset, which bear a resemblance to the low-altitude unmanned aerial vehicle (UAV) perspective. The DOTA dataset comprises 2,806 aerial images with 188,282 annotated instances. The image sizes range from approximately 800 $\times$ 800 to 4,000 $\times$ 4,000 pixels, encompassing multiple objects of varying scales, orientations, and shapes. Fifteen distinct object categories are represented, including baseball diamonds, basketball courts, bridges, harbors, helicopters, ground track fields, large vehicles, airplanes, ships, small vehicles, soccer fields, tanks, swimming pools, tennis courts, and roundabouts. We uniformly cropped these images to 640 $\times$ 640 dimensions with a 20% overlap and obtained 24,888 cropped images. Out of these, 19,006 were allocated for training, and 5,882 were set aside for validation.

Next is the DIOR dataset, comprising 23,463 images with dimensions of 800 $\times$ 800 pixels and 192,472 annotated instances. The dataset encompasses 20 categories, including airplanes, airports, baseball fields, basketball courts, bridges, chimneys, dams, highway service areas, highway toll booths, harbors, golf courses, ground track fields, corridors, boats, stadiums, tanks, tennis courts, train stations, vehicles, and wind turbines. We maintained the original image sizes for input into the network.

Experimental results demonstrate that our proposed algorithm not only performs exceptionally well on the VisDrone-Det dataset but also achieves significant performance improvements on the DOTA and DIOR datasets. As shown in Table 5, our improved algorithm enhances the mAP metric by 1.5% and 2.4% on DOTA and DIOR, respectively. These results validate the feasibility and effectiveness of our enhanced algorithm.

Table 5
Cross-dataset validation experimental results comparison table.

Datasets Models Images Classes Input-size Recall AP50 mAP

DOTA YOLOv8s 24888 15 640 74.4 79.5 60.1

ASM-YOLO(ours) 24888 15 640 76.0 81.5 61.6

DIOR YOLOv8s 23463 20 800 70.7 73.5 50.2

ASM-YOLO(ours) 23463 20 800 71.9 76.8 52.6

VisDrone-Det YOLOv8s 10209 10 640 34.7 32.6 18.7

ASM-YOLO(ours) 10209 10 640 39.9 40.0 23.3

Datasets	Models	Images	Classes	Input-size	Recall	AP50	mAP
DOTA	YOLOv8s	24888	15	640	74.4	79.5	60.1
	ASM-YOLO(ours)	24888	15	640	76.0	81.5	61.6
DIOR	YOLOv8s	23463	20	800	70.7	73.5	50.2
	ASM-YOLO(ours)	23463	20	800	71.9	76.8	52.6
VisDrone-Det	YOLOv8s	10209	10	640	34.7	32.6	18.7
	ASM-YOLO(ours)	10209	10	640	39.9	40.0	23.3

5. Conclusion

In this study, we propose the ASM-YOLO algorithm as an improvement upon YOLOv8s for object detection in low-altitude UAV imagery. To address challenges in detecting objects with significant scale variations and difficulties in detecting small objects during UAV aerial processes, we advocate the replacement of the PAN-FPN structure in YOLOv8 with the ABiFPN structure. By maintaining a fixed channel number of 128 in the ABiFPN structure, the model’s parameter count is reduced by 26.1%. Concurrently, the AP50 metric exhibits an improvement from 32.6% to 39.4%. This enhancement underscores the pragmatic efficacy of our proposed algorithm in simultaneously reducing parameter count and augmenting detection accuracy. To tackle challenges associated with complex backgrounds in UAV aerial images, rendering objects challenging to detect, we introduce the SFE module. This module is designed to inject features extracted by the Backbone into the Neck structure, thereby amplifying the efficiency of utilizing features extracted by the Backbone. Through integration with the CA mechanism, this structure accentuates positional and directional information during the feature injection process. This strategic focus on small object features within long-range dependencies enhances the model’s ability to detect small objects amidst intricate backgrounds. Additionally, the introduction of the MPDIoU boundary box loss function, characterized by superior localization performance and faster convergence speed, further elevates the overall detection performance of the model.

In the experiments conducted on the public dataset VisDrone-DET, this paper demonstrates the effectiveness of the proposed algorithm compared to the baseline model YOLOv8s. While reducing the parameter size and model size by 26.1% and 24.7% respectively, the AP50 and mAP metrics improve by 7.4% and 4.6% respectively. Additionally, we conducted algorithm generalization validation on the DOTA and DIOR datasets. Experimental results indicate a notable improvement in the performance of ASM-YOLO across both datasets.

In the future, we plan to further explore the application of drones based on this foundation. Currently, the field of smart transportation still faces several challenges, including issues related to traffic flow prediction [50] and traffic data collection [51, 52]. We intend to leverage the unique advantages of drones, particularly from their aerial perspective, to assist in addressing these challenges and contribute to the development of a more intelligent transportation system. This research direction is poised to provide substantial support for the future advancements in smart transportation.

Footnotes

Acknowledgments

This work was supported in part by the projects of the National Natural Science Foundation of China (62376059), Fujian Provincial Department of Science and Technology (2021Y4019) and Fujian Provincial Audit Office (MSK2306).

References

Lin

T.-Y.

Maire

Belongie

Hays

Perona

Ramanan

Dollár

Zitnick

C.L.

, Microsoft coco: Common objects in context, in: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, Springer, 2014, pp. 740–755.

Zhang

Lin

Pan

, Crftl: cache reallocation-based page-level flash translation layer for smartphones, IEEE Transactions on Consumer Electronics (2023).

Jocher

Chaurasia

Qiu

, YOLO by Ultralytics, URL: https://github.com/ultralytics/ultralytics (2023).

Liao

Luo

Xiao

Zou

Lin

, Eagle-YOLO: An Eagle-Inspired YOLO for Object Detection in Unmanned Aerial Vehicles Scenarios, Mathematics 11(9) (2023), 2093.

Lyu

Liu

Guo

, A Real-Time and Lightweight Method for Tiny Airborne Object Detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 3015–3024.

Wang

Liew

J.H.

Zou

Zhou

Feng

, Panet: Few-shot image semantic segmentation with prototype alignment, in: proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 9197–9206.

Hou

Zhou

Feng

, Coordinate attention for efficient mobile network design, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 13713–13722.

Zheng

Wang

Ren

Liu

Zuo

, Enhancing geometric factors in model learning and inference for object detection and instance segmentation, IEEE Transactions on Cybernetics 52(8) (2021), 8574–8586.

Siliang

Yong

, MPDIoU: A Loss for Efficient and Accurate Bounding Box Regression, arXiv preprint arXiv:2307.07662 (2023).

10.

Zhu

Wen

Bian

Fan

Ling

, Detection and tracking meet drones challenge, IEEE Transactions on Pattern Analysis and Machine Intelligence 44(11) (2021), 7380–7399.

11.

Xia

G.-S.

Bai

Ding

Zhu

Belongie

Luo

Datcu

Pelillo

Zhang

, DOTA: A large-scale dataset for object detection in aerial images, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3974–3983.

12.

Wan

Cheng

Meng

Han

, Object detection in optical remote sensing images: A survey and a new benchmark, ISPRS Journal of Photogrammetry and Remote Sensing 159 (2020), 296–307.

13.

Liu

Anguelov

Erhan

Szegedy

Reed

C.-Y.

Berg

A.C.

, Ssd: Single shot multibox detector, in: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, Springer, 2016, pp. 21–37.

14.

Redmon

Divvala

Girshick

Farhadi

, You only look once: Unified, real-time object detection, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779–788.

15.

Redmon

Farhadi

, YOLO9000: better, faster, stronger, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7263–7271.

16.

Redmon

Farhadi

, Yolov3: An incremental improvement, arXiv preprint arXiv:1804.02767 (2018).

17.

Lin

T.-Y.

Goyal

Girshick

Dollár

, Focal loss for dense object detection, in: Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988.

18.

Girshick

Donahue

Darrell

Malik

, Rich feature hierarchies for accurate object detection and semantic segmentation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580–587.

19.

Ren

Girshick

Sun

, Faster r-cnn: Towards real-time object detection with region proposal networks, Advances in neural Information Processing Systems 28 (2015).

20.

Gkioxari

Dollár

Girshick

, Mask r-cnn, in: Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969.

21.

Everingham

Van Gool

Williams

C.K.

Winn

Zisserman

, The pascal visual object classes (voc) challenge, International Journal of Computer Vision 88 (2010), 303–338.

22.

Neubeck

Van Gool

, Efficient non-maximum suppression, in: 18th international conference on pattern recognition (ICPR’06), Vol. 3, IEEE, 2006, pp. 850–855.

23.

Wang

C.-Y.

Bochkovskiy

Liao

H.-Y.M.

, YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7464–7475.

24.

Jocher

Chaurasia

Stoken

Borovec

Kwon

Michael

Fang

Yifu

Wong

Montes

et al., ultralytics/yolov5: v7. 0-yolov5 sota realtime instance segmentation, Zenodo (2022).

25.

Wang

Chen

Tang

Yang

, Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection, Advances in Neural Information Processing Systems 33 (2020), 21002–21012.

26.

Van Etten

, Satellite imagery multiscale rapid detection with windowed networks, in: 2019 IEEE winter conference on applications of computer vision (WACV), IEEE, 2019, pp. 735–743.

27.

Sahin

Ozer

, Yolodrone: Improved yolo architecture for object detection in drone images, in: 2021 44th International Conference on Telecommunications and Signal Processing (TSP), IEEE, 2021, pp. 361–365.

28.

Huang

Chen

Huang

, UFPMP-Det: Toward accurate and efficient object detection on drone imagery, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36, 2022, pp. 1026–1033.

29.

Zhao

Zhu

, MS-YOLOv7: YOLOv7 Based on Multi-Scale for Object Detection on UAV Aerial Photography, Drones 7(3) (2023), 188.

30.

Clausen

Grov

Aspinall

, CBAM: A Contextual Model for Network Anomaly Detection, Computers (2021), 79. doi: 10.3390/computers10060079.

31.

Liu

Lin

Cao

Wei

Zhang

Lin

Guo

, Swin Transformer: Hierarchical Vision Transformer using Shifted Windows., in: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021. doi: 10.1109/iccv48922.2021.00986.

32.

Bodla

Singh

Chellappa

Davis

L.S.

, Soft-NMS – improving object detection with one line of code, in: Proceedings of the IEEE international conference on computer vision, 2017, pp. 5561–5569.

33.

Misra

, Mish: A self regularized non-monotonic activation function, arXiv preprint arXiv:1908.08681, (2019).

34.

Fan

Huang

Han

, A Modified YOLOv8 Detection Network for UAV Aerial Image Recognition, Drones 7(5) (2023), 304.

35.

Tang

Han

Guo

Wang

, GhostNetv2: enhance cheap operation with long-range attention, Advances in Neural Information Processing Systems 35 (2022), 9969–9982.

36.

Tong

Chen

, Wise-IoU: Bounding Box Regression Loss with Dynamic Focusing Mechanism, arXiv preprint arXiv:2301.10051 (2023).

37.

Wang

Chen

Hong

Huang

, UAV-YOLOv8: A Small-Object-Detection Model Based on Improved YOLOv8 for UAV Aerial Photography Scenarios, Sensors 23(16) (2023), 7190.

38.

Zhu

Wang

Zhang

Lau

R.W.

, BiFormer: Vision Transformer with Bi-Level Routing Attention, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 10323–10333.

39.

Chen

Lin

Polat

Alhudhaif

Alenezi

, Consistency-and dependence-guided knowledge distillation for object detection in remote sensing images, Expert Systems with Applications 229 (2023), 120519.

40.

Lin

T.-Y.

Dollár

Girshick

Hariharan

Belongie

, Feature pyramid networks for object detection, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2117–2125.

41.

Ghiasi

Lin

T.-Y.

Q.V.

, Nas-fpn: Learning scalable feature pyramid architecture for object detection, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 7036–7045.

42.

Tan

Pang

Q.V.

, Efficientdet: Scalable and efficient object detection, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10781–10790.

43.

Qiao

Chen

L.-C.

Yuille

, Detectors: Detecting objects with recursive feature pyramid and switchable atrous convolution, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 10213–10224.

44.

Yang

Lei

Zhu

Cheng

Feng

Liang

, AFPN: Asymptotic Feature Pyramid Network for Object Detection, arXiv preprint arXiv:2306.15988 (2023).

45.

Chen

Kao

S.-h.

Zhuo

Wen

Lee

C.-H.

Chan

S.-H.G.

, Run, Don’t Walk: Chasing Higher FLOPS for Faster Neural Networks, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 12021–12031.

46.

Zhang

Ren

Sun

, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.

47.

Geng

Jiang

Cheng

Zhang

Chu

, Yolov6 v3. 0: A full-scale reloading, arXiv preprint arXiv:2301.05586 (2023).

48.

Wang

Nie

Guo

Liu

Han

Wang

, Gold-YOLO: Efficient Object Detector via Gather-and-Distribute Mechanism, arXiv preprint arXiv:2309.11331 (2023).

49.

Selvaraju

R.R.

Cogswell

Das

Vedantam

Parikh

Batra

, Grad-cam: Visual explanations from deep networks via gradient-based localization, in: Proceedings of the IEEE international conference on computer vision, 2017, pp. 618–626.

50.

Liao

Zheng

Zou

Qiu

Zhang

, An improved dynamic Chebyshev graph convolution network for traffic flow prediction with spatial-temporal attention, Applied Intelligence 52(14) (2022), 16104–16116.

51.

Liao

Lin

Zou

Luo

, Traj2Traj: A road network constrained spatiotemporal interpolation model for traffic trajectory restoration, Transactions in GIS (2023).

52.

Lin

Luo

, HRST-LR: A Hessian Regularization Spatio-Temporal Low Rank Algorithm for Traffic Data Imputation, IEEE Transactions on Intelligent Transportation Systems (2023).

Lightweight unmanned aerial vehicle object detection algorithm based on improved YOLOv8

Abstract

Keywords

1. Introduction

2. Related works

2.1 Universal object detection model

2.2 Object detection in aerial images from unmanned aerial vehicle perspective

3. Methods

3.1 Structure of the ASM-YOLO

4.1 Evaluation indicators

Table 1 Training parameter table. Hyparameters Value Epochs 300 Patience 20 Batch Size 4 Image Size 640 × 640 Optimizer SGD NMS IoU 0.7 Initial Learning Rate 0.01 Final Learning Rate 1e-4 Momentum 0.937 Weight-Decay 5e-4

Footnotes

Acknowledgments

References

Table 1
Training parameter table.

Hyparameters Value

Epochs 300

Patience 20

Batch Size 4

Image Size 640 $\times$ 640

Optimizer SGD

NMS IoU 0.7

Initial Learning Rate 0.01

Final Learning Rate 1e-4

Momentum 0.937

Weight-Decay 5e-4