DEMA-YOLO: An Effective Industrial Defect Detection Algorithm Based on a Double-flow Edge Detail Enhancement Module and a Multi-scale Attention Mechanism

Abstract

Accurate industrial defect detection is crucial for industrial quality control, yet challenges such as complex backgrounds and small targets hinder performance. This study presents DEMA-YOLO, an efficient neural network based on the YOLOv10 architecture, integrating a dual-stream edge detail enhancement module and a multiscale attention mechanism. These components enhance feature representation and fine-grained perception. An improved NWD loss further stabilizes small object detection. Extensive experiments on PCB board, NEU-DET, and mixed-type WM38 datasets show that DEMA-YOLO achieves mAP scores of 93.9%, 90.5%, and 98.7%, respectively, outperforming YOLOv10s by 6.7% and 0.9% on PCB and NEU-DET. In the mixed-type WM38 dataset, while accuracy is comparable, DEMA-YOLO reduces parameters by 0.3M and increases the inference speed by 5.8 FPS. Inference speeds reach 119.2, 112.0, and 134.2 FPS on the three datasets, respectively. These results demonstrate the model’s effectiveness and efficiency in deep learning-based computer vision for industrial defect detection.

Keywords

Industrial defect detection YOLOv10 efficient neural network deep learning computer vision

1. Introduction

Industrial products are foundational to construction, infrastructure, and various manufacturing sectors, playing a critical role in ensuring sustainable development and driving economic growth (Hassan et al., 2024; ShiLong et al., 2023; Sun et al., 2024). However, surface defects—caused by limitations in production cost, equipment, and process stability—are prevalent in items such as steel plates, printed circuit boards, and wafers. These defects significantly compromise product quality, safety, and reliability. To prevent defective products from entering the market, effective surface defect detection across all stages of production is imperative.

In real-world manufacturing environments, challenges such as variable lighting, equipment inconsistencies, and material differences contribute to the diversity and complexity of defect types. Traditional visual inspection methods, particularly manual inspections, suffer from low efficiency, high false detection rates, and long-term health risks to human operators (Li et al., 2024; Zhang et al., 2023). These limitations highlight the urgent need for automated, robust, and scalable defect detection solutions.

With advances in machine vision, the field has shifted from manual inspection to algorithm-driven approaches. Machine learning methods, including support vector machines (SVM), random forests (RF), and K-means clustering, rely on handcrafted features such as texture, shape, and colour. For example, Damira et al. (2024) proposed a hybrid method combining non-destructive ultrasonic testing and machine learning to detect bonding defects in composite materials. While effective in specific scenarios, the generalization ability of these approaches is constrained by their reliance on manually engineered features, making them less adaptable to the variability of real-world defect patterns (Lin et al., 2024).

In contrast, deep learning, particularly convolutional neural networks (CNNs), has demonstrated remarkable success in visual recognition tasks (Nagata et al., 2020). CNN-based detectors can learn hierarchical and abstract representations from raw data, thus improving detection accuracy and reducing the reliance on domain-specific feature design. Numerous studies have applied deep learning to industrial defect detection (Haigang et al., 2023; Hongxin et al., 2024; Ming et al., 2024; Zekai et al., 2023), achieving notable performance improvements. However, many of these methods sacrifice the speed of inference for precision (Feifan et al., 2024; Shuwen et al., 2023; Yuanyuan et al., 2024; Zhuxi et al., 2023), making them unsuitable for real-time deployment in high-throughput industrial scenarios (Kim et al., 2024).

Therefore, it is crucial to design a detection algorithm that balances real-time inference speed with high accuracy, while also being lightweight enough for deployment in resource-constrained environments. To address these challenges, we propose DEMA-YOLO, a lightweight and efficient industrial surface defect detection model based on YOLOv10 (Wang, 2024). The main contributions of this work are as follows.

In order to improve the detection efficiency and accuracy of the model, we developed an efficient industrial defect detection model, termed DEMA-YOLO, which is based on YOLOv10. This model significantly reduces the number of parameters and enhances inference speed while maintaining a high level of detection accuracy. Additionally, we replaced the standard convolution in the model with a depth-separable convolution, further improving the inference speed without compromising detection accuracy.

Upsampling operations in traditional detectors often result in the loss of fine details, particularly at the edges of small or irregular defects. These edge features are crucial for accurate defect boundary regression in industrial applications. To address this issue, we propose a dual-flow Detail Enhancement Upsampling (DDEU) module, which extracts and fuses key information from both low-level and high-level features to preserve edge integrity during upsampling. By facilitating interactions across different scales, the DDEU module aids in recovering the detailed structure surrounding defects.

In complex industrial environments, defect patterns often exhibit variability in appearance and are embedded within noisy backgrounds. Conventional attention mechanisms typically treat all spatial information uniformly, which limits their capacity to adaptively focus on key defect areas from diverse perspectives. To address this limitation, we propose a Multiscale Attention Mechanism of Different Perspectives (MAMDP) that assigns varying weights to feature maps derived from different scales and views. This allows the network to prioritize information that is both spatially and semantically significant. Consequently, this mechanism improves feature discrimination across different defect types and improves robustness against background interference.

To address the limitations of traditional localization loss functions, which often struggle with accurately regressing bounding boxes for small objects due to scale imbalance and localization sensitivity, we introduced the novel Normalized Wasserstein Distance (NWD) loss (Rezatofighi et al., 2019). NWD treats bounding boxes as 2D Gaussian distributions and computes a distance metric that is robust to scale variations, thereby enhancing the model’s sensitivity to small target defects. Furthermore, we improved this loss by incorporating adaptive temperature coefficients, which dynamically adjust the influence of the distillation term during training, and by applying high-order distance metrics to better capture spatial discrepancies. These enhancements enable the model to autonomously adapt to varying object sizes, thereby improving the stability and accuracy of the bounding box regression for both small and large defects.

2. Related Work

2.1. Target Detection Methods Based on Deep Learning

Deep learning techniques for industrial defect detection are typically categorized into two types: two-stage and single-stage methods. Two-stage methods, such as Faster R-CNN (Ren et al., 2015) and Mask R-CNN (He et al., 2017), initially generate candidate regions using a regional proposal network (RPN), and subsequently classify and refine these regions through positional regression. Liyun et al. (2020) improved the Faster R-CNN method by optimizing anchor points, thus improving the accuracy and efficiency of engine surface defect detection and positioning. Shuhong et al. (2023) propose a wheel hub defect detection model based on DS-Cascade R-CNN, which integrates spatial attention for multiscale positioning and deformable convolution for adaptive feature extraction to improve the accuracy of defect detection. Ping et al. (2024) improved the faster RCNN model by integrating ResNet50 and the feature pyramid network (FPN), and used K-Means++ to optimize the generation of suggestion boxes, achieving better detection accuracy.

Single-level detection methods, such as YOLO and SSD, perform target localization and classification simultaneously, offering high-speed detection that is well suited for real-time industrial inspection. Although these methods were initially less accurate than two-stage approaches, recent research has significantly improved their accuracy. Several studies have enhanced real-time detection by combining multiscale feature processing with coordinate attention (CA), such as (Dehua et al., 2023; Jingni et al., 2024; Yuxi et al., 2024).

Liu et al. (2025) improves YOLOv8s for surface defect detection in metal castings by introducing modules for small object detection, spatial feature preservation, and multi-scale attention, along with Wise-IoU loss. It achieves higher accuracy, faster inference, and smaller size, showing strong performance in both proprietary and public datasets and real-time capability on edge devices. Pang et al. (2025) proposed ASSM-YOLO, a lightweight and high-precision model for detecting multi-scale defects on cathode copper plates, which combines ADFEM and SimAM for feature enhancement, a Slim Neck for efficiency, and MPDSIoU loss for improved localization performance. Ma et al. (2025) proposed ELA-YOLO, a YOLOv8-based model for steel surface defect detection, which integrates linear attention, a selective feature pyramid, and a lightweight detection head to balance accuracy and efficiency under complex industrial conditions. Zhou and Zhao (2025) proposed MPA-YOLO, a YOLOv8-based model for steel surface defect detection, which integrates a multi-path convolution attention module (MPCA), partial self-attention, and an auxiliary detection head to improve feature representation and localization. It achieved notable gains in mAP, precision and recall in the NEU-DET and VOC2007 datasets. Tao et al. (2024) proposed EFE-YOLO, an enhanced YOLOv5-based model for industrial small object detection. Introduces PSRFA upsampling, MSE downsampling, and AFAF modules to improve feature extraction under occlusion and complex backgrounds, achieving higher precision and recall in a custom dataset.

2.2. Technical Gaps

Although significant progress has been made in deep learning-based industrial defect detection, several critical challenges remain unsolved:

–
Loss of fine-grained features during upsampling. Many existing detectors rely on standard upsampling operations (e.g., nearest-neighbour, bilinear interpolation), which often discard edge and texture information crucial for accurately localizing small or boundary defects.
–
Limited multi-perspective attention mechanisms. Current attention modules often fail to effectively distinguish features from multiple spatial perspectives, making it difficult to capture subtle, irregular, or occluded defects in complex industrial backgrounds.
–
Insufficient accuracy in localizing small objects. Widely used localization loss functions such as GIoU and CIoU tend to underperform when regressing small targets, primarily due to their limited sensitivity to scale variations and position misalignment.
–
Suboptimal trade-off between speed and accuracy. While many real-time detectors (e.g., YOLOv5/8) achieve high inference speed, they often sacrifice detection precision, particularly under high-speed or variable production conditions.
–
Model complexity and deployment constraints. Some advanced models introduce heavy modules (e.g. cascaded R-CNN heads, complex attention blocks) that increase computation and memory load, making them less suitable for edge deployment or embedded systems in industrial settings.
To address the aforementioned challenges, this study proposes a lightweight detection model named DEMA-YOLO, which is based on the YOLOv10 architecture. This model significantly enhances feature extraction and attention representation while maintaining a compact size. In comparison to traditional two-stage and conventional single-stage methods, DEMA-YOLO achieves superior accuracy-speed trade-offs and demonstrates strong adaptability in high-speed and variable industrial environments, thereby showcasing substantial practical value for real-world deployment.
2.3. YOLOv10 Architecture

YOLOv10 is a recent single-stage detector evolved from YOLOv8, designed to improve detection performance through architectural optimizations. Key innovations include: (1) a consistent dual assignment strategy that supports end-to-end training and inference without requiring Non-Maximum Suppression (NMS); (2) lightweight classification heads and spatial-channel decoupling downsampling (SCDown) for improved efficiency; and (3) partial self-attention (PSA) and large kernel convolution modules that enhance receptive field and feature refinement with minimal computational cost. Its backbone integrates improved C2f and SPPF modules to enhance multi-scale feature representation, while the head employs multi-scale feature maps via the v10Detect layer for effective detection across object sizes. Given its high accuracy, speed, and real-time inference capabilities, YOLOv10 serves as a suitable and efficient baseline for the detection of industrial defects. The overall architecture is shown in Figure 1.

Figure 1.

YOLOv10 Architecture.

3. The Proposed Approach

3.1. Efficient Industrial Defect Detection Model: DEMA-YOLO

To address the resource limitations encountered in actual industrial defect detection scenarios and to further improve the balance between detection accuracy and efficiency, we have developed an efficient detection model, DEMA-YOLO, based on YOLOv10. The primary objective of this model is to effectively balance the efficiency and precision of industrial defect detection. The objective is to achieve favourable detection results while simultaneously reducing the number of parameters. The DEMA-YOLO framework is shown in Figure 2. DEMA-YOLO employs YOLOv10 as the backbone network for the detection model. Replaces the standard convolutional layers in the backbone with the depth-wise separable convolution and average pooling (DWAConv) and substitutes the C2f module with the MAMDP operation proposed for feature extraction. This approach selects key industrial defect information from both spatial and channel perspectives, distinguishing the importance of various input features by weighting the original feature map. Consequently, it reduces computational complexity while enhancing feature extraction capabilities. The neck component employs the DDEU structure for upsampling, ensuring that the upsampled features consistently retain rich feature information. Ultimately, we improve the model’s capacity to detect industrial defects by integrating and optimizing the NWD loss.

Figure 2.

DEMA-YOLO Architecture.

3.2. Depthwise Separable Convolution and Average Pooling (DWAConv)

To reduce the computational requirements of the model, we enhanced the YOLOv10 infrastructure as our backbone network. YOLOv10 employs a more efficient model architecture and a novel training strategy that enhance both performance and efficiency. Initially, we replaced the conventional convolutional (Conv) module with the DWAConv module, which substantially reduces computational complexity. The structural diagram of DWAConv is shown in Figure 3. DWAConv consists primarily of depthwise separable convolutions (DWConv), an average pooling layer, and a squeeze-and-excitation (SE) block. The SE module comprises two components: Squeeze, which captures global information from the network, and Excitation, which reweights the extracted information. By employing global average pooling to reduce the spatial dimensions of feature maps, information is effectively concentrated in the channel dimension, leading to the generation of lower-dimensional feature vectors. During the excitation stage, the feature vector undergoes a $1 \times 1$ convolution operation followed by a sigmoid layer to produce a weight vector. This weight vector is then multiplied by the original feature vector to yield a reweighted feature vector. The architecture of the SE block is detailed in Table 1. This architecture is designed to preserve important features while maintaining minimal computational complexity. The architecture of DWAConv is detailed in Table 2. The operation of DWConv consists of a $3 \times 3$ grouped convolution and a $1 \times 1$ standard convolution, and it also reduces the dimensionality of the feature map by 2 times.

Figure 3.

DWAConv Moudle.

Table 1.

Each Layer Parameter of SE Block.

Layer	Kernelsize	Stride	Padding	Dimension
Input	–	–	–	$h_{1} \times w_{1} \times c_{1}$
GAP	1 $\times$ 1	1	0	$1 \times 1 \times c_{1}$
Conv	1 $\times$ 1	1	0	$1 \times 1 \times c_{1} / 16$
ReLU	–	–	–	$1 \times 1 \times c_{1} / 16$
Conv	1 $\times$ 1	1	0	$1 \times 1 \times c_{1}$
Sigmoid	–	–	–	$1 \times 1 \times c_{1}$
Output	–	–	–	$1 \times 1 \times c_{1}$

Table 2.

Each Layer Parameter of DWAConv Module.

Operator	Kernelsize	Stride	Padding	Dimension
Input	–	–	–	$h_{1} \times w_{1} \times c_{1}$
DWConv	$3 \times 3$	1,1	1,0	$h_{1} \times w_{1} \times c_{2}$
BatchNorm	–	–	–	$h_{1} \times w_{1} \times c_{2}$
Sigmoid	–	–	–	$h_{1} \times w_{1} \times c_{2}$
AvgPool	$2 \times 2$	2	0	$h_{1} / 2 \times w_{1} / 2 \times c_{2}$
SE	–	–	–	$1 \times 1 \times c_{2}$
Output	–	–	–	$h_{1} / 2 \times w_{1} / 2 \times c_{2}$

3.2.1. Multiscale Attention Mechanism of Different Perspectives (MAMDP)

To enhance the focus of the model on salient features, the MAMDP module is integrated into the backbone network, effectively capturing critical information from both spatial and channel dimensions. Spatial attention is achieved by computing the mean and maximum values along the channel axis, which are then concatenated and subjected to convolution to emphasize spatial dependencies. Concurrently, multi-scale features are extracted through three parallel convolutional branches, which are subsequently fused and adaptively weighted back into the original feature map. Additionally, channel-wise attention is further refined using the squeeze-and-excitation (SE) block, with the final output derived from the combination of spatial and channel features based on their respective contributions. The structure of the MAMDP module is illustrated in Figure 4, while its architecture is detailed in Table 3.

Figure 4.

MAMDP Block.

Table 3.

Each Layer Parameter of MAMDP Module.

Operator	Kernelsize	Stride	Padding	Dimension
Input	–	–	–	$h_{1} \times w_{1} \times c_{1}$
Mean	–	–	–	$h_{1} \times w_{1} \times 1$
Max	–	–	–	$h_{1} \times w_{1} \times 1$
Conv	$1 \times 1$	1	0	$h_{1} \times w_{1} \times 1$
Conv	$5 \times 5$	1	2	$h_{1} \times w_{1} \times 1$
Conv	$3 \times 3$	1	1	$h_{1} \times w_{1} \times 1$
Conv	$7 \times 7$	1	1	$h_{1} \times w_{1} \times 1$
Conv	$1 \times 1$	1	0	$h_{1} \times w_{1} \times 1$
BatchNorm	–	–	–	$h_{1} \times w_{1} \times 1$
Sigmoid	–	–	–	$h_{1} \times w_{1} \times 1$
SE	–	–	–	$1 \times 1 \times c_{1}$
Output	–	–	–	$h_{1} \times w_{1} \times c_{1}$

3.3. Dual-flow Detail Enhancement Upsampling (DDEU)

The neck network in YOLOv10 is designed for advanced feature fusion and multi-scale information extraction. It inherits the design philosophy of YOLOv8 and introduces a novel feature fusion module, C2fCIB, which replaces the original C2f module in semantically rich layers. By employing depthwise separable convolution in place of standard convolution, C2fCIB effectively reduces computational complexity while enlarging the receptive field, thereby achieving efficient feature aggregation with minimal computational overhead. However, the reliance on a basic upsampling module within the neck can result in the loss of fine-grained details, leading to frequent missed or false detections of small industrial defects. To address this limitation, the DDUE module is proposed. It comprises two parallel branches: one dedicated to upsampling the feature map to the target resolution, and the other designed to extract and preserve fine details prior to fusion. This structure enables the integration of high-resolution features containing shallow-layer details with semantically enriched deep features after upsampling, thereby enhancing spatial information reconstruction. In the detail recovery branch, a dual-stream enhancement mechanism is applied, incorporating $3 \times 3$ average pooling and $1 \times 1$ convolution to capture detailed information. In addition, a three-level pyramid structure is employed to facilitate cross-scale interaction, improving the model’s responsiveness to various characteristics, particularly in contexts sensitive to detail. The structure of the DDUE module is illustrated in Figure 5. The architecture of DDUE is detailed in Table 4.A novel edge enhancement module, termed Focal Edge, is introduced to improve the model’s sensitivity to edge information. This module constructs an edge detector using a two-layer depthwise separable convolution with ReLU activation, thereby enhancing high-frequency components. To facilitate multiscale feature fusion, it integrates conventional pooling with dilated convolution. A spatial attention mechanism is further employed to adaptively modulate the enhancement strength at each spatial location, ensuring output values lie within the range of 1 to 2 for dynamic scaling. Additionally, a convolutional layer equipped with Batch Normalization and Parametric ReLU (PReLU) is used to strengthen feature representation while preserving the original information and reinforcing edge characteristics. Finally, multiscale information fusion is performed by leveraging the complementary responses of conventional pooling and dilated convolution, enabling the model to capture edge features across varying receptive fields. The architecture of Focal Edge is detailed in Table 5.

Figure 5.

DDUE Module.

Table 4.

Each Layer Parameter of DDEU Moudle.

Layer	Kernelsize	Stride	Padding	Dimension
Input	–	–	–	$h_{1} \times w_{1} \times c_{1}$
DSConv	$3 \times 3$	1	1	$h_{1} \times w_{1} \times c_{1}$
Conv	$1 \times 1$	1	0	$h_{1} \times w_{1} \times c_{1}$
AP	$3 \times 3$	1	1	$h_{1} \times w_{1} \times c_{1}$
Conv	$1 \times 1$	1	0	$h_{1} \times w_{1} \times c_{1}$
AP	$3 \times 3$	1	1	$h_{1} \times w_{1} \times c_{1}$
Conv	$1 \times 1$	1	0	$h_{1} \times w_{1} \times c_{1}$
AP	$3 \times 3$	1	1	$h_{1} \times w_{1} \times c_{1}$
Conv	$1 \times 1$	1	0	$h_{1} \times w_{1} \times c_{1}$
FocalEdge	–	–	–	$h_{1} \times w_{1} \times c_{1}$
FocalEdge	–	–	–	$h_{1} \times w_{1} \times c_{1}$
FocalEdge	–	–	–	$h_{1} \times w_{1} \times c_{1}$
Output	–	–	–	$h_{1} \times w_{1} \times 2$

Table 5.

Each Layer Parameter of Focal Edge Module.

Layer	Kernelsize	Stride	Padding	Dimension
Input	–	–	–	$h_{1} \times w_{1} \times c_{1}$
Conv	$3 \times 3$	1	1	$h_{1} \times w_{1} \times c_{1}$
ReLU	–	–	–	$h_{1} \times w_{1} \times c_{1}$
Conv	$3 \times 3$	1	1	$h_{1} \times w_{1} \times c_{1}$
AP	$3 \times 3$	1	1	$h_{1} \times w_{1} \times c_{1}$
Dilation Conv	$3 \times 3$	1	2	$h_{1} \times w_{1} \times c_{1}$
Conv	$3 \times 3$	1	1	$h_{1} \times w_{1} \times c_{1}$
Conv	$3 \times 3$	1	1	$h_{1} \times w_{1} \times c_{1} / 4$
ReLU	–	–	–	$h_{1} \times w_{1} \times c_{1} / 4$
Conv	$3 \times 3$	1	1	$h_{1} \times w_{1} \times c_{1}$
Sigmoid	–	–	–	$h_{1} \times w_{1} \times c_{1}$
Conv	$1 \times 1$	1	0	$h_{1} \times w_{1} \times c_{1}$
BatchNorm	–	–	–	$h_{1} \times w_{1} \times c_{1}$
PReLU	–	–	–	$h_{1} \times w_{1} \times c_{1}$
Conv	$1 \times 1$	1	0	$h_{1} \times w_{1} \times c_{1}$

3.4. Loss Function

To enhance the detection performance of the model, we modified the loss function of DEMA-YOLO by incorporating NWD loss, which improves the stability of model training. The loss function of YOLOv10 is primarily divided into two components, comprising three loss functions: the position loss $l_{b o x}$ for the target box and the prediction box, and the classification loss $l_{c l s}$ used for target classification. The overall loss is calculated as presented in Equation (1):

L o s s = l_{b o x} + l_{c l s}

(1)

Specifically, the

l_{b o x}

loss is composed of DFL loss and CIoU loss. The confidence loss

l_{D F L}

is computed as depicted in Equation (2):

D F L (s_{i}, s_{j}) = - ((y_{j} - y) l o g (s_{i}) + (y - y_{i}) l o g (s_{j}))

(2)

Where

s_{i}

and

s_{j}

are the probability values of two adjacent intervals predicted by the model (corresponding to the distribution of discretized offsets),

y

is the actual offset (such as the distance from the center point to the boundary), and the two nearest boundary values to

y

in the discretization interval of

y_{i}

and

y_{j}

. The confidence loss

l_{C I o U}

is computed as depicted in Equation (3):

C I o U L o s s = 1 - I o U + \frac{ρ^{2} (b_{p r e d}, b_{g t})}{c^{2}} + α v

(3)

Where IoU represents the intersection and union ratio between the predicted box and the real box.

ρ

represents the squared Euclidean distance between the predicted box and the center point of the real box. c represents the diagonal length between the minimum bounding box prediction box and the true box.

α

serves as a positive trade-off parameter. The classification loss

l_{c l s}

adopts BCE loss and only calculates the classification loss of positive samples. The classification loss

l_{c l s}

is computed as depicted in Equation (4):

B C E L o s s = - \frac{1}{N_{p o s}} \sum_{K = 1}^{N_{p o s}} [y_{k} l o g (p_{k}) + (1 - y_{k}) l o g (1 - p_{k})]

(4)

Where

y_{k}

represents the real label (0 or 1), and only positive samples (

y_{k}

=1) participate in the calculation.

p_{k}

represents the category probability predicted by the model (activated by Sigmoid).

N_{p o s}

represents the number of positive samples used to normalize the loss. The combination of DFL and CIoU enhances robustness in complex scenarios by improving both local offset prediction and global geometric alignment. However, traditional IoU-based losses remain highly sensitive to small positional deviations, particularly for small targets, often resulting in unstable gradients. To address this issue, the NWD loss is introduced, leveraging distributional similarity to reduce sensitivity to localization errors. Even with slight deviations, high similarity scores can be maintained when the predicted and ground-truth boxes substantially overlap, thereby stabilizing gradient propagation. To enhance adaptability, a temperature coefficient is dynamically adjusted based on the target area, enabling sharper gradients for small objects and greater tolerance for larger ones. Furthermore, replacing the L2-based center distance with a higher-order term improves robustness against outliers and suppresses excessive penalties, thereby enhancing the stability and precision of bounding box regression across diverse target scales. The NWD loss as shown in Equation (5):

\begin{aligned} N W D (B_{p}, B_{t}) = e x p (- \frac{\sqrt{W^{2} (N_{p}, N_{t})}}{C}) \end{aligned}

(5)

Where

B_{p}

and

B_{t}

respectively represent the predicted bounding box and ground truth bounding box.

N_{p}

and

N_{t}

represent modelling the bounding box as a two-dimensional Gaussian distribution.

W^{2}

represents the square of the Wasserstein distance between two Gaussian distributions. C represents a normalization constant. The improved NWD loss as shown in Equation (6):

\begin{aligned} {\begin{aligned} N W D_{i m p r o v e d} = e x p (- \frac{\sqrt{D_{c e n t e r}^{4} + D_{w h}}}{a d a p t i v e_c o n s t a n t}) \\ a d a p t i v e_c o n s t a n t = C \times \frac{t a r g e t}{m e a n_t a r g e t + ϵ} \\ D_{c e n t e r}^{4} = (x_{p} - x_{t})^{4} + (y_{p} - y_{t})^{4} + ϵ \end{aligned} \end{aligned}

(6)

Where target represents the area of the real bounding box, while mean_target represents the average area of all real bounding boxes in the current batch.

ϵ

represents the numerical stability coefficient (default value is 1e-7, preventing division into zero).

D_{c e n t e r}^{4}

replaces the 2nd power term in the original NWD loss formula with 4th power, allowing the model to make reasonable choices when facing gradients of different strengths.

D_{w h}

represents the distance between the width and height of the target box and the predicted box. The gradient comparison between L2 and L4 is shown in Equation (7):

\begin{aligned} {\begin{aligned} L 2 & = \frac{\partial L}{\partial d} \propto - d \cdot e^{- d} \\ L 4 & = \frac{\partial L}{\partial d} \propto - 2 d^{3} \cdot e^{- d^{2}} \end{aligned} \end{aligned}

(7)

In scenarios with large localization errors, the gradient of the quadratic loss term rapidly approaches zero, limiting the network’s ability to adjust bounding box predictions effectively. By contrast, incorporating a fourth power term allows the gradient to increase cubically with the error, enabling suppression of excessive gradients while retaining sensitivity to large deviations during training. Compared to CIoU loss, the improved NWD loss dynamically adapts to object scale, enhancing gradient propagation for small targets through stronger responses and offering stable, appropriately scaled penalties for larger targets.

3.5. Training Procedure

The DEMA-YOLO model is built upon a YOLOv10 backbone and is further enhanced by the integration of the proposed DDEU and MAMDP modules. As depicted in Algorithm 1, the training process begins with the initialization of the model using pre-trained YOLOv10 weights, which aids in achieving faster convergence. During each epoch, input batches undergo standard data augmentation techniques, such as resizing, flipping, and mosaic transformations. Subsequently, feature extraction is performed utilizing the DDEU-enhanced backbone, followed by multi-scale fusion facilitated by the MAMDP. The network generates outputs that include classification scores, bounding box coordinates, and objectness predictions. A composite loss function governs the training process, which consists of the loss of BCE for classification, the loss of CIoU for regression, the loss of DFL for enhancing the precision of the localization, and a modified NWD loss specifically designed for the detection of small objects.

4. Experiments and Results

4.1. Experimental Datasets

To assess the effectiveness of our proposed approach, evaluate our method using three datasets: the PCB board dataset (Beijing University, 2024), the NEU-DET steel dataset (Northeast University, 2024), and the mixed-type WM38 dataset (Wang et al., 2020). We selected three datasets for our study because of their unique characteristics that enhance the evaluation of our method. The PCB dataset is particularly valuable, as it contains a significant number of small target defects on circuit boards, facilitating the assessment of our method’s performance on small target datasets. The NEU-DET dataset was chosen for its abundance of steel defects that closely resemble the background, allowing us to evaluate the effectiveness of our proposed module in recognizing the edges of target defects. Lastly, we included the complex Mixed type WM38 dataset, which features a variety of mixed wafer defects. This dataset is instrumental in demonstrating the model’s capacity to accurately identify different defects within a single image and assess detection accuracy. These highly representative industrial defect datasets enable a comprehensive comparison of the performance and robustness of our methods, further supported by ablation experiments that validate the compatibility of each contribution.

4.2. Experimental Parameter Settings

To ensure a fair and consistent evaluation, all experiments in this study—including baseline comparisons, ablation studies, and variant analyses—were conducted in identical training settings. Specifically, we used the same dataset, number of epochs, learning rate, batch size, image resolution, AdamW optimizer, and data augmentation strategies across all models. This standardized configuration eliminates potential biases arising from inconsistent hyperparameters or preprocessing, thereby ensuring that performance differences can be attributed solely to the architectural design of the models. Table 6 shows the hardware and software environment of the experimental platform. Table 7 lists the default hyperparameter settings for the training procedure.

Table 6.
Experimental Platform.

Platform Name

CPU Intel Xeon Gold 6330

GPU NVIDIA Tesla T4

CUDA 11.7

Python 3.8

Framework Pytorch 2.0.1

Platform	Name
CPU	Intel Xeon Gold 6330
GPU	NVIDIA Tesla T4
CUDA	11.7
Python	3.8
Framework	Pytorch 2.0.1

Table 7.

Experimental Hyperparameter.

Hyperparameter	Name
Batch size	64
Epoch	500
Initial learning decay	0.01
Weight decay	0.0005
Momentum decay	0.937
Learning rate	0.001
Training threshold of confindence	0.7

4.3. Evaluation Indicators

In this study, we adopt precision (P), recall (R), average precision (AP), mean average precision (mAP), and Frames Per Second (FPS) as evaluation metrics based on their widespread use and practical relevance in industrial defect detection tasks. These indicators demonstrate the effectiveness of the model in detecting and classifying defects. The calculation formulas for these indicators are as follows: precision and recall provide a clear measure of the model’s ability to correctly detect and localize defects while minimizing false alarms and missed detections, which are critical in high-risk industrial environments. Precision measures how many of the samples identified as positive by the model are truly positive samples. The formula for precision can be expressed as:

Precision = \frac{TP}{TP + FP}

(8)

recall measures how many positive samples the model can correctly identify. The formula for recall can be expressed as:

Recall = \frac{TP}{TP + FN}

(9)

AP and mAP offer a comprehensive view of detection accuracy across varying recall thresholds and all defect categories, respectively, enabling a balanced evaluation of performance across multiple classes. AP is defined as the mean accuracy at various recall points, represented on the Precision-Recall (PR) curve as the area beneath thecurve. A higher AP value indicates a higher average accuracy of the model. This formula can be expressed as:

AP = \int_{0}^{1} Precision (Recall) d (Recall)

(10)

mAP is calculated by averaging the AP values in all categories. The AP metric reflects how accurately each category is predicted, whereas the mAP serves as a comprehensive indicator of model accuracy across all classes. This formula can be expressed as:

mAP = \frac{1}{N} \sum_{i = 1}^{N} {AP}_{i}

(11)

Where TP are the true positive examples, FP is the false positive examples, FN is the false negative examples, and TN is the true negative examples. FPS is included to assess the real-time capability of the model, which is essential for edge deployment scenarios where fast inference is required. Together, these metrics ensure a holistic evaluation of both the accuracy and efficiency of the proposed model in practical applications.

4.4. Experiments on the PCB Board Dataset

The PCB Surface Defects Dataset is a synthetic and open source dataset of 1386 pictures from Peking University. There are six defects in the image: missing hole, mouse bite, open circuit,spur, and short and spurious copper. Each image has several defects of the same type. Figure 6 shows samples with various defects in the PCB dataset, with red boxes indicating the locations of the defects. It is evident that the PCB dataset features a complex background and contains a significant number of small targets. To overcome this issue, data augmentation techniques such as random rotation, flipping, and brightness adjustments were employed to improve the network’s generalizability. This enhancement process generated a total of 2420 images. Splitting the training set, validation set, and test set according to 8:1:1. During the experiment, we adjusted the image pixels to $640 \times 480$ as the input size for the model. The distribution of each defect category is illustrated in Figure 7.

Figure 6.

Visualization of a PCB Dataset Containing Six Defect Categories.

Figure 7.

Distribution of the Number of Different Defect Categories in the Enhanced PCB Dataset.

4.4.1. Ablation Experiments for Different Modules

In order to verify the effectiveness of each module in DEMA-YOLO on small target defects and improve the performance of the model, we conducted ablation experiments on three key modules: MAMDP, DDUE, and NWD loss on the PCB dataset. Main performance indicators, including P, R, and mAP@0.5 See Table 8. The addition of the DDEU module resulted in a modest enhancement in the model’s precision, which can be attributed to the improved extraction of critical edge information during the reconstruction of the feature map for small target defects. This, in turn, enhances the model’s detection efficiency. Furthermore, the incorporation of the MAMDP module leads to an additional increase in the model’s precision. Experimental results demonstrate that the MAMDP module effectively assigns importance weights to feature maps from various perspectives, thereby bolstering the model’s capacity to identify significant features. Lastly, the integration of the NWD loss function maximizes the overall precision improvement. This phenomenon arises from the original CIoU loss function’s sensitivity to small object detection boxes, which causes considerable fluctuations and slow convergence during the training process, ultimately impairing the model’s ability to detect small object targets. The refined NWD loss function mitigates the significant gradient effects on defects of varying sizes, thus facilitating better model convergence.

Table 8.
Ablation Study of Different Modules on the PCB Dataset.

Method P (%) R (%) mAP@.5 (%) Params (M) FPS

YOLOv10s 89.5 84.1 87.8 8.1 112.0

$+$ DDEU 93.6 90.3 92.5 8.0 97.2

$+$ MAMDP 93.1 84.3 93.1 7.7 117.3

$+$ $N W D_{i m p r o v e d}$ loss 93.7 92.1 93.6 8.1 102.1

Method	P (%)	R (%)	mAP@.5 (%)	Params (M)	FPS
YOLOv10s	89.5	84.1	87.8	8.1	112.0
$+$ DDEU	93.6	90.3	92.5	8.0	97.2
$+$ MAMDP	93.1	84.3	93.1	7.7	117.3
$+$ $N W D_{i m p r o v e d}$ loss	93.7	92.1	93.6	8.1	102.1

4.4.2. Comparison Experiments

To verify the effectiveness of the model, Table 9 shows the comparison of the proposed DEMA-YOLO method with several mainstream target detectors in terms of P, R, mAP and FPS, including the two-stage model Faster R-CNN, Cascade R-CNN (Cai & Vasconcelos, 2018) and the Transformer model EfficientViT (Xie & Liao, 2023), RT-DETR-R18 (Lv, 2023), DETR (Carion, 2020) that integrates the attention mechanism, as well as the most popular single-stage detection models YOLOv8s (Jocher et al., 2023), YOLOv9s (Wang et al., 2024), YOLOv11s (Jocher et al., 2023) and YOLOv10s. Our proposed DEMA-YOLO model outperforms the original model in terms of precision, recall, and mAP value, demonstrating the superior detection performance. In addition, our model has the lowest number of parameters among various models. This reduction is attributed to the use of the lightweight YOLOv10 as the backbone network for feature extraction, which effectively decreases the model’s parameter count. We also designed a MAMDP structure based on spatial and channel attention mechanisms for feature fusion. This structure enhances multi-scale feature fusion while minimizing redundant calculations, further improving detection accuracy. Although DEMA-YOLO may be slightly inferior to other models in certain precision metrics, its efficient characteristics offer significant advantages in resource-constrained environments.

Table 9.
Comparison of this Paper’s DEMA-YOLO Model with other Models on the PCB Dataset.

Model P (%) R (%) mAP@.5 (%) Params (M) FPS

Faster R-CNN 85.6 84.1 84.7 138.4 10.1

Cascade R-CNN 88.2 89.7 89.5 71.1 46.0

EfficientViT 93.2 88.7 92.6 91.0 12.8

RT-DETR-R18 94.1 89.5 92.1 19.9 104.8

DETR 91.4 90.8 91.4 41 34.2

YOLOv8s 87.7 83.6 87.5 11.1 87.9

YOLOv9s 86.2 88.4 87.1 20.0 102.0

YOLOv11s 93.1 86.4 92.4 9.4 109.2

YOLOv10s 89.5 84.1 87.8 8.1 112.0

DEMA-YOLO(OURS) 94.4 87.3 93.9 7.8 119.2

Model	P (%)	R (%)	mAP@.5 (%)	Params (M)	FPS
Faster R-CNN	85.6	84.1	84.7	138.4	10.1
Cascade R-CNN	88.2	89.7	89.5	71.1	46.0
EfficientViT	93.2	88.7	92.6	91.0	12.8
RT-DETR-R18	94.1	89.5	92.1	19.9	104.8
DETR	91.4	90.8	91.4	41	34.2
YOLOv8s	87.7	83.6	87.5	11.1	87.9
YOLOv9s	86.2	88.4	87.1	20.0	102.0
YOLOv11s	93.1	86.4	92.4	9.4	109.2
YOLOv10s	89.5	84.1	87.8	8.1	112.0
DEMA-YOLO(OURS)	94.4	87.3	93.9	7.8	119.2

4.4.3. The AP Score of Different Categories

Table 10 shows the performance of DEMA-YOLO in the PCB dataset in six categories of defects. The model achieves an AP@.5 score exceeding 90% in the first five classes, with the missing hole class reaching an impressive 98.3%. Although spurious copper attained an accuracy of 88.6%, which is below 90%, it is evident that DEMA-YOLO demonstrates effective performance in detecting small-target defects. Figure 8 shows the PR curves and confusion matrices of DEMA-YOLO and YOLOv10s, reflecting the performance differences of the model in various types of defects. The P and R for missing_hole and short defects are both high, with the PR curve showing stable, high P and R. However, for mouse_bite and spur, P and R are relatively balanced but slightly lower, positioning their curves in a moderate region. For open_circuit and spurious_copper, although P is high, R is low, causing the PR curve to lean toward high-P, low-R areas, but overall our model outperforms the baseline model in detecting these defects.

4.5. Experiments on the NEU-DET Steel Dataset

The publicly available NEU-DET steel surface defects dataset from Northeastern University contains 1800 images that represent six types of defects: crazing, inclusion, patches, pitted_surface, rolled-in_scale and scratches. In addition, the steel images were rotated and brightness processed to enhance the robustness of the model and better simulate real industrial defect scenarios. It can be seen that the category distribution of steel surface defects is non-IID (lack of independence, and samples have the same distribution), and the number of inclusions and Paches is large. This paper divides the dataset into training set, validation set, and test set in a ratio of 8:1:1. During the experiment, we adjusted the image pixels to $640 \times 640$ as the input size for the model. Figure 9 shows samples with various defects from the NEU-DET steel dataset. Figure 10. shows the distribution of each defect category.

Figure 8.

PR Curve and Confusion Matrices of DEMA-YOLO and YOLOv10s on the PCB Dataset. (a) Representing DEMA-YOLO, (b) Represents YOLOv10s.

Figure 9.

Visualization of a NEU-DET Dataset Containing Six Defect Categories.

Figure 10.

Distribution of Six Types of Defects on the NEU-DET Dataset.

Table 10.

Various Defects in the DEMA-YOLO Model on the PCB Dataset.

Defect	P (%)	R (%)	AP@.5 (%)
Missing_Hole	94.7	95.2	98.3
Mouse_Bite	89.6	87.1	93.8
Open_Circuit	96.1	81.6	91.0
Short	94.8	97.5	96.4
Spur	87.4	89.7	95.1
Spurious_Copper	96.5	80	88.6

4.5.1. Ablation Experiments for Different Modules

In order to verify the effectiveness of each module in DEMA-YOLO to deal with defects in complex backgrou nds and the performance of model improvement, we conducted ablation experiments on three key modules: MAMDP, DDUE, and NWD loss on the NEU-DET dataset. Main performance indicators, including P, R, FPS and mAP@0.5 See Table 11. The results of the ablation experiments indicate that the model performance improved more significantly with the addition of the DDEU module. This enhancement can be attributed to the DDEU module’s role as an upsampling component, which effectively preserves a substantial number of detailed features within the upsampling feature map. This preservation enables the model to accurately identify the defect locations and maximizes the differentiation of defects that closely resemble the background. In particular, in detection scenarios where the background and defects are highly similar, the DDEU module demonstrates particularly robust performance.

Table 11.
Ablation Study of Different Modules on the NEU-DET Dataset.

Method P (%) R (%) mAP@.5 (%) Params (M) FPS

YOLOv10s 88.7 86.1 89.6 8.1 121.4

$+$ DDEU 90.1 87.6 90.4 8.0 117.2

$+$ MAMDP 90.2 85.9 90.1 7.7 124.1

$+$ $N W D_{i m p r o v e d}$ loss 89.7 89.5 89.9 8.1 118.6

Method	P (%)	R (%)	mAP@.5 (%)	Params (M)	FPS
YOLOv10s	88.7	86.1	89.6	8.1	121.4
$+$ DDEU	90.1	87.6	90.4	8.0	117.2
$+$ MAMDP	90.2	85.9	90.1	7.7	124.1
$+$ $N W D_{i m p r o v e d}$ loss	89.7	89.5	89.9	8.1	118.6

4.5.2. Comparison Experiments

Table 12 shows the performance comparison of DEMA-YOLO and other object detectors on the NEU-DET dataset. Compared with object detectors with similar numbers of parameters, DEMA-YOLO uses only 7.8 million parameters and outperforms YOLOv10s by 1.0%, 1.2% and 0.9% in P, R and mAP. It also performs better than other popular target detectors.

Table 12.
Compare the DEMA-YOLO Model Proposed in this Article with other Models on the NEU-DET Dataset.

Model P (%) R (%) mAP@.5 (%) Params (M) FPS

Faster R-CNN 87.9 84.7 86.3 138.4 8.6

Cascade R-CNN 75.6 73.8 76.9 71.1 45.2

EfficientViT 88.5 87.4 88.1 91.0 47.0

RT-DETR-R18 88.1 89.3 87.2 19.9 69.6

DETR 87.5 88.2 88.1 41 38.2

YOLOv8s 87.9 89.2 88.4 11.1 104.0

YOLOv9s 86.9 87.8 87.2 20.0 87.2

YOLOv11s 89.7 88.1 88.9 9.4 113.2

YOLOv10s 89.7 85.9 89.6 8.1 121.4

DEMA-YOLO(OURS) 90.7 87.2 90.5 7.8 122.0

Model	P (%)	R (%)	mAP@.5 (%)	Params (M)	FPS
Faster R-CNN	87.9	84.7	86.3	138.4	8.6
Cascade R-CNN	75.6	73.8	76.9	71.1	45.2
EfficientViT	88.5	87.4	88.1	91.0	47.0
RT-DETR-R18	88.1	89.3	87.2	19.9	69.6
DETR	87.5	88.2	88.1	41	38.2
YOLOv8s	87.9	89.2	88.4	11.1	104.0
YOLOv9s	86.9	87.8	87.2	20.0	87.2
YOLOv11s	89.7	88.1	88.9	9.4	113.2
YOLOv10s	89.7	85.9	89.6	8.1	121.4
DEMA-YOLO(OURS)	90.7	87.2	90.5	7.8	122.0

4.5.3. The AP Score of Different Categories

Table 13 shows the accuracy of DEMA-YOLO in six defect categories. It can be seen that the precision of the model exceeds 83% in all categories and even exceeds 90% in the three categories of cracking, rolled-in-scale, and scratches. It can be seen that DEMA-YOLO can also achieve good results in detecting defects in complex backgrounds. Figure 11 shows that the PR curve and confusion matrices for the types of defects demonstrate strong performance in detecting crazing, rolled-in_scale, and scratches, with P and R values approaching 99%. This indicates high detection accuracy and minimal false positives or negatives. Inclusion and patches exhibit slightly lower P and R, yet still achieve mAP scores of 83.9% and 84.6%, respectively, reflecting a favourable balance between P and R. pitted_surface displays moderate performance, characterized by a trade-off between P (87.4%) and R (80.3%), resulting in a lower mAP of 83.1%. Overall, the model excels in identifying well-defined defects, such as crazing and rolled-in_scale, while still providing reliable results for other defect types. Compared to YOLOv10s, this model performs well in identifying clear defects, such as silver lines and curled edges, while still providing reliable results for other types of defects. It can be seen that our model performs well even in situations where industrial defects are very similar to the background.

4.6. Experiments on the Mixed-type WM38 Dataset

The mixed WM38 dataset comprises a total of 3,798 images, including eight different types of defects. These defect types are as follows: donut, center, loc, edge_loc, edge_ring, near_full, scratch and random, as shown in Figure 12. The Mixed-type WM38 dataset is characterized by the absence of real-world noise and precise boundary labels. To mitigate these limitations and enhance the model’s generalization, accuracy, and robustness, we expanded the dataset using techniques such as rotation, flipping, brightness adjustment, and re-labeling. The final dataset comprises 5,602 images. Subsequently, the training set, validation set, and testing set were split in an 8:1:1 ratio. During the experiment, we adjusted the image pixels to $640 \times 640$ as the input size for the model. Figure 13 illustrates the distribution of each category of defects.

Figure 11.

PR Curve and Confusion Matrices of DEMA-YOLO and YOLOv10s Model on the NEU-DET Dataset. (a) Representing DEMA-YOLO, (b) Represents YOLOv10s.

Figure 12.

Visualization of a Mixed Type WM38 Dataset Containing Eight Defect Categories.

Figure 13.

Distribution of Eight Types of Defects on the Mixed-Type WM38 Dataset.

Table 13.

Various Defects in the DEMA-YOLO Model on the the NEU-DET Dataset.

Defect	P (%)	R (%)	AP@.5(%)
Crazing	99.8	99.1	99.5
Inclusion	81.9	74.9	83.9
Patches	82.7	77.2	84.6
Pitted_Surface	87.4	80.3	83.1
Rolled-in_Scale	99.1	99.5	99.5
Scratches	88.1	86.2	92.7

4.6.1. Ablation Experiments for Different Modules

Ablation experiments were performed using DEMA-YOLO on the mixed-type WM38 dataset. The results show that the model is basically the same as YOLOv10s in terms of precision, but the inference speed is significantly faster than YOLOv10s. It can be seen that our model also performs well for mixed complex defects. The results are shown in Table 14. The results show that adding one of the three modules can achieve a faster inference speed and smaller parameters without affecting precision. In the context of mixed industrial defect detection, the synergy of the three modules enables the model to achieve optimal detection performance.

Table 14.
Ablation Study of Different Modules on the Mixed Type WM38 Dataset.

Method P (%) R (%) AP@.5 (%) Params (M) FPS

YOLOv10s 98.4 95.5 98.7 8.1 128.4

$+$ DDEU 98.5 94.6 98.5 8.0 114.7

$+$ MAMDP 98.7 95.3 98.4 7.7 147.1

$+$ $N W D_{i m p r o v e d}$ loss 98.6 96.6 98.7 8.1 104.8

Method	P (%)	R (%)	AP@.5 (%)	Params (M)	FPS
YOLOv10s	98.4	95.5	98.7	8.1	128.4
$+$ DDEU	98.5	94.6	98.5	8.0	114.7
$+$ MAMDP	98.7	95.3	98.4	7.7	147.1
$+$ $N W D_{i m p r o v e d}$ loss	98.6	96.6	98.7	8.1	104.8

4.6.2. Comparison Experiments

Table 15 compares the proposed DEMA-YOLO with several mainstream object detectors, including Faster R-CNN, Cascade R-CNN, EfficientViT, RT-DETR-R18, DETR, YOLOv8s, YOLOv9s, YOLOv11s and YOLOv10s. DEMA-YOLO achieves an mAP of 98.7%, which is basically the same as YOLOv10s, while maintaining a low number of parameters of 7.8 million, showing excellent accuracy and efficiency, but the number of parameters of the DEMA-YOLO model is reduced compared to YOLOv10s and the inference speed is improved. These results confirm the effectiveness of the design and implementation of DEMA-YOLO.

Table 15.
Compare the DEMA-YOLO Model Proposed in this Article with other Models on the Mixed Type WM38 Dataset.

Model P (%) R (%) mAP@.5 (%) Params (M) FPS

Faster R-CNN 88.4 89.7 92.1 138.4 6.5

Cascade R-CNN 94.7 93 94.3 71.1 57.0

EfficientViT 93.2 94.6 94.9 91.0 32.4

RT-DETR-R18 94.0 94.8 95.3 19.9 102.1

DETR 95.2 93.1 96.1 41 45.8

YOLOv8s 93.8 91.1 93.2 11.1 67.0

YOLOv9s 97.3 93.9 97.5 20.0 92.3

YOLOv11s 98.4 95.1 98.3 9.4 119.7

YOLOv10s 98.4 95.5 98.7 8.1 128.4

DEMA-YOLO(OURS) 99.0 96.0 98.7 7.8 134.2

Model	P (%)	R (%)	mAP@.5 (%)	Params (M)	FPS
Faster R-CNN	88.4	89.7	92.1	138.4	6.5
Cascade R-CNN	94.7	93	94.3	71.1	57.0
EfficientViT	93.2	94.6	94.9	91.0	32.4
RT-DETR-R18	94.0	94.8	95.3	19.9	102.1
DETR	95.2	93.1	96.1	41	45.8
YOLOv8s	93.8	91.1	93.2	11.1	67.0
YOLOv9s	97.3	93.9	97.5	20.0	92.3
YOLOv11s	98.4	95.1	98.3	9.4	119.7
YOLOv10s	98.4	95.5	98.7	8.1	128.4
DEMA-YOLO(OURS)	99.0	96.0	98.7	7.8	134.2

Table 16.

Various Defects in the Mixed Type WM38 Dataset.

Defect	P (%)	R (%)	AP@.5(%)
Center	99.5	98.3	99.5
Donut	99.1	99.2	99.5
Edge_Loc	97	80.5	95.1
Edge_Ring	99.2	96.6	99.4
Loc	97.7	96.1	99.1
Near_Full	96.2	99,6	99.5
Scratch	98.3	96.9	98.2
Random	98	99.7	99.5

4.6.3. The AP Score of Different Categories

Table 16 demonstrates that the model exhibits the highest detection performance for center, donut, near_full and random defects. In contrast, the recall rate for edge_loc and loc defects are relatively low. However, the model maintains strong performance, achieving an AP@.5 that exceeds 95%.

Figure 14 shows the PR curves and confusion for various type of defects detected by the DEMA-YOLO and YOLOv10s model in the mixed type WM38 dataset. From these curves, we observe that the model achieves a high detection probability for most types of defects. However, the detection rates for edge_loc and scratch defects are comparatively lower than those for other defect types. Despite this, the overall detection rate remains above 95%.

Figure 14.

PR Curve and Confusion Matrices of DEMA-YOLO and YOLOv10s on the Mixed Type WM38 Dataset. (a) Representing DEMA-YOLO, (b) Represents YOLOv10s.

Figure 15.

Visualization Showing Different Models on the PCB Dataset. (a) Represents DEMA-YOLO, (b) Represents YOLOv11s, (c) Represents YOLOv10s and (d) Represents YOLOv9s.

Figure 16.

Visualization Showing Different Models on the NEU-DET Dataset. (a) Represents DEMA-YOLO, (b) Represents YOLOv11s, (c) Represents YOLOv10s and (d) Represents YOLOv9s.

Figure 17.

Visualization Showing Different Models on Mixed Type WM38 Dataset. (a) Represents DEMA-YOLO, (b) Represents YOLOv11s, (c) Represents YOLOv10s and (d) Represents YOLOv9s.

4.7. Visualization Studies

Figure 15 illustrates the inference results of DEMA-YOLO, YOLOv11s, YOLOv10s and YOLOv9s in the PCB dataset. The results indicate that our proposed method demonstrates high accuracy in detecting small target objects, as evidenced by the precision of the predicted bounding-box positions and the confidence levels associated with the category classifications. In contrast, the other two models either missed or misdetected the objects. Figure 16 shows the inference results of our model in the NEU-DET dataset. Despite significant differences in defect sizes and high similarity between defects and background, our model shows good performance, although slightly lower than the performance observed on the other two datasets. However, in general our model outperforms most models in all three datasets. At the same time, our model also performs well compared to the visualization results of the other two models, with no false positives or missed positives. In Figure 17, our model performs well on the mixed-type WM38 dataset, which shows that it can also effectively perform the detection task for mixed-type industrial defects.

4.8. Feasibility in Real-World Deployment

The proposed DEMA-YOLO model is optimized not only for high detection performance but also for practical demands in real-world deployment. Featuring a compact architecture with only 7.8 million parameters, the model significantly reduces memory consumption and computational load. This lightweight design facilitates deployment on resource-limited devices, such as embedded systems and industrial edge computing platforms, without necessitating powerful GPUs or servers. In terms of inference efficiency, the model achieves an average speed of 125.1 FPS on an NVIDIA Tesla T4 GPU across the PCB, NEU-DET, and Mixed type WM38 datasets. This high-speed performance meets the real-time requirements of most industrial surface defect detection tasks, particularly in high-throughput manufacturing environments where timely feedback is critical for quality control and process optimization. Furthermore, the model’s strong performance across three diverse datasets demonstrates its robustness and generalization ability in various defect scenarios, enhancing its practical applicability. Compared to traditional or heavier deep learning models, DEMA-YOLO strikes a favourable balance between detection accuracy, model complexity, and inference speed. These advantages confirm that DEMA-YOLO is well-suited for deployment in real-world industrial inspection systems, especially those constrained by hardware resources, latency requirements, or energy efficiency concerns.

5. Discussion: Challenges in Model Development and Deployment

During the development of DEMA-YOLO, several key challenges were encountered. Firstly, severe class imbalance and label noise in industrial defect datasets hindered stable training and generalization. These issues necessitated the implementation of targeted data augmentation strategies and class-reweighting techniques. Secondly, the model encountered difficulties in preserving fine-grained features during upsampling and achieving efficient multi-scale representation, which were addressed through the design of the DDEU and MAMDP modules. Although deployment was not included in this study, future work will focus on real-world implementation to evaluate the robustness and efficiency of the model under practical industrial conditions.

6. Conclusion

To balance the efficiency and accuracy of industrial defect detection, we propose an efficient detection model, DEMA-YOLO, based on YOLOv10, which is suitable for resource-constrained environments. The model parameters are only 7.8M, due to the use of YOLOv10 as the backbone network. Additionally, we have designed the DDEU module, which effectively addresses the issue of detail loss during upsampling by exchanging critical information regarding industrial defect edges at various scales. Concurrently, we propose the MAMDP module to significantly enhance the model’s feature extraction and fusion capabilities. Finally, we introduce NWD loss and incorporate adaptive temperature coefficients along with high-order terms to improve its performance, thereby enhancing the model’s stability in small-target border regression and mitigating the issue of large training jitter amplitude.

Experimental results show that DEMA-YOLO has achieved a good balance between detection efficiency and accuracy and is suitable for resource-constrained industrial defect detection scenarios.

Footnotes

ORCID iDs

Jiajin Zhong

Hongcheng Wang

Funding

This work was supported by Dongguan Science and Technology of Social Development Program under Grant 20231800940532, and Songshan Lake Sci-Tech Commissioner Program under Grant 20234373-01KCJ-G.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Beijing University. PKU-market-PCB dataset. Retrieved November 1, 2024, from https://robotics.pkusz.edu.cn/resources/dataset

Cai

Vasconcelos

(2018). Cascade R-CNN: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6154–6162).

Carion

, et al. (2020). End-to-end object detection with transformers. arXiv preprint arXiv:2005.12872. https://arxiv.org/abs/2005.12872.

Damira

Vykintas

Elena

(2024). Machine learning-based approach for automatic defect detection and classification in adhesive joints. NDT & E International, 148, 103221.

Dehua

, et al. (2023). An efficient lightweight convolutional neural network for industrial surface defect detection. Artificial Intelligence Review, 56, 10651–10677.

Feifan

Haigang

Jinfeng

, et al. (2024). YOLOv7-siamFF: Industrial defect detection algorithm based on improved YOLOv7. Computers & Electrical Engineering, 114, 109090.

Haigang

Ronghui

Fengjun

, et al. (2023). Zero-DD: Zero-sample defect detection for industrial products. Computers & Electrical Engineering, 105, 108516.

Hassan

S. A.

Beliatis

M. J.

Radziwon

, et al. (2024). Textile fabric defect detection using enhanced deep convolutional neural network with safe human–robot collaborative interaction. Electron, 13, 4314.

Gkioxari

Dollar

, et al. (2017). Mask R-CNN. In Proceedings of IEEE International Conference on Computer Vision (ICCV) (pp. 2961–2969).

10.

Hongxin

Xiaoxin

Hao

, et al. (2024). MD-YOLO: Surface defect detector for industrial complex environments. Optics and Lasers in Engineering, 178, 108170.

11.

Jingni

Shujuan

Yan

(2024). A real-time surface defects detection model via dual-branch feature extraction and dynamic multi-scale fusion attention. Digital Signal Processing, 152, 104582.

12.

Jocher

Chaurasia

Qiu

(2023). Ultralytics YOLO, Version 8.0.0. AGPL-3.0 License. https://github.com/ultralytics/ultralytics.

13.

Jocher

Qiu

Chaurasia

(2023). Ultralytics YOLO. Version 8.0.0. https://github.com/ultralytics/ultralytics.

14.

Kim

Tak

Shin

(2024). A deep learning model for wafer defect map classification: Perspective on classification performance and computational volume. Physica Status Solidi B, 70, 2300113.

15.

Xuefeng

Feng

, et al. (2024). Surface defect detection of industrial components based on improved YOLOv5s. IEEE Sensors Journal, 24, 23940–23950.

16.

Lin

T. H.

Chang

C. T.

Putranto

(2024). Tiny machine learning empowers climbing inspection robots for real-time multiobject bolt-defect detection. Engineering Applications of Artificial Intelligence, 133(Part F), 108618. https://doi.org/10.1016/j.engappai.2024.108618

17.

Liu

Sun

, et al. (2025). SFMW-YOLO: A lightweight metal casting surface defect detection method based on modified YOLOv8s. Expert Systems with Applications, 287, 128170.

18.

Liyun

Boyu

Hong

, et al. (2020). Improved faster R-CNN algorithm for defect detection in powertrain assembly line. Procedia CIRP, 93, 479–484.

19.

, et al. (2023). DETRs beat YOLOs on real-time object detection. arXiv preprint arXiv:2304.08069.

20.

Chen

Feng

, et al. (2025). ELA-YOLO: An efficient method with linear attention for steel surface defect detection during manufacturing. Advanced Engineering Informatics, 65, 103377.

21.

Ming

Wangqi

Ying

, et al. (2024). WSS-YOLO: An improved industrial defect detection network for steel surface defects. Meas, 236, 115060.

22.

Nagata

Miki

Otsuka

, et al. (2020). Pick-and-place robot using visual feedback control and transfer learning-based CNN. In proceedings of the 2020 IEEE International Conference on Mechatronics and Automation (ICMA) (pp. 850–855).

23.

Northeast University. Faculty resources. Retrieved November 1, 2024, from http://faculty.neu.edu.cn/songkechen/zh-CN/zhym/263269/list/index.html

24.

Pang

Lian

Dong

, et al. (2025). ASSM-YOLO: A lightweight detection model with adaptive deep feature enhancement for cathode copper surface defect recognition. Measurement, 255, 118061. https://doi.org/10.1016/j.measurement.2025.118061

25.

Ping

, et al. (2024). Enhanced detection of glass insulator defects using improved generative modeling and faster RCNN. Procedia CIRP, 129, 31–36.

26.

Ren

Girshick

, et al. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the International Conference on Neural Information Processing Systems (pp. 91–99).

27.

Rezatofighi

Tsoi

Gwak

, et al. (2019). Generalized Intersection over Union: A metric and a loss for bounding box regression. arXiv preprint arXiv:1902.09630. https://arxiv.org/abs/1902.09630.

28.

ShiLong

Gang

MingLe

, et al. (2023). ICA-Net: Industrial defect detection network based on convolutional attention guidance and aggregation of multiscale features.

29.

Shuhong

Jiaxin

Mutian

, et al. (2023). Wheel hub defect detection based on the DS-cascade RCNN. Meas, 206, 112208.

30.

Shuwen

Minqi

Zhenyu

, et al. (2023). An industrial interference-resistant gear defect detection method through improved YOLOv5 network using attention mechanism and feature fusion. Meas, 221, 113433.

31.

Sun

Geng

Guo

, et al. (2024). A strip steel surface defect salient object detection based on channel, spatial and self-attention mechanisms. Electron, 13, 4277.

32.

Tao

Zheng

Wang

, et al. (2024). Enhanced feature extraction YOLO industrial small object detection algorithm based on receptive-field attention and multi-scale features. Measurement Science & Technology 35, (10). https://doi.org/10.1088/1361-6501/ad633d

33.

Wang

, et al. (2024). YOLOv10: Real-time end-to-end object detection. arXiv preprint arXiv:2405.14458.

34.

Wang

C. Y.

Yeh

I. H.

Liao

H. Y. M.

(2024). YOLOv9: Learning what you want to learn using programmable gradient information. arXiv preprint arXiv:2402.13616.

35.

Wang

, et al. (2020). Deformable convolutional networks for efficient mixed-type wafer defect pattern recognition. IEEE Transactions on Semiconductor Manufacturing, 33, 587–596.

36.

Xie

Liao

(2023). Efficient-ViT: A light-weight classification model based on CNN and ViT. In Proceedings of the 2023 6th International Conference on Image and Graphics Processing (pp. 64–70).

37.

Yuanyuan

Jialong

S. K.

, et al. (2024). YOLO-RLC: An advanced target-detection algorithm for surface defects of printed circuit boards based on YOLOv5. Comput Mater Continua, 70, 4973–4995.

38.

Yuxi

Xiang

Shiyan

(2024). Machine vision-based detection of surface defects in cylindrical battery cases. Journal of Energy Storage, 101, 113949.

39.

Zekai

Mingle

Honglin

, et al. (2023). IDD-net: Industrial defect detection method based on deep-learning. Engineering Applications of Artificial Intelligence, 123, 106390.

40.

Zhang

Zou

Wang

, et al. (2023). Lightweight neural network-based real-time PCB defect detection system. In Proceedings of the 2023 CAA Symposium on Fault Detection, Supervision and Safety for Technical Processes (SAFEPROCESS) (pp. 1–6).

41.

Zhou

Zhao

(2025). MPA-YOLO: Steel surface defect detection based on improved YOLOv8 framework. Pattern Recognition, 168, 111897.

42.

Zhuxi

Yibo

Minghui

, et al. (2023). Online visual end-to-end detection monitoring on surface defect of aluminum strip under the industrial few-shot condition. Journal of Manufacturing Systems, 70, 31–47.

DEMA-YOLO: An Effective Industrial Defect Detection Algorithm Based on a Double-flow Edge Detail Enhancement Module and a Multi-scale Attention Mechanism

Abstract

Keywords

1. Introduction

2. Related Work

2.1. Target Detection Methods Based on Deep Learning

2.2. Technical Gaps

3.1. Efficient Industrial Defect Detection Model: DEMA-YOLO

4. Experiments and Results

4.1. Experimental Datasets

4.2. Experimental Parameter Settings

Table 6. Experimental Platform. Platform Name CPU Intel Xeon Gold 6330 GPU NVIDIA Tesla T4 CUDA 11.7 Python 3.8 Framework Pytorch 2.0.1

Table 8. Ablation Study of Different Modules on the PCB Dataset. Method P (%) R (%) mAP@.5 (%) Params (M) FPS YOLOv10s 89.5 84.1 87.8 8.1 112.0 + DDEU 93.6 90.3 92.5 8.0 97.2 + MAMDP 93.1 84.3 93.1 7.7 117.3 + N W D i m p r o v e d loss 93.7 92.1 93.6 8.1 102.1

4.5. Experiments on the NEU-DET Steel Dataset

Table 11. Ablation Study of Different Modules on the NEU-DET Dataset. Method P (%) R (%) mAP@.5 (%) Params (M) FPS YOLOv10s 88.7 86.1 89.6 8.1 121.4 + DDEU 90.1 87.6 90.4 8.0 117.2 + MAMDP 90.2 85.9 90.1 7.7 124.1 + N W D i m p r o v e d loss 89.7 89.5 89.9 8.1 118.6

4.6. Experiments on the Mixed-type WM38 Dataset

Table 14. Ablation Study of Different Modules on the Mixed Type WM38 Dataset. Method P (%) R (%) AP@.5 (%) Params (M) FPS YOLOv10s 98.4 95.5 98.7 8.1 128.4 + DDEU 98.5 94.6 98.5 8.0 114.7 + MAMDP 98.7 95.3 98.4 7.7 147.1 + N W D i m p r o v e d loss 98.6 96.6 98.7 8.1 104.8

4.8. Feasibility in Real-World Deployment

5. Discussion: Challenges in Model Development and Deployment

6. Conclusion

Footnotes

ORCID iDs

Funding

Declaration of Conflicting Interests

References

Table 6.
Experimental Platform.

Platform Name

CPU Intel Xeon Gold 6330

GPU NVIDIA Tesla T4

CUDA 11.7

Python 3.8

Framework Pytorch 2.0.1

Table 8.
Ablation Study of Different Modules on the PCB Dataset.

Method P (%) R (%) mAP@.5 (%) Params (M) FPS

YOLOv10s 89.5 84.1 87.8 8.1 112.0

$+$ DDEU 93.6 90.3 92.5 8.0 97.2

$+$ MAMDP 93.1 84.3 93.1 7.7 117.3

$+$ $N W D_{i m p r o v e d}$ loss 93.7 92.1 93.6 8.1 102.1

Table 11.
Ablation Study of Different Modules on the NEU-DET Dataset.

Method P (%) R (%) mAP@.5 (%) Params (M) FPS

YOLOv10s 88.7 86.1 89.6 8.1 121.4

$+$ DDEU 90.1 87.6 90.4 8.0 117.2

$+$ MAMDP 90.2 85.9 90.1 7.7 124.1

$+$ $N W D_{i m p r o v e d}$ loss 89.7 89.5 89.9 8.1 118.6

Table 14.
Ablation Study of Different Modules on the Mixed Type WM38 Dataset.

Method P (%) R (%) AP@.5 (%) Params (M) FPS

YOLOv10s 98.4 95.5 98.7 8.1 128.4

$+$ DDEU 98.5 94.6 98.5 8.0 114.7

$+$ MAMDP 98.7 95.3 98.4 7.7 147.1

$+$ $N W D_{i m p r o v e d}$ loss 98.6 96.6 98.7 8.1 104.8