Semi-supervised YOLO-DEP for high-resolution X-ray component localization and counting

Abstract

Accurate localization and counting of tiny electronic components in high-resolution X-ray images is a critical yet challenging task in nuclear science, radiation imaging, and industrial quality control. Traditional methods suffer from poor generalization in cluttered scenes, while deep learning approaches are limited by the lack of large-scale annotated datasets. This study aims to develop a semi-supervised detection framework that achieves high-precision component localization and counting in 3072 $\times$ 3072-pixel X-ray images, while significantly reducing manual annotation costs. We propose YOLO-DEP, a novel object detector that integrates the YOLOv11 architecture with a Deep Encoding Processor (DEP) and a Graph Attention Network (GAT). The DEP module enhances feature discrimination for dense and small targets via half-channel and spatial attention mechanisms. A semi-supervised label propagation strategy is designed to generate high-confidence pseudo-labels from only one labeled image per category, leveraging feature similarity graphs and GAT-based confidence filtering. We also introduce LEEC, a large-scale X-ray dataset for electronic component counting, containing 720 images across 49 component types. YOLO-DEP outperforms state-of-the-art detectors on both LEEC and DOTAv1 datasets. Specifically, YOLO-DEP-x achieves 79.2% mAP50 and 70.9% mAP50–95 on LEEC, with a counting error rate as low as 0.8, providing an immediately deployable solution for industrial automation, nuclear electronics inspection, fuel-assembly verification and broader radiation-based quality-control lines.

Keywords

X-ray imaging object counting semi-supervised learning convolutional neural network

Introduction

Industrial electronics manufacturing often requires fast and accurate counting of small components for inventory management, assembly verification, and quality control. Ensuring each product has the correct number is essential to prevent assembly errors or inventory discrepancies. Traditionally, this task has been performed by human workers or simple automated counters. Still, manual counting is time-consuming and prone to errors, especially when dealing with thousands of tiny components daily. Classical machine vision algorithms have been applied to automate counting, often using techniques like thresholding, contour detection, and template matching to identify parts in images. Leelawattananon et al.¹ used differential Gaussian edge detection technology for counting, Kumar et al.² used a minimum distance classifier to track objects, Perera et al.³ optimized the feature space and modeled it through Principal Component Analysis (PCA), and Wu et al.⁴ improved counting accuracy through image denoising. However, these algorithmic approaches typically struggled with cluttered scenes, changing lighting or component orientation, and overlapping parts, leaving a need for a more robust and efficient counting solution.

Beyond generic object detectors, a parallel line of research architectures to X-ray non-destructive testing. Early systems adopted dual-energy thresholding⁵ and morphological blob analysis⁶ for component presence check, while recent CNNs exploit segmentation^7,8 or oriented bounding-box regression^9,10 for solder-joint and wire-bond inspection. Xie et al.¹¹ employed a convolutional neural network to perform macroscopic counting regression on synchrotron Kirkpatrick–Baez mirrors, which highlights the growing trend of using CNNs for dense counting. To correct spectral peak drift and obtain reliable net counts, Yang et al.¹² introduced a data-driven CNN-LSTM model that tackles X-ray peak-count drift. CNNs have also been adopted for macroscopic counting, such as automatic $α$ -track enumeration on imaging plates¹³ and dense neutron-radiograph quality scoring.¹⁴ Kang et al.¹⁵ proposed a multidimensional X-ray image sorting algorithm based on CdZnTe photon counting detectors, which accurately counted and classified coal and gangue through deep learning, greatly improving the accuracy of industrial sorting. Wei et al.¹⁶ used convolutional neural networks to automatically detect and identify dangerous objects in complex security inspection X-ray images, effectively improving detection and classification performance. Fauzi et al.¹⁷ compared the performance of multiple convolutional neural network models in distinguishing primary and metastatic tumors in bone tumor MRI sequences, verifying the superiority of deep learning in medical image classification tasks. These methods, however, rely on dense, clean annotations and struggle when targets are densely stacked.

In recent years, advances in computer vision and machine learning have led to significant improvements in object detection performance, opening new possibilities for automated counting. Convolutional Neural Networks (CNNs)^18,19 can learn rich feature representations and have revolutionized visual recognition tasks, outperforming earlier hand-crafted feature methods. de Arruda et al.²⁰ combined CNN with feature map enhancement. Onoro-Rubio et al.²¹ designed CNN for instance counting, Kilic et al.²² proposed a heatmap learner CNN, Gao et al.²³ combined Adaboost and CNN for crowd counting. Two-stage detectors like Faster R-CNN^24,25 first generate region proposals and then classify them, achieving strong accuracy at the cost of speed.

In contrast, single-stage detectors such as Single Shot MultiBox Detector (SSD) and You Only Look Once (YOLO) directly predict object bounding boxes and classes in one pass, enabling real-time detection. SSD²⁶ enhanced the detection ability of small objects by detecting on feature maps of different scales. EfficientDet²⁷ optimized model performance at multiple scales through the compound scaling method. YOLOv1²⁸ achieved end-to-end object detection through a single network structure, while YOLOv2²⁹ further improved performance by introducing batch normalization and high-resolution classifiers. YOLOv3³⁰ introduced multi-scale prediction and a deeper network structure. YOLOv4³¹ achieved the optimal balance between speed and accuracy by integrating multiple technical improvements such as weight decay and CIOU loss. YOLOv5³² was completely based on the PyTorch framework and introduced auto anchor and multi-scale pre-trained models. YOLOv6³³ was developed by the Meituan technical team and emphasizes high performance in industrial applications. YOLOv7, YOLOv8, YOLOv9, and YOLOv10^34–37 continued to innovate in feature fusion, model architecture, and detection speed, demonstrating unparalleled performance in terms of speed and accuracy.

The task of counting objects in images has significant applications in various fields, including agriculture, industry, and public safety. In recent years, deep learning-based methods have achieved remarkable success in object detection and counting tasks, particularly in the domain of crowd counting. For example, Khan et al.³⁸ provided a comprehensive review of state-of-the-art methods in crowd counting, highlighting the effectiveness of convolutional neural networks (CNNs)^39,40 and density map estimation techniques.⁴¹ Similarly, Fan et al.⁴² surveyed CNN-based approaches for crowd counting and density estimation, emphasizing the importance of multi-scale feature extraction and regression methods. These techniques have also been successfully applied to other counting tasks. For instance, Huang et al.⁴³ surveyed deep learning-based object detection methods in crop counting, demonstrating how CNNs can handle occlusion and scale variation in agricultural settings. Gao et al.⁴⁴ proposed a method called SwinCounter for multi-class object counting in remote sensing images, leveraging transformer-based architectures to address challenges in aerial imagery. Additionally, Wang et al.⁴⁵ introduced the NWPU-Crowd dataset, a large-scale benchmark for crowd counting and localization, which has further advanced research in this area. Despite these advancements, industrial applications such as electronic component counting remain challenging due to densely packed and overlapping objects. This paper explores the adaptation of deep learning-based methods from crowd counting to electronic component counting. By integrating techniques such as density map estimation, multi-scale feature extraction, and transformer architectures, we aim to address the unique demands of industrial component counting. The work builds on methodologies developed for crowd counting⁴⁶ and applies them to solve analogous problems in electronic component detection and counting, as highlighted by researchers like Holla et al.⁴⁷ and Wang et al.⁴⁸

Recently, transformer-based object detection models like DETR⁴⁹ introduced an alternative architecture, using an encoder–decoder transformer to predict objects without traditional anchor boxes or post-processing. DETR and subsequent transformer models such as Deformable DETR,⁵⁰ Dynamic DETR⁵¹ offer a new way to capture global context and relationships between objects, which could be beneficial for counting many similar items. Zhang et al.⁵² proposed that a CNN based on an encoder-decoder is also used to generate high-quality density maps to accurately count X-ray images. These deep learning approaches represent a major improvement over earlier algorithmic methods.

Despite these advances, directly applying general-purpose detectors to industrial component counting still presents challenges. Electronic components on production lines or PCBs are often tiny, closely spaced, and visually similar,⁵³ and thus general object detection models not specifically trained on such data can miss small, densely packed items or confuse one type for another, resulting in inaccurate counts. Moreover, many existing deep models require large annotated datasets for training, but public datasets for component detection or counting are limited in the electronic manufacturing domain. To mitigate annotation cost, weakly-supervised counting^54–56 and semi-supervised detection^57–60 have been explored in natural scenes. In the X-ray domain, unsupervised domain adaptation⁶¹ alleviates cross-machine drift but rarely addresses label scarcity for counting tasks. Recent studies begin to fill this gap: Andriiashen et al.⁶² bridge simulation-to-reality gaps with scattering-calibrated synthetic data, and Jia et al.⁶³ adopt adversarial domain adaptation to segment PCB layers across dose regimes without extra target annotations.

To meet these challenges, this work bridges this gap by introducing a dedicated dataset and a specialized detection model for electronic component counting. We present the LEEC dataset, a new collection of annotated images specifically designed for learning to detect and count electronic components in industrial scenarios. The LEEC dataset contains a wide variety of components (e.g., different types of chips, capacitors, resistors, and connectors) captured under diverse conditions, including varying backgrounds, lighting, and degrees of occlusion. To mitigate the high cost of manual labeling, a semi-supervised approach was implemented, leveraging a feature similarity graph and GAT to propagate labels. This method requires only one fully labeled image per category to generate reliable pseudo-labels, significantly reducing annotation effort while expanding the training dataset.

Using insights gained from the dataset, we also develop YOLO-DEP, a deep neural network tailored for accurate, real-time counting of electronic components. YOLO-DEP builds upon the YOLOv11⁶⁴ framework and introduces domain-specific enhancements to improve the detection of small, closely packed objects. Through these modifications, YOLO-DEP achieves higher precision in identifying each component. We evaluate our proposed approach on the LEEC dataset and compare it against several state-of-the-art detectors. Experimental results show that YOLO-DEP significantly outperforms standard YOLO models and other baseline methods in counting accuracy and detection precision, while operating at speeds suitable for real-time deployment on assembly lines. In summary, our contributions in this paper include:

Novel detection and counting framework: This paper introduces YOLO-DEP, an advanced object detection and counting model specifically designed for X-ray imaging of electronic components. By integrating deep encoding processors and multi-scale attention mechanisms, YOLO-DEP significantly enhances feature extraction capabilities, particularly for densely packed, small electronic components, thereby improving counting accuracy and localization robustness regardless of component orientations.

Large-scale dataset with semi-supervised labeling: The paper constructs and introduces the LEEC dataset, which includes 720 high-resolution X-ray images (3072 $\times$ 3072 pixels) across 49 distinct component categories. To alleviate the substantial manual labeling costs associated with large-scale datasets, a semi-supervised labeling approach is proposed, which employs a GAT combined with feature similarity graphs to propagate labels effectively, requiring one labeled image per category to generate reliable pseudo-labels, substantially expanding the available training data.

Superior experimental performance: Extensive evaluations demonstrate that YOLO-DEP surpasses existing state-of-the-art detection methods on both the LEEC and the DOTAv1 OBB dataset. Ablation studies further validate the effectiveness and necessity of each module integrated into YOLO-DEP. Results confirm that the proposed method achieves high accuracy in detection and counting tasks, as well as practical real-time efficiency suitable for industrial deployment on moderate computational hardware.

Methodology

YOLO-DEP framework

Figure 1 shows the detailed structure of YOLO-DEP, which is an improvement on the mature YOLOv11 framework. X-ray images are different from optical images, which can penetrate the surface of an object to show its internal features. As an accurate counting model tailored for X-ray images, YOLO-DEP is fundamentally based on the introduced DEP module, which replaces the conventional self-attention mechanism in YOLOv11, as illustrated in Figure 2.This module divides the input feature map into multiple slices (similar to the tomography slices in X-rays), each of which focuses on processing a part of the channel information, thereby improving the focus and efficiency of information processing.

Figure 1.

The overall structure of YOLO-DEP.

Figure 2.

The structure of the deep encoding processor module (DEP).

The DEP module uses a half-channel attention mechanism (HCA), which only applies the attention layer to half of the input channels, while the other half of the channels remain unchanged. This design enables the attention mechanism to attend to both the object’s external contours and its internal structure in X-ray images. Consequently, it captures features and localizes targets more accurately. Attention is applied to only half of the channels, computational overhead is reduced, and the projection layer propagates the refined attention features uniformly across all channels, further improving feature representation. At the same time, the introduction of the spatial attention mechanism enables the DEP module to capture feature information closely related to the spatial position of the target, which is crucial for accurately locating the target. Based on the half-channel attention and spatial attention, the DEP module fuses the outputs of the two attention mechanisms through channel shuffle. This step not only improves the feature expression capability but also achieves uniform distribution of feature maps in the channel dimension by rearranging channels, significantly improving information utilization efficiency.

Furthermore, to further enhance global feature interaction, YOLO-DEP replaces the original SPPF module with the Structural Adaptive Relative Interaction (SARI) module. As illustrated in Figure 3, SARI is a spatially-aware and computationally efficient variant of AIFI, which employs spatial reduction and relative position bias to facilitate long-range dependency modeling in high-resolution feature maps. By integrating these key improvements, YOLO-DEP not only adheres to the core principles of the YOLOv11 framework, but also breaks through these principles and provides an advanced mechanism for high-precision detection and counting of objects in X-ray images. This technological advancement is particularly critical in application scenarios where the internal structure of an object is as important as its external form, allowing for more comprehensive and in-depth analysis in areas such as security inspections, medical diagnosis, and industrial quality control.

Figure 3.

The structure of the Structural Adaptive Relative Interaction module (SARI).

Feature slice & split

Given a feature block $X$ with dimensions $H \times W \times C$ , where $H$ is the height, $W$ is the width, and $C$ is the number of channels, we evenly divide $X$ into $K$ feature slices along the channel dimension. Each slice is denoted as $X_{n}$ , where $n \in {0, 1, 2, \dots, K}$ . For each feature slice $X_{n}$ , instead of further dividing it into channel and spatial feature slices, we allocate parts of the $\frac{C}{K}$ -sized slice based on the slice’s index $n$ :

Allocate $\frac{C}{K} \times \frac{n}{K} = \frac{C \times n}{K^{2}}$ to the channel features $X_{n c}$ .

Allocate $\frac{C}{K} \times \frac{K - n}{K} = \frac{C \times (K - n)}{K^{2}}$ to the spatial features $X_{n s}$ .

Half channel attention (HCA)

The half channel attention module offers an efficient approach to modeling channel-wise dependencies in neural networks. It begins by taking a feature slice of dimension $\frac{C}{2 K} \times H \times W$ and splitting it equally along the channel dimension into two branches. The first branch preserves the original features through an identity mapping. The second branch, containing $\frac{C}{4 K}$ channels, is activated by a GELU function for nonlinearity, and then reshaped into Queries (Q), Keys (K), and Values (V) for the Self Attention mechanism, which computes attention scores by taking the dot product of Q and K, followed by a softmax operation to produce attention weights. These weights are applied to V, generating a weighted sum that represents the channel attention. The attention-weighted output and the identity branch are then concatenated and passed through a projection layer, restoring the channel dimensionality to $\frac{C}{2 K}$ .

The HCA module’s final output is a feature map that retains the input’s dimensionality while enhancing channel-wise dependencies, providing a refined representation for deep learning architectures. This concise process makes the HCA module a lightweight and effective choice for attention mechanisms.

Spatial attention (SA)

The spatial attention module enhances feature maps by focusing on informative spatial regions. It starts with Group Normalization (GN) on the input feature map of dimensions $\frac{C}{2 K} \times H \times W$ to normalize spatial statistics. A convolutional layer then refines these features, followed by an Exponential Linear Unit (ELU) activation to introduce non-linearity. Average pooling condenses the spatial information, and the resulting map is fused with the original through multiplication to highlight key areas. The process concludes with a sigmoid function to generate attention weights, which are applied to the original feature map to emphasize important spatial features and downplay less relevant ones.

Feature concat & shuffle

The integration of half channel attention and spatial attention culminates in a concatenated feature map that amalgamates the nuanced insights gained from both channel and spatial perspectives. This unified feature map is then subjected to a channel shuffle operation, which intricately interlaces the channels, fostering an environment where cross-group information can flow freely. This not only enhances the feature map’s richness but also paves the way for a more nuanced and comprehensive representation, readily integrated into subsequent neural network layers.

Semi-supervised learning pipeline

The proposed training process adopts a three-stage progressive semi-supervised learning framework, shown in Figure 4, aiming to fully utilize a small amount of labeled data and a large amount of unlabeled data to improve model performance. Overview of the semi-supervised pipeline: labeled and unlabeled images are embedded by the pre-trained backbone; a dynamic graph + GAT produces pseudo-labels, which are filtered by confidence and then combined with the original labels to retrain the detector.

Figure 4.

The overall framework of pseudo-label generation.

The specific workflow is as follows:

Initialization training with fully labeled data (red arrow section)

First, 15% of the fully labeled data is used to initialize the training of the YOLO-DEP model. The feature extractor, combined with the backbone network, extracts multi-scale visual features.

GAT confidence filter training with mixed data (yellow arrow section)

The embedding layer maps image features to a low-dimensional space. A dynamic graph construction module establishes sample relationships based on feature similarity, improving local semantic structure capture. A GAT is integrated for joint training. Specifically, pseudo-labels are generated for 80% of the unlabeled data and, together with the 15% labeled data, are fed into the GAT, which models global dependencies and adjusts pseudo-label weights using an attention mechanism.

A confidence filter selects high-quality pseudo-labels based on entropy thresholds and probability distribution consistency, reducing noise and improving reliability.

Iterative fine-tuning of YOLO-DEP (blue arrow section)

The pseudo-labels filtered by confidence are stored in a database, which are then fused with the original labeled data to form an upgraded training set. This strategy effectively balances data diversity and reliability, avoiding interference from redundant samples. The updated dataset is used for iterative fine-tuning of YOLO-DEP. This method, through the collaborative optimization of feature generation, graph structure modeling, and data filtering mechanisms, breaks through the performance bottleneck of models under limited labeled data, providing a new technical paradigm for weakly supervised object detection.

Loss functions within training GAT

Supervised classification loss

L_{sup} = - \frac{1}{| Y_{L} |} \sum_{i \in Y_{L}} \sum_{c = 1}^{C} q_{i c} \log P (c | x_{i})

(1)

where

Y_{L}

represents the indices of labeled nodes,

C

represents the number of classes,

q_{i c} = (1 - ϵ) \cdot y_{i c} + ϵ / C

represents the smoothed label distribution (

ϵ = 0.1

), and

y_{i c}

represents the original one-hot label.

Confidence-aware pseudo-label loss

L_{pseudo} = - \frac{1}{| Y_{P} |} \sum_{j \in Y_{P}} \log P ({\tilde{y}}_{j} | x_{j})

(2)

where

Y_{P} = {j \in Y_{U} ∣ p_{j}^{max} > τ}

represents the indices of pseudo-labeled nodes filtered by a confidence threshold,

Y_{U}

represents the indices of unlabeled nodes,

p_{j}^{max} = max_{c} P (y = c | x_{j})

represents the maximum predicted probability, and

{\tilde{y}}_{j} = \arg max_{c} P (y | x_{j})

represents the hard pseudo-label.

Structural consistency loss

L_{consist} = \frac{1}{| E |} \sum_{(i, j) \in E} ‖ \log P (y | x_{i}) - \log P (y | x_{j}) ‖_{2}^{2}

(3)

where

E

represents the edge set excluding self-loops, and

\log P (y | x_{i})

represents the log-probability vector of node

i

Total loss function

L_{total} = L_{sup} + λ_{p} (t) \cdot L_{pseudo} + λ_{consist} \cdot L_{consist}

(4)

where

λ_{p} (t) = min (0.25, t / 150)

t

represents the training steps, and

λ_{consist} = 0.15

represents the consistency loss weight.

Dataset and preprocessing methods

The LEEC dataset was crafted to fill the gap in resources for industrial counting tasks, a domain where specialized dataset resources are notably scarce. It comprises 720 high-resolution images, each with a resolution of 3072 $\times$ 3072 pixels, obtained via X-ray transmission technology of packaged electronic components. Encompassing 49 different types of electronic components, the dataset ensures that each type is represented in at least one image.

The LEEC contains images of various electronic components, ranging from standard resistors and capacitors to complex integrated circuits. It focuses on the diversity and comprehensiveness of the data, ensuring that the component images in the dataset have high variability in shape, size, contrast, and background to simulate various situations that may be encountered in the real world. In terms of image annotation, it uses a precise bounding box annotation method to assign a bounding box to each electronic component in the image and record its category information in detail. This annotation method not only provides the model with the precise location information of the component but also enables the model to learn the spatial relationship between different components. Accurate component count annotations have been provided for every image within the dataset, ensuring that each component is precisely identified and tallied. This approach furnishes the model with high-quality training data. Furthermore, special attention has been given to the balance of the data, ensuring that the representation of each component type within the dataset mirrors its distribution in real-world applications, thus preventing the overrepresentation of specific categories.

The LEEC dataset is the first dataset dedicated to small sample multi-class X-ray industrial electronic components. The electronic components in the dataset exhibit vast differences in appearance, size, and brightness due to their different shapes, sizes, and light transmission characteristics, especially the small size and high-density distribution of components, making the minimum target in the dataset only 15 $\times$ 8 pixels, accounting for 0.00127% of the total image area, which undoubtedly increases the difficulty of data processing. Compared with existing counting datasets, LEEC has significantly increased image size, computation, and memory requirements. In terms of the total number of annotations and the average count per image, LEEC also significantly surpassed the ShanghaiTech, NWPU-MOC, CARPK, UCF_CC_50, XRAY-IECCD, UCF-QNRF. LEEC excels in the diversity of component types and image resolution. Table 1 provides a detailed comparison of key indicators between LEEC and other datasets.

Table 1.

Comparison of shanghaiTech, NWPU-MOC, CARPK,UCF_CC_50,XRAY-IECCD,UCF-QNRF, and LEEC.

Dataset	Image size(HxW)	Number of images	Number of classes	Number of annotations	Average count
ShanghaiTech-A	868 $\times$ 589	482	1	241,677	501
ShanghaiTech-B	1024 $\times$ 768	716	1	88,488	123
NWPU-MOC	1024 x 1024	3416	14	383,195	112
CARPK	1280 x 720	1448	1	89,777	62
XRAY-IECCD	2048 x 2048	1460	10	2,915,126	1,996
UCF_CC_50	2888 x 2101	50	1	63,974	1279
UCF-QNRF	2902 x 2013	1535	1	1,251,642	815
LEEC	3072 x 3072	720	49	1,347,599	1,872

Figure 5 shows the display of the LEEC dataset and normalization in two different scenes. Figure 6 shows the preprocessing process of the LEEC dataset:

Image Normalization: Eliminates illumination and contrast variances while mitigating peripheral artifacts to ensure pixel uniformity and component visibility.

OBB Annotation: Employs Oriented Bounding Boxes (OBBs) on normalized images to achieve precise target identification and spatial localization.

Target Separation: Decouples components from the background through OBB-informed binary masking and bitwise operations to generate clean training samples.

Orientation Alignment: Rotates isolated components and their corresponding labels to a canonical orientation to facilitate standardized feature learning.

Figure 5.

The display of the proposed LEEC dataset. The left side shows 12 pictures of different types in the LEEC dataset, and the right side shows the normalized processing results of two original images in different scenes, corresponding to the conditions of bright strips on the edges (upper right) and dark scenes (lower right).

Figure 6.

LEEC dataset preprocessing method.

Experiment

Implementation details

The LEEC dataset is partitioned with 15% labeled data allocated for training and 5% labeled data reserved for testing, while the remaining 80% of the images are unlabeled. Each image undergoes a series of preprocessing steps, including rotational alignment, dimensional adjustment through cropping and scaling, followed by resolution normalization to 512 $\times$ 512 pixels. To enhance diversity and simulate real-world manufacturing conditions, photometric transformations (such as adjusting contrast, brightness, saturation, and hue) are also used. Detailed information about each component is provided in Table 2.

Table 2.

Distribution statistics of 9 types of electronic components. The train and test sets consist of non-empty images and corresponding annotations selected from the original images after segmentation, and annotations refer to the total number of original images.

Type	Train set	Test set	Annotations
4-SMD	69	17	1591
8mmcap	113	28	459
CDCard	29	7	699
DG $-$ 2002 $-$ 40	97	24	1500
RH74 $-$ 101M-S-Z	71	17	1000
RWO49 $-$ 26	49	12	1995
SWITCH	100	25	1001
USB Connect	89	22	1000
XBH $-$ 8A	112	28	702

For training, a batch size of 16 was chosen to optimize GPU memory usage and maintain training stability, facilitating effective learning from the LEEC dataset. The Momentum optimizer, initiated with a learning rate of 0.01 and complemented by a cosine annealing schedule, was employed to achieve efficient convergence. Training was accelerated on NVIDIA RTX A6000 GPUs, each equipped with 48 GB of VRAM, and Automatic Mixed Precision (AMP) was utilized to boost computational throughput and minimize memory usage without compromising accuracy. Hyperparameter configurations are outlined in Table 3.

Table 3.

Hyperparameters used in the model.

Hyperparameter	Value
Epochs	200
Batch_size	16
Img_size	512
Momentum	0.937
Patience (epochs without improvement)	1000
Lr_0 (initial learning rate)	0.01
Lr_f (learning rate decay factor)	0.2
Weight_decay (L2 regularization strength)	0.0005
Box (box loss gain)	0.05
Cls (classification loss gain)	0.5
Obj (objectness loss gain)	1.0
Iou_t (IOU threshold for positive anchors)	0.2
Anchor_t (anchor threshold)	4.0
Mosaic (probability of mosaic augmentation)	0.75
Mixup (probability of mixup augmentation)	0.1
Csl_radius (center sample radius)	2.0

Comparison study in the LEEC dataset

As shown in Table 4, the YOLO-DEP series models show significant advantages in many core metrics. Compared with the latest YOLO series and RT-DETR models of the same scale, YOLO-DEP maintains a leading position in localization accuracy and comprehensive performance of target detection. At the five scales of n/s/m/l/x, the mAP50 and mAP50 $-$ 95 indicators of YOLO-DEP both reach the highest level, among which the mAP50 (79.2%) and mAP50 $-$ 95 (70.9%) of YOLO-DEP-x-obb are 1.1% and 1.3% higher than those of YOLOv11-x-obb, and 1.3% and 3.3% higher than those of RT-DETR-x-obb.

Table 4.

Metrics for each model on the validation set after training on the full labeled training dataset . The number of parameters (#Param.), floating-point operations (FLOPs), mean average precision at 50% (mAP50(%)), 50 $-$ 95% (mAP50 $-$ 95(%)) thresholds, precision (%), recall (%), and electronic component counting error rate (CE).(Bold numbers indicate the best result enclosed by the nearest top and bottom horizontal rules.)

Model	Precision(%)	Recall(%)	mAP50(%)	mAP50 $-$ 95(%)	#Param.(M)	FLOPs(G)	Latency(ms)	CE(‰)
YOLOv5-n-obb	83.3	55.6	69.0	46.5	2.6	7.4	2.3	33.2
YOLOv8-n-obb	92.1	58.8	73.0	59.4	3.1	8.4	2.0	14.6
YOLOv11-n-obb	93.7	59.7	74.3	61.2	2.7	6.6	1.1	11.7
YOLO-DEP-n-obb	91.8	60.0	75.1	61.5	3.3	7.1	1.6	9.9
YOLOv5-s-obb	91.0	60.5	72.9	58.8	9.4	24.9	3.0	15.7
YOLOv8-s-obb	92.6	58.8	73.4	59.5	11.4	29.5	2.7	12.1
YOLOv11-s-obb	94.7	59.6	74.9	62.9	9.7	22.4	1.4	8.5
YOLO-DEP-s-obb	93.4	60.8	76.4	64.0	11.0	23.4	1.9	7.4
YOLOv5-m-obb	93.1	61.1	75.3	62.5	25.7	66.3	3.8	9.5
YOLOv8-m-obb	94.0	62.0	76.8	66.0	26.4	81.0	3.7	5.6
YOLOv11-m-obb	94.6	60.7	77.0	67.2	20.9	71.5	2.3	3.3
YOLO-DEP-m-obb	94.9	62.0	78.3	68.8	22.2	72.6	3.0	1.7
YOLOv5-l-obb	94.4	61.6	76.5	65.5	54.3	138.8	4.6	6.4
RT-DETR-l-obb	94.6	62.4	76.2	66.9	32.9	104.1	7.4	5.3
YOLOv8-l-obb	95.2	60.6	77.2	67.2	44.5	168.7	4.9	3.8
YOLOv11-l-obb	95.8	60.8	77.5	68.1	26.2	90.5	3.3	2.4
YOLO-DEP-l-obb	95.3	62.3	78.7	69.5	27.3	91.4	4.1	1.2
YOLOv5-x-obb	95.0	62.1	77.1	66.7	99.0	252.4	6.4	3.1
RT-DETR-x-obb	95.1	61.2	77.9	67.6	66.4	223.2	8.6	2.8
YOLOv8-x-obb	95.1	62.3	78.2	68.3	69.5	263.4	6.2	1.9
YOLOv11-x-obb	95.8	61.0	78.1	69.6	58.8	203.0	5.3	1.3
YOLO-DEP-x-obb	95.8	62.4	79.2	70.9	60.6	204.4	5.8	0.8

With respect to the key indicator, counting error rate (CE), which equals the absolute difference between the predicted and true counts divided by the true count, YOLO-DEP demonstrates outstanding stability in industrial counting scenarios. The counting error rate of the entire series of models is less than 10‰, among which the CE value of YOLO-DEP-x-obb is only 0.8‰, which is 38.5% lower than YOLOv11-x-obb (1.3‰) and 71.4% lower than RT-DETR-x-obb (2.8‰). This sub-thousandth error control capability, combined with its high frame rate characteristics, fully verifies the practical value of this architecture in real-time industrial detection systems. YOLO-DEP achieves a better balance in terms of accuracy, speed, and resource consumption through a deeply optimized feature fusion mechanism and lightweight design, providing a new technical paradigm for target detection and counting tasks in complex scenarios.

Comparison study on public dataset

As shown in Table 5, experiments based on the DOTAv1 aerial imagery dataset show that YOLO-DEP has significant advantages in multi-directional target detection tasks. At the x-large model scale, YOLO-DEP-x-obb achieves the best detection accuracy with (85.5%) mAP50, which is better than YOLOv11-x-obb (84.9%). YOLO-DEP optimizes feature expression through the deep encoding processor (DEP) module, and enhances the contour resolution capability of complex targets while using fewer parameters and lower computational cost. The hierarchical feature encoding strategy of the DEP module effectively improves the model’s detection robustness for small and rotating targets, and provides a new technical pathway for automatic interpretation of the detection results.

Table 5.

Metrics for each model on DOTAv1.(Bold numbers indicate the best result enclosed by the nearest top and bottom horizontal rules.)

Model	mAP50(%)	Params(M)	FLOPs (G)
YOLOv8-n-obb	79.9	3.1	8.3
YOLOv11-n-obb	80.4	2.7	6.6
YOLO-DEP-n-obb	80.7	3.2	7.0
YOLOv8-s-obb	82.4	11.4	29.4
YOLOv11-s-obb	82.5	9.7	22.3
YOLO-DEP-s-obb	82.9	11.0	23.4
YOLOv8-m-obb	83.0	26.4	80.9
YOLOv11-m-obb	83.8	20.9	71.4
YOLO-DEP-m-obb	84.4	22.2	72.4
YOLOv8-l-obb	83.4	44.5	168.6
YOLOv11-l-obb	84.1	26.1	90.3
YOLO-DEP-l-obb	84.8	27.3	91.2
YOLOv8-x-obb	83.9	69.5	263.2
YOLOv11-x-obb	84.9	58.8	202.8
YOLO-DEP-x-obb	85.5	60.5	204.2

Ablation study

Table 6 presents an ablation study of YOLO-DEP on the LEEC dataset, evaluating the impact of DEP and SARI modules on model performance, which is measured by AP^val(%), parameters, FLOPs, and latency. The table is organized by YOLO-DEP variants (n, s, m, l, x), showing performance with different module combinations. Checkmarks (✓) indicate included modules.

Table 6.

Ablation study with YOLO-DEP on LEEC.(bold numbers indicate the best result enclosed by the nearest top and bottom horizontal rules. All models use the OBB head; ‘-obb’ suffix is omitted for compactness.)

Model	Baseline	DEP	SARI	AP^val(%)	#Param.(M)	FLOPs(G)	Latency(ms)
	✓			74.3	2.7	6.6	1.1
	✓	✓		73.4	2.6	6.6	1.3
YOLO-DEP-n	✓		✓	74.7	3.3	7.1	1.3
	✓	✓	✓	75.1	3.3	7.1	1.6
	✓			74.9	9.7	22.4	1.4
	✓	✓		75.3	9.6	22.3	1.4
YOLO-DEP-s	✓		✓	75.4	11.2	23.5	1.8
	✓	✓	✓	76.4	11.0	23.4	1.9
	✓			77.0	20.9	71.5	2.3
	✓	✓		77.3	20.8	71.4	2.4
YOLO-DEP-m	✓		✓	77.4	22.4	72.7	2.8
	✓	✓	✓	78.3	22.2	72.6	3.0
	✓			77.5	26.2	90.5	3.3
	✓	✓		77.3	25.9	90.2	3.6
YOLO-DEP-l	✓		✓	77.7	27.6	91.6	3.9
	✓	✓	✓	78.7	27.3	91.4	4.1
	✓			78.1	58.8	203.0	5.3
	✓	✓		77.7	58.1	202.5	5.5
YOLO-DEP-x	✓		✓	78.6	61.3	205.0	5.6
	✓	✓	✓	79.2	60.6	204.4	5.8

The ablation study clearly demonstrates that the inclusion of DEP and SARI modules consistently improves the performance of YOLO-DEP across all scales. The largest model, YOLO-DEP-x, benefits the most from these modules, achieving the highest AP^val, albeit at the cost of increased computational resources and latency. This suggests that the attention mechanisms are crucial for enhancing the model’s ability to detect objects, especially in more complex scenarios that require finer-grained feature extraction and spatial awareness.

Pseudo-labeling study

The experiments in Table 7 show that pseudo-labeling can effectively improve the recall rate and high-confidence localization accuracy (mAP50 $-$ 95) in most models, but its impact on comprehensive detection performance (mAP50) varies depending on the model structure. The precision of YOLO-DEP-x and YOLOv11-x increased by 1.1% and 0.9%, the recall rate increased by 0.9% and 0.7%, and the mAP50 $-$ 95 increased by 0.7% and 1.0%, respectively, but the mAP50 of both decreased slightly ( $- 0.2$ % and $- 0.1$ %), indicating that pseudo-labels may introduce slight noise in high-threshold detection. In contrast, the recall rate and mAP50 $-$ 95 of RT-DETR-x increased by 1.0% and 0.8%, but its precision and mAP50 decreased by 0.1% and 0.4%, respectively, suggesting that the transformer architecture is more sensitive to pseudo-label noise. Therefore, future research needs to combine dynamic thresholds or confidence weighting strategies to optimize the robustness of different models in pseudo-labeling applications.

Table 7.

Performance comparison before and after pseudo-labeling.(all models use the OBB head; ‘-obb’ suffix is omitted for compactness.)

Model	Metric	B&A PL	Improvement
YOLO-DEP-x	Precision(%)	95.8/96.9	$+ 1.1 ↑$
	Recall(%)	62.4/63.3	$+ 0.9 ↑$
	mAP50(%)	79.2/79.0	$- 0.2 ↓$
	mAP50 $-$ 95(%)	70.9/71.6	$+ 0.7 ↑$
YOLOv11-x	Precision(%)	95.8/96.7	$+ 0.9 ↑$
	Recall(%)	61.0/61.7	$+ 0.7 ↑$
	mAP50(%)	78.1/78.0	$- 0.1 ↓$
	mAP50 $-$ 95(%)	69.6/70.6	$+ 1.0 ↑$
RT-DETR-x	Precision(%)	95.1/95.0	$- 0.1 ↓$
	Recall(%)	61.2/62.2	$+ 1.0 ↑$
	mAP50(%)	77.9/77.5	$- 0.4 ↓$
	mAP50 $-$ 95(%)	67.6/68.4	$+ 0.8 ↑$

Results visualization

The first column in Figure 7 displays the input images of 9 electronic components, the second column shows the ground truth detection maps corresponding to their input images, and the third to fifth columns display the detection maps predicted by the proposed method, and the other two latest methods (YOLOv11, RT-DETR). The red rotating box is the target detected by the ground truth or model, and the blue rotating box is the target that is not detected.

Figure 7.

YOLO-DEP and other methods generate detection maps for 9 types of components in the LEEC dataset.

As can be seen, YOLO-DEP-x-obb can accurately locate components in most cases, and the detection frame fits the real target well, which effectively reduces false detections and missed detections. YOLOv11-x-obb has equally good overall detection performance, but is slightly inferior to YOLO-DEP-x-obb when dealing with complex or mutually interfering components, and occasionally misses detections. RT-DETR-x-obb can accurately capture subtle features and identify occluded parts, but has a higher miss detection rate.

Heatmap visualization

According to Figure 8, YOLO-DEP shows a significant primary-secondary characteristic in attention allocation, focusing most of the attention on the core area of the circular material disk, ensuring high-precision detection of the main target and significantly reducing the possibility of false detection and missed detection. At the same time, part of the attention is reasonably allocated to the four small areas between the square disk and the circular disk, which can effectively capture potential targets at the edge or occluded part (such as material offset or stacking), thereby improving the robustness in complex scenes. Through this primary-secondary attention mechanism, YOLO-DEP ensures the accuracy of core target detection while avoiding the omission of key details caused by completely ignoring the background.

Figure 8.

Heatmap of YOLO-DEP and other models.

YOLOv11 mainly focuses on the edge area of the square disk and the circular disk, but does not pay enough attention to the core circular material disk, which may lead to a decrease in the stability of the main target detection. In addition, its excessive focus on the background or secondary areas makes it more sensitive to noise (such as lighting changes or irrelevant objects), thereby increasing the false detection rate.

The attention distribution of YOLOv8 is relatively scattered, lacks a clear focus, and fails to highlight the core area, resulting in insufficient localization accuracy for the circular material disk. At the same time, it pays less attention to small edge areas (such as gaps between materials or stacked parts) and is prone to missing small targets or occluded targets in complex scenes.

Conclusion

This work introduces YOLO-DEP, a semi-supervised detector and counter for high-resolution X-ray radiographs. A Deep Encoding Processor inside the YOLO backbone preserves fine structure while capturing multi-scale context, achieving sub-pixel localization and top-tier counting accuracy on dense micro-electronic assemblies. To support reproducible research, we release LEEC (49 classes, 3072 $\times$ 3072 px)—the first large dataset for dense radiographic counting. Sparse expert labels are expanded via a Graph-Attention–driven label-propagation scheme, slashing annotation effort. Experiments show the YOLO-DEP excels on varied scales, shapes, and rotations, making it ideal for fuel-fabrication, detector-module assembly, and other radiation-based inspection lines.

Current work still needs faster inference for edge deployments and rigorous validation under extreme imaging conditions. Going forward, we will streamline the network by pruning parameters and applying knowledge distillation to boost efficiency, and we will extend the framework to neutron radiography, proton CT and $γ$ -ray imaging, where label scarcity and dense-target ambiguity remain open challenges.

Footnotes

Acknowledgement

This research was made possible through the collaborative efforts of our research team, who contributed their expertise in deep learning, industrial automation, and X-ray imaging. We extend special thanks to the team members who assisted with dataset collection, annotation, and experimental validation. Additionally, we thank the reviewers for their valuable feedback, which helped improve the quality of this work.

ORCID iDs

Zhixuan Xiao

Huahai Sun

Xu Tuo

Liang Li

Author contributions

Zhixuan Xiao: Conceptualization, Methodology, Software, Formal analysis, Investigation, Data Curation, Writing – Original Draft, Visualization. Huahai Sun: Methodology, Software, Validation, Investigation, Data Curation, Writing – Review & Editing. Xu Tuo: Data Curation, Investigation, Writing – Review & Editing, Supervision. Liang Li: Conceptualization, Supervision, Project administration, Funding acquisition, Writing – Review & Editing.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Beijing Natural Science Foundation under Grant No. L222001.

Declaration of competing interest

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Leelawattananon

Chittayasothorn

. Electronic parts counting in physics laboratory using difference of gaussians edge extraction. In: 2024 12th International conference on information and education technology (ICIET), 2024, pp.338–342. IEEE.

Kumar

Kshirsagar

, et al. A real-time object counting and collecting device for industrial automation process using machine vision. IEEE Sens J 2023; 23: 13052–13059.

Perera

Fernando

Herath

, et al. A generic object counting algorithm under partial occlusion conditions. In: 2013 IEEE 8th International conference on industrial and information systems, 2013, pp.554–559. IEEE.

Kuo

. A counting algorithm and application of image-based printed circuit boards. J Appl Sci Eng 2009; 12: 471–479.

Mery

Svec

Arias

, et al. Modern computer vision techniques for x-ray testing in baggage inspection. IEEE Trans Syst Man Cybernet: Syst 2016; 47: 682–692.

Turcsany

Mouton

Breckon

. Improving feature-based object recognition for x-ray baggage security screening using primed visualwords. In: 2013 IEEE International conference on industrial technology (ICIT), 2013, pp.1140–1145. IEEE.

Miao

Xie

Wan

, et al. Sixray: A large-scale security inspection x-ray benchmark for prohibited item discovery in overlapping images. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp.2119–2128.

Zhuo

, et al. Eaod-net: Effective anomaly object detection networks for x-ray images. IET Image Proc 2022; 16: 2638–2651.

Wei

Tao

, et al. Occluded prohibited items detection: An x-ray security inspection benchmark and de-occlusion attention module. In: Proceedings of the 28th ACM international conference on multimedia, 2020, pp.138–146.

10.

Liu

Yuan

, et al. A lightweight dangerous liquid detection method based on depthwise separable convolution for x-ray security inspection. Comput Intell Neurosci 2022; 2022: 5371350.

11.

Xie

Jiang

, et al. Deep learning for estimation of kirkpatrick–baez mirror alignment errors. Nucl Sci Techn 2023; 34: 122.

12.

Yang

Fang

Huang

, et al. A new imaging mode based on x-ray ct as prior image and sparsely sampled projections for rapid clinical proton ct. Nucl Sci Tech 2023; 34: 126.

13.

Qin

Luo

, et al. Counting of alpha particle tracks on imaging plate based on a convolutional neural network. Nuclear Sci Tech 2023; 34: 37.

14.

Zhang

Meng

Jiang

, et al. Comprehensive quality assessment method for neutron radiographic images based on cnn and visual salience. Nuclear Sci Tech 2025; 36: 118.

15.

Kang

, et al. A novel multi-dimensional coal and gangue x-ray sorting algorithm based on cdznte photon counting detectors. J Xray Sci Technol 2024; 32: 369–378.

16.

Wei

Tang

, et al. A deep learning-based recognition for dangerous objects imaged in x-ray security inspection device. J Xray Sci Technol 2023; 31: 13–26.

17.

Fauzi

Yueniwati

Naba

, et al. Performance of deep learning in classifying malignant primary and metastatic brain tumors using different mri sequences: A medical analysis study. J Xray Sci Technol 2023; 31: 893–914.

18.

Wang

Kuen

, et al. Recent advances in convolutional neural networks. Pattern Recognit 2018; 77: 354–377.

19.

Liu

Yang

, et al. A survey of convolutional neural networks: analysis, applications, and prospects. IEEE Trans Neural Netw Learn Syst 2021; 33: 6999–7019.

20.

de Arruda

MdS

Osco

Acosta

, et al. Counting and locating high-density objects using convolutional neural network. Expert Syst Appl 2022; 195: 116555.

21.

Onoro-Rubio

López-Sastre

. Towards perspective-free object counting with deep learning. In: European conference on computer vision, 2016, pp.615–629. Springer.

22.

Kilic

Ozturk

. An accurate car counting in aerial images based on convolutional neural networks. J Ambient Intell Humaniz Comput 2023; 14: 1259–1268.

23.

Gao

Zhang

, et al. People counting based on head detection combining adaboost and cnn in crowded surveillance environment. Neurocomputing 2016; 208: 108–116.

24.

Ren

Girshick

, et al. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 2016; 39: 1137–1149.

25.

Pang

Liu

, et al. Passion fruit detection and counting based on multiple scale faster r-cnn using rgb-d images. Precis Agricult 2020; 21: 1072–1091.

26.

Liu

Anguelov

Erhan

, et al. Ssd: Single shot multibox detector. In: Computer Vision–ECCV 2016: 14th European conference, amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, 2016, pp.21–37. Springer.

27.

Tan

Pang

. Efficientdet: Scalable and efficient object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp.10781–10790.

28.

Redmon

Divvala

Girshick

, et al. You only look once: Unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp.779–788.

29.

Redmon

Farhadi

. Yolo9000: better, faster, stronger. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp.7263–7271.

30.

Redmon

Farhadi

. Yolov3: An incremental improvement. arXiv preprint arXiv:180402767 2018.

31.

Bochkovskiy

Wang

Liao

HYM

. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:200410934 2020.

32.

Liu

, et al. Application of local fully convolutional neural network combined with yolo v5 algorithm in small target detection of remote sensing image. PLoS ONE 2021; 16: e0259283.

33.

Jiang

, et al. Yolov6: A single-stage object detection framework for industrial applications. arXiv preprint arXiv:220902976 2022.

34.

Wang

Bochkovskiy

Liao

HYM

. Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp.7464–7475.

35.

Reis

Kupec

Hong

, et al. Real-time flying object detection with yolov8. arXiv preprint arXiv:230509972 2023.

36.

Wang

Yeh

Mark Liao

. Yolov9: Learning what you want to learn using programmable gradient information. In: European conference on computer vision, 2024, pp.1–21. Springer.

37.

Wang

Chen

Liu

, et al. Yolov10: Real-time end-to-end object detection. Adv Neural Inf Process Syst 2024; 37: 107984.

38.

Khan

Menouar

Hamila

. Revisiting crowd counting: State-of-the-art, trends, and future perspectives. Image Vis Comput 2023; 129: 104597.

39.

Wang

Liu

Wang

. Pyramid-dilated deep convolutional neural network for crowd counting. Appl Intell 2022; 52: 1825–1837.

40.

Wang

, et al. Multi-scale dilated convolution of convolutional neural network for crowd counting. Multimed Tools Appl 2020; 79: 1057–1073.

41.

Yang

Zhu

. Survey on algorithms of people counting in dense crowd and crowd density estimation. Multimed Tools Appl 2023; 82: 13637–13648.

42.

Fan

Zhang

, et al. A survey of crowd counting and density estimation based on convolutional neural network. Neurocomputing 2022; 472: 224–251.

43.

Huang

Qian

Wei

, et al. A survey of deep learning-based object detection methods in crop counting. Comput Electron Agricult 2023; 215: 108425.

44.

Gao

Zhao

. Nwpu-moc: A benchmark for fine-grained multicategory object counting in aerial images. IEEE Trans Geosci Remote Sens 2024; 62: 1–14.

45.

Wang

Gao

Lin

, et al. Nwpu-crowd: A large-scale benchmark for crowd counting and localization. IEEE Trans Pattern Anal Mach Intell 2021; 43: 2141–2149.

46.

Huang

Zhang

, et al. Approaches on crowd counting and density estimation: a review. Pattern Anal Appl 2021; 24: 853–874.

47.

Holla

Suma

Holla

. Optimizing accuracy and efficiency in real-time people counting with cascaded object detection. Int J Inform Technol 2024: 1–14.

48.

Wang

Zhao

, et al. Grainnet: efficient detection and counting of wheat grains based on an improved yolov7 modeling. Plant Methods 2025; 21: 44.

49.

Carion

Massa

Synnaeve

, et al. End-to-end object detection with transformers. In: European conference on computer vision, 2020, pp.213–229. Springer.

50.

Zhu

, et al. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:201004159 2020.

51.

Dai

Chen

Yang

, et al. Dynamic detr: End-to-end object detection with dynamic attention. In: Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp.2988–2997.

52.

Zhang

, et al. Amsa-caff net: Counting and high-quality density map estimation from x-ray images of electronic components. Expert Syst Appl 2024; 237: 121602.

53.

Luo

Wan

Lei

, et al. Ec-yolo: Improved yolov7 model for pcb electronic component detection. Sensors 2024; 24: 4363.

54.

Zhang

Liu

, et al. Group r-cnn for weakly semi-supervised object detection with points. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp.9417–9426.

55.

Qian

Huo

Cheng

, et al. Mining high-quality pseudoinstance soft labels for weakly supervised object detection in remote sensing images. IEEE Trans Geosci Remote Sens 2023; 61: 1–15.

56.

Zeng

Liu

, et al. Wsod2: Learning bottom-up and top-down objectness distillation for weakly-supervised object detection. In: Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp.8292–8300.

57.

Zhou

Wang

, et al. Instant-teaching: An end-to-end semi-supervised object detection framework. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp.4081–4090.

58.

Zhang

Lin

Zhang

, et al. Semi-detr: Semi-supervised object detection with detection transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp.23809–23818.

59.

Liu

Zhang

Lin

, et al. Ambiguity-resistant semi-supervised learning for dense object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp.15579–15588.

60.

Tang

Wang

Bai

, et al. Pcl: Proposal cluster learning for weakly supervised object detection. IEEE Trans Pattern Anal Mach Intell 2018; 42: 176–191.

61.

Akçay

Kundegorski

Devereux

, et al. Transfer learning using convolutional neural networks for object classification within x-ray baggage security imagery. In: 2016 IEEE international conference on image processing (ICIP), 2016, pp.1057–1061. IEEE.

62.

Andriiashen

van Liere

van Leeuwen

, et al. Quantifying the effect of x-ray scattering for data generation in real-time defect detection. J Xray Sci Technol 2024; 32: 1099–1119.

63.

Jia

Shi

Wei

, et al. Correction of motion artifact in cl based on mafusnet. J Xray Sci Technol 2023; 31: 393–407.

64.

Khanam

Hussain

. Yolov11: An overview of the key architectural enhancements. arXiv preprint arXiv:241017725 2024.