Abstract
Unmanned sorting technology can significantly improve the transportation efficiency of the logistics industry, and package detection technology is an important component of unmanned sorting. This paper proposes a lightweight deep learning network called EPYOLO, in which a lightweight self-attention feature extraction backbone network named EPnet is also designed. It also reduces the Floating-Point Operations (FLOPs) and parameter count during the feature extraction process through an improved Contextual Transformer-slim (CoTs) self-attention module and GSNConv module. To balance network performance and obtain semantic information for express packages of different sizes and shapes, a multi-scale pyramid structure is adopted using the Feature Pyramid Network (FPN) and the Path Aggregation Network (PAN). Finally, comparative experiments were conducted with the state-of-the-art (SOTA) model by using a self-built dataset of express packages by using a self-built dataset of express packages, results demonstrate that the mean Average Precision (mAP) of the EPYOLO network reaches 98.8%, with parameter quantity only 11.63% of YOLOv8 s and FLOPs only 9.16% of YOLOv8 s. Moreover, compared to the YOLOv8 s network, the EPYOLO network shows superior detection performance for small targets and overlapping express packages.
Introduction
Nowadays, with the flourishing development of mobile hardware devices and software technology, the e-commerce market has become the major consumer market, which has led to a significant increase in the number of express packages [1]. Package sorting is the most tedious step in logistics transportation, requiring a substantial amount of time and labor costs. In response to this problem, the logistics industry urgently needs an efficient and intelligent package sorting technology to reduce labor costs, improve sorting efficiency, accelerate logistics transportation speed, and enhance the competitiveness of logistics companies [2]. The application of target detection technology in express package classification is a future trend. Compared to manual inspection, machines are less likely to have false detections or missed detections under long working conditions [3]. It can continuously develop with the advancement of deep learning technology, greatly reducing the need for human resources. A logistics sorting center usually has tens of thousands of sorting devices, each requiring one or more technical personnel to operate. The application of target detection technology can reduce costs and achieve a more stable operational balance of income and expenses. Therefore, it is necessary to transform existing sorting centers into intelligent sorting centers [4]. By using more affordable embedded detection devices, logistics companies can reduce initial investment and maintenance costs. It is therefore necessary to design a lightweight and highly effective target detection network, so that these low computational embedded devices can operate more efficiently and stably. Based on the image features of express packages, this paper has designed a lightweight target detection network named EPYOLO. As demonstrated in Fig. 1, the EPYOLO model consists of four parts: Input, Backbone network, Neck and Prediction. On the input side, it utilizes mainstream Mosaic Data Augmentation and adaptive image scaling techniques for image preprocessing. This paper compared the performance of EPYOLO with YOLOv8 s, as shown in Fig. 2. Compared to YOLOv8 s, the mAP accuracy of the EPYOLO network reached 0.988. Additionally, the file size of the weight file of EPYOLO is only 2.82 MB, while YOLOv8 s has a file size of 24.4 MB. This paper makes the following contributions:

Architecture of EPYOLO network.

Effect comparison of EPYOLO and YOLOv8s.
This paper designs a lightweight backbone network named EPnet, which consists of multiple lightweight self-attention feature extraction modules called EPBlock. In order to enhance the global semantic information of the model and preserve more image information with a minimal number of parameters, a self-attention module named CoTs is designed and embedded into the feature extraction modules. EPnet possesses the advantages of low computational cost and high efficiency from Convolutional Neural Network (CNN), as well as the generalization ability of the self-attention module. CoTs address the issue of high memory access cost within the modules through group convolution and combination with CNN, and it also improves the detection capability of overlapping packages.
An express package dataset was created, and the self-attention network was applied for the first time in the logistics field to express package detection. The research in this paper fills a gap in the field of logistics.
Based on the size and shape of express packages, a multi-scale detection layer with a lightweight pyramid structure is innovatively added after the lightweight self-attention feature extraction module, reducing the false detection probability of small objects at the edges.
In the multi-scale pyramid, a lightweight convolution module named GSNConv and two Coordinate Attention (CA) modules [5] are designed. The former significantly reduces the parameter count without sacrificing accuracy. The latter focuses on high-level semantic information, suppresses unnecessary information, and improves the performance of the express package detection network. The convolutional channels are also adjusted to reduce data redundancy in the network.
With the development of artificial intelligence, deep learning has gone through two stages: CNN and the Recurrent Neural Network (RNN) and has now advanced to the Transformer model stage. Although self-attention models such as VIT [6], DeiT [7] and NesT [8] have shown excellent performance in large-scale tasks, they struggle to deeply model input sequences on small-scale datasets due to sparse data, and their performance is not as good as CNN. Moreover, the large computational resources required and the higher deployment device costs limit their practical applications. Although lightweight detection models like ShuffleNet-V2 [9], MobileNet v3 [10] and YOLOx [11] may reduce detection accuracy on large-scale datasets, they still perform well on small-scale datasets. Moreover, they also meet the demand for high real-time performance. Hence, despite CNN’s weak performance in large-scale tasks, CNN detection models such as YOLOv5 [12], YOLOv6 [13] and YOLOv7 [14] are still widely used in real-world scenarios. Nowadays, the combination of CNN and Transformer in lightweight networks has become an important research direction, as it can overcome the limitations of traditional CNN’s shallow receptive field and reduce the computational cost of self-attention. Representative models in this regard include Lite Transformer [15], MobileVit [16] and ConvNeXt [17].
The purpose of this study is to classify and detect express packages in logistics sorting scenarios using deep learning and computer vision. The focus is on the size and detection performance of the constructed network. Therefore, this study discusses the approaches and limitations in the field of target detection in recent years.
JIANG et al. [18] improved the performance of traffic sign recognition in YOLOv3 [19] by decomposing the convolution process into Depthwise Convolution (DWConv) and Pointwise Convolution (PWConv), thereby separating intra-channel and inter-channel convolutions. They also replaced the original loss function with the GIoU [20] loss function, improving detection accuracy and addressing real-time performance issues, but the detection speed still needs improvement. Aggarwal et al. [21] proposed a method using the region of interest (ROI) to identify vehicle license plates. They segmented the image into ROI using low-pass filtering and dynamic thresholding and then extracted and recognized license plates from the ROI, achieving high accuracy and a low false positive rate in license plate detection. Liu et al. [22] proposed the Swin Transformer, a novel unified model for vision and language processing. They reduced the computational complexity by using sliding windows and enabled the self-attention structure to construct hierarchical feature maps similar to those used in convolutional neural networks. Although it achieves high accuracy, the computational complexity still remains high. Ren et al. [23] proposed a lightweight real-time detection algorithm, YOLOv5-R, which replaces the backbone network with Ghostnet [24] and incorporates the Efficient Channel Attention (ECA) module [25] into Ghostnet to improve the inference speed. However, there are still limitations in terms of accuracy. Maaz et al. [26] proposed a new lightweight hybrid architecture that combines the strengths of CNN and Transformer models. It consists of convolutional and self-attention-based efficient encoders to effectively integrate local and global information, improving model efficiency and reducing computational complexity. However, it cannot construct hierarchical feature maps similar to Swin Transformer and has poor detection performance for small objects.
Therefore, to enhance the detection capability and performance of the network, this paper further improves the feature extraction module based on the combination of CNN and Transformer models, significantly reducing the computational complexity. Additionally, a multi-scale pyramid structure is incorporated into the self-attention model to improve the performance and feature representation capability of the model. By fusing features from different scales [27], the detection performance for objects of different scales is improved.
EPYOLO network
Backbone design
This paper proposes a lightweight self-attention backbone network called EPnet. At the top of the backbone network, a CBRM layer is introduced, consisting of convolutions and max pooling that achieve 4-fold downsampling, suppress non-essential features, and increase the receptive field. As shown in Fig. 3, this layer removes redundant information from non-essential features while preserving the essential ones.

Feature maps after the CBRM layer.
Immediately following is the main part of the backbone network, as shown in Fig. 4. This main body consists of six lightweight self-attention feature extraction modules called EPBlock, which utilize the classic inverted residual structure to enhance the learning capacity of the network by enabling the extraction of more features through convolution operations. Based on the stride value, the network is divided into two units, with EPnet performing different convolution operations on the feature maps of each unit [28]. When the stride is 1, a unit consists of two PWConv, DWConv, and the CoTs module. First, the input feature map undergoes a 1x1 PWConv operation to increase its dimension, expanding the input channels by a factor of four. Next, a 3x3 depth convolution is performed to extract high-dimensional information, which is then input into CoTs to obtain separable high-dimensional feature information [29], thereby enhancing the effectiveness of feature weighting. Lastly, the high-dimensional feature information is passed through a 1x1 PWConv for channel adjustment, so that the number of channels is aligned with the input. When the stride is 2, a unit is divided into left and right branches, with the input feature maps being input to each of the two branches. Both the left and right branches consist of a 1x1 PWConv for dimension expansion, a 3x3/2 DWConv, and a 1x1 PWConv for dimension reduction operations. The first PWConv in the left branch always adjusts the channels to a fixed value of 32, while the first PWConv in the right branch increases the number of channels by four times the input channels. This structure design allows for more linear combinations of feature maps at different high and low dimensions. Finally, the Concatenate (Concat) method is used to join the two branches at the same network depth to get an output channel that is twice as deep as the input. This preserves the original two-dimensional structure and improves feature abstraction and network representation.

Architecture of EPBlock.
The features outputted by the two units at the end of EPBlock undergo a Channel Shuffle operation to maintain information interaction between neuron groups and improve model performance [30]. Additionally, EPBlock also uses the Rectified Linear Unit 6 (ReLU6) activation function, which is a variant of the Rectified Linear Unit (ReLU). ReLU6 improves on ReLU by limiting the output range of ReLU6 from 0 to 6, allowing for better control of the activation function’s output range. It improves the stability and generalization ability of the model in some boundary-sensitive tasks. The formula of ReLU6 is as follows:
Where “x” represents the input value, the “min” function returns the smaller of the two parameters, and the “max” function returns the larger of the two parameters. Therefore, ReLU6 compares the input value 0 with “x”, and if “x” is less than 0, the output is set to 0. If “x” is greater than 0, the output is equal to “x”. If “x” is greater than 6, the output is truncated to 6. This formula can be used to calculate the output of the ReLU6 activation function for any real number, ensuring that low-dimensional feature information is not lost when mapping from high-dimensional space to low-dimensional space.
EPBlock achieves this by adjusting the number of convolutional kernels to maintain the same input and output channel numbers, thereby ensuring minimal memory cost and alleviating the increased memory cost introduced by the inclusion of the CoTs self-attention module. Finally, a Spatial Pyramid Pooling Fast (SPPF) module is added at the end of the backbone network to combine feature maps from different receptive domains, thereby improving runtime speed and achieving adaptive output sizing.
In this paper, to combine the capability of Transformers to capture global information with the capability of CNNs to capture neighboring local information, the Contextual Transformer (CoT) module is introduced. This module integrates self-attention learning into a framework, thereby enhancing the visual expressive ability of the network model. To reduce the computational cost of the CoT module, CoTs with lower resource demands are proposed. As shown in Fig. 5, the CoTs module first reduces the number of channels through a 1x1 convolution (Conv), thereby reducing the number of parameters. Next, it utilizes a 3x3 DWConv to embed the input keys with contextual information and obtain neighboring local information. Then, the channel number is adjusted back to the input channel structure using a 1x1 Conv, and the feature map is concatenated and overlaid with the original image [31]. Subsequently, the feature information is adjusted through a 1x1 Conv and Softmax activation, and the Softmax function is a commonly used activation function that maps an input real-valued vector to a probability distribution such that each output value is between 0 and 1 and the sum of the output values is 1.This allows the Softmax function to be used in multi-category classification tasks where the input vector can be converted to an output vector representing the probability of each category, thus facilitating classification decisions. The formula for Softmax is:

Architecture of CoTs module.
Where the z i represents the i-th element of the input vector, n is the number of output nodes, which is the number of categories for classification. For each element z i of the input vector, first calculate its exponential value e z i , then take the sum of all exponential values e z n as the denominator and divide e z i by the denominator to obtain the corresponding probability value. Finally, the Softmax function maps the input vector to an output vector representing a probability distribution. Each element of the output vector Softmax (z i ) represents the probability of the corresponding category, and the sum of all probability values is 1.
Finally, the global information is obtained through a self-attention operation on the Value Map. The neighboring information and global information are combined for output fusion. As shown in Fig. 6, compared to the CoT module, the CoTs module reduces the number of parameters while slightly outperforming the CoT module.

Effect comparison of CoT and CoTs.
The comparison between EPnet and other lightweight backbone networks on the express package dataset is presented in Table 1. Our model achieves higher accuracy than Shufflenetv2 and Mobilenetv3 with a similar number of parameters. Compared to Ghostnet, our backbone network reduces the parameter count and FLOPs by 26.91% and 1.1 G, respectively, while achieving the same level of accuracy. Therefore, the proposed backbone network design outperforms others on this dataset.
Compare different lightweight backbone networks
According to the distribution characteristics of the express package dataset constructed in this paper, the size of each object in the dataset has been examined. The investigation found that there was a lot of redundant information in the images. In order to improve the model’s ability to extract shallow graphical features and deep semantic features, this paper adopted a multi-scale pyramid structure with FPN and PAN to improve the efficiency of feature fusion in the model [32].
This paper presents a statistical analysis of the distribution characteristics of the express package dataset, as shown in Table 2. This paper investigates the range of pixels in the width and height of the label box for each category of express packages. To establish the criteria for defining small targets, this paper refers to the MSCOCO dataset, where targets with a resolution below 32 pixels×32 pixels are classified as small. Interestingly, all targets in our express package dataset exceed this small target threshold established by the MSCOCO. Nonetheless, due to the presence of overlapping express packages during the classification process, the unobstructed area of the overlapped packages is significantly limited. At this point, the detection boxes meet the definition of small targets according to MSCOCO, indicating the continued need for a small target detection layer.
Range of label box widths and heights for statistical packages
Range of label box widths and heights for statistical packages
Additionally, as there are distinct feature differences between the foreground and background of the express image data, and relative feature extraction can be achieved, this paper adjusted the number of convolution kernels, reduced non-primary features, and improved model efficiency. Figure 7 shows the multi-scale feature pyramid structure of EPYOLO, which further enhances the model performance through the lightweight design of the number of filters and output sizes for each layer. The feature pyramid structure sends the feature maps extracted by the EPBlock module convolution from the 4th layer of the backbone network to the Concat layer of the 10th layer, sends the feature maps extracted by the EPBlock module convolution from the 2nd layer to the Concat layer of the 15th layer, sends the feature maps extracted by the GSNConv module convolution from the 13th layer to the Concat layer of the 19th layer, and sends the feature maps extracted by the EPBlock module convolution from the 8th layer to the Concat layer of the 22nd layer. By employing this approach, the low-dimensional feature information fuses with high-dimensional feature information to facilitate multi-scale object detection.

Feature pyramid network architecture of EPYOLO.
In this work, this paper proposes a novel convolutional module called GSNConv for downsampling, as illustrated in Fig. 8. The module consists of a 1x1 Conv operation to reduce the number of parameters and a 3x3 DWConv operation to get deep information about that dimension, separating feature map redundant information and reducing the computational cost. The output of the 3x3 Conv operation is concatenated with the output of the previous feature extraction step to promote the interaction of feature information, reduce the information separation caused by the channels of input images, and optimize the interaction weights using batch normalization to ensure that the output information is normally distributed in terms of mean and variance [33]. Finally, the Channel Shuffle dense connection is employed to allow the dense convolutional features to penetrate the separate features of deep convolution and enhance the ability of feature extraction and fusion of the convolutional operation.

Architecture of GSNConv module.
As shown in Table 3, compared to other downsampling methods, GSNConv has almost the same parameters compared to the GSConv module while keeping the FLOPs constant, but it significantly improves the overall accuracy. Compared to standard convolution, the GSNConv module reduces 123,456 parameters and 0.2 G FLOPs, and accuracy is instead improved.
Comparison different down sampling methods
The CA module performs pooling operations on input features by decomposing channel attention into two one-dimensional feature encodings, each of which aggregates features separately along the horizontal and vertical directions. Furthermore, the module performs a series of operations such as average pooling, dimension adjustment, feature concatenation, one-dimensional convolution, Sigmoid activation function and coefficient multiplication. Compared to the SE channel attention mechanism, the CA module preserves accurate positional information to enhance the feature representation of objects. Additionally, the CA module avoids the broadcast mechanism. The weight matrix generated by the module positively impacts position feature extraction, dual spatial fusion, and final model detection. Figure 9 illustrates the specific structure of the CA module.

Architecture of CA module.
This paper examined four frequently utilized attention models and deployed them in identical positions. Following experimentation of the above-mentioned methods, we selected the CA module. Table 4 presents the impact of distinct attention modules formulated in this research.
Image size distribution of express package
Environment setup and dataset preparation
The training configurations in this paper are shown in Table 5:
Software and hardware configuration
Software and hardware configuration
Currently, no public datasets are available for detecting express packages, so this paper uses a self-constructed express package dataset, which has a total of 6,710 express package images, which were classified into four categories: paper bag, plastic bag, envelope and bubble bag. The express packages were divided into training, validation, and test sets using an 8 : 1:1 split ratio. Finally, the dataset was fed into the EPYOLO network, which was trained for 200 epochs to obtain the optimal model weights and performance metrics for express package detection. The specific training parameter settings are shown in Table 6.
Training parameters configuration
To better match the express package dataset training data, constrain model learning and stability, and effectively evaluate and optimize model performance, this paper utilizes the Complete Intersection over Union (CIOU)loss function. The CIOU function is a commonly used loss function for target detection that uses the intersection and concatenation between the true box, which is the bounding box of the object labeled by the labeler, and the predicted box, which is the bounding box of the object predicted by the algorithm, to compute the similarity between the two boxes. Which is computed using the following formula:
Where “b” represents the predicted center coordinate parameters, “b gt ” represents the parameters of the center of the true target bounding box, “ρ” represents the Euclidean distance between “b” and “b gt ”. “c” represents the diagonal distance of the smallest closed-off region containing both the predicted and true boxes. “IOU” is the overlap between the predicted box and true box, which is determined by parameters “α” and “v” for aspect ratios and “w”, “h”, “w gt ” and “h gt ” for the predicted width, predicted height, true width, and true height respectively [34]. First, calculate the squared distance between the center points of the bounding boxes. This step aims to measure the positional offset of the bounding boxes in space. Next, calculate the squared distance of the diagonal of the bounding boxes. This diagonal distance reflects the size of the bounding boxes and obtains the aspect ratio parameters “α” and “v”. The third step is to calculate the intersection area of the bounding boxes. The intersection area represents the overlapping region between the two bounding boxes and is used to measure their degree of overlap. Finally, calculate the CIoU loss value. The CIoU loss function can better handle the overlap and position offset between bounding boxes, accurately measure the similarity between bounding boxes, and thus improve the performance of object detection models.
Training and validation datasets were input into the network model to produce loss function curves for boundary box loss, object loss, and classification loss.
The more stable the loss function curve, the better the model performed, as shown in Fig. 10. It is apparent that the loss function curve decreases rapidly when the model is near 30 epochs, then slowly decreases when approaching 120 epochs, until it converges almost completely when nearing 200 epochs.

Loss function curve.
The paper evaluates model performance using FLOPs, Frames Per Second (FPS), Average Precision, mean Average Precision, Precision and Recall. FPS represents the detection speed of the model, real-time image capture from the camera, and the captured image is converted into a digital signal and sent to the detection model. Based on the number of images processed per second, the detection speed of the model can be determined. FLOPs can be used to measure the complexity of algorithms and models, with higher complexity requiring more computation and hardware resources. The calculations for these metrics are as follows:
Where the “FLOPs” represent the computational cost of the current convolutional layer. “Cin” represents the number of input channels, “K” represents the size of the convolutional kernel, “Cout” represents the number of output channels, and “HW” represents the size of the output feature map. The total computational cost is obtained by summing up the computational costs of each convolutional layer during one training round.
Where the “TP” represents the number of correctly predicted positive samples by the model, “FP” represents the number of negative samples incorrectly predicted as positive, and “FN” represents the number of positive samples incorrectly predicted as negative. “AP” represents the Average Precision, “P” represents Precision, “R” represents Recall, and “mAP” represents the mean Average Precision [35]. “N” represents the number of categories. The images of express packages are divided into four categories, so the value of “N” is 4. The variable “i” represents the ith category.
In this paper, the self-made express package dataset was validated, and the mAP for each category was obtained. The P-R curve of the EPYOLO network is shown in Fig. 11, and the final mAP value for the EPYOLO network model is 0.988.

P-R curve of the EPYOLO network.
For the detection of express packages on the conveyor belt, this paper installs cameras at a position 1.2 meters directly above the belt to accommodate height variations of different express packages, avoiding occlusion and feature loss. The running speed of the conveyor belt is adjusted to 1.5 m/s to ensure operational efficiency. To reduce motion blur that may occur due to the rapid movement of packages and ensure continuity of image sequences, the camera is set to image capture one frame every 30 milliseconds. The above deployment can ensure the accurate detection of express packages.
To further verify the effectiveness of this method, this paper conducted comparative experiments using the self-made express package dataset and multiple SOTA models. To ensure the rigor of the experimental results, each set of experimental data was averaged through multiple cross-validations. Table 7 shows the comparative results of different network models in various evaluation metrics. Obviously, the mAP of the EPYOLO network reached 98.8%. The network size of this method is only 2.82 MB, which is only 2.52% of that of the Swin Transformer, equally better than other models. The EPYOLO network designed in this paper achieves 137 FPS, which meets the real-time requirement. At the same time, the FLOPs of the EPYOLO network are 2.6 G FLOPs, which is only 2.47% of YOLOv7. In summary, the proposed method significantly reduces the number of parameters and FLOPs while ensuring high accuracy, making it easier to deploy in embedded hardware devices.
Comparison of the proposed model and other models
As shown in Figs. 12and 13 , this paper proposed both EPYOLO network and YOLOv8 s network have demonstrated good detection performance in the regular test images located in the top-left corner.

Effect of EPYOLO.

Effect of YOLOv8s.
However, in the images located in the top-right and bottom-left corners, the YOLOv8 s network missed the detection of the black plastic bag package and the bubble bag package pressed under an envelope package, respectively. The EPYOLO network, benefiting from the combination of CNN and CoTs self-attention modules, has a larger global receptive field, helping the network establish fine-grained correspondence between targets and backgrounds. Even when the background and the plastic bag packaging tend to be black, it can still capture subtle differences between the target object and the black background. Furthermore, by combining the advantages of a multi-scale pyramid structure for small object detection, the EPYOLO network can eliminate ambiguity between overlapping target objects. Despite the extensive overlap between the white envelope package and the white bubble bag package, the EPYOLO network can accurately identify the overlapping express packages.
In the bottom-right corner image, the YOLOv8 s network produced false detections by recognizing one express package as two, while the EPYOLO network correctly identified the plastic bag package. By the above comparison, the detection effect of EPYOLO network is much better than that of YOLOv8 s.
According to the demand for unmanned sorting in the logistics industry, this paper proposes a lightweight and high-precision express package sorting target detection network, EPYOLO. In terms of structure, this paper designs a lightweight feature extraction network EPnet, which consists of multiple lightweight feature extraction modules EPBlock, and introduces self-attention module CoTs in this CNN feature extraction module. The design of this feature extraction network not only balances the computational cost but also improves the network’s detection of complex scenes. Due to variations in the size and shape of express packages, the advantages of a multi-scale pyramid structure in object detection are combined with the self-attention feature extraction structure. The GSNConv module and the CA module are also introduced to further optimize the model and reduce the parameter count. Experimental results demonstrate that this combined structure can effectively handle objects of different scales and achieve excellent detection performance for overlapping and edge feature-depleted express packages. Ultimately, the mAP of the EPYOLO network reaches 98.8%, with a computational complexity of 2.6 G FLOPs and a weight size of only 2.82 MB. Despite the breakthroughs achieved by the proposed EPYOLO network in lightweight design, accuracy and detection performance, its detection speed still needs improvement. In future work, the focus of research will be on enhancing model performance through architectural refinement and model pruning, making the model even lighter, and optimizing the model’s detection speed.
Footnotes
Acknowledgement
This study was funded by the Natural Science Foundation of Fujian Province (No.2020J05236 and No.2020J01277).
