Abstract
Under the highly valued environment of intelligent breeding, rapid and accurate detection of pigs in the breeding process can scientifically monitor the health of pigs and improve the welfare level of pigs. At present, the methods of live pig detection cannot complete the detection task in real time and accurately, so a pig detection model named TR-YOLO is proposed. Using cameras to collect data at the pig breeding site in Rongchang District, Chongqing City, LabelImg software is used to mark the position of pigs in the image, and data augmentation methods are used to expand the data samples, thus constructing a pig dataset. The lightweight YOLOv5n is selected as the baseline detection model. In order to complete the pig detection task more accurately, a C3DW module constructed by depth wise separable convolution with large convolution kernels is used to replace the C3 module in YOLOv5n, which enhances the receptive field of the whole detection model; a C3TR module constructed by Transformer structure is used to extract more refined global feature information. Contrast with the baseline model YOLOv5n, the new detection model does not increase additional computational load, and improves the accuracy of detection by 1.6 percentage points. Compared with other lightweight detection models, the new detection model has corresponding advantages in terms of parameter quantity, computational load, detection accuracy and so on. It can detect pigs in feeding more accurately while satisfying the real-time performance of target detection, providing an effective method for live monitoring and analysis of pigs at the production site.
Introduction
Pigs are the largest livestock product in China and have significant strategic importance [1]. Scientific and intelligent pig farming managements are beneficial to the health of the pigs in the daily management, asset inventory, and breeding processes of large-scale pig farms. Real-time and accurate detection and counting of pigs are necessary conditions at the software level for intelligent management [2]. Traditional feeding management mainly relies on visual inspection by feeding personnel, It not only consumes a large amount of manpower and material resources but also is inefficient [3, 4]. In an intelligent breeding environment, using visual technology for non-contact and low-stress pig target detection and counting has significant importance in achieving precise and personalized pig health intelligent monitoring for group-raised pigs [5, 6]. The application of this technology has already been confirmed in some studies. For example, Karahan et al. [7] used machine learning to detect facial features and objects, Sharifi et al. [8] used machine learning and meta-heuristic algorithms to predict Bitcoin prices, and Mekawy [9] used neural networks for object detection in smart homes.
Researchers have extensively investigated computer vision techniques for pig detection and have obtained relevant results. Li et al. [10] established a relationship between explicit directional templates and brightness ratio templates and edge features, and used clustering algorithms to automatically select representative templates and images from the training images for matching. The algorithm attained an average detection rate of 86.8%. Li [11] obtained the feature vector of individual pigs by calculating the height-to-width ratio of the external rectangle and Fourier descriptor of the image contour after obtaining the pig contour image through simple image processing methods. Then, they detected it by calculating the square Mahalanobis distance with the baseline side view, and the detection accuracy reached 91.7%. The mentioned detection studies are based on traditional object detection algorithm, these are constructed using manual features and shallow trainable architectures. This method was characterized by slow detection speed and low detection performance. The current trend is to take advantage of deep learning methods for pig detection. Yang et al. [12] used the Faster R-CNN [13] object detection model to detect the body and head positions of pigs, and judged whether it was a feeding behavior based on the time spent when the head position intersects with the feeding area. This method achieved a detection accuracy of over 95%, and the recall rate exceeded 80%. On this basis, we selected the distance, overlapping area, and crossing angle between two pigs in a single frame as spatial features related to behavior. Then the XGBoost was used as a classifier to recognize climbing behavior between pigs. The Two-Stage pig target detection method has high detection accuracy but the detection speed is relatively slow, making it is difficult to meet the requirements for real-time detection. Moreover, the model’s large volume requires high device performance [14]. Bringing it challenging to transplant to embedded platforms. Therefore, an increasing number of scholars have turned their attention to one-stage detection algorithms [15, 16]. Alameer et al. [17] applied the YOLOv2 [18] model to detect drinking and feeding behaviors of pigs. Alameer replaced the original backbone network with ResNet [19] and modified the preset anchor box size using the K-medoids clustering algorithm. It greatly improves the accuracy of the detection. Gao et al. [20] introduced time dimension information into the 2D features of images, as well as built a pig aggression behavior judgment model using 3DConv. the detection accuracy of short-term aggressive behavior was improved by fusing time and space features extracted at different stages of the network. However, the effect of temporal motion information on recognition is not significant. There is a high error rate in identifying long video segments. Similar to Khodaverdian et al. [21] and Algarni [22], both studies used advanced technologies for energy reduction and fire detection respectively, yet highlighted real-world application challenges and limitations. Psota et al. [23] constructed a pig instance-level detection dataset, the [24] paper also analyzes swine data. Then they employed a fully convolutional neural network to detect the position and direction of each pig, and simulating the actual environment with different intensities of light to achieve good detection results. However, in cases of occlusion, there may be missed detections or false detections due to the inability to accurately obtain the head and tail positions of the pigs. Li et al. [25] built a multi-view pig detection dataset to address the problem of low automation testing level of pig feeding behavior in the pigsty environment. They added a spatial attention module (SAM) based on YOLOv4 to make the model more focused on the target objects in the image, which can effectively detect pig feeding behaviors in side and overhead views.
In summary, some progress has been made in the detection of pigs. While it is difficult to balance the parameter quantity and accuracy of a detection model. In practical intelligent breeding scenarios, live monitoring of pig conditions requires precise as well as enough lightweight detection algorithms. Therefore, we propose a lightweight deep learning-based object detection model for efficient pig detection, with the goal of monitoring pig growth effectively.
Materials
Data acquisition
The data was collected at a hog farm in Rongchang, Chongqing, China, where there are multiple pig houses with an area of about 5.41m×3.87m each, and multiple pigs are raised in each house. To enhance the diversity of the collected data, monitoring cameras were installed to monitor pigs in different houses, as shown in Fig. 1, and to obtain a large amount of image data required for training deep learning models, video monitoring was utilized. To ensure that the entire scene of a single pig house could be captured and the actual living conditions of each pig could be clearly obtained, the camera was installed directly above the ceiling at a height of about 2.75 meters, and the collected actual situation images are shown below. Keyframes were cut from the collected monitoring videos, and images with ghosting or blur were removed, this resulted in a total of 2500 pig images at a resolution of 1920×1280 pixels, as shown in Fig. 2.

Collection scenario.

Collection result.
The positions of the pigs in the pig image data need to be manually marked in order to be used for training deep learning object detection models. We used the open-source software LabelImg to manually annotate the selected pig data images. The overall annotation interface is shown in the figure below. In order to facilitate transfer between different models, the annotation format is the commonly used PASCAL VOC [24] data set format, This includes the top-left and bottom-right coordinates of the target along with its category, as shown in Fig. 3.

Image data annotation.
The number of data samples is positively correlated with the model’s generalization ability to some extent, directly impacting its training effectiveness. In reality, pigs exhibit different characteristics in different environments. Detecting pigs in complex and diverse environments is a challenge. To address this, in this study, the original pig image data underwent a series of data augmentation operations to obtain more pig data samples. These operations include adding Gaussian noise, affine transformations, and changing the saturation and brightness of the image. Adding Gaussian noise is a common data augmentation technique. It adds a certain amount of random noise to the image data, simulating various disturbances that the image may encounter in the actual environment, thereby increasing the robustness of the model to noise. Affine transformations include image scaling, rotation, and translation. These operations can simulate pigs in different angles, sizes, and positions, so that the model can accurately detect pigs under various conditions. Changing the saturation and brightness of the image is to simulate the image of pigs under different lighting conditions. This can make the model have better adaptability to changes in lighting. Through these data augmentation operations, we obtained 4000 pig data samples from the original pig image data. The data set is used to improve the generalization ability of the model, so that it can accurately detect pigs in various complex and diverse environments, as depicted in Fig. 4.

Example-of-pig-image-data-enhancement.
YOLOv5n model
YOLOv5 is an excellent detection model. The structure of this series of detection models consists of the backbone feature extraction network Backbone, the feature enhancement network Neck, and the detection decoding layer Head. YOLOv5n is the lightest version in the YOLOv5 series. It differs from other versions in that the YOLOv5n network is narrower, as shown in Fig. 5, with fewer output feature map channels per layer, resulting in significantly reduced parameter quantity and computational load. The structure of the YOLOv5n detection model is shown in Fig. 5, which is composed of Conv modules and C3 modules. The CSPDarkNet network is used in the Backbone part for main feature extraction. It consists of multiple residual convolutions. By increasing the depth, the detection accuracy is improved, and skip connections are used internally to mitigate the problem of gradient vanishing caused by depth in deep neural networks. In the Neck part, The Feature Pyramid network (FPN) is utilized to process the three feature layers extracted by the backbone feature extraction network, effectively fusing different level feature information to extract better features. Finally, the decoding is performed in the Head part to detect the object’s position and class.

Structure of YOLOv5n.
In practical scenarios of pig detection, a more lightweight detection model is needed. YOLOv5n has less computational load and parameter quantity, making it more suitable for pig detection tasks than other detection models. However, due to the smaller network width factor of YOLOv5n, the network is not deep enough, it lacks a larger receptive field and deeper semantic information.resulting in slightly lower detection accuracy [27]. This article optimized the YOLOv5n detection model by addressing the problem of insufficient receptive field of YOLOv5n, using 5×5 large convolutional kernels of depth-wise separable convolution [28] to construct a new C3DW structure to replace the C3 module in YOLOv5n, while increasing the receptive field of the detection model and improving detection accuracy without additional computational or parameter complexity. To address the problem of insufficient feature information extraction in YOLOv5n, a C3TR module was constructed using self-attention based on the Vision-Transformer [29] multi-head attention, and placed at the end of the backbone feature stack, which is more suitable for detecting larger targets like pigs. The TR-YOLO pig detection model was built based on the above two improvement methods on the basis of YOLOv5n, as shown in Fig. 6.

Structure of TR-YOLO.
The C3 module in YOLOv5 primarily utilizes 3×3 convolution for feature extraction, and the size of the convolution kernel impacts the receptive field of the detection network.particularly for the pig detection task, although a larger receptive field can make the extracted feature information more detailed and have better detection performance. Directly increasing the convolution kernel size will cause more computational load. Therefore, the C3 module in YOLOv5n has been optimized through the use of depth-wise separable convolution. Two 5×5 depth-wise separable convolutions are used to replace the convolution responsible for feature extraction in the Bottleneck structure, and a new C3DW module is constructed, as shown in Fig. 7. The C3DW module, with a larger receptive field replaces the C3 module in YOLOv5n, this change not only reduces the computational load and the number of parameter but also enhances the receptive field of the detection model, consequently improving detection accuracy. The schematic diagram of the new C3DW detection module is displayed below.
Depth-wise separable convolution decomposes the traditional convolution into depth convolution and 1×1 pointwise convolution, as illustrated in Fig. 8. In the depth convolution, each convolution kernel corresponds to one channel, ensuring complete consistency between the number of produced feature maps and the number of input channels. The pointwise convolution combines the previous feature maps in the depth direction with weights, resulting in a new feature map with the same quantity as the convolution kernel.

Structure of C3DW.

Structure of depth-separable convolution.
When comparing the parameter quantity and computational load of the two convolution methods while ensuring that the input and output feature map sizes remain unchanged. Let’s consider the following assumptions: the input feature map size is M×M×N, the convolution kernel size is K×K×P, and the convolution stride is set to 1. Then the corresponding parameter and computational complexity of the standard convolution are: The corresponding parameter quantity and computational complexity of the depth-wise separable convolution are:
In the original version of YOLOv5n detection model, convolution operations were used to extract feature information from input data, but it lacked the utilization of global contextual information. In this paper, we employ the Multi-headed-attention of Transformer module to extract more robust by learning the relationship between different pixel points and using the long feature dependency that comes with Transformer module target feature information. The Encoder layer with Multi-Head Attention in the Transformer module is used to replace the residual units in the original C3 module to construct the new C3TR module, as shown in Fig. 9.

Structure of C3TR.
In the C3TR module, the self-attention allows the model to establish direct connections between elements at different locations and to converge these connections in a learnable way. The Self-Attention calculation involves the use of three matrices: the query matrix Q(query), the key matrix K(key), and the value matrix V(value). Self-Attention takes as input a vector matrix X, composed of the x. Q, K, and V are the three matrices obtained through distinct matrix transformations following the linear transformation of the input in the Self-Attention process. The calculation formula is:
The Multi-headed Self-attention is an extended Self-attention that learns different queries, keys and values in different attention subspaces. In the Multi-headed Self-attention, the input sequence is divided into several sub-sequences and self-attentions is computed for each sub-sequence. The self-attention results of these sub-sequences are then stitched together by using a fully connected layer. The Multi-head-attention allows the model to more fully capture the information from the input sequence without the limitation of individual self-attention. Different Head uses different Query, Key, and Value matrices, and the multi-head-attention is computed as follows:
Evaluation indicators
The task of target detection is more complex than that of general image classification task, because it requires the network to output the location and class of the target. The simple accuracy metrics in the classification task do not reflect the accuracy of the target detection task results, so we design experiments to assess the model’s detection capability using metrics such as Mean Average Precision (mAP), Recall, and Precision.
The experimental environment of this paper is shown in the following table: During the training of the enhanced model in this study, the input network size is consistently set to 640*640 pixels. The training parameters are configured as follows: the initial learning rate of 0.01, the weight decay rate is 0.0005, the batch size is 16, and the utilization of the Stochastic Gradient Descent Optimizer for training, Additionally a warm-up learning strategy is employed to mitigate the impact of weight initialization. The training consists of 150 Epochs, and early stopping is activated as preventive measure against over fitting, as outlined in Table 1.
Experimental environment configuration
Experimental environment configuration
To illustrate the viability of each enhanced module with the TR-YOLO model, we conducted ablation experiment. The results of these experiments are presented in Table 2.
Ablation results
Ablation results
As shown in the table, the original YOLOv5n detection model had a certain parameter quantity and computational complexity without C3DW and C3TR. This means that the model needs to consume a certain amount of computational resources when calculating and processing data. When we replace the C3 module with the C3DW module, the number of parameters and computational load of the model are significantly reduced. This is because the design of the C3DW module is more efficient, which can reduce the number of parameters and computational load of the model while maintaining the performance of the model. However, this change only improved the mAP accuracy by 0.5%, which is not accurate enough for the pig detection task. In order to further improve the accuracy of the model, we added the C3TR module to the YOLOv5n detection model with the C3DW module. Although this slightly increased the number of parameters and computational complexity of the model, the mAP was 1.6% higher than the original YOLOv5n and 1.1% higher than the YOLOv5n detection model with only the C3DW module. This result proves the effectiveness of the C3TR module in this paper for improving the detection model. In summary, by adding the C3DW and C3TR modules to the YOLOv5n detection model, we successfully improved the detection accuracy of the model while maintaining computational efficiency, enabling it to more accurately detect pigs in various complex and diverse environments.
To access the effectiveness of the TR-YOLOv5n detection model, we conducted experiment using the same configuration and compared it with other state-of-the-art detection models, as described in previous studies. The experimental results are presented in Table 3.
Comparative experimental results
Comparative experimental results
The results from the comparison experiments in Table 3 clearly indicate that the TR-YOLO model is significantly better than the other models in parameter quantity and computational load, which are only 1.87M and 3.9G. The TR-YOLO model is also the best in recall and mAP@0.5, with indicators of 88.9% and 95.7%, which are 2.9% and 2% higher than YOLOV3-Tiny; 6.7%, 6.1%, and 4.9% higher than PICODET_S; and 0.7%, 1.8%, and 0.9% higher than YOLOV7-Tiny. In terms of model speed index, due to the addition of Transformer structure, the FPS of TR-YOLO model is 120, which is lower than YOLOV3-Tiny but higher than 104 of PICODET_S and 142 of YOLOV7-Tiny, fully meeting the industrial vision requirements. In summary, the TR-YOLO model has higher detection accuracy and efficiency with lower parameter quantity and computation, which can meet the actual production requirements. To present the detection results of the TR-YOLO target detection model in a more intuitive manner, the pig target detection results of TR-YOLO model and other detection models are compared, as shown in Fig. 10, when the IOU threshold is uniformly set to 0.5, and the confidence threshold is 0.6, both the original YOLOV5n Fig. 10(a) and YOLOV3-Tiny Fig. 10(b) exhibit instance of missed detection. YOLOV7-Tiny Fig. 10(c) and PICODET_S Fig. 10(d) will misdetect the strong light and other interfering parts of the window as pig objects, the detection effect is poor, only the model in this paper can maintain a high confidence level without misdetection by omission and at the same time has a high detection speed.

Comparison of detection results.
For the pig detection task, we use YOLOv5n as the benchmark model, and for the problem of limited sensory field of YOLOv5n, we replaced the C3 unit module in YOLOv5n with the C3DW module, which is constructed using depth-wise separable convolution with a large convolution kernel This modification aimed to enhance the receptive field of the entire detection model; meanwhile, the C3TR module is constructed by incorporating Multi-headed-attention from Transformer structure to address the issue of insufficient feature extraction information in YOLOv5n. The C3TR module is designed to address the issue of inadequate feature extraction information in YOLOv5n. To ensure that the overall model’s computational load remains unchanged, the C3TR module is positioned at the very end of the backbone feature extraction network, enhancing the extraction of global feature information related to the pig. This article evaluates the detection model from three aspects: model accuracy, number of parameters, and detection speed. Compared with the original benchmark model YOLOv5n, TR-YOLO improves detection accuracy by 1.6% without significant loss of speed and additional computational load, and the new detection model TR-YOLO also achieves a large advantage over other lightweight detection models. Real-time pigs’ detection tasks require higher detection accuracy and faster detection speed. TR-YOLO achieves an average detection accuracy of 95.7% with 120 FPS and is very lightweight, it can effectively deploy deep learning in embedded terminals for real-time pig detection tasks, providing an effective way to closely monitor pig growth and improve the living environment of pigs.
Footnotes
Acknowledgments
This work received support from the Chongqing Special Financial Funds Project (22520C). Data availability Data will be provided upon request.
