Abstract
The occlusion in the real feedlot environment is ubiquitous, and the current research based on the cattle face recognition under occlusion conditions is almost non-existent. Thus, an attention mechanism module with high accuracy and low model complexity is designed to incorporate into MobileNet so that the cattle face under occlusion can be identify accurately, which is the RGB images captured in the ranch environment. In this paper, we also construct a Simmental cattle face image dataset for data modeling and method evaluation, which contains 10,239 images of 103 cattle. The experimental results show that when the occluder is in the upper left and lower right corner, if the occlusion rate is less than 30%, the value of Top_1 reaches more than 90%; if it is less than 50%, the value of Top_1 is more than 80%. Even if the middle part occludes lots of important information, the occlusion rate of 40% has an accuracy of more than 80%. Furthermore, comparing the proposal model with MobileNet, the parameter and model size are equal, and the amount of calculation as a cost increase a little. Therefore, the proposal model is suitable to transplant to the embedded system in the future.
Keywords
Introduction
In recent years, animal biometrics has been a very popular and promising research field [1–3]. Among them, the identification of cattle plays an important role in registration and traceability [4–6]. In the past few years, with the widespread and successful use of biometric applications, cattle face recognition has received widespread attention [7]. For example, a combination of traditional feature extraction operators and image processing techniques is used to recognize cattle faces in the references [8–10]. The above method has not been widely used because it relies excessively on manually designed components and has to go through a complicated parameter adjustment process, and poor generalizability and robustness. With the development of deep learning [11, 12], convolutional neural networks (CNNs) have made huge breakthroughs in many fields. Reference [13] shows that CNNs are feasible in solving the problem of animal biometrics. Therefore, the cattle face recognition based on CNNs is studied in the references [14–16], artificially designed components are not dependent and have high recognition accuracy. The above studies are all aimed at cattle face recognition under constrained conditions, while the cattle face images collected in the real feedlot environment are all unconstrained, that is to say, the occlusion phenomenon is inevitable. Therefore, we verify the recognition of the occluded image by the traditional CNNs on the dataset we made. The experimental results are shown in Fig. 1. The blue shows that the recognition accuracy of traditional CNNs for positive unobstructed cattle face images is above 90%; The green shows the recognition accuracy results of the traditional CNNs after 40% of the cattle face is occluded. It can be found that the recognition accuracy of each network is less than 84%. The results show that the traditional CNNs have better feature extraction ability for unoccluded data, and the recognition effect of occluded data needs to be improved urgently. And it can be seen from the figure that Mobilenet [17] has a good recognition accuracy compared with other models under occlusion conditions, and also has a very low parameter, which is very useful for the transplantation of embedded systems.

The accuracy of traditional convolutional neural network for cattle face recognition with or without occlusion.
At present, the research on image recognition under occlusion conditions is focused on human face recognition [18–23]. There are no reports on the research of cattle face recognition under occlusion conditions. Therefore, we decided to study the cattle face recognition under occlusion conditions by combining the unique characteristics of cattle and based on the research results and methods of face recognition under occlusion conditions. One of the methods of face recognition under occlusion conditions is to use the attention mechanism to distinguish the occluded and unoccluded part, so that the feature information of the unoccluded part is focused and extracted by the network, and various methods have been proposed to integrate the attention mechanism into CNNs [21–23]. Therefore, on the basis of the above research, a channel attention mechanism module with high accuracy and low model complexity is designed in this paper. The specific manifestation is the introduction of Lp Pooling which is a complex spatial aggregation strategy and the use of lightweight Efficient Channel Attention (ECA) module to reduce the amount of overall network parameters. From Fig. 1, it can be found that MobileNet has the advantages of lightweight and high recognition accuracy in occluded images, so it is selected as the baseline network of the model of the cattle face recognition under occlusion. And the accuracy of cattle face recognition under occlusion conditions is verified on the dataset we make. The experimental results show that the attention mechanism module we designed can improve the accuracy of cattle face recognition under occlusion conditions.
Our main contributions are summarized as follows: Dataset. Since there is no public standard face dataset of Simmental beef cattle, the data is collected and screened in a real feedlot environment. And a system based on Yolov3 to automatically filter background noise is designed to automatically generate the dataset for classification task, and on this basis, Python language is used to automatically add occlusion factors. Architecture of attention mechanism. By analyzing the traditional channel-based attention architecture, Lp Pooling is introduced to improve the representation ability of the global receptive field, which is a more complex spatial aggregation strategy; The traditional channel excitation method is realized by a fully connected bottleneck structure, which carries a large number of trainable parameters, so the fully connected structure is replaced by ECA-Net, which is a lighter module. Improve the accuracy of cattle face recognition under occlusion conditions. The attention mechanism module we designed is plugged into MobileNet. Experiments on the cattle face dataset under occlusion conditions showed that it can distinguish between occluded and unoccluded parts. And the identity of the cattle is distinguished based on the characteristics of the unoccluded part.
The rest of this paper is organized as follows. The Section 2 describes the process and details of the dataset production. The Section 3 explains our attention mechanism architecture. The Section 4 explains the experimental configuration and analyzes the experimental results. The Section 5 summarizes this paper.
This section mainly introduces the acquisition of Simmental beef cattle dataset, the extraction of key frames, the filtering of background noise and the addition of occlusion factors. Figure 2 shows the flow chart of dataset production.

The production process of cattle face dataset.
Since Simmental beef cattle is not only the second largest breed in the world after Holstein cows, but also the largest breed of beef cattle in the world, the research object of this paper is Simmental beef cattle. The video data in this paper was collected by members of the same group who went to the cattle farm twice, where the video images of 48 cattle were collected at the Shengyuan beef cattle farm in Hebei, China from July to August 2018; Another 55 cattle were collected in Kulun Banner, Tongliao, Inner Mongolia Autonomous Region, China from August to September 2020. All the cattle in the cattle farm were kept in a large open-air environment surrounded by fences. A handheld mobile camera bracket was used to enter the farm to collect cattle videos at close range, the equipment used was a Sony camera. The resolution of the image was 1920×1080, and the fps was set to 50 to reduce the impact of blurring caused by the relative increase in movement of the cattle. During the shooting, the sunny weather without fog and haze was selected, and the shooting was carried out at 8 : 00-16 : 00. Finally, 295 videos of 103 cattle were shot. Each video was about 2-4 min long and saved in mp4 format. Our raw and annotated datasets have been shared at https://github.com/Oliver6999/cattle_dataset/tree/main.
After obtaining the cattle face videos with background noise, the framing software of Free Video to JPG Converter was used to decompose each video to 1000 frames. The image after framing contain a large number of invalid data, such as it only contains the background, the deflection angle of the cattle face is too large to recognize any valid facial information, occluded entirely by other cattle, and the adjacent images are very similar. Aiming at the problems caused by the above-mentioned framing, a combination of manual selection and algorithmic filtering is used to deal with in this paper. Manual selection is mainly aimed at the first three problems, and the images after manual selection are sent to the algorithm composed of Structural Similarity (SSIM) [24] to filter the similarities between adjacent images. The filtering algorithm is as follows:
Where the value of Th is set to 0.32, the number of raw images of 103 cattle is 68260, and the number of images after manual selection and algorithm filtering is 10239 (The number of images of each cattle is 100 except for the four cattle whose numbers are 77, 92, 91, and 79), and about 85% of the images are discarded.
The images after manual selection and algorithm filtering contain background noise, which is contrary to the form of the dataset required by the classification task [25, 26]. The traditional method of filtering background noise is manual cropping. This method not only is time-consuming, but also has low accuracy. Thus, an automatic image capture algorithm based on YOLOv3 [27] is designed to automatically generate a dataset for classification tasks. This method has the characteristics of short time and high accuracy. The algorithm flow chart is shown in Fig. 3:

Background noise filtering flowchart.
In the training process, 1/5 of all the data were randomly selected for the labeling of YOLOv3 model data. The K-Means algorithm is replaced by K-Means++ [28] to improve the accuracy of anchor box clustering; The DarkNet-53 feature extraction network is replaced by MobileNet to improve the speed of model detection. Then the labeled data is inputted into YOLOv3 for training and the weights are saved. In the testing process, all the data are loaded into the YOLOv3 model and the weights learned in the training process are transplanted to the testing process. The task of the traditional testing process is to detect the target image and draw it with a rectangular box on the raw image. Therefore, the image cropping module is embedded on this basis, that is, the coordinates of the upper left corner and the lower right corner of the rectangular box are passed into the cutting function, the function of automatic cropping is realized by loop.
Because the occluded cattle face image collected in the real feedlot environment can neither control the occluded area nor the occlusion rate, the actual collected cattle face image with occlusion does not meet the most basic conditions of the experiment (control variables). Therefore, the most direct Image Compositing method [29] is used to apply the occluder to the positive unobstructed cattle face dataset in this section. The raw cattle picture is normalized to M*M, and filled with bilinear interpolation [30]. The size of the occluder can be calculated by M*M*ration, which is percentage of the raw cattle picture is occluded. And the position of the occluder can be obtain by artificially setting the coordinates. The specific operations are shown in Fig. 4. The artificially determine the coordinates of the upper left corner of the occluder (Lw and Lh), where Lw and Lh are in the range of 0 M. According to the size M * M * ration of the occluder, the coordinates of the lower right corner can be obtained. This operation is implemented by the Python language, and a loop is added to avoid the time spent on manual cutting and pasting.

The representation of the occluded cattle face image.
For specific parameter settings, the raw cattle face image is normalized to 64*64*3. Since a ratio greater than 0.6 is meaningless, the ratio is set to 0-0.6 to calculate the size of the occluder, and different proportions of the occluders are added to the upper left, middle and lower right corners of the raw cattle face picture.
The convolution kernel, as the core of the CNNs, is usually regarded as the fusion of spatial information and channel information in the local receptive field. When channel information is fused, the contribution of each channel is the same. Inspired by this, Hu et al. [21] proposed Squeeze-and-Excitation network (SENet) to explicitly model the interdependence between each channel. Specifically, the importance of each channel is obtained by passing the SENet on the feature map, which is get by the convolution operation. And then according to the importance, the essential features are extracted and the useless features for the current task are suppressed. This is similar to the brain signal processing mechanism of the human visual system, which selects and focuses on some salient areas after a series of partial scans, while ignoring the content of other areas [31].
The purpose of cattle face recognition for partial occlusion is to make the computer learn to ignore the feature information of the occluded part of pictures, extract the feature information of the other part. And the feature information is sent to the top layer of the CNNs for classification and recognition. Therefore, a novel attention mechanism module is designed and plugged into MobileNet to further improve the ability of model to distinguish between occluded and unoccluded parts in this paper. The novel attention mechanism module mainly includes two parts: squeeze and excitation.
Experimental results of squeeze modules composed of different pooling methods on the positive unobstructed cattle face dataset
Qilong et al. [35] proposed an ECA module for deep CNNs, which avoids dimensionality reduction and effectively captures cross-channel interaction information, aiming to ensure computational performance and model complexity. The weight of each channel is calculated only by considering the interaction between the i-th channel and its k neighbors. The calculation formula is shown as follows.
Where y
i
represents the i-th input channel,
Although the optimal value range of the convolution kernel K can be adjusted manually, the manual cross-validation will cost lots of computing resources. And further analysis shows that there is a certain mapping between K and the number of channels C, and since the number of channels is usually an exponential multiple of 2, an exponential function with base 2 is used to express the mapping relationship between K and C, so the author proposes a method for adaptively adjusting the size of the convolution kernel, its calculation formula is shown as follows. Where the constantsγand b are set to 2 and 1 according to the experiment.
Therefore, the feature map is calculated by the novel attention mechanism module, illustrated as formula 5, where ⊗ is the element-wise multiplication, and ECA can be calculated by formulas 3 and 4.
Based on the above analysis, the architecture of attention mechanism module in this paper is shown in Fig. 5. The squeeze part uses the parallel connection of Max Pool, Avg Pool and Lp Pool to aggregate to obtain more effective features. Among them, Lp Pool is used in the attention mechanism architecture for the first time and achieved surprising effects. Then connect with the lightweight ECA-Net to get the activation value of each channel reduce the parameter amount of SENet.

Architecture diagram of the novel attention mechanism module.
In this section, ablation analysis is first conducted to verify the feasibility and effectiveness of this algorithm, and MobileNet is adopted as the basic network architecture. The attention mechanism module we designed is plugged in MobileNet and the architecture of the proposal network is shown in Fig. 6. Since the channel attention is composed of squeeze and excitation, an effective method to compose the squeeze module is first searched, and an effective method to reduce the excess of traditional excitation parameters is considered.

The architecture diagram of the proposal cattle face recognition under partial occlusion network.
To evaluate the performance of the classification network, accuracy (Top_1 and Top_5), parameters, the model size, MFLOPs and the average test time are common performance standards. Among them, parameters and MFLOPs are our focus because of the demands for transplanting model to embedded systems in the future. Top_1 is given as the number of correctly identified images as a percentage of the testing dataset. Top_5 is given as the number of images with the correct label included in the first five classification probabilities as a percentage of the testing dataset. The model size is considered as the storage space occupied by the model and parameters. The average detection time is the average time it took each photo of the cattle face to be identified. The parameters and MFLOPs are calculated by formulas 6 and 7 to represent the complexity of the model.
Where C0 represents the number of output channels, C i represents the number of input channels, K represents the size of the convolution kernel, and H,W represents the height and width of the output feature map.
All experiments in this paper are based on the hardware platform shown in Table 1 for modeling and training. And all the models and ablation analysis processes in this paper are trained on the basis of the following parameters and hyperparameters. The 10239 cattle face images in the dataset are randomly shuffled and divided into training, validation set and testing set according to the ratio of 6 : 2:2. And the input image size is reduced to 64×64×3, batch_size is 32, the learning rate is 0.001, the loss function is CrossEntropyLoss, the optimizer is Adam [36], and the number of epochs is set to 600.
Hardware experiment platform configuration
Hardware experiment platform configuration
Experimental results of different excitation methods on the positive unobstructed cattle face dataset
Experimental results of different excitation methods on the positive unobstructed cattle face dataset
The robustness of our proposed method is tested for cattle face recognition under partial occlusion conditions in this section. Since the mutual occlusion between the faces of two cattle is more frequent in the actual environment, the face of other cattle is chosen as the occluder in the paper. In order to verify the anti-interference of the algorithm in this paper, we choose to set different occlusions in the upper left corner, middle and lower right corner of the cattle face picture. Figure 7 shows a 0 to 60% occlusion of the cattle face on the different part.

Images of a cattle face under different occlusion rates and parts.
MobileNet is used as the baseline network, and the settings of parameters and hyperparameters are consistent with section 4.2. The results of specific test samples are shown in Table 4. The results show that under different occlusion rates, the recognition accuracy of the upper left corner and the lower right corner of the cattle face image is almost the same after being occluded. When the occlusion rate is less than 30%, the value of TOP_1 can reach more than 90%; when the occlusion rate is less than 50%, the value of TOP_1 can reach more than 80%; when the occlusion rate is more than 60%, the accuracy drops sharply. It can be seen from Fig. 7 that when the occlusion rate is 60%, most of the facial features of the cattle are missing, and it is difficult to achieve high recognition accuracy only by relying on the remaining facial information. Since the occluder occludes more facial information in the middle part, the recognition accuracy is much lower than when the occluder is in the upper left corner and the lower right corner. When the occlusion rate is less than 20%, the value rate of TOP_1 can reach more than 90%; when the occlusion rate is less than 40%, the value of TOP_1 can reach more than 80%; when the occlusion rate reaches more than 50%, the accuracy drops sharply. And this can be verified from Fig. 8. When the occlusion rate is lower than 40%, the two curves fit well and basically overlap; then as the occlusion rate increases, the fit of the two curves continues to deteriorate. The value of TOP_1 is also declining. The above experimental results show that our proposed method has good experimental results on the occluded cattle face dataset, and can meet the daily cattle face recognition task under partial occlusion.
Cattle face recognition results under different occlusion rates and parts. Among them, Left, Right and Center represent the upper left corner, lower right corner and middle of the cattle face image respectively, which are occluded

Convergence curves of the accuracy of the training and the validation set under different occlusion rates, when the occluder is in the middle of the cattle face picture.
The occlusion phenomenon in the real feedlot environment is ubiquitous, and the application of the existing model is very bad in this environment. Thus, we research the cattle face recognition under the occlusion condition by combining the unique features of the cattle and the attention mechanism. In this paper, an attention mechanism module with high accuracy and low model complexity is designed and plugged into MobileNet to improve the effect of cattle face recognition under occlusion. Experiments show that when the occlusion rate is less than 20%, the accuracy of our proposed method is more than 90%; when the occlusion rate is less than 40%, the accuracy is more than 80%. Moreover, compared with the MobileNet, no parameter or model size is increased, and only a small increase in the amount of calculation is taken as a cost. The experimental results of the traditional CNNs and our proposed method on the dataset where 40% of the cattle face image is occluded are shown in Table 5. The experimental results show that the model we designed can distinguish the occluded and unoccluded part in the image, and extract the feature information of the unoccluded part to recognize and classify. And our novel model is suitable for transplanting into the embedded system applying in the real feedlot environment. Furthermore, our proposal method can be applied to recongnize the others animals under occlusion not only by face but also by back, side and so on.
The results of our proposed method and traditional CNNs on the cattle face dataset with an occlusion rate of 40%
Footnotes
Acknowledgments
Thanks for the support of national natural science fund project (61640011) and the Natural Science Foundation of Inner Mongolia (2016MS0617).
