Abstract
YTNR (Yunnan Tongbiguan Nature Reserve) is located in the westernmost part of China’s tropical regions and is the only area in China with the tropical biota of the Irrawaddy River system. The reserve has abundant tropical flora and fauna resources. In order to realize the real-time detection of wild animals in this area, this paper proposes an improved YOLO (You only look once) network. The original YOLO model can achieve higher detection accuracy, but due to the complex model structure, it cannot achieve a faster detection speed on the CPU detection platform. Therefore, the lightweight network MobileNet is introduced to replace the backbone feature extraction network in YOLO, which realizes real-time detection on the CPU platform. In response to the difficulty in collecting wild animal image data, the research team deployed 50 high-definition cameras in the study area and conducted continuous observations for more than 1,000 hours. In the end, this research uses 1410 images of wildlife collected in the field and 1577 wildlife images from the internet to construct a research data set combined with the manual annotation of domain experts. At the same time, transfer learning is introduced to solve the problem of insufficient training data and the network is difficult to fit. The experimental results show that our model trained on a training set containing 2419 animal images has a mean average precision of 93.6% and an FPS (Frame Per Second) of 3.8 under the CPU. Compared with YOLO, the mean average precision is increased by 7.7%, and the FPS value is increased by 3.
Introduction
Biological resources are an organic part of natural resources and the natural basis for human survival and development. Moreover, wild animal resources are an important part of biological resources. YTPN (Yunnan Tongbiguan Provincial Nature Reserve) [1] is located in the westernmost part of the tropical region of China, bordering Myanmar and close to East Assam in India. As it has not been affected by the Quaternary glaciation [2], the YTPN has become a refuge in many ancient biological groups, and many ancient primitive animals and plant species had been preserved. It is the most concentrated and typical area of China’s Indo-Burma [3] tropical biogeographical resources. It is also the only area in China with tropical biota of Irrawaddy River [4] system. In the reserve, there is iconic tropical rainforest vegetation such as Shorea assamica and Dipterocarpus retusus [5], as well as tropical animals such as Skywalker hoolock gibbon [6] and hornbills. Protecting the animal and plant resources and their habitats in this area are of great significance to the rescue and development of endangered species and the maintenance of ecological balance. The primary task of protecting wild animals is to realize real-time monitoring of animals, but this is a very time-consuming and laborious task. Applying the research results of object detection to the monitoring of wild animals can reduce the related workload and lay the foundation for subsequent wildlife protection work.
In recent years, researchers have carried out a series of work related to classification, identification and detection in the field of wildlife. Sean Ward et al. [7] described a system that uses low-cost drones, predictive models, computer vision, and thermal imaging to autonomously detect wild animals. The system is an effective, low-cost, independent system that can use thermal sensors to detect animals. However, its thermal imaging has low resolution and cannot distinguish many different animals. Luo Wei [8] and others used the deep learning model Mask R-CNN to detect and locate various large herbivores in aerial images. By extracting the mask generated in the detection process of Mask R-CNN, the contour vector of the animal is obtained, and then the population quantity and distribution information of various large herbivores are estimated. Anh Nguyen [9] and others tested the capabilities of deep neural networks (DNNs). This method can automatically extract image information from the SS data set in the largest existing wild animal labeling data set, which can save a lot of time for biological researchers and volunteers. But for images with multiple species, the deep neural network only provides one species label.
At present, the object detection algorithm based on deep learning is the mainstream. Existing research results include two-stage object detection algorithms: R-CNN [10], Fast Rcnn [11], Faster Rcnn [12], Mask R-CNN [13] and single-stage object detection algorithms: SSD [14] (single-shot multi-box detectors), YOLO (You only look once) series. Algorithms have their own advantages and disadvantages. Considering that wildlife monitoring is mostly video surveillance equipment and usually runs on the CPU platform, this paper makes some improvements based on the single-stage object detection algorithm YOLO.
The main contributions of this paper are 1. Aiming at the problem of YOLO’s detection speed impact on non-GPU platforms, the MobileNet-Yolo model is proposed. Its lightweight network greatly accelerates the speed of wildlife detection and improves the feasibility of edge deployment of the algorithm. 2. Collected and constructed the Tongbiguan Wildlife Data Set, which combines field observation data with public data from the Internet. It can provide a reference for the training and testing of models for such studies. Finally, the transfer learning strategy is used to train the model. Through the comparison of the model test results, it is found that the method proposed in this paper can greatly improve the detection speed while ensuring detection accuracy.
The paper is presented as follows: Section 2 presents the related models used in the proposed method; Section 3 describes the details of the MobileNet-Yolo based wildlife detection method; Experimental results and discussion are shown in Section 4. Finally, Section 5 shows the Conclusion and recommendations.
Related research models
YOLOv4
YOLOv4 [15] is a single-stage object detection model that can trade-off speed and accuracy well. It adds many small improvements on the basis of the YOLOv3 [16]. For example, the Mosaic [17] data enhancement method is added in the data processing stage, the residual module is modified in the backbone feature extraction network from Darknet53 to CSPDarknet53, SPP (Spatial Pyramid Pooling) [18] and PANet (Path Aggregation Network) [19] are combined in the feature pyramid part, and CIoU (Complete Intersection over Union) [20] is combined in the loss function. However, the overall detection idea of YOLOv4 is not much different from YOLOv3, which uses three different scales of feature layers, large, medium and small, for classification and regression detection of target objects of different sizes.
Specifically, the Mosaic data enhancement algorithm mentioned above stitches together four original images into a new image that is passed into the model for training, enriching the background of the target object. A new residual structure, CSPNet(Cross Stage Partial Network) [21], is introduced into the backbone feature extraction network CSPDarkNet53. CSPNet splits the residual structure in the YOLOv3. The main part continues to stack residual blocks, and the other part is like a large residual edge, which directly combined with the output of the main part after a small amount of processing aiming to extract more feature information. SPP structure is processed using maximum pooling at four different scales (pooling kernel sizes of 13×13, 9×9, 5×5, 1×1) after three convolutions of the last feature layer of CSPdarknet53, which can increase the perceptual field and separate significant contextual features. PANet is characterized by iterative extraction of features. The combination of CIoU in the loss function takes into account the distance between the target and anchor, the overlap rate, the scale, and the penalty term, making the target box regression more stable.
Mobilenets
Mobilenets [22] is a lightweight neural network proposed by Google for mobile and embedded devices. The main feature of its network structure is the use of DSC (Depthwise Separable Convolution) [23]. DSC reduces the amount of parameter computation by changing the computation of the network, reducing the complexity of the model while speeding up detection. It actually divides the standard convolution into two parts: depthwise convolution and pointwise convolution. For example, if a 3×3 convolutional kernel is used to process input data with channel 3 and output channels with 6. Standard convolution will use six 3×3 size convolution kernels to traverse each of the three channels of data, and finally calculate the required six output channels, the number of parameters is 3×6×3×3 = 162. The DSC will use three 3×3 sized convolution kernels to traverse three channels of data to get three feature maps, and then six 1×1 sized convolution kernels to traverse these three feature maps to get six output channels, and the required parameters are 3×3×3 + 3×6×1×1 = 45. Changing the method of network computation does reduce the number of parameters by a significant amount.
Transfer learning
Transfer learning [24] is a machine learning method. The weights trained at a node in the same network structure can be migrated to a new network using transfer learning to continue the training using new data. It is difficult to collect a large amount of training data for most object detection tasks, and this is where transfer learning comes into play. Using the network to train pre-trained weights on large datasets such as ImageNet [25], COCO [26], and VOC [27] before transfer learning not only avoids the embarrassment of having little training data to generalize a model with higher accuracy but also speeds up training and saves training time.
MobileNet-Yolo based wildlife detection method
In the actual monitoring of wildlife, staff usually install remote infrared and video monitoring equipment in areas where animals are often active. These devices usually run under CPU platforms and do not have the parallel accelerated computing capabilities of GPUs. YOLO is computationally delayed on the CPU platform. It will be difficult to achieve real-time wildlife detection. Therefore, this paper proposes the MobileNet-Yolo model, which further improves YOLOv4 by replacing the backbone feature extraction network CSPDarknet53 with a lightweight neural network MobileNet while retaining the Mosaic data augmentation and the original feature pyramid network.
MobileNetv2 [28] is an enhanced version of MobileNetv1 with the inverted residual with linear bottleneck in addition to the DSC to obtain higher accuracy using fewer parameters. The inverted residual structure is first up-dimensioned using 1×1 convolution, followed by feature extraction using 3×3 convolution, and then down-dimensioned using 1×1 convolution, with the residual edge part connecting the input and output directly. As shown in Fig. 1.

The inverted residual structure.
Although MobileNetv3 [29] is the most complete among the currently released versions of MobileNet and is an improvement on the previous two versions, it is difficult to make a good trade-off between model accuracy and model size. MobileNetv3 has two sub-versions, large and small. The number of parameters for several versions of MobileNet is shown in Table 1.
The number of parameters for several versions of MobileNet
It is obvious to see that Mobilenetv3-small has the smallest number of parameters, followed by Mobilenetv2, and MobileNetv3-large has the largest number of parameters. Therefore, a trade-off considers using Mobilenetv2 as a replacement network to join YOLOv4. The structure of our model is shown in Fig. 2.

Model structure of MobileNetv2-Yolov4. There are four parts: Mobilenetv2, SPP, PANet and Head. Mobilenetv2 acts as the Backbone of the network. SPP and PANet form the main part of a feature pyramid network. Head performs prediction work.
The whole network structure of YOLOv4 has three parts: Backbone, Neck and Head. The backbone feature extraction network performs preliminary feature extraction and obtains three scales of effective feature layers. Neck for deeper feature extraction to obtain more effective feature layers at three scales. Head use more efficient feature layers to make predictions. MobileNetv2 is used to do enhanced feature extraction for the three preliminary effective feature layers, which do the replacement of MobileNetv2 into YOLOv4.
After the input image is passed into the MobileNetv2-YOLOv4 network, we need to locate the location of the target animal first, and then determine which the animal belongs to. The detailed process is as follows.
Input training samples and adjust the pre-clustered anchor with the bounding box of ground-truth of the target. The process of adjustment i.e., box regression loss, classification loss, and confidence loss adjustment process. The loss function is shown in Equation (1).
MobileNetv2-YOLOv4 uses the same loss function as YOLOv4 with CIoU instead of MSE (Mean-Square Error). The combination of CIoU in the loss function takes into account the distance between the target and anchor, the overlap rate, the scale, and the penalty term, making the target box regression more stable. And the penalty factor takes into account the aspect ratio of the prediction box to fit the target box. The CIoU is calculated as shown in Equation (2).
In Equation (2), The IoU is the ratio of the area of the overlapping parts of the ground-truth box and the prediction box to their total area and takes values in the range [0,1]. d2 calculates the Euclidean distance between the centre point of the prediction box and the centre point of the ground-truth box. c2 calculates the diagonal distance of the smallest closed area that can contain both the prediction box and the truth box. α is the parameter used to make trade-offs. v is a parameter that measures the consistency of the aspect ratio.
If the loss function is stable around a value or no longer shows large fluctuations, it means that the network has finished training and can more accurately locate and classify the target animal in the input image.
Building the dataset
It is difficult to collect a large number of available wildlife images in reality. Therefore, to support this study, 50 high-definition cameras were first deployed in the study area for over 1000 hours of continuous observation. Secondly, a large number of images from the Internet 1 containing the Phayre’s leaf-monkey, the Malabar pied hornbill, the Wreathed hornbill, the Great hornbill, the Skywalker hoolock gibbon, and the Red-thighed falconet. The final research dataset is constructed using observational data and data collected on the Internet combined with manual annotation by domain experts.
Figure 3 shows a map of the study area and photographs of the field deployment of the data collection equipment. Researchers have deployed monitoring equipment in three wildlife habitats at Yingjiang Hornbill Valley [30], Daniang Mountain [31] and Mangshi [32] in the YTNR. The equipment is connected to the intelligent monitoring system and the monitoring data is viewed and downloaded in the real-time backstage.

Map of the study area and photographs of the field deployment of data collection equipment.
As shown in Fig. 4, a total of 2987 images were available in the self-built dataset. The numbers of images of the six species of animals, namely, Phayre’s Leaf-monkey, Malabar pied hornbill, Red-thighed Falconet, Great Hornbill, Wreathed hornbill, Skywalker hoolock gibbon, are 404, 711, 419, 548, 501 and 551 respectively. The numbers of animals included in these images are 760, 833, 700, 623, 628 and 437 respectively.

bar graphs. (a) contains information on the number of pictures per type of animal in the dataset. (b) contains information on the number of different types of animals in the dataset. (c) contains information on the number of pictures per type of animal in the benchmark dataset. In the horizontal axis labels of the bar graphs, ‘0’ represents Phayre’s Leaf-monkey, ‘1’ represents Malabar pied hornbill, ‘2’ represents Red-thighed Falconet, ‘3’ represents Great hornbill, ‘4’ represents Wreathed hornbill and ‘5’ represents Skywalker hoolock gibbon.
In order to demonstrate the validity of the self-built dataset, the images of animals collected in the field were screened out as benchmark dataset for experimental comparisons. A total of 1410 images were included in the benchmark dataset as shown in Fig. 4(c), with the number of images for the six animals being 189, 288,233,217,253,230 respectively.
As shown in Fig. 5, the six wildlife species are less clearly distinguishable from the environment because the animals themselves are in a wild environment. Hence to train detection models with good generalization capabilities, the dataset also needs to be augmented in real-time by common data enhancements - flipping, rotating, cropping, contrast, Gaussian noise, etc., in combination with Mosaic data enhancement.

Sample image of a self-built dataset. These images represent different types of wildlife, such as (a) is Phayre’s leaf-monkey, (b) is Malabar pied hornbill, (c) is Red-thighed Falconet, (d) is Great Hornbill, (e) is Wreathed hornbill and (f) is Skywalker hoolock gibbon, present in the dataset. Also, (a)(b)(c) from screenshots of the observation video and (d)(e)(f) from Internet.
The specific configuration of the experimental platform used in this study is shown in Table 2.
Experimental platform configuration
Experimental platform configuration
The model in this study was optimally trained using pre-trained weights trained on the VOC2007 dataset as initialization weights. The initial learning rate is 0.001 and the learning rate adjustment strategy uses cosine annealing decay of YOLOv4. The learning rate will rise and then fall. A linear rise is used for the rise and a cosine function is simulated for the fall, iterated several times. Momentum is 0.9, Epoch is 100, Batch_size is set to 8 and the number of iterations is 27,200.
The loss function curve for the training of the model in this paper is shown in Fig. 6. It can seem that the value of the loss function starts to decrease slowly at the 70th epoch, and the value of the loss function is stable from the 90th to the 100th epoch.

Training loss curves. Box_loss is box regression loss, cls_loss is classification loss, and obj_loss is confidence loss.
In order to verify the effectiveness of the model, average testing time, FPS (Frame Per Second), AP (Average Precision) and mAP (mean Average Precision) were used as evaluation metrics. The average testing time and FPS are used to evaluate the real-time detection of the model. The calculation process is shown in Equations (3) and (4).
Where time represents the average testing time, frameNum represents the number of frames and elapsedTime represents the time elapsed. FPS is the number of frames a model can process in one second, the larger the number the faster the processing speed. The shorter average processing time also indicates the faster detection of the model.
The average precision is related to Precision and Recall, the formulas for Precision and Recall are shown in Equations (5) and (6).
TP (True Positives) indicates the true positive class. (TP + FP) is the total number of positive class samples obtained by model detection, (TP + FN) is the actual total number of positive class samples. Precision i.e., the ratio of the number of positive samples correctly identified by the model to the number of all samples classified as positive. A recall is the ratio of the number of positive class samples correctly identified by the model to the total number of actual positive class samples.
A range of precision and recall values are calculated for each category based on the formula and confidence thresholds. These recall and precision values are plotted as the x and y-axis for the P-R curve and the area under the curve is the average precision.
The mAP is an arithmetic mean of the average precision of all categories in the test set, calculated as shown in Eq. (7). A larger mean average precision indicates a higher detection accuracy of the model.
To verify the efficiency of our model, Faster R-CNN, YOLOv3, YOLOv4, YOLOv5, as well as MobileNetv1-Yolov4 and MobileNetv3-Yolov4, were selected for comparison in the experiment. Consistent training parameters and experimental platform configuration across models, as detailed in 3.2. The models were trained under the GPU platform, while the detection speed and accuracy of each model was compared under the CPU platform.
Performance comparison of the different algorithms on CPU platforms is shown in Tables 3 and 4. Looking at the experimental data in the table, we can see that Faster RCNN, YOLOv5 and MobileNetv2-Yolov4 all have over 90% of the mAP. Further, the Map of MobileNetv2-Yolov4 is 93.6%, which is a 7.7% improvement over the YOLOv4 and has the highest detection accuracy of all the experimental models. Comparing the detection speed of the models, it is easy to see that our model has the fastest detection speed with an average testing time and FPS of 260 ms and 3.8 respectively, which can meet the demand for real-time detection in the wild. In summary, the proposed model in this paper introduces the lightweight network MobileNetv2 not only speeds up the detection of wild animals on CPU platforms but also ensures detection accuracy well. There is no doubt that it can play an important role in wildlife detection in Yunnan.
Comparison of the detection accuracy of different models
Comparison of the detection accuracy of different models
Comparison of detection speed of different models under CPU
To further verify the validity of the data in this paper, MobileNetv2-Yolov4 was trained using the self-built dataset and the benchmark dataset, respectively. Table 5 shows the detection results of MobileNetv2-Yolov4 on the self-built dataset and the benchmark dataset. This comparison experiment uses the same test dataset containing 143 images of animals collected in the field. We can see that the self-built dataset used in this paper does improve the overall detection accuracy of the model. The mAP increased from 87.4% to 89.4%. There was also an increase in AP for each animal species. In addition, the detection AP for the Skywalker hoolock gibbon on the benchmark dataset was only 73.2%, indicating that the quality of the Skywalker hoolock gibbon images collected in the field was poor. It is difficult for the camera to capture a clear image because the Skywalker hoolock gibbon is so mobile by nature.
Comparison of the detection accuracy of benchmark dataset (I) and self-bulit dataset (II)
Figure 7 shows the P-R curves for MobileNetv2-Yolov4 on six wildlife species of the self-built dataset. We can see that the area under the curve is above 0.9 for all five wildlife categories, except for the Phayre’s Leaf-monkey.

P-R curves. In the legend, feishiyehou i.e., Phayre’s leaf-monkey, guanbanxiniao i.e., Malabar pied hornbill, hongtuixiaosun i.e., red-thighed falconets, shuang jiaoxiniao i.e., great hornbill, zhoukuixiniao i.e., Wreathed hornbill and baimeichangbiyuan i.e., Skywalker hoolock gibbon.
To have a more intuitive view of the model’s detection performance, a representative set of six results is shown in Fig. 8. We can see that the model proposed in this paper is able to detect Phayre’s leaf-monkey accurately even if it is obscured by the leaves of the tree. Although the Great hornbill and Malabar pied hornbill are very similar, the model does not confuse the two. Naturally, since the Great hornbill in the case picture only shows its head, the confidence level of the model detection is only 0.88. All three red-thighed Falconets in the detection area can be detected. The wreathed hornbill can be accurately detected even with its back to the detection camera. The Skywalker hoolock gibbon and the Phayre’s leaf-monkey are also similar, but the model does not detect the former as the latter.

Sample test results. These images show the results of six wildlife species tested under MobileNetv2-Yolov4. (a) is Phayre’s leaf-monkey, (b) is Malabar pied hornbill, (c) is Red-thighed Falconet, (d) is Great Hornbill, (e) is Wreathed hornbill and (f) is Skywalker hoolock gibbon.
This paper proposes to replace the backbone feature extraction network CSPDarknet53 in the YOLOv4 model with a lightweight network MobileNetv2, which simplifies the network structure and improves detection speed while ensuring detection accuracy. Regarding the establishment of the dataset, this study collected and constructed a wildlife dataset combining field observation data and public data from the Internet, which provides a benchmark for model training and testing in such studies. The resulting MobileNetv2-Yolov4 using transfer learning achieves a detection accuracy of 93.6%, which is 7.7% better than YOLOv4 and 3% better than YOLOv5 with the same conditions. Our model achieved a detection speed of 3.8 FPS on the CPU platform, which is 4.75 times faster than YOLOv4.
The model has been applied to the backend of the Dehong Smart Monitoring System, which, together with the system’s front and back-end architecture, enables real-time detection and counting of six types of wildlife: the Phayre’s leaf-monkey, the Malabar pied hornbill, the Red-thighed Falconet, the Great Hornbill, the Wreathed hornbill and the Skywalker hoolock gibbon. People only need to log on to the system via their computers or mobile phones to view animal incursions, greatly saving labour costs. Although this study has solved some practical problems and has been used with some success in practice, it is only applicable to the six wildlife species included in the self-constructed dataset. Our model is not yet able to accurately detect new animals if they appear in the YTNR. Besides, the vulnerability of animals to cover in the field environment, the variable posture of animals and the small size of animal targets are all features that increase the demands on wildlife detection models, which our model has not improved targeted. In our experiments, we also found that different models were differently sensitive to various animals, but we did not address the interpretability of the models in this study.
Our future work will focus on two points. The first is to develop more applicable models with a broader range of detection of wildlife traits and the second is to explore in-depth the interpretability of the model. It may be possible to consider modifying the Mobilenet network structure and introducing new data enhancement algorithms to focus on improving the quality of images collected in the field so that the model can better address practical needs. Alternatively, the interpretability of the model could explore using methods and means of interpreting deep learning models.
