Abstract
Human-wildlife conflicts in the habitats along the forest fringes are a substantial issue. An automated monitoring system that can find animal breaches and deter them from foraging fields is essential to solve this conflict. However, automatically forefending the intruding animals is a challenging task. In this paper, we propose a deep learning model for elephant identification using YOLO lite with knowledge distillation which could be easily deployed in edge devices. We also propose an elephant re-identification system using Siamese network which is helpful in tracking the number of times the elephant tries to forage the field. This re-encounter information about the same elephant can be used to decide the averting sound for the particular elephant. The proposed system is found to show an accuracy of 89%, which is provides good performance improvement when compared to the state of art models proposed for animal identification. Thus the proposed lite weight knowledge distillation based animal identification model and deep learning based animal re-identification model can be employed in edge devices for real time monitoring and animal deterring to safe guard the farm fields.
Keywords
Introduction
The conflict between humans and elephants arises in the overlapping human and elephant habitats, as elephants forage into crop fields for food. Mitigation of such conflicts should be addressed from a conservation, societal, and technological standpoint. Traditionally, farmers have used a variety of mitigation strategies like bursting sparklers or carbide cannons, screaming and shouting, hitting metals, and throwing stones to prevent elephants from raiding crop fields [1]. These techniques are effective at keeping elephants away from crops, but they disrupt farmers’ as well as animals’ psychosocial well-being [2, 3].
Modern technology could assist in developing early elephant intrusion warning and drive-away systems. As ear-splitting sound scares elephants away, anti-crop raiding acoustic gun devices generate a safe, low-frequency, ear-splitting audio signal. These high-tech acoustic deterrents are also problematic. Audio playbacks of threatening sounds such as wild cat growls, human yells, bee buzzes, and matriarchal elephant vocalisations are found to be short-term elephant repellents. According to some studies, elephants quickly learn to tolerate these sounds and return to raid crops. Australia’s national research organization, CSIRO, developed an artificial intelligence scarecrow with three components: sensors to detect pests, a processing brain to identify them and decide how to respond, and deterrent devices that can respond with the right combination of sound or light. These can be dispersed to cover any area that requires protection. However, after a few encounters, elephants simply dismissed the sound as a stationary threat and walked around it.
All scarecrow techniques have a long history of failure, and elephants quickly learn to tune out a deterrent if no potential danger exists. Because of elephants’ high intelligence level and adaptive learning capability, developing a long-lasting drive-away technology is difficult, so human intervention-based conventional drive-away methods are widely accepted and used to this day. While mitigation techniques may succeed or fail, more emphasis must be placed on the development of ecologically relevant strategies. Few attempts have been made to prioritise the incorporation of information about elephant behaviour and sensory perception into conflict mitigation strategies. One solution would be to identify the elephant and use different sounds and frequencies when the same animal approaches the farm again. This would make the elephant believe that there are various types of danger present in the farm and would avert it from foraging the field. Hence in this work we propose a light weight deep learning model for animal identification and re-identification with knowledge distillation framework. Also this work could further be used to tag and label the elephants and have a track of the elephants that frequently visit the field. In section 2, the existing works in animal re-identification is discussed along with their limitations and in section 3 the methodology adopted in the proposed technique is outlined followed by the discussion of the results in section 4.
Our primary contributions are:
(i). Develop a light weight knowledge distillation model for animal identification
(ii) Develop an animal re-identification model using a Siamese Network
Related works
Ravoor et al. [4] attempted to reduce animal-human conflicts by recognising animal invasions automatically using the deep learning network, MobileNetv2-SSD model. They developed a computer vision-based networked cross-camera tracking system for animal trespass detection. Animal re-identification work is done to monitor an animal movement between different mounted cameras. By connecting Raspberry Pi devices to a laptop and running object detection on the Raspberry Pi devices. The testing was carried out on tigers, jaguars, and elephants, and it displays detection rates of 80%, 89.47%, and 92.56%, respectively, despite working at around 2-3 fps. Rose et al [5] in their book "Deep Learning using Python" has demonstrated several deep learning models for real time object tracking and detection.
Meivel et al. [6] proposed a substitution approach for anticipating wild animals using elevated thermal cameras and this method outperformed existing multiscale image detection methods such as the Haar cascade classifier and Hog description. Korschens et al. [7] presented a re-identification of elephants with 276 elephant individuals in 2078 photos dispersed together in a lengthy dispersion. The photographs were taken over a period of around 15 years and presents a number of challenges, such as good animal variances, insinuating a new perspective on that elephant only from the training side, animal ageing effects, and significant skin colour changes. They used YOLO object identification, ImageNet feature extraction, and support vector machine discrimination and obtained the highest classification accuracy of 56%. Numerous dataset-specific problems, such as substantial colour contrasts in the animals, multiple elephants in one image, and coping with muck occluding critical features, are also present in the dataset.
Oishi et al. [8] presented a statistical method that is used to estimate wild animal numbers. However, applying such strategies to large areas is problematic. Researchers have used the computer-aided detection of moving wild animals (DWA) technique on thermal remote sensing photos. The precision of aerial thermal images in a large region was around 77.3%. When it comes to recognising moving wild animals, the suggested strategy can help reduce supervision. Prashanth C. Ravoor et al. [9] presented the available animal re-identification techniques enabling completely automated specific animal monitoring in some kind of cross-camera arrangement. As different modern techniques for tracking objects are looked into, it becomes clear how important it is to extract features from these kinds of systems. Individual re-identification and item monitoring devices that have proven successful can be studied in the hopes of transferring concepts to specific animal monitoring.
Meena et al. [10] proposed an effective wildlife detection and recognition system using unchanging characteristics and fuzzy logic. The proposed approach uses Zernike, form, texturing, and skeletal paths. When considered collectively, these traits are invariant to rotation, scale, translation, lighting, and partly posture. Low contrast/illumination, haze/blur, occlusion, camouflage, backdrop clutter, and position change are all challenges that the suggested model can handle. The suggested fuzzy system has a 97% average accuracy, which is really comparable to domain experts’ accuracy in recognising animals from the thermal dataset.
Xie et al. [11] presented the viewpoint of the robust knowledge distillation (VRKD) method to speed up vehicle re-identification. The VRKD approach is made up of a sophisticated teacher network and a basic student network. The teacher network uses triple-directional deep networks to learn perspective’s robust characteristics. The student network is made up of only a narrow backbone sub-network and a global average pooling layer. The student network distils perspective-rich knowledge from the teacher network by lowering the Kullback-Leibler divergence between the posterior probability distributions produced from the student and instructor networks. As a result, the vehicle re-identification time is substantially expedited because only a modest network of testing computations is required. Ashok Kumar et al. [12] in their book has presented several deep learning research application realted to the state of art current research using deep learning models.
Chawla et al. [13] presented the Deep Inversion for Object Detection (DIODE) model for inverting information that is produced through knowledge distillation into neural network models trained just on object identification tasks. To improve picture integrity and distillation efficacy, DIODE relies on two key components: first, a large number of differentiable augmentations. Second, a novel automated bounding box and category sampling approach for picture synthesis allows the development of a large number of images with a diverse range of spatial and category features Data-free knowledge distillation from an instructor to a brand-new student is possible using the visuals generated. In a series of experiments, they demonstrated that the DIODE’s ability to consistently match this same training image dispersion enables even more successful knowledge distillation or out proxy sets of data, which inevitably occur in an information set up due to a lack of initial knowledge base.
Kumar et al. [14] has proposed deep learning based assistive technology utilizing knowledge distillation for audio visual speech recognition. Zheng et al. [15] suggested a different deep learning model for re-identification of humans. Although re-id has come a long way in recent years, geographical localization and perspective representational training for efficient cross-view matching remain significant obstacles. They answer these questions using the Consistent Attentive Siamese Network, a new attention-driven Siamese learning architecture.
Though there are many works focusing on animal identification as reviewed in [9], only few works are done related to animal re-identification. These re-identification systems require high end systems, hence making them less feasible to use in animal intrusion averting systems. A less resource demanding deep learning models, could allow the deployed edge devices to have low cost processing units which in turn can make these devices economically viable to farmers along the forest borders. So exploiting the research outputs in areas like vehicle identification [11, 13] and human re-identification [15], we propose to identify and re- identify elephants from thermal camera images. The knowledge distillation technique is highly helpful as it would reduce the computational demand, which can be met easily in the edge devices installed in remote areas. Also as huge number of installations has to be made across the entire crop field or plantation, the hardware requirements could be met at low cost if knowledge distillation technique is used. To have a track of the elephant that is revisiting the fields frequently and to alter the kind of auditory buzz given to threaten them, can be decided based on their intrusion count. So to make this possible re-identification is done using Siamese network and the count is stored in the edge device to make the decision.
Proposed methodology
Animal identification using RCNN
Object detection is the process of finding and classifying objects in an image. Regions based convolutional neural networks (R-CNN), is a deep learning model, which combines rectangular region proposals with convolutional neural network features for object detection The goal of R-CNN is to take in an image, and correctly identify where the object is in the picture. R-CNN detection system consists of three modules as in Fig. 1. The first generates category-independent region proposals or bounding boxes, using a process called Selective Search and propose a bunch of boxes in the image and see if any of them correspond to an object. As soon as the proposals are created, R-CNN encloses the region to a standard square size and passes it next module. The second module is a deep convolutional neural network that extracts a feature vector from each region. The third module is a set of class-specific classifier i.e. linear Support Vector Machine (SVM) that classifies whether this is an elephant.

R-CNN Architecture for animal identification.
Instead of selecting interesting parts of an image as in RCNN, You Only Look Once (YOLO) model predict classes and bounding boxes for the whole image in one run of the algorithm. YOLO is a cutting-edge, real-time detection technology which divides an image into grids, each of which identifies objects within its own boundaries. It can be used to detect objects in real time from data streams. The image is segmented into regions by this network, which forecasts accountable containers and possibilities for every area. Predictable possibilities are used to evaluate these binding boxes. During testing, it considers the entire image in order to make predictions in a global context. This also provides recommendations based on a personal network test, unlike R-CNN, which needs thousands of pictures [16]. The Yolo v5 seems to be a single-stage detector with the same three vital parts as every other index.
1. Model Backbone
2. Neck model
3. Model Head
The Model Backbone is a well-known tool for extracting important details from a featured image. Figure 2 shows the YOLO V5 architecture in which cross-stage partial networks being utilised as the foundation to retrieve information elements from the embedded image. Feature pyramids are frequently created with the Model Neck. The included pyramid assists the user in learning how to measure an object. Path Aggregation Network (PANet) is used as a neck for feature pyramids in Yolo 5.

YOLO V5 Cross stage partial network architecture.
Information distillation refers to the idea of model compression by teaching a small network, step by step, what to do using a large network already trained. Train a compact model, which is known as a student network, from one or more large pre-trained models, which are also known as teacher networks. Knowledge distillation achieves this by directing the student through the class using the teacher’s predictions, which include detailed interclass and object information. Soft labels refer to outline feature maps created by a large network after each conversion layer. Figure 3 shows a knowledge distillation framework-based light-weight animal identification model. The small network is then taught to act like the larger network by trying to copy all of its effects. The architecture of the student network is substantially easier than the teacher network. However, because the student network is considerably simpler, this should be strengthened by inheriting point-of-view knowledge from the teacher network. To be more exact, the loss function for the student network is as follows:

Knowledge distillation Framework based lite weight Animal identification model.
where
x- training time
y- class label
L CE - Cross Entropy loss function
L Kld - Kullback–Leibler divergence
θ S - Student - network- parameter
When λ > = 0 is used to control the Kullback –Leibler
divergence loss
The formula for the loss L Kld is as follows:
Using equation 1 as well as equation 2, it is shown that the teacher network’s variable is also not implicated in Equation 1, implying that the teacher network is responsible for giving the logit values during the student network’s training phase and that its variable is fixed. The student network’s result (i.e., a logit value z) is then induced to be similar to the student network’s output by optimising the KLD loss function. As a result, the student network can extract perspective stability from the instructor network while maintaining tiny testing calculations for quick testing.
Re-identification of a system distinguishes between two individuals of the same species, whether humans, automobiles, or wildlife. Programs of recognition tracts are created by inserting vectors using photos and measuring their similarity to the Euclidean range using conventional metrics. The process presented in this research accomplishes an animal experiment in detection and recognition that uses traits to identify whether the same animal appeared earlier. As a result, the animal may be tracked across cameras, and, in the event of scattered cameras, the number of animals discovered can be correctly calculated.
A Siamese network is a network of neural networks with two or more identical networks. The Siamese basic network system is shown in Fig. 4. A Siamese Neural Network (SNN) is a class of neural network architectures that contain two or more identical sub-networks. “Identical” here means they have the same configuration with the same parameters and weights. Parameter updating is mirrored across both sub-networks and it’s used to find similarities between inputs by comparing its feature vectors. After that, the elements’ similarity is computed using either a differential or numerical value. The goal output is 1 for input pairings of the same class and 0 for input pairings of different classes.

Siamese network basic structure.
The pipeline of the animal re-identification module is shown in Fig. 5. This system proposes to detect animal intrusion along with the number of times the animal has entered the field using deep learning models. Further knowledge distillation improves the performance of complex models, making them easier to deploy on edge devices. To identify the animal, first the image of the animal must first be detected. The image has to be cropped using the bounding box technique, and the cropped images have to be saved for re-identification. Then the exact image has to be stored with the bounding box. The features of the image have to be extracted through the feature extraction process. If the features of the animal match any of the previous images, the counter for that animal has to be increased; otherwise, the counter has to be stored as such. After this, the class and counter of the detected animal have to be returned.

Pipeline for proposed algorithm.
The dataset used in this model is the open source Arribada Elephant Dataset. Arribada’s thermal elephant initiative has gathered a dataset of Asian elephant images at ZSL Whipsnade Zoo in collaboration with the Zoological Society of London. They captured thermal photographs of the Asian elephant herd using FLIR Lepton 2.5 and 3.5 microbolometer sensors (80 x 60 and 160 x 120, respectively). The total number of images present in this dataset is 75,830. Figure 6 shows sample images from the dataset. This dataset consists of images categorised based on angle, distance, object, and sensor. The angle category consists of front, rear, and side view thermal images of elephants. The distance category consists of the different distances between the elephant and the camera. The object category consists of images with some other objects like humans, goats, multiple elephants, etc. Sensor images come from various cameras, such as the Lepton 2.5 or Lepton 3.5. In order to overcome the issue of overfitting, 75,830 images from the training dataset were augmented to 1,50,000 images. For each image, a maximum of 2 augmented versions were generated by randomly applying horizontal mirroring, and changes in the grey scale.

Pipeline for proposed algorithm.

Individual image prediction accuracy of RCNN.
R-CNN vs YOLO lite
The performance metrics of the RCNN and YOLO model is first analyzed, to choose the one that shows better performance in identifying the elephant. The prediction confidence on individual elephant images using the RCNN method happens to be around 52–86%, as depicted in Fig. 8. The precision and recall rates are 0.75 and 0.667, respectively. The overall training accuracy of the RCNN method is found to be 60%.

Individual image prediction accuracy of YOLO.
The prediction The prediction confidence of individual images in the YOLO light-weight model is around 89% to 94%, as shown in Fig. 9. The recall rate and precision are found to be 0.98 and 0.995, respectively. The overall training accuracy of the YOLO model is 89%. So the YOLO light-weight model has been chosen for animal identification as it gives a 14% improvement in training accuracy compared with the RCNN model. The receiver operating characteristic (ROC) curve for both the models, shows the performance of a classification model at all classification thresholds is plotted as shown in the Fig. 9. It can be inferred from the ROC curve that YOLO performs better than RCNN.

ROC curve for the proposed RCNN and YOLO models for animal identification.
Normally training the deep learning models is quite complex and requires Graphical Processing Units (GPU). However the knowledge distillation helps to reduce the processing power requirement. The complex models are trained in the teacher module, and the knowledge is distilled and fed to the student model. Thus, the processing resource demand of the student model is very low when compared to the teacher model, and so it can be easily deployed on the edge device. The proposed Siamese network based knowledge distillation model can be deployed on the edge-enabled thermal camera. Thus, all the images from the thermal camera are detected using the student model, which is lightweight and ready for deployment in the edge device. In this proposed system, the knowledge distillation is done using the Yolo lightweight deep learning model. Here, the teacher model is a complex model, and the student model is a lightweight model. Figure 10 shows the graph of the map, precision, and recall of the teacher and student models, where blue represents the teacher model and orange represents the student model. The proposed method has achieved precision of 0.99, recall of 0.98, accuracy of 89%, and an F1 score of 0.98.

Evaluation metrics for YOLO with KD framework.
The box loss represents how well the algorithm can locate the centre of an object and how well the predicted bounding box covers an object. Objectness is essentially a measure of the probability that an object exists in a proposed region of interest. If the objectivity is high, this means that the image window is likely to contain an object. From Fig. 11 it can be seen that the box and objectness loss shows and rapid decline and reduces to near zero as the epochs approach 150. The model improved swiftly in terms of precision, recall and mean average precision before plateauing after about 150 epochs. The confusion matrix for the classification is shown in Fig. 12.

Plots of box loss, objectness loss, classification loss, precision, recall and mean average precision (mAP) over the training epochs for the training and validation set YOLO with knowledge distillation model.

Confusion matrix of the YOLO with knowledge distillation model.
The animal re-identification is doneusing the dissimilarity rate. Siamese network is used for the purpose of calculating the similarity rate. When the dissimilarity rate is less than 30%, the counter variable is incremented, which confirms that the same animal has entered again. If the dissimilarity rate is more than 30%, then the animal is a new one; it has been entered for the first time. Figure 13 shows the dissimilarity obtained for two images. Here the dissimilarity rate is greater than 30%, so both the animals in the images are different. The neural network system was trained using a workstation with NVIDIA and took 0.8 hours for training the knowledge distillation framework.

Dissimilarity rate.
Table 1 shows the experimental results of the suggested method in comparison to other methods. The proposed method surpassed Resnet152, Resnet101, SSD_mobilenet_v2, CNNSVM, CNN-XGBoost, and ANN, achieving an accuracy of 89%.
Comparative Analysis
The proposed model has efficiently solved the problem of intrusion detection and provides various metrics for preventing the intrusion. The YoloV5 model’s accuracy can be increased through training it with a larger collection of pictures and fine-tuning it. After fine tuning the training dataset, it will reduce the latency of the Siamese network such that the re-identification performs at an efficient speed. The proposed YOLO v5 method achieved precision of 0.995, recall of 0.98, accuracy of 89%, and a F1 score of 0.98, values that are greater than those of other models such as Resnet152, Resnet101, SSD_mobilenet_v2, CNNSVM, CNNXGBoost, and ANN. The future work is to extend and integrate the animal re-identification module and animal identification module into a lightweight edge device to decide the averting sound for deterring the elephant.
Footnotes
Acknowledgement
This research is funded by the Department of Science and Technology, Technology and Development Transfer division from New Delhi, sanction number DST/TDT/AGRO-23/2019(G).
