Abstract
Human-Elephant Conflict (HEC) and its mitigation have always been a serious conservation issue in India. It occurs mainly due to the encroachment of forests by humans as part of societal development. Consequently, these human settlements are highly affected by the intrusion of wild elephants as they cause extensive crop-raiding, injuries and even death in many cases. HEC is a growing problem in rural areas of India which shares a border with forests and other elephant habitats. Based on the studies, it is very explicit that HEC is an important conservation issue which affects the peaceful co-existence of both humans and elephants near the forest areas. The desirable solution for this problem would be to facilitate co-existence among humans and elephants, but this often fails because of technical difficulties. Hence, this paper presents an end-to-end technological solution to facilitate smoother coexistence of humans and elephants. The proposed work deploys a live video surveillance system along with deep learning strategies to effectively detect the presence of elephants. From the numerical analysis, it is revealed that the post-training accuracy of the deep learning model used in the proposed approach is evaluated at 98.7% and outperforms an an out-of-the-box image detector. The layered approach used in the proposed work improves resource management which is a major bottleneck in real-time deployment scenarios.
Introduction
The loss and fragmentation of elephants habitat due to expanding human activity leads to serious conservation problems in India. This continuously escalated the Human-Elephant conflicts (HEC) which range from crop-raiding, injuries, and deaths to humans caused by elephants, and elephants being killed by humans to prevent crop-loss, and land encroachment. In addition to this, established elephant trails are now impacted by roads and railways. This leads elephants to be often killed by trains as they attempt to cross the railway lines, and elephants charging on humans and vehicles when they attempt to cross trails.
Various conventional deterrent systems such as chili and beehive fencing have been developed in the past to prevent HEC. On the other hand, technology-based methods have been developed to track as well as to caution the presence of the elephant in specific regions such as satellite tracking of elephants using global positioning systems (GPS). These methods were primarily developed to study the behaviour of elephants or to develop early warning systems. But, most of these techniques were invasive, not cost-effective and become complex when scaled up and applied for a larger population and a wider geographical area. This revealed the path for the rise of non-invasive techniques like acoustic, seismic and vision-based detection methods.
Studies have shown that elephants may remain silent or can learn to be silent and still for long periods of time [1]. Seismic vibrations produced by the elephants suffers when the ground has moisture. Also, seismic noises analogous to that made by elephants can be produced by external sources. These reasons clearly show the need for an effective early warning system which contributes to the efficient detection of elephants utilizing the positive aspects of visual detection techniques [2]. This paper proposes a robust vision-based method that uses deep learning techniques to accurately detect the presence of elephants.
Related works
India’s forests hold 60% of the Asian elephant population. The habitat ranges for elephants in the country are more or less fragmented and impacted due to increased human activity [3]. HEC is an important conservation issue in India especially in the forest fringes where protected areas are covered by highly fertile and cultivable lands [4]. The HEC that ranges from crop raiding, injuries and death to humans caused by elephants, and sometimes killing the animals to prevent crop-loss, and land encroachment create severe localized economic and socio-political issues [5]. In addition, untoward developmental pressures including the construction of roads, railroads, dams, and canals in the forested regions in the country have increased the magnitude of the problem. The present management plans for HEC in India fundamentally include forceful rehabilitation of people especially the indigenous communities inhabiting forested regions; conventional methods such as electric fencing, digging a trench and using biological repellent such as chili or other plant parts to drive away from the elephants [6]. None of these are found to be sustainable and economically viable solutions since they do not address the problem holistically. Development of systems that provide detection, tracking of the animal and communication with the stakeholders are considered to be a path-breaking invention to solve the HEC [7]. These technologies were exposed to fundamental improvisation and transformed into invasive initially than to noninvasive elephant detecting systems [8]. The invasive elephant detecting systems that are cost-intensive and non-scalable techniques, such as satellite tracking, GPS, light sensors with laser beams and vibration-based sensors, were replaced by reliable, low cost and scalable methods such as acoustic and visual monitoring system in many parts of the world [9]. However, in India, the application of non-invasive elephant detecting system to mitigate HEC is still in its infancy. The traditional methods are still in use and the issues related to HEC are severe in most part of peninsular India [10].
Most of the conventional detecting systems are either involving a manual effort by invasively tracking the elephants by using radio collar, GPS or by using infrared sensors [11]. These are not cost-effective and become complex when scaled up and applied for a larger population and a wider geographical area. Another major concern with respect to invasive techniques is the maintenance and monitoring for its proper functioning. Also, in case of failure, how much easier it is to replace the one which has trouble or malfunctioning is to be considered. Hence, the techniques which are non-invasive are appreciable and preferred over invasive techniques.
The elephant detecting systems today utilize any one of the non-invasive methods with stimuli like visuals, vocal, or seismic signals in detecting the presence of the elephants [12]. Elephant calls are rare events and they may remain silent or can learn to be silent for a long time. Because of this, using vocalization based method may be difficult in identifying its presence [13]. The seismic vibration-based detection methods suffer when the ground has moisture. Also, seismic noises analogous to that made by elephants can be produced by external sources. The elephants produce seismic signals through rumbles and foot stomps; the amplitude of these signals is considerably above the background noise [15]. This makes elephants potentially detectable by the system. These signals along with acoustic cues are used to communicate to the herds. With the available technology, it is difficult to read these signals beyond a certain range. Studies carried out in this sector confirm that the seismic signals can be accurately detected at a range of about 40 meters [16]. The signal may get attenuated when the distance is increased thereby affecting the signal strength. Hence, it might not be effective at broader surveillance. Hence, with this stimulus, it would be difficult to zero in the presence of elephants.
The acoustic signals produced by the elephants over other stimuli as they have the tendency to travel for a longer distance in the medium [17]. The studies reveal that these signals may even travel over 2 km in the air. These signals enable the elephant detection possible even without its physical presence in the close proximity to the detection system. This is a strong fact supporting us to actively consider using acoustic signals for the elephant presence detection [18]. Though the acoustic signals appear to be the better choice, the challenge with respect to noise filtration has to be handled [19]. Also, the suppression of other acoustic signals raised by other animals and birds in the deployed region should be addressed. These properties of the elephant make using only vocalization based method difficult in identifying its presence [20]. These reasons clearly show the need for efficient detection of elephants utilizing the positive aspects of visual detection techniques [14].
Some image-based detection techniques use range and segmentation of elephant image from the landscape are to detect presence of elephants. Zeppelzauer et al. [5] have developed a multi-modal early warning system which primarily uses a automatic vision based module for detection of elephants and to track them in the wild video recordings. The authors have developed a method to identify elephants based on the color model of their body. The background color model is built from the remaining color of the elephant’s surroundings. These models were used to train an SVM classifier to predict the presence of elephant in the wild video recordings. Such approaches face issues when deployed in environment with limited computational resources. Hence, it is noted that there is a need for an effective real-time approach for detecting elephants utilizing the positive aspects of visual detection techniques.
Proposed system
The system proposed in this paper aims at detecting the presence of elephants using non-invasive video surveillance method as shown in Figure 1. This method employs a layered approach to process the live video streams obtained from different surveillance cameras placed near timberline off the forest. It uses a composition of classic machine learning and deep learning algorithms to effectively identify the presence of the elephants by minimizing the false positive and true negative predictions. The different computational layers stacked in this system primarily focus on offloading the intensive computation happens at the edge of the deployed system which in turn result in improved response time.

System Design.
The first layer preprocesses the captured image frame for the subsequent layers. The second layer uses machine learning algorithms to identify the presence of anomalies in the image if any. The third layer is a computationally intensive layer and it will be activated only if there are anomalies found in the image by the second layer. This layer confirms the presence of an elephant by using a pre-trained convolutional neural network (CNN) model, which takes the preprocessed image as an input and runs a single shot detection on it [21]. The final layer is activated if the neural network detects elephants in the image; this layer either sends an alert message to the concerned authorities or logs the visual information.
The areas where such systems are deployed demand for smaller and lighter computers with a minimal technological footprint which can blend with the environment. This naturally favors an ultra-small computer like an Arduino board or a Raspberry Pi. The compromise that should be made here is the amount of computational power such devices would provide. The layering approach used in this system provides conditionally activated layers which in turn reduces the average computational load on the microprocessor thereby making for efficient resource and power usage.
The detection accuracy often undermines the success of a system which is deployed in similar scenarios. Since the system will be deployed in the timberline areas which demand the system to make the decisions automatically with minimal human intervention. These preconditions demand the features taken into consideration are to be minimal and important as the real response times are critical and computing power is limited. The architecture of the proposed system is shown in Figure 2.

Proposed System.
The layers which have been implemented to build the system are described in detail in the subsequent subsections.
The first layer: preprocessing
This layer of the system preprocesses the input data for the subsequent layers. Individual frames are broken down from the continuous live video feed and are converted into two-dimensional arrays containing the pixel values [22]. The image is then resized to a smaller dimension, to increase the processing time in the next layers. Since the support vector machine (SVM) in the next layer was trained on 64X64 grayscale images, every frame is converted to grayscale and is resized to match, and then flattened. Flattening is a process where a 2D matrix is converted to a 1XN matrix, by appending each row of the 2D matrix one after another.
The second layer: checking for anomalies
This layer uses the output from the first layer and passes it through a pre-trained SVM. It is highly selective and sensitive to the area where the system is being deployed. The layer is trained on thousands of images (with slight variation) of the said area. The parameters of the SVM used for training are listed in Table 1. The SVM model tries to predict if there are any anomalies in the current frame [23]. Anomalies can include anything which is out of the ordinary for the given location where the system is deployed; it could be wild animals, humans, and other out-of-place objects. This layer is very light computationally and does not involve a lot of processing power either - therefore all incoming frames are checked for anomalies against this layer. If an anomaly is detected, the control gets passed on to the next layer, which confirms if the anomaly was indeed an elephant or not. This layer is the key layer to the entire system as it acts as a filter for all incoming frames, restricting the frames which make it to the next layer. The post-training accuracy of this model after evaluated is 98.7%.
Parameters of the SVM used for training
Parameters of the SVM used for training
A pre-trained CNN (trained on thousands of images of elephants) is now employed to detect the presence of elephants. This is a computationally intensive task on a small computer like the Raspberry Pi. This CNN uses the MobileNet architecture along with the single-shot detection algorithm (well known for image related detections) and predicts a probability or the confidence of an elephant being present [24]. The layer is capable of detecting multiple elephants in a single frame, even if the entire elephant isn’t in the frame. As this layer is only activated when the second layer detects an anomaly, a lot of energy and time is saved on average. After training, CNN was tested against a variety of images. For clear and obvious images, the detection confidence was more than 95% in most cases. For occluded and obstructed images, the neural network was able to detect the elephants albeit the confidence scores were lowered as shown in Figure 3.

Confidence scores in challenging scenarios: a) 83% b) 15% c) 27% d) 52%.
If there was an elephant detected in the previous layer, this layer draws bounding boxes wherever an elephant was detected in the current frame. Other information like time, location and date is saved along with the image for retrieval and analysis later. Alerting systems (SMS alert etc.) can be attached to this layer, thus adding the potential for end-to-end deployment. The comparative analysis of an out-of-the-box image detector and proposed CNN detector is shown in Table 2. It reveals that the proposed approach is 1.5 times faster than the out-of-the-box image detector.
Comparison between proposed model and out-of-the-box model
Comparison between proposed model and out-of-the-box model
Description of dataset
The data set consisting of images used to train and test the model was collected from real-time video recordings and various Internet sources. The recordings were performed in various regions of which includes Nagerhole and Bandipur national parks on different days under different lighting and climatic conditions. The entire data set consists of 1792 curated images of which 67.96% images have elephants in arbitrary positions and angles including images with multiple elephants in the frame. The collected images were further compiled, standardized, preprocessed and annotated with necessary metadata to form the data set.
Performance of the proposed system
The detection time is of the essence in such systems, and therefore it is critical that the system performs as quickly as it can. The trade-off between detection time and computing capacity has to be made. The following results were obtained on running this system on a Raspberry Pi 3 with the following specifications. CPU: 1.2 GHz quad-core ARM Cortex A53 (ARMv8 Instruction Set) GPU: Broadcom VideoCore IV @ 400 MHz Memory: 1 GB LPDDR2-900 SDRAM
The average time taken for the SVM to process one frame from the camera (t
svm
) is 0.12 s. The average time taken for CNN to run detection on one frame (t
cnn
) is 2.1 s. Miscellaneous tasks and alerting on one frame (t
m
) takes 0.2 s. Therefore, the total time (T) taken to process one frame which contains some anomaly will be:
If the anomaly detector (SVM) was not present, the total time to process all frames (T0) in seconds can be calculated by:
With the anomaly detector (SVM), total time to process all frames (T1) in seconds can be calculated by:
Comparing the time taken in each case:
It can be seen that the second layer boosts the speed of the entire system by almost 4 times while keeping the power and processor consumption low. It filters out the frames which do not have any chance of containing an elephant. Since the camera’s effective frame rate depends on the average time of processing one frame,
The performance of the machine learning model in detecting the presence of elephants in the dataset is described in the confusion matrix shown in Table 3.
Confusion Matrix
The test results of the proposed method is shown in Table 4 in comparison with the state-of-the-art method [5] and it is observed that the proposed method outperforms. The proposed method has fairly high true positive rate, low false positive rate with a better accuracy. The test dataset for the methods are different and the result of the other method is collected from the respective paper.
Comparison of test results
Comparison of test results
When the image contains elephant(s) which are hidden behind vegetation, the model tends to produce true negatives. Similarly, when a herd of elephants are in the frame, the model is not able to correctly predict the number of elephants. These are popular problems in the domain of computational image processing viz occlusion and cluttering. Our future work primarily focus on providing solutions for the aforesaid issues.
Conclusion
The paper proposes an approach for detecting elephants in live video surveillance using a CNN. The proposed approach reduces the average computational load on the hardware thereby making way for effective resource and power usage while not compromising on the detection accuracy. Once the detection is confirmed, a trigger can be passed to an alert module to intimate the locals about the presence of elephants. Our proposed work mainly focuses and emphasizes the normal wellbeing of both elephants and locals in the area where it is getting deployed. Deploying the system in the identified region would help reduce the number of HEC incidents. This can contribute to an improvement in conservation, management, and research on elephants. Also, this system can also be added to automate any of the existing camera-based surveillance systems.
