Abstract
Traditional machine learning-based pest classification methods are a tedious and time-consuming process A method of multi-class pest detection based on deep learning and convolutional neural networks could be the solution. It automatically extracts the complex features of different pests from the crop pest images. In this paper, various significant deep learning-based object detection models like SSD, EfficientDet, Faster R-CNN, and CenterNet are implemented based on the Tensorflow Object Detection framework. Several significant networks like MobileNet_V2, ResNet101_V1, Inception_ResNet_V2, EfficientNet, and HourGlass104 are employed as backbone networks for these models to extract the different features of the pests. Object detection models are capable of identifying and locating pests in crops. Initially, these models are pre-trained with the COCO dataset and later be fine-tuned to the target pest dataset of 20 different pest classes. After conducting experiments on these models using the pest dataset, we demonstrate that Faster R-CNN_ResNet101_V1 outperformed every other model and achieved mAP of 74.77%. Additionally, it is developed as a lightweight model, whose size is ∼9 MB, and can detect pest objects in 130 milliseconds per image, allowing it to be used on resources-constrained devices commonly used by farmers.
Introduction
India is one of the pinnacle producers of numerous vegetation and crops worldwide. As of 2018, over 50% of people in India had their livelihood based on agriculture, and generated 17–18% of India’s GDP. Hence, productivity in agriculture is an essential factor for India’s economy. Raising agricultural productivity also lets farmers earn more profit and aids the people in meeting their needs.
The various threats to curtail crop yields are weeds, disease, pests, and unseasonal climatic changes. This research throws light on pests that affect production in agricultural land and the different methods of identifying these pests in earlier stages. Pests are the creatures that damage and induct diseases into vegetation. Hence, detecting pests in their earliest stages will result in better productivity. Farmers can yield more quantitative crops/vegetables of good quality. Hence, every individual farmer can earn more profit from their yields, resulting in the growth of the Indian economy.
The traditional method of manual pest identification is a time-consuming, labor-intensive, and expensive process and also hard to identify pests in earlier stages [39]. Despite the effectiveness of the machine learning-based pest identification method, it can be very tedious. It involves several stages, such as image pre-processing, feature extraction, and classification. Because of the complicated structure and high resemblance in appearance among the different pests, manually extracting features and classifying pests is an arduous task. Due to this, the machine learning-based method may affect the model’s discrimination power. To overcome this challenge, a deep learning-based pest detection model has been developed that uses the original pest images as input and automatically extracts the complex features of each pest [8].
Many of the existing deep learning-based multi-class pest identification models predict only one object in an image. In multi-class classification models, if an image has more than one object, the model will classify it as an object for which the last layer gives the highest probability. In the agricultural field, multiple pests may be present in a single crop or plant. This issue is addressed by building the pest detection model as a multi-label classification model using the TensorFlow object detection framework. These object detection models can classify multiple pests in a single image. In addition, it helps to identify where pests are located on images.
Object detection is one of the challenging tasks in computer vision that entails determining the existence, location, and type of one or more objects in an image. The object detection tasks are accomplished using Convolutional Neural Network (CNN) based deep learning models. Object detection models can be categorized into two types based on their construction and functionality: one-stage and two-stage detectors. A two-stage detector consists of two phases: the region proposal phase and the classification phase. The first phase proposes various object candidates called regions of interest (RoI) and the second phase categorizes the proposals. The locations of proposals are fine-tuned during the second phase [17]. In contrast, a one-stage detector uses a single convolutional network to provide the bounding boxes and the object classification without using region proposals [17]. For example, Faster R-CNN, Cascade R-CNN, and Mask R-CNN are two-stage detectors, and SSD, EfficientDet, and CenterNet512 are one-stage detectors. The general structure of one-stage and two-stage detectors is shown in Fig. 1.

The general structure of Object detection models, a) One-stage detectors, b) Two-stage detectors (Faster R-CNN).
During the object detection process, detecting the objects on different scales is a toilsome process, especially for smaller objects. A pyramid of images with the same object on different scales can detect objects on different scales. But this way of processing is time-consuming and demands large memory requirements for the end-to-end training process. Instead, the Object detection process uses a pyramid of multi-scale feature maps. The Feature Pyramid Network (FPN) acts as a feature extractor in an object detection network that extracts the pyramid of multi-scale feature maps. This FPN is independent of the convolutional architecture of the object detection model and constructs feature pyramids that object detection models will use.
Transfer learning is a proficient technique used in computer vision applications, to construct an accurate deep learning model in a shorter period of training time [25]. Transfer learning can be accomplished by using pre-trained models. The model is initially trained with a larger benchmark dataset, such as ImageNet [30], MS COCO [23], or PASCAL VOC [5, 13], and then fine-tuned to the target pest dataset [28, 42]. Instead of training the models from scratch, models are fine-tuned using weights learned from already-trained models. As a result, the training time required for building a significant model decreases.
The literature survey has been conducted in two aspects: The first one is to examine the best performing object identification models, especially best suited for small objects with complex features because pests are generally small and have distinctive features. The other one is about existing pest detection works.
Object detection models
On EfficientDet, a one-stage detector developed by Tan et al 2020, different configurations of EfficientDet architectures D0 to D7 were evaluated using the COCO dataset, achieving better accuracy with fewer parameters [36]. Yankun et al. 2021 proposed a vehicle tracking algorithm based on the large motion trend combined with the color histogram in which EfficientDet achieved higher mAP than Mask R-CNN as object detectors [44].
For automatic object detection from high-resolution panchromatic (PAN) images of military and civilian fields, Hou et al. 2020 developed an accurate and fast object detection model, called a refined single-shot multi-box detector [14]. The motivation behind the work proposed by Chen et al. 2021 was guiding model selection. In this work, all the models are initially trained with the COCO dataset, later fine-tuned to target KITTI dataset to detect cars and pedestrians [6].
According to Xu et al. 2019, global average pooling (GAP) based adversarial Faster-RCNN was proposed to generate robust samples and enhance the performance of object detection algorithms [43]. Ahmed et al. 2019 were motivated to use pre-trained Faster-RCNN and Mask-RCNN models on the frontal view dataset to facilitate video analytics in the IoT for overhead view multiple object detection and segmentation [13].
Liu, Y et al. 2020 constructed an anchor-free CNN architecture and a frame-by-frame technique incorporating a lightweight stacked hourglass network to predict the heatmap at the center point of a surgical tool for real-time surgical tool detection during robot-assisted surgery [18]. On the VisDrone2019 dataset, Pailla et al. 2019 employed CenterNet with the HourGlass-104 backbone network for real-time object detection, outperforming other significant object detection models [26].
Existing pest identification approaches
Thenmozhi, and Reddy, 2019 introduced a model for pest management which achieved better classification accuracy than other pre-trained models [39]. This is a multi-class classification model in nature. Panchbhaiyye, and Ogunfunmi, 2018 used VGG16, ResNet, and Inception for identifying pests in agricultural land [28]. Roldán-Serrato et al. 2018 proposed an automatic pest detection system for detecting pests on potato and bean crops [29]. The authors used neural classifiers, such as RSC (Random Subspace Classifier) and LIRA (Limited Receptive Area) in this pest detection task.
Chen et al. 2021 proposed an AI-based pest detection system that detects pests based on pest images [7]. Their work utilized different detection models to detect three types of pests, such as Mealybugs, Coccidae, and Diaspididae. The motivation behind the work proposed by Selvaraj et. al 2019 was to utilize transfer learning benefits on various object detection models to detect the banana pest and disease symptoms on different parts of the banana plants [35]. The proposed model can distinguish between healthy and infected plant portions for several diseases.
Li et al. 2019 proposed an effective augmentation strategy for the CNN-based method to detect pests [21]. In this work, a CNN is used as a backbone to extract the pest object’s features. These features are passed on to an RPN network to obtain the pest object’s location and class. The four different classes of pest are considered in this work, namely wheat sawfly, wheat aphid, wheat mite, and rice planthopper.
Dataset and Pre-processing
The dataset used in the proposed work is one of the benchmark pest datasets, namely IP102: A Large-Scale Benchmark Dataset for Insect Pest Recognition (v1.1), provided in the research work [41]. This dataset contains 102 different types of pests, out of which 20 different types are randomly chosen and used in this study. These pest types are Molecricket, Redspider, Wireworm, Armyworm, Grub, Ampelophaga, Lytta Polita, Meadow Moth, Pieris Canidia, Wheat Sawfly, Xylotrechus, Cicadellidae, Cicadella Viridis, Miridae, Papilio Xuthus, Prodenia Litura, Lawana Imitata Melichar, Salurnis Marginella Guerr, Apolygus lucorum, and Locustoidea.
First level data pre-processing includes the data cleaning process to remove insignificant and unclear images, which negatively impacts the model and leads to misclassification. More importantly, while preparing the dataset for this work, more number of actual agricultural field-conditioned pest images and a few number of lab-conditioned pest images are considered approximately 7:3 ratio. Some of the sample lab-conditioned and agricultural field pest images are shown in Fig. 2. The model trained only with lab-conditioned images will not perform well in the actual agricultural field [27]. The incorporation of both agricultural field and lab-conditioned images in the training dataset will make it possible to construct a model capable of identifying pests in agricultural fields.

Sample pest images from the dataset (both lab conditioned and field images), The images present in (a) to (f) and (g) to (l) are the agricultural field images and Lab conditioned images of Mole Cricket, Armyworm, Iytta Poltia, Meadow Moth, Xylotrechus, and Apolygus Lucorum pest classes, respectively.
While building training and testing datasets, the dataset is split into an 8:2 ratio. At first, all the actual pest images in the dataset are in the variable size. It is hard to train deep learning models with these varying scale pest images. The training images should be resized to a fixed size to facilitate mini-batch learning and let the CNN model work properly. Here all the pest images are resized to 512×512.
For creating a balanced distribution of images among the classes, approximately 780 images per class are taken, comprising 624 numbers for training, and 156 images for testing, since it follows an 8:2 split ratio. Of the 20 pest classes used in this work, six pest classes, namely Mole cricket, Xylotrechus, Cicadellidae, Cicadella Viridis, Miridae, and Prodenia Litura, have sufficient instances in each of their classes. The training and test images for the remaining 14 pest classes are brought into the above size using the augmentation process.
Augmentation is the process of creating the transformed version of images using various operations like shearing, rotation, width and height shifting, horizontal and vertical flipping, and zooming [9, 12]. The effect of these augmentation operations for a sample pest image is shown in Fig. 3. This transformed version of images acts as different instances, turning the dataset, into a rich and sufficient one. Hence, it facilitates the model to perform well, become more accurate and avoid overfitting.

Effect of different augmentation operations, a) Original Image, b) Horizontal Flip, c) Vertical Flip, d) Shearing, e) Width shift, f) Height Shift, g) Rotation, h) Zoom.
Further, mosaic augmentation is utilized for preparing the training dataset, which forms a new image by combining multiple pest regions [40]. Since we have used four-tile mosaic augmentation, four pest images are combined to create mosaic augmented images [3]. In this work, the mosaic augmented images are created in the following three different ways. Single-pest mosaic augmented image, Two-pests mosaic augmented image Four-pests mosaic augmented image.
The four regions of single-pest, two-pests, and four-pests mosaic augmented images have the same pest class, two different pest classes, and four different pest classes, respectively. Some of the sample mosaic augmented images are shown in Fig. 4. Including these mosaic augmented images into the training, dataset facilitates the model to detect multiple pests in the same image. It also enhances the detection of pests outside their regular context.

Sample Mosaic Augmented images, a) Single-pest mosaic augmented image contain Grub pest, b) Two-pests mosaic augmented image contains Salurnis Marginella Guerr and Lawana Imitata Milichar pest classes, c) Four-pests mosaic augmented image contain Cicadellidae, Cicadella Viridis, Xylotrechus and Wheat Sawfly pest classes, d) Four-pests mosaic augmented image contains Wireworm, Locustoidea, Molecricket, and Redspider pest classes.
In this work, the cutout augmentation technique is also used in the training dataset that randomly masks out square regions of training images. The main motive of this cutout augmentation is to let the model detect partial and occluded objects [11]. For example, it helps the model to differentiate a pest that overlaps with another pest. It also lets the model concentrate on pest objects that occur partially. For reference, some sample cutout augmented images are shown in Fig. 5.

Sample cutout augmented images.
Following this augmentation process, both the training and testing dataset images are brought into the annotation process. Annotation is the process of attaching bounding boxes to the pest region in the images. The details like selected pest class and bounding box coordinates are stored in its label file.
This annotation process was done with the help of the LabelMG (v1.8.0) tool. Label files are created in the pascal format that can be used by all models implemented based on the TensorFlow2.0 framework. The format of the label of the file is visualized in Fig. 6. The complete details of the dataset used in this work before and after augmentation are shown in Table 1.

Annotation process, a) Bounding the pest object in an image using LabelIMG tool, b) Corresponding label file in pascal format.
The details of the dataset used in this pest detection work
In this work, four significant deep learning-based object detection models are utilized, such as SSD (Single-Shot Detector) [15, 34], EfficientDet [16, 36], Faster R-CNN [1, 15], and CenterNet [10, 46] for pest detection task. Here, MobileNet_V2 [33, 37], ResNet101_V1 [2], Inception_ResNet_V2 [6], EfficientNet [31, 38], and HourGlass104 [24] networks are used as backbone networks for extracting the discriminative features of the various pests, and these features are utilized by the object detection models for detecting pests. These models are initially pre-trained on the COCO dataset [23].
Object detection models
SSD
The SSD stands for Single Shot Detector. This deep learning model is a one-stage object detector that predicts the type of the object and bounding box coordinates directly without creating region proposals. The working principle of SSD architecture is determining the appropriate bounding box in each image that should be regarded as an object and then classifying the type of object based on that area of the bounding box [4]. The best aspect of this model is to detect the multiple objects present in the image using a single runtime. So, the SSD model has a substantially faster detection speed than two-stage detectors like Faster R-CNN. But the detection accuracy of the two approaches is nearly the same. In this work, the SSD model is combined with two different backbone networks, ResNet101_V1 and MobileNet_V2.
EfficientDet
EfficientDet-D1 object detection model is utilized in this pest detection work which follows a one-stage detector paradigm. It employs EffiecientNet [38] as a backbone network, Bi-directional Feature Pyramid Network (BiFPN) is used as a feature network, and a shared class/box prediction network [36]. This BiFPN network optimizes multi-scale feature fusion that collects the features from level 3 to level 7 of the backbone network and applies these features to bidirectional fusion (both top-down and bottom-up feature fusion) repeatedly. Later, these fused features are passed to the class/box network for generating class/box prediction. The weights of class/box networks are shared among all feature levels, same as [22].
Faster R-CNN
Faster R-CNN is one of the most widely used object detection models that follows the two-stage detection paradigm. A significant aspect of this model is that it includes the Region Proposal Network (RPN) and Fast R-CNN detector [7]. Initially, the input images are fed to convolution layers called feature extractors, and its output (feature map) is given to RPN [19]. Later, the RPN generates region proposals on this feature map for selecting candidate regions for the input image. This RPN replaces the traditional selective search used in Fast-RCNN, which is time-consuming. As backbone networks for Faster R-CNN, significant models like ResNet101_V1 and Inception_ResNet_V2 are used for extracting features from the image.
CenterNet
The deep learning model, CenterNet used here is an anchorless object detection model [10]. Usually, anchor-based detection models create a large number of predictions. The prediction having the highest degree of overlap and confidence score with the object will be selected, and every other prediction will be ignored. Hence, these anchor-based models spend more time on irrelevant predictions. CenterNet is a keypoint-based object detection approach. As a first step, it finds the center of the box and treats it as an object and a key point. With this predicted center, it finds other coordinates of bounding boxes. The HourGlass104 and ResNet101_V1 networks are used as the backbone network for this pest detection task.
Backbone networks
ResNet101_V1
The key concept behind the ResNet is to establish residual connections between convolution layers to combine output from previous layers with output from stacked layers [2]. In turn, this allows us to train a much deeper network. ResNet101_V1 is a deep residual network used here as a feature extractor. It employs 101 convolution layers for extracting the features from the image.
MobileNet_V2
MobileNet_V2 is an updated version of the MobileNet_V1 that serves as a feature extractor for the SSD model in this pest detection task. The standard convolution operation is used in the convolution layers of MobileNet_V2 [33] are replaced by depth-wise separable convolution and point-wise convolution operation, as similar to MobileNet_V1. It helps to build a computationally effective model. MobileNet_V2 has a residual layer structure like ResNet architecture that is not present in MobileNet_V1 [37].
Inception_ResNet_V2
The key idea behind the Incception_ResNet_V2 is to associate the advantage of Inception units with residual connections [6]. Here, residual connections are used to combine the multiple-size convolution filters. Aside from avoiding the deep architecture issue, it also reduces training time. It helps the model to attain better accuracy in a shorter time.
HourGlass104
HourGlass104 is an HourGlass network whose depth is 104 layers. An HourGlass is a specialized form of fully convolutional neural network that follows Encoder-Decoder based architecture [24]. This model extracts the feature map from the input images and then combines earlier layers of the model containing higher spatial information than feature information, thereby helping to identify the object’s location.
EfficientNet
The primary building block of EfficientNet contains MBConv, to which squeeze-and-excitation optimization is also added. The structure of MBConv is the same as residual blocks used in MobileNet_V2 that form a residual connection between the start and end of a convolution block [38]. For optimizing the accuracy and efficiency of the model, an efficient compound scaling method is used in EfficientNet. Compound scaling is the process of mutually scaling the network depth, width, and resolution that results in reduced model size and FLOPs.
Experimental environment and evaluation metrics
Experimental setup design
In this work, various experiments are conducted with four object detection models and five backbone networks. Based on the association mentioned in Table 2, these four object detection models are turned out as seven variants. They are SSD-MobileNet_V2, SSD- ResNet101_V1, EfficientDet with EfficientNet, Faster R-CNN-ResNet101_V1, Faster R-CNN-Inception_ResNet_V2, CenterNet-HourGlass104 and CenterNet-ResNet101_V1. All these variants have experimented on the dataset mentioned in Table 1.
Different Combinations of object detection models with the backbone networks (feature extractors)
Different Combinations of object detection models with the backbone networks (feature extractors)
All the seven object-model variants and their experiments are implemented on the Colab-Pro platform, which provides GPU Tesla T4, P100, 25GB of virtual RAM, and 166GB of virtual disk space. Before training the models, various augmentation techniques were used to enrich the instances in the dataset [45]. The augmentation operations like shearing, rotation, width and height shifting, horizontal flipping, and zooming are implemented using Keras API - Image Data Preprocessing. The mosaic and cutout augmentation are implemented using tools provided by Roboflow. The integration of bounding boxes to these pest regions and assigning their respective pest class are done using LabelImg (v1.8.0) annotation tool.
Initially, all these seven object detection model variants are pre-trained using the COCO dataset [23] and then fine-tuned by using the target pest dataset to detect the pests in the agricultural field. This kind of Transfer Learning approach helps the models to increase their detection ability. It is used to train all these models as long as the model achieves the best performance. From Table 1, the dataset used in this work contains about 15,603 images of 20 pest classes, of which 12,614 images are used for the training process, and 3089 images are used for the testing process. The split ratio followed here is the 8:2 ratio for training and testing. In this, all the images are resized to 512×512 size.
For evaluating the model using unseen images (test dataset), COCO detection metric_set is also used. It computes the intersection over union, mAP at IoU thresholds, and precision/recall values for small, medium, and large objects.
The performance of all the above object detection models is evaluated using the following metrics. Accuracy(Acc) Precision(Pr) Recall(Rc) F1-Score(F1) Average Precision (AP) Average Recall (AR) mean Average Precision (mAP) Model Loading Time (MLT) Average Detection Time (ADT) Average Processing Time (APT)
Accuracy, Precision, Recall, and F1 Score
The Accuracy (Acc) of a model is the fraction of the objects that are correctly predicted to the total number of predictions. The Precision (Pr) is expressed as the ratio of correctly predicted positive observations to all predicted positive observations. Recall (Rc) is the ratio between the number of correctly predicted positive observations and the total number of observations in the actual class. The good performing object detection model always should have high precision and recall value, measured by the F1-score (F1), which is a harmonic mean of recall and precision value. These precision and recall values are also the sub-metrics to calculate AP, AR, and mAP. The mathematical form of the Accuracy, Precision, Recall, and F1-Score are represented in Equations (1), (2), (3), and (4), respectively.
where TP is True Positive, FP is False Positive, FN is False Negative and TN is True Negative.
Average precision is a combined measure of recall and precision for ranked intervals. The area under the interpolated precision-recall curve is defined as AP, and its mathematical form is mentioned in Equation (5).
Average Recall (AR) is averaged recall values over the IoU threshold values from 0.5 to 1.0. It can be calculated as two times the area present under the recall-IoU curve. Its mathematical form is mentioned in Equation (6).
The mean Average Precision (mAP) is expressed as the mean of AP across all the N classes in the dataset where the calculation AP involves one class. The mathematical form of mAP is given in Equation (7).
Average Detection Time (ADT) is referred to as the average time taken to detect pest objects in the entire test dataset. The Model Loading Time (MLT) represents the time required to load the trained model into the running environment for detecting the pest objects. The Average Processing Time (APT) referred to the time required to process an image to detect the pest objects in an image. The mathematical form of ADT and APT is mentioned in Equation (8) and (9).
In this section, the performance of the object detection models, such as SSD-MobileNet_V2, SSD-ResNet101_V1, EfficientDet, Faster R-CNN-ResNet101_V1, Faster R-CNN-Inception _ResNet_V2, CenterNet-HourGlass 104, and CenterNet-ResNet101_V1 are demonstrated for the pest detection tasks. As mentioned above, all these models are pre-trained on the COCO dataset [23] and later fine-tuned to the pest dataset mentioned in Table 1.
Initially, all the object detection models utilized in this pest detection task are evaluated using Accuracy(Acc), Precision(Pr), Recall(Rc), and F1-Score(F1). The corresponding values achieved in these metrics by these models are recorded in Table 3. Among the models’ results presented in Table 3, Faster R-CNN-ResNet101_V1 achieved the highest Accuracy (Acc), Precision (Pr), Recall (Rc), and F1-Score (F1) values, such as 91.09%, 92.62%, 91.16%, and 91.89%, respectively. Next to this, CenterNet-ResNet101_V1 and SSD-MobileNet_V2 also performed well in the pest detection task, achieving the second and third highest values in the Acc, Pr, Rc, and F1 metrics. Furthermore, the top three models are evaluated at every pest class level by examining the Acc and F1 score, and the values are recorded in Table 4. The confusion matrices of these three models are presented in Fig. 7.
Comparison of Accuracy(Acc), Precision(Pr), Recall(Rc), F1 Score(F1) achieved by the Tensorflow based Object Detection models used in this work
Comparison of Accuracy(Acc), Precision(Pr), Recall(Rc), F1 Score(F1) achieved by the Tensorflow based Object Detection models used in this work
Comparison of metrics achieved in individual pest class levels by the top three good performing models, Faster RCNN-ResNet101_V1 (PM1), CenterNet-Resnet101_V1(PM2), SSD MobileNet_V2 (PM3)

Confusion Matrix produced by top three better-performing object detection models, a) Faster RCNN-ResNet101_V1, b) CenterNet-Resnet101_V1, c) SSD MobileNet_V2, d) Pest class and Pest name details.
From Fig. 7, the Faster R-CNN ResNet101_V1 performs well and attained the highest number of True Positive(TP) values for all the pest classes, except the pest class Cicadella Viridis (P13), i.e., 82 TP out of 156 instances. Despite producing more True Positive (TP) results, it also produced a little higher number of False Positive (FP) values for the pest classes of Cicadellidae, Papilio Xuthus, and Salurnis Marginella Guerr than the other better-performing models. Hence, this Faster R-CNN ResNet101_V1 model achieved the highest Acc and F1 values in all pest classes except these four pest classes, but the difference is in the negligible range.
In the pest classes Cicadellidae and Cicadella Viridis, Papilio Xuthus, and Salurnis Marginel Guerr the CenterNet-ResNet101_V1 and SSD_MobileNet _V2 models attained the highest Acc and F1 values, respectively. As a result of the complex structure and small size of the Redspider pest, all the better-performing models have difficulty detecting it. Hence, the F1 value of this Redspider pest class is less when compared to other pest classes.
The ability to detect pest objects precisely is evaluated with AP, AR, and mAP, whereas the ability to respond rapidly is evaluated with three metrics: ADT, MLT, and APT.
The mean Average Precision (mAP) is the most common metric in evaluating the performance of the object detection model. The mAP attained by all the models is recorded in Table 5. The comparison chart is visualized in Fig. 8. Since it is designed for deployment in resource-limited devices, its response time is tested using the metrics, namely ADT, MLT, and APT, in the current running environment initially, to determine how quickly it responds to pest objects in an input test image. The ADT, MLT, APT, and the corresponding final model size are also recorded in Table 5.

The comparison chart of mAP (mean Average Precision) achieved by the different models used in this pest detection task.
The comparison of mAP, response time, and model size of the different models used in the pest detection work
From Table 5, concerning mAP, the Faster R-CNN-ResNet101_V1 model achieves a better performance among all, that is 74.77%, while CenterNet-ResNet101_V1 and SSD-MobileNet_V2 have achieved the second and third largest values, which are 73.42% and 72.77%, respectively. From Table 5, SSD-MobileNet_V2 is the least-size model, and its size is 7.98 MB. The Faster R-CNN-ResNet101_V1 and SSD-ResNet101_V1 models are the second and third least-size models, 8.95 MB and 15.73 MB, respectively.
Regarding Average Detection Time (ADT), the SSD-MobileNet_V2 model consumes very less time for detecting the pest objects in an image, that is 14.05 sec (ADT = 53 ms, MLT = 14 sec). After that, Faster R-CNN-ResNet101_V1 and CenterNet-ResNet101_V1 consume less time, that is 20.13 sec (ADT = 130 ms, MLT = 20 sec) and 34.08 sec (ADT = 85 ms, MLT = 34 sec). Since the model Faster R-CNN-ResNet101_V1 follows a two-detection strategy, its ADT is a little larger than SSD-MobileNet_V2 and CenterNet-ResNet101_V1 but at the same time, its MLT is less when compare to CenterNet-ResNet101_V1, because its model size is lesser than CenterNet-ResNet101_V1. Even though the CenterNet-ResNet101_V1 model consumes lesser ADT(85ms), its MLT is larger (34 sec) because of its model size, 16.31 MB.
In addition, AP and AR metrics are calculated based upon different IoU scales, object dimensions, and the number of detections. There are a total of 11 metrics, namely APsmall, APmedium, APlarge, APIoU = 0.75, APIoU = 0.5, AR @ 1, AR @ 10, AR @ 100, AR @ 100hboxsmall, AR @ 100hboxmedium, and AR @ 100hboxlarge. The detailed results attained by the Object Detection models on these 11 metrics are presented in Table 6.
The comparison of other metrics achieved by the different models Tensorflow based object models used in this pest detection task
While considering other metrics from Table 6, the Faster R-CNN ResNet101_V1 model achieved the highest value in AP large , AR @ 1, AR @ 10, AR @ 100, AR @ 100hboxmedium, and AR @ 100hboxlarge. The CenterNet-ResNet101_V1 model has achieved higher accuracy in APmedium and APIoU = 0.75 A higher score is achieved in APsmall, APIoU = 0.5, and AR @ 100hboxsmall, SSD- MobileNet_V2, Faster R-CNN-Inception_ResNet _V2, and CenterNet-HourGlass104, respectively.
The object detection task is more challenging than the object classification task. In contrast, object classification models only classify objects, while object detection models locate and classify the objects present in an image. It is more challenging in the agricultural field since the background color and pest color are approximately similar for some kinds of pests. Hence, it is hard to distinguish the pest from the background image.
Detection images of all the objection detection models are visualized in Fig. 9 for qualitative comparison. For this, an unseen image from the test dataset is used that contains two pest objects, both belonging to the Cicadellidae pest class (Fig. 9a). The quality of detection is evaluated using a confidence score. The confidence score is a value, predicted by a model that is a probability of an anchor box carrying the object. Overall, all the object models used in this work performed well in detecting the two pest objects and correctly classifying them as Cicadellidae.

Detection images of different object models used in the pest detection task. a) Actual image belongs to Cicadellidae pest class, b) Using Faster RCNN-ResNet101_V1 model, c) Using CenterNet-Resnet101_V1 model, d) Using SSD MobileNet_V2 model, e) Using EfficientDet _D1 model, f) Using Faster RCNN-Inception_ResNet_V2 model, g) Using CenterNet-HourGlass104 model, h) Using SSD-ResNet101_V1 model.
Further, while analyzing the other model detection based on the confidence score, the Faster R-CNN ResNet101_V1 model detects both the pest objects with a confidence score of 100% (Fig. 9b).. Next to that, the models CenterNet-Resnet101_V1 and SSD MobileNet_V2 also detect both pest objects with confidence score 96%, 90% and 100%, 80%, respectively (Fig. 9.c and 9.d). The models like EfficientDet_D1, Faster RCNN Inception_ResNet_V2 detected both pest objects with a confidence score, ranging from 81% to 88% (Fig. 9 e. and 9. f), and the models like CenterNet-HourGlass104 and SSD-ResNet101_V1 detected both pest objects, with confidence scores from 71% to 87% (Fig. 10. g and 9. h).

Some of the sample pest images detected by the Faster R-CNN-ResNet101_V1 model, a) Mole Cricket, b) Pieris Canidia, c) Wireworm, d) Xylotrechus, e) Wheat Sawfly, f) Papilio Xuthus, g) Grub, h) Salurnis Marginella Guerr.
In this pest detection task, the Faster R-CNN ResNet101_V1 model is detecting pest objects with a high confidence score. Mostly, the model Faster R-CNN ResNet101_V1 yields a confidence score higher than 85%, and even in some images, it achieved a confidence score of 100%. A few of its detected images are shown in Fig. 10.
In addition, the object detection model must identify multiple objects in an image while the object classification model will classify the image into one class even if it contains multiple objects. These objects present in an image belong to either the same class or a different class. Detection of multiple objects with the same class and multiple objects with different classes are visualized in Fig. 10 (f, g, h) and Fig. 11, respectively.

Faster R-CNN-ResNet101_V1 model detected image of multi-object with different classes in mosaic augmented image.
From Tables 3 5, and 6, among the 16 evaluation metrics used in this work, the Faster R-CNN ResNet101_V1 model achieved a higher score in 11 metrics when compared to others. Further, it achieved the second largest value in APmedium. For APIoU = 0.5 and APsmall, it has achieved the third-largest values. Additionally, it has achieved the second-least Model size and Average Processing Time(APT), i.e., the second-best value.
Usually, the backbone network is used to extract features that impact the performance of object detection models; because these extracted features are used for detecting the objects. Here, the Faster R-CNN uses Resnet101_V1 as a backbone network, which adopts top-down information flow and residual connection in its network, helping to give a high-level feature map of good resolution. In addition to that, a two-stage detector is better for achieving better accuracy because of the following i) The two-stage detectors filter out most of the undesirable proposals by sampling a sparse set of region proposals, ii) It also has high-quality features of sampled proposals by use of the RoIAlign operation, and iii) It also regress the object location twice i.e, once on each stage, so bounding boxes are better refined than one-stage methods [20]. Since the Faster R-CNN replaces the selective search used in the Fast R-CNN with Region Proposal Network (RPN), let the detection time be more or less the same one-stage detector.
According to Tables 3, 5, and 6, the model CenterNet Resnet101_V1 also performed better in this pest detection task. Among the 16 metrics, this model achieved the highest score in 2 metrics, the second-highest value in 11 metrics. This model has achieved the highest value in APmedium, AP @ IoU = 0.75, and the second-highest value in Acc, Pr, Rc, F1, mAP, APlarge, AR @ 1, AR @ 10, AR @ 100, AR @ 100hboxmedium, and AR @ 100hboxlarge. The model size of CenterNet-Resnet101_V1 is ∼16 MB. Following this, the SSD MobileNet_V2 model also brings about scores in pest detection tasks. The first and second scores are achieved in one metric, each. The high score in APsmall and the second-highest score AR @ 100hboxsmall. Third highest in Acc, Pr, Rc, F1, APmedium, APlarge, AP @ IoU = 0.5, AR @ 1, AR @ 10, and AR @ 100hboxmedium. Comparing other models, SSD MobileNet_V2 is the one with the smallest size, 8 MB, and lesser response time (ADT = 85 ms, APT = 14.05s).
As a part of the pest detection task on IP102: A Large-Scale Benchmark Dataset for Insect Pest Recognition dataset, Wu et al. [41] employed the ResNet model (pre-trained with ImageNet dataset) for the pest detection task. The pest class-wise accuracy attained by the Faster R-CNN ResNet101, CenterNet Resnet101_V1, and SSD MobileNet_V2 models (initially pre-training with the COCO dataset) are compared with the existing ResNet model (initially pre-training with ImageNet dataset) used in [41] and presented in Table 7. Utilizing efficient backbone networks for object detection models, initially pre-training these models with the COCO dataset, and applying an effective augmentation technique on the training dataset offers better Accuracy (Acc) values regarding pest detection than using the pre-trained ResNet model [41].
Comparison with existing work
From the above investigation, among the top three performing models, the Faster R-CNN ResNet101 provides the best trade-off among accuracy, model size, and detection time. The SSD-MobileNet_V2 model is lesser in size and gives somewhat lesser accuracy when compared to the Faster R-CNN ResNet101 model and CenterNet Resnet101_V1. The model CenterNet Resnet101_V1 is performing better than SSD-MobileNet_V2, but its model size and detection time are high compared to the other top three performing models.
Pests are a severe threat to agricultural yield that damages and induces various diseases in vegetation. As a means of assisting farmers in identifying these pests in their earlier stages, we have performed extensive experiments on a variety of deep learning-based object detection models along with a variety of feature extraction networks. All the models are pre-trained with the COCO dataset and later fine-tuned to the target pest dataset. The pre-trained weights of the models in the COCO dataset enable a relatively short training period to build a robust pest detection model. The performance of these models regarding object detection is evaluated using 16 different metrics. The Faster R-CNN-ResNet101_V1 model has achieved the highest mAP of 74.77% over the other models. From the remaining 15 metrics, it has achieved the highest values in 10 metrics, second highest values in one metric, and third highest values in two metrics. Hence, we demonstrate that the Faster RCNN with ResNet101_V1 model has more discriminative power in detecting multiple pest objects in the agricultural fields. Following this, CenterNet-Resnet101_V1 and SSD MobileNet_V2 are also consistently doing good detection in the pest detection task. The visualized detection images and the obtained results are evident for it.
While building a computer vision-based mobile application for aiding the farmers, a lesser size deep learning model and quickly responding models are required because it is convenient to deploy on the resource-constrained devices used by the farmers. Over the top three best performing models, the SSD MobileNet_V2 and Faster R-CNN ResNet101_V1 models are smaller in size with ∼8 MB and ∼9MB, respectively, and the average processing time (APT) of an image for these models are 14.05 seconds, and 20.13 seconds, respectively. The size of the CenterNet-Resnet101_V1 model is a little larger as its size is∼16MB and hence, its average processing time (APT) is also a little higher, that is 34.08 seconds.
As this Faster R-CNN ResNet101_V1 model is consistently doing good in all the metrics and is quick in detecting objects and convenient in deploying resource-constrained devices, we have concluded this model is an efficient model for this pest detection task. The two-stage detection strategy and use of ResNet101_V1 as a backbone network help the Faster R-CNN model to attain better results in this pest detection task, even if it is smaller in size.
The pest detection approach proposed in this work is through the analysis of still images. Two limitations may be present in this approach. First, the object model fails to locate pest objects if the image is unclear or if the pest objects are not properly positioned within the image. Second, the dead insects and dust bags may be mistakenly detected as living pests in an image-based pest detection approach. The future direction will be focusing to develop a video pest detection system to detect pest objects from the video captured in the agricultural field. In this case, the detection decision relies on continuous frames in a video instead of a single frame.
