Effective and efficient multi-crop pest detection based on deep learning object detection models

Abstract

Traditional machine learning-based pest classification methods are a tedious and time-consuming process A method of multi-class pest detection based on deep learning and convolutional neural networks could be the solution. It automatically extracts the complex features of different pests from the crop pest images. In this paper, various significant deep learning-based object detection models like SSD, EfficientDet, Faster R-CNN, and CenterNet are implemented based on the Tensorflow Object Detection framework. Several significant networks like MobileNet_V2, ResNet101_V1, Inception_ResNet_V2, EfficientNet, and HourGlass104 are employed as backbone networks for these models to extract the different features of the pests. Object detection models are capable of identifying and locating pests in crops. Initially, these models are pre-trained with the COCO dataset and later be fine-tuned to the target pest dataset of 20 different pest classes. After conducting experiments on these models using the pest dataset, we demonstrate that Faster R-CNN_ResNet101_V1 outperformed every other model and achieved mAP of 74.77%. Additionally, it is developed as a lightweight model, whose size is ∼9 MB, and can detect pest objects in 130 milliseconds per image, allowing it to be used on resources-constrained devices commonly used by farmers.

Keywords

Deep learning Convolutional Neutral Network object detection pest detection transfer learning

1 Introduction

India is one of the pinnacle producers of numerous vegetation and crops worldwide. As of 2018, over 50% of people in India had their livelihood based on agriculture, and generated 17–18% of India’s GDP. Hence, productivity in agriculture is an essential factor for India’s economy. Raising agricultural productivity also lets farmers earn more profit and aids the people in meeting their needs.

The various threats to curtail crop yields are weeds, disease, pests, and unseasonal climatic changes. This research throws light on pests that affect production in agricultural land and the different methods of identifying these pests in earlier stages. Pests are the creatures that damage and induct diseases into vegetation. Hence, detecting pests in their earliest stages will result in better productivity. Farmers can yield more quantitative crops/vegetables of good quality. Hence, every individual farmer can earn more profit from their yields, resulting in the growth of the Indian economy.

The traditional method of manual pest identification is a time-consuming, labor-intensive, and expensive process and also hard to identify pests in earlier stages [39]. Despite the effectiveness of the machine learning-based pest identification method, it can be very tedious. It involves several stages, such as image pre-processing, feature extraction, and classification. Because of the complicated structure and high resemblance in appearance among the different pests, manually extracting features and classifying pests is an arduous task. Due to this, the machine learning-based method may affect the model’s discrimination power. To overcome this challenge, a deep learning-based pest detection model has been developed that uses the original pest images as input and automatically extracts the complex features of each pest [8].

Many of the existing deep learning-based multi-class pest identification models predict only one object in an image. In multi-class classification models, if an image has more than one object, the model will classify it as an object for which the last layer gives the highest probability. In the agricultural field, multiple pests may be present in a single crop or plant. This issue is addressed by building the pest detection model as a multi-label classification model using the TensorFlow object detection framework. These object detection models can classify multiple pests in a single image. In addition, it helps to identify where pests are located on images.

Object detection is one of the challenging tasks in computer vision that entails determining the existence, location, and type of one or more objects in an image. The object detection tasks are accomplished using Convolutional Neural Network (CNN) based deep learning models. Object detection models can be categorized into two types based on their construction and functionality: one-stage and two-stage detectors. A two-stage detector consists of two phases: the region proposal phase and the classification phase. The first phase proposes various object candidates called regions of interest (RoI) and the second phase categorizes the proposals. The locations of proposals are fine-tuned during the second phase [17]. In contrast, a one-stage detector uses a single convolutional network to provide the bounding boxes and the object classification without using region proposals [17]. For example, Faster R-CNN, Cascade R-CNN, and Mask R-CNN are two-stage detectors, and SSD, EfficientDet, and CenterNet512 are one-stage detectors. The general structure of one-stage and two-stage detectors is shown in Fig. 1.

Fig. 1

The general structure of Object detection models, a) One-stage detectors, b) Two-stage detectors (Faster R-CNN).

During the object detection process, detecting the objects on different scales is a toilsome process, especially for smaller objects. A pyramid of images with the same object on different scales can detect objects on different scales. But this way of processing is time-consuming and demands large memory requirements for the end-to-end training process. Instead, the Object detection process uses a pyramid of multi-scale feature maps. The Feature Pyramid Network (FPN) acts as a feature extractor in an object detection network that extracts the pyramid of multi-scale feature maps. This FPN is independent of the convolutional architecture of the object detection model and constructs feature pyramids that object detection models will use.

Transfer learning is a proficient technique used in computer vision applications, to construct an accurate deep learning model in a shorter period of training time [25]. Transfer learning can be accomplished by using pre-trained models. The model is initially trained with a larger benchmark dataset, such as ImageNet [30], MS COCO [23], or PASCAL VOC [5, 13], and then fine-tuned to the target pest dataset [28, 42]. Instead of training the models from scratch, models are fine-tuned using weights learned from already-trained models. As a result, the training time required for building a significant model decreases.

2 Related work

The literature survey has been conducted in two aspects: The first one is to examine the best performing object identification models, especially best suited for small objects with complex features because pests are generally small and have distinctive features. The other one is about existing pest detection works.

2.1 Object detection models

On EfficientDet, a one-stage detector developed by Tan et al 2020, different configurations of EfficientDet architectures D0 to D7 were evaluated using the COCO dataset, achieving better accuracy with fewer parameters [36]. Yankun et al. 2021 proposed a vehicle tracking algorithm based on the large motion trend combined with the color histogram in which EfficientDet achieved higher mAP than Mask R-CNN as object detectors [44].

For automatic object detection from high-resolution panchromatic (PAN) images of military and civilian fields, Hou et al. 2020 developed an accurate and fast object detection model, called a refined single-shot multi-box detector [14]. The motivation behind the work proposed by Chen et al. 2021 was guiding model selection. In this work, all the models are initially trained with the COCO dataset, later fine-tuned to target KITTI dataset to detect cars and pedestrians [6].

According to Xu et al. 2019, global average pooling (GAP) based adversarial Faster-RCNN was proposed to generate robust samples and enhance the performance of object detection algorithms [43]. Ahmed et al. 2019 were motivated to use pre-trained Faster-RCNN and Mask-RCNN models on the frontal view dataset to facilitate video analytics in the IoT for overhead view multiple object detection and segmentation [13].

Liu, Y et al. 2020 constructed an anchor-free CNN architecture and a frame-by-frame technique incorporating a lightweight stacked hourglass network to predict the heatmap at the center point of a surgical tool for real-time surgical tool detection during robot-assisted surgery [18]. On the VisDrone2019 dataset, Pailla et al. 2019 employed CenterNet with the HourGlass-104 backbone network for real-time object detection, outperforming other significant object detection models [26].

2.2 Existing pest identification approaches

Thenmozhi, and Reddy, 2019 introduced a model for pest management which achieved better classification accuracy than other pre-trained models [39]. This is a multi-class classification model in nature. Panchbhaiyye, and Ogunfunmi, 2018 used VGG16, ResNet, and Inception for identifying pests in agricultural land [28]. Roldán-Serrato et al. 2018 proposed an automatic pest detection system for detecting pests on potato and bean crops [29]. The authors used neural classifiers, such as RSC (Random Subspace Classifier) and LIRA (Limited Receptive Area) in this pest detection task.

Chen et al. 2021 proposed an AI-based pest detection system that detects pests based on pest images [7]. Their work utilized different detection models to detect three types of pests, such as Mealybugs, Coccidae, and Diaspididae. The motivation behind the work proposed by Selvaraj et. al 2019 was to utilize transfer learning benefits on various object detection models to detect the banana pest and disease symptoms on different parts of the banana plants [35]. The proposed model can distinguish between healthy and infected plant portions for several diseases.

Li et al. 2019 proposed an effective augmentation strategy for the CNN-based method to detect pests [21]. In this work, a CNN is used as a backbone to extract the pest object’s features. These features are passed on to an RPN network to obtain the pest object’s location and class. The four different classes of pest are considered in this work, namely wheat sawfly, wheat aphid, wheat mite, and rice planthopper.

3 Dataset and Pre-processing

The dataset used in the proposed work is one of the benchmark pest datasets, namely IP102: A Large-Scale Benchmark Dataset for Insect Pest Recognition (v1.1), provided in the research work [41]. This dataset contains 102 different types of pests, out of which 20 different types are randomly chosen and used in this study. These pest types are Molecricket, Redspider, Wireworm, Armyworm, Grub, Ampelophaga, Lytta Polita, Meadow Moth, Pieris Canidia, Wheat Sawfly, Xylotrechus, Cicadellidae, Cicadella Viridis, Miridae, Papilio Xuthus, Prodenia Litura, Lawana Imitata Melichar, Salurnis Marginella Guerr, Apolygus lucorum, and Locustoidea.

First level data pre-processing includes the data cleaning process to remove insignificant and unclear images, which negatively impacts the model and leads to misclassification. More importantly, while preparing the dataset for this work, more number of actual agricultural field-conditioned pest images and a few number of lab-conditioned pest images are considered approximately 7:3 ratio. Some of the sample lab-conditioned and agricultural field pest images are shown in Fig. 2. The model trained only with lab-conditioned images will not perform well in the actual agricultural field [27]. The incorporation of both agricultural field and lab-conditioned images in the training dataset will make it possible to construct a model capable of identifying pests in agricultural fields.

Fig. 2

Sample pest images from the dataset (both lab conditioned and field images), The images present in (a) to (f) and (g) to (l) are the agricultural field images and Lab conditioned images of Mole Cricket, Armyworm, Iytta Poltia, Meadow Moth, Xylotrechus, and Apolygus Lucorum pest classes, respectively.

While building training and testing datasets, the dataset is split into an 8:2 ratio. At first, all the actual pest images in the dataset are in the variable size. It is hard to train deep learning models with these varying scale pest images. The training images should be resized to a fixed size to facilitate mini-batch learning and let the CNN model work properly. Here all the pest images are resized to 512×512.

For creating a balanced distribution of images among the classes, approximately 780 images per class are taken, comprising 624 numbers for training, and 156 images for testing, since it follows an 8:2 split ratio. Of the 20 pest classes used in this work, six pest classes, namely Mole cricket, Xylotrechus, Cicadellidae, Cicadella Viridis, Miridae, and Prodenia Litura, have sufficient instances in each of their classes. The training and test images for the remaining 14 pest classes are brought into the above size using the augmentation process.

Augmentation is the process of creating the transformed version of images using various operations like shearing, rotation, width and height shifting, horizontal and vertical flipping, and zooming [9, 12]. The effect of these augmentation operations for a sample pest image is shown in Fig. 3. This transformed version of images acts as different instances, turning the dataset, into a rich and sufficient one. Hence, it facilitates the model to perform well, become more accurate and avoid overfitting.

Fig. 3

Effect of different augmentation operations, a) Original Image, b) Horizontal Flip, c) Vertical Flip, d) Shearing, e) Width shift, f) Height Shift, g) Rotation, h) Zoom.

Further, mosaic augmentation is utilized for preparing the training dataset, which forms a new image by combining multiple pest regions [40]. Since we have used four-tile mosaic augmentation, four pest images are combined to create mosaic augmented images [3]. In this work, the mosaic augmented images are created in the following three different ways.

Single-pest mosaic augmented image,

Two-pests mosaic augmented image

Four-pests mosaic augmented image.

The four regions of single-pest, two-pests, and four-pests mosaic augmented images have the same pest class, two different pest classes, and four different pest classes, respectively. Some of the sample mosaic augmented images are shown in Fig. 4. Including these mosaic augmented images into the training, dataset facilitates the model to detect multiple pests in the same image. It also enhances the detection of pests outside their regular context.

Fig. 4

Sample Mosaic Augmented images, a) Single-pest mosaic augmented image contain Grub pest, b) Two-pests mosaic augmented image contains Salurnis Marginella Guerr and Lawana Imitata Milichar pest classes, c) Four-pests mosaic augmented image contain Cicadellidae, Cicadella Viridis, Xylotrechus and Wheat Sawfly pest classes, d) Four-pests mosaic augmented image contains Wireworm, Locustoidea, Molecricket, and Redspider pest classes.

In this work, the cutout augmentation technique is also used in the training dataset that randomly masks out square regions of training images. The main motive of this cutout augmentation is to let the model detect partial and occluded objects [11]. For example, it helps the model to differentiate a pest that overlaps with another pest. It also lets the model concentrate on pest objects that occur partially. For reference, some sample cutout augmented images are shown in Fig. 5.

Fig. 5

Sample cutout augmented images.

Following this augmentation process, both the training and testing dataset images are brought into the annotation process. Annotation is the process of attaching bounding boxes to the pest region in the images. The details like selected pest class and bounding box coordinates are stored in its label file.

This annotation process was done with the help of the LabelMG (v1.8.0) tool. Label files are created in the pascal format that can be used by all models implemented based on the TensorFlow2.0 framework. The format of the label of the file is visualized in Fig. 6. The complete details of the dataset used in this work before and after augmentation are shown in Table 1.

Fig. 6

Annotation process, a) Bounding the pest object in an image using LabelIMG tool, b) Corresponding label file in pascal format.

Table 1

The details of the dataset used in this pest detection work

S.No	Class	Pest name	Actual no.of images	No. of images after cleaning	Before augmentation		After augmentation
Process	Train	Test	Train	Test
1	P01	Molecricket	1649	780	624	156	624	156
2	P02	Redspider	529	450	360	90	650	160
3	P03	Wireworm	887	650	520	130	620	155
4	P04	Armyworm	1071	600	480	120	625	155
5	P05	Grub	860	600	480	120	630	154
6	P06	Ampelophaga	764	554	443	111	624	156
7	P07	Lytta Polita	654	530	424	106	625	155
8	P08	Meadow Moth	270	240	192	48	620	150
9	P09	Pieris Canidia	469	350	280	70	625	150
10	P10	Wheat Sawfly	339	280	224	56	625	152
11	P11	Xylotrechus	1150	780	624	156	624	156
12	P12	Cicadellidae	5740	780	624	156	624	156
13	P13	Cicadella Viridis	1279	780	624	156	624	156
14	P14	Miridae	5081	780	624	156	624	156
15	P15	Papilio Xuthus	449	375	300	75	624	150
16	P16	Prodenia Litura	1304	780	624	156	624	156
17	P17	Lawana Imitata Melichar	578	410	328	82	620	155
18	P18	Salurnis Marginella Guerr	487	385	308	77	630	155
19	P19	Apolygus Lucorum	381	350	280	70	620	150
20	P20	Locustoidea	1394	780	624	156	624	156
21	A1	Single-pest mosaic augmented image	20	2
22	A2	Two-pests mosaic augmented image	20	2
23	A3	Four-pests mosaic augmented image	___	___	___		60	6
24	A4	Cutout augmented images					60	6
		Total	25535	11234	8897	2247	12614	3089

4 Materials and methods

In this work, four significant deep learning-based object detection models are utilized, such as SSD (Single-Shot Detector) [15 , 34], EfficientDet [16, 36], Faster R-CNN [1, 15], and CenterNet [10, 46] for pest detection task. Here, MobileNet_V2 [33, 37], ResNet101_V1 [2], Inception_ResNet_V2 [6], EfficientNet [31, 38], and HourGlass104 [24] networks are used as backbone networks for extracting the discriminative features of the various pests, and these features are utilized by the object detection models for detecting pests. These models are initially pre-trained on the COCO dataset [23].

4.1 Object detection models

4.1.1 SSD

The SSD stands for Single Shot Detector. This deep learning model is a one-stage object detector that predicts the type of the object and bounding box coordinates directly without creating region proposals. The working principle of SSD architecture is determining the appropriate bounding box in each image that should be regarded as an object and then classifying the type of object based on that area of the bounding box [4]. The best aspect of this model is to detect the multiple objects present in the image using a single runtime. So, the SSD model has a substantially faster detection speed than two-stage detectors like Faster R-CNN. But the detection accuracy of the two approaches is nearly the same. In this work, the SSD model is combined with two different backbone networks, ResNet101_V1 and MobileNet_V2.

4.1.2 EfficientDet

EfficientDet-D1 object detection model is utilized in this pest detection work which follows a one-stage detector paradigm. It employs EffiecientNet [38] as a backbone network, Bi-directional Feature Pyramid Network (BiFPN) is used as a feature network, and a shared class/box prediction network [36]. This BiFPN network optimizes multi-scale feature fusion that collects the features from level 3 to level 7 of the backbone network and applies these features to bidirectional fusion (both top-down and bottom-up feature fusion) repeatedly. Later, these fused features are passed to the class/box network for generating class/box prediction. The weights of class/box networks are shared among all feature levels, same as [22].

4.1.3 Faster R-CNN

Faster R-CNN is one of the most widely used object detection models that follows the two-stage detection paradigm. A significant aspect of this model is that it includes the Region Proposal Network (RPN) and Fast R-CNN detector [7]. Initially, the input images are fed to convolution layers called feature extractors, and its output (feature map) is given to RPN [19]. Later, the RPN generates region proposals on this feature map for selecting candidate regions for the input image. This RPN replaces the traditional selective search used in Fast-RCNN, which is time-consuming. As backbone networks for Faster R-CNN, significant models like ResNet101_V1 and Inception_ResNet_V2 are used for extracting features from the image.

4.1.4 CenterNet

The deep learning model, CenterNet used here is an anchorless object detection model [10]. Usually, anchor-based detection models create a large number of predictions. The prediction having the highest degree of overlap and confidence score with the object will be selected, and every other prediction will be ignored. Hence, these anchor-based models spend more time on irrelevant predictions. CenterNet is a keypoint-based object detection approach. As a first step, it finds the center of the box and treats it as an object and a key point. With this predicted center, it finds other coordinates of bounding boxes. The HourGlass104 and ResNet101_V1 networks are used as the backbone network for this pest detection task.

4.2 Backbone networks

4.2.1 ResNet101_V1

The key concept behind the ResNet is to establish residual connections between convolution layers to combine output from previous layers with output from stacked layers [2]. In turn, this allows us to train a much deeper network. ResNet101_V1 is a deep residual network used here as a feature extractor. It employs 101 convolution layers for extracting the features from the image.

4.2.2 MobileNet_V2

MobileNet_V2 is an updated version of the MobileNet_V1 that serves as a feature extractor for the SSD model in this pest detection task. The standard convolution operation is used in the convolution layers of MobileNet_V2 [33] are replaced by depth-wise separable convolution and point-wise convolution operation, as similar to MobileNet_V1. It helps to build a computationally effective model. MobileNet_V2 has a residual layer structure like ResNet architecture that is not present in MobileNet_V1 [37].

4.2.3 Inception_ResNet_V2

The key idea behind the Incception_ResNet_V2 is to associate the advantage of Inception units with residual connections [6]. Here, residual connections are used to combine the multiple-size convolution filters. Aside from avoiding the deep architecture issue, it also reduces training time. It helps the model to attain better accuracy in a shorter time.

4.2.4 HourGlass104

HourGlass104 is an HourGlass network whose depth is 104 layers. An HourGlass is a specialized form of fully convolutional neural network that follows Encoder-Decoder based architecture [24]. This model extracts the feature map from the input images and then combines earlier layers of the model containing higher spatial information than feature information, thereby helping to identify the object’s location.

4.2.5 EfficientNet

The primary building block of EfficientNet contains MBConv, to which squeeze-and-excitation optimization is also added. The structure of MBConv is the same as residual blocks used in MobileNet_V2 that form a residual connection between the start and end of a convolution block [38]. For optimizing the accuracy and efficiency of the model, an efficient compound scaling method is used in EfficientNet. Compound scaling is the process of mutually scaling the network depth, width, and resolution that results in reduced model size and FLOPs.

5 Experimental environment and evaluation metrics

5.1 Experimental setup design

In this work, various experiments are conducted with four object detection models and five backbone networks. Based on the association mentioned in Table 2, these four object detection models are turned out as seven variants. They are SSD-MobileNet_V2, SSD- ResNet101_V1, EfficientDet with EfficientNet, Faster R-CNN-ResNet101_V1, Faster R-CNN-Inception_ResNet_V2, CenterNet-HourGlass104 and CenterNet-ResNet101_V1. All these variants have experimented on the dataset mentioned in Table 1.

Table 2
Different Combinations of object detection models with the backbone networks (feature extractors)

Backbone/Models A B C D

MobileNet V2 √

ResNet101 V1 √ √ √

EfficientNet √

Inception_ ResNet_V2 √

HourGlass 104 √

A: SSD C: Faster R-CNN

B: EfficientDet D CenterNet

Backbone/Models	A	B	C	D
MobileNet V2	√
ResNet101 V1	√		√	√
EfficientNet		√
Inception_ ResNet_V2			√
HourGlass 104				√
A: SSD	C: Faster R-CNN
B: EfficientDet	D CenterNet

All the seven object-model variants and their experiments are implemented on the Colab-Pro platform, which provides GPU Tesla T4, P100, 25GB of virtual RAM, and 166GB of virtual disk space. Before training the models, various augmentation techniques were used to enrich the instances in the dataset [45]. The augmentation operations like shearing, rotation, width and height shifting, horizontal flipping, and zooming are implemented using Keras API - Image Data Preprocessing. The mosaic and cutout augmentation are implemented using tools provided by Roboflow. The integration of bounding boxes to these pest regions and assigning their respective pest class are done using LabelImg (v1.8.0) annotation tool.

Initially, all these seven object detection model variants are pre-trained using the COCO dataset [23] and then fine-tuned by using the target pest dataset to detect the pests in the agricultural field. This kind of Transfer Learning approach helps the models to increase their detection ability. It is used to train all these models as long as the model achieves the best performance. From Table 1, the dataset used in this work contains about 15,603 images of 20 pest classes, of which 12,614 images are used for the training process, and 3089 images are used for the testing process. The split ratio followed here is the 8:2 ratio for training and testing. In this, all the images are resized to 512×512 size.

For evaluating the model using unseen images (test dataset), COCO detection metric_set is also used. It computes the intersection over union, mAP at IoU thresholds, and precision/recall values for small, medium, and large objects.

5.2 Evaluation metrics

The performance of all the above object detection models is evaluated using the following metrics.

Accuracy(Acc)

Precision(Pr)

Recall(Rc)

F1-Score(F1)

Average Precision (AP)

Average Recall (AR)

mean Average Precision (mAP)

Model Loading Time (MLT)

Average Detection Time (ADT)

Average Processing Time (APT)

5.2.1 Accuracy, Precision, Recall, and F1 Score

The Accuracy (Acc) of a model is the fraction of the objects that are correctly predicted to the total number of predictions. The Precision (Pr) is expressed as the ratio of correctly predicted positive observations to all predicted positive observations. Recall (Rc) is the ratio between the number of correctly predicted positive observations and the total number of observations in the actual class. The good performing object detection model always should have high precision and recall value, measured by the F1-score (F1), which is a harmonic mean of recall and precision value. These precision and recall values are also the sub-metrics to calculate AP, AR, and mAP. The mathematical form of the Accuracy, Precision, Recall, and F1-Score are represented in Equations (1), (2), (3), and (4), respectively. $Acc = \frac{TP + TN}{TP + FP + FN + TN}$ (1) $Pr = \frac{TP}{TP + FP}$ (2) $Rc = \frac{TP}{TP + FN}$ (3) $F 1 = 2 \times \frac{(Pr \times Rc)}{(Pr + Rc)}$ (4)

where TP is True Positive, FP is False Positive, FN is False Negative and TN is True Negative.

5.2.2 Average Precision (AP)

Average precision is a combined measure of recall and precision for ranked intervals. The area under the interpolated precision-recall curve is defined as AP, and its mathematical form is mentioned in Equation (5). $AP = \int_{0}^{1} Pc (Rc) dRc$ (5) where AP is Average Precision, Pc is precision value, Rc recall value, Pc(Rc) is a corresponding precision value at Rc. In this work, the AP value is calculated at two IoU scales, namely AP_@ IoU = _0.5 and AP_@ IoU = _0.75, and three objects dimensions, namely AP_small, AP_medium, and AP_large. A metric with AP_@ IOU = _0.5 specifies average precision using only IoU = 0.5, which is a rough sense of precision that is not concerned with the location of the bounding boxes. The metric AP_@ IoU = _0.75 specifies average precision using only IoU = 0.75. It is strictly concerned with the position of the bounding boxes because it requires at least IoU = 0.75 to count as positive. AP_small calculates the average precision for bounding boxes, whose area is less than 32×32 pixels. AP_medium calculates the average precision for bounding boxes with an area between 32×32 pixels and 96×96 pixels. AP_large calculates the average precision for bounding boxes with an area exceeding 96×96 pixels.

5.2.3 Average Recall (AR)

Average Recall (AR) is averaged recall values over the IoU threshold values from 0.5 to 1.0. It can be calculated as two times the area present under the recall-IoU curve. Its mathematical form is mentioned in Equation (6). $AR = 2 \times \int_{0.5}^{1.0} Rc (I) dI$ (6) where AR is Average Recall, I is IoU value, Rc(I) is the corresponding recall value at I. In this work, Average Recall values are sliced by the number of detections, AR_@ 1, AR_@ 10, AR_@ 100, and the size of the detected bounding boxes, AR_{@ 100hboxsmall}, AR@_{100hboxmedium}, AR_{@ 100hboxlarge}. The AR_@ 1 calculates the average recall of each class of images based on a maximum of one detection. AR_@ 10 and AR_@ 100 will also be the same, where the number of detections will be at most 10 and 100, respectively. AR_{@ 100hboxsmall}, AR_{@ 100hboxmedium}, and AR_{@ 100hboxlarge} calculate the mean average recall values AR_@ 100 for the bounding box with an area less than 32×32 pixels, between 32×32 pixels, and 96×96 pixels, more than 96×96 pixels, respectively.

5.2.4 mean Average Precision (mAP)

The mean Average Precision (mAP) is expressed as the mean of AP across all the N classes in the dataset where the calculation AP involves one class. The mathematical form of mAP is given in Equation (7). $mAP = \frac{\sum_{j = 1}^{N} APj}{N}$ (7) where mAP is the mean Average Precision, AP_j is the Average Precision of class j, and N is the total number of classes.

5.2.5 ADT, MLT, and APT

Average Detection Time (ADT) is referred to as the average time taken to detect pest objects in the entire test dataset. The Model Loading Time (MLT) represents the time required to load the trained model into the running environment for detecting the pest objects. The Average Processing Time (APT) referred to the time required to process an image to detect the pest objects in an image. The mathematical form of ADT and APT is mentioned in Equation (8) and (9). $ADT = \frac{1}{n} \sum_{i = 1}^{n} DT$ (8) $APT = MLT + ADT$ (9)

6 Results and discussion

In this section, the performance of the object detection models, such as SSD-MobileNet_V2, SSD-ResNet101_V1, EfficientDet, Faster R-CNN-ResNet101_V1, Faster R-CNN-Inception _ResNet_V2, CenterNet-HourGlass 104, and CenterNet-ResNet101_V1 are demonstrated for the pest detection tasks. As mentioned above, all these models are pre-trained on the COCO dataset [23] and later fine-tuned to the pest dataset mentioned in Table 1.

Initially, all the object detection models utilized in this pest detection task are evaluated using Accuracy(Acc), Precision(Pr), Recall(Rc), and F1-Score(F1). The corresponding values achieved in these metrics by these models are recorded in Table 3. Among the models’ results presented in Table 3, Faster R-CNN-ResNet101_V1 achieved the highest Accuracy (Acc), Precision (Pr), Recall (Rc), and F1-Score (F1) values, such as 91.09%, 92.62%, 91.16%, and 91.89%, respectively. Next to this, CenterNet-ResNet101_V1 and SSD-MobileNet_V2 also performed well in the pest detection task, achieving the second and third highest values in the Acc, Pr, Rc, and F1 metrics. Furthermore, the top three models are evaluated at every pest class level by examining the Acc and F1 score, and the values are recorded in Table 4. The confusion matrices of these three models are presented in Fig. 7.

Table 3
Comparison of Accuracy(Acc), Precision(Pr), Recall(Rc), F1 Score(F1) achieved by the Tensorflow based Object Detection models used in this work

Model Acc Pr Rc F1

(%) (%) (%) (%)

CenterNet-HourGlass104 73.64 84.13 73.68 78.56

CenterNet-Resnet101_V1 88.31 89.46 88.38 89.91

SSD MobileNet V2 87.95 89.12 88.03 88.5

SSD-ResNet101_V1 80.12 84.84 80.21 82.46

EfficientDet_D1 86.14 88.28 86.21 87.21

Faster RCNN-ResNet101_V1 91.09 92.62 91.16 91.89

Faster RCNN-Inception_ResNet_V2 86.14 88.13 86.27 87.19

Highest Value Second Highest Value Third Highest Value

Model	Acc	Pr	Rc	F1
CenterNet-HourGlass104	73.64	84.13	73.68	78.56
CenterNet-Resnet101_V1	88.31	89.46	88.38	89.91
SSD MobileNet V2	87.95	89.12	88.03	88.5
SSD-ResNet101_V1	80.12	84.84	80.21	82.46
EfficientDet_D1	86.14	88.28	86.21	87.21
Faster RCNN-ResNet101_V1	91.09	92.62	91.16	91.89
Faster RCNN-Inception_ResNet_V2	86.14	88.13	86.27	87.19
Highest Value		Second Highest Value		Third Highest Value

Table 4

Comparison of metrics achieved in individual pest class levels by the top three good performing models, Faster RCNN-ResNet101_V1 (PM1), CenterNet-Resnet101_V1(PM2), SSD MobileNet_V2 (PM3)

Pest Class	Pest Name	PM1		PM2		PM3
		Acc	F1	Acc	F1	Acc	F1
		(%)	(%)	(%)	(%)	(%)	(%)
P01	Molecricket	99.77	97.71	98.99	89.19	98.51	83.68
P02	Redspider	97.53	73.42	96.89	64.96	97.31	66.66
P03	Wireworm	99.02	89.65	98.38	81.48	98.15	78.65
P04	Armyworm	99.02	89.02	98.86	88.37	98.76	88.41
P05	Grub	99.67	96.85	99.48	94.96	99.38	94.04
P06	Ampelophaga	99.8	98.07	99.32	93.02	99.32	93.11
P07	Lytta polita	99.25	93.09	98.67	87.9	98.73	88.42
P08	Meadow moth	99.09	90.14	98.6	83.89	98.51	82.77
P09	Pieris canidia	99.87	98.67	99.48	94.83	99.44	94.53
P10	Wheat sawfly	99.52	91.69	99.22	91.3	99.19	90.9
P11	Xylotrechus	99.61	96.17	98.47	86.29	97.37	78.51
P12	Cicadellidae	98.25	85.16	98.38	86.03	98.25	85.08
P13	Cicadella viridis	96.34	60.0	98.51	84.24	98.8	87.45
P14	Miridae	99.51	94.98	99.22	92.4	99.15	91.87
P15	Papilio Xuthus	99.51	95.17	99.06	91.02	99.7	97.02
P16	Prodonia Litura	99.54	95.33	99.12	91.48	98.96	90.06
P17	Lawana Imitata Milichar	98.89	90.0	98.57	87.13	98.41	85.95
P18	Salurnis Marginella Guerr	99.06	91.29	99.22	91.94	99.35	93.71
P19	Apolygus Lucorum	99.83	98.32	99.22	91.94	99.19	91.63
P20	Locustoidea	99.61	96.05	98.76	87.66	99.35	93.19
	The highest Accuracy value			The highest F1 Score

Fig. 7

Confusion Matrix produced by top three better-performing object detection models, a) Faster RCNN-ResNet101_V1, b) CenterNet-Resnet101_V1, c) SSD MobileNet_V2, d) Pest class and Pest name details.

From Fig. 7, the Faster R-CNN ResNet101_V1 performs well and attained the highest number of True Positive(TP) values for all the pest classes, except the pest class Cicadella Viridis (P13), i.e., 82 TP out of 156 instances. Despite producing more True Positive (TP) results, it also produced a little higher number of False Positive (FP) values for the pest classes of Cicadellidae, Papilio Xuthus, and Salurnis Marginella Guerr than the other better-performing models. Hence, this Faster R-CNN ResNet101_V1 model achieved the highest Acc and F1 values in all pest classes except these four pest classes, but the difference is in the negligible range.

In the pest classes Cicadellidae and Cicadella Viridis, Papilio Xuthus, and Salurnis Marginel Guerr the CenterNet-ResNet101_V1 and SSD_MobileNet _V2 models attained the highest Acc and F1 values, respectively. As a result of the complex structure and small size of the Redspider pest, all the better-performing models have difficulty detecting it. Hence, the F1 value of this Redspider pest class is less when compared to other pest classes.

The ability to detect pest objects precisely is evaluated with AP, AR, and mAP, whereas the ability to respond rapidly is evaluated with three metrics: ADT, MLT, and APT.

The mean Average Precision (mAP) is the most common metric in evaluating the performance of the object detection model. The mAP attained by all the models is recorded in Table 5. The comparison chart is visualized in Fig. 8. Since it is designed for deployment in resource-limited devices, its response time is tested using the metrics, namely ADT, MLT, and APT, in the current running environment initially, to determine how quickly it responds to pest objects in an input test image. The ADT, MLT, APT, and the corresponding final model size are also recorded in Table 5.

Fig. 8

The comparison chart of mAP (mean Average Precision) achieved by the different models used in this pest detection task.

Table 5

The comparison of mAP, response time, and model size of the different models used in the pest detection work

Object Detection Models	mAP (%)	Model Size (MB)	MLT (s)	ADT (ms)	APT (s)
CenterNet-HourGlass104	69.87	61.66	69	150	69.15
CenterNet-Resnet101_V1	73.42	16.31	34	85	34.08
SSD MobileNet V2 FPN	72.77	7.98	14	53	14.05
SSD-ResNet101_V1	72.63	15.73	35	133	35.13
EfficientDet_D1	70.58	30.31	52	132	52.13
Faster RCNN-ResNet101_V1	74.77	8.95	20	130	20.13
Faster RCNN-Inception_ResNet_V2	71.9	17.18	35	650	35.65
	Best Value		Second Best Value		Third Best Value

From Table 5, concerning mAP, the Faster R-CNN-ResNet101_V1 model achieves a better performance among all, that is 74.77%, while CenterNet-ResNet101_V1 and SSD-MobileNet_V2 have achieved the second and third largest values, which are 73.42% and 72.77%, respectively. From Table 5, SSD-MobileNet_V2 is the least-size model, and its size is 7.98 MB. The Faster R-CNN-ResNet101_V1 and SSD-ResNet101_V1 models are the second and third least-size models, 8.95 MB and 15.73 MB, respectively.

Regarding Average Detection Time (ADT), the SSD-MobileNet_V2 model consumes very less time for detecting the pest objects in an image, that is 14.05 sec (ADT = 53 ms, MLT = 14 sec). After that, Faster R-CNN-ResNet101_V1 and CenterNet-ResNet101_V1 consume less time, that is 20.13 sec (ADT = 130 ms, MLT = 20 sec) and 34.08 sec (ADT = 85 ms, MLT = 34 sec). Since the model Faster R-CNN-ResNet101_V1 follows a two-detection strategy, its ADT is a little larger than SSD-MobileNet_V2 and CenterNet-ResNet101_V1 but at the same time, its MLT is less when compare to CenterNet-ResNet101_V1, because its model size is lesser than CenterNet-ResNet101_V1. Even though the CenterNet-ResNet101_V1 model consumes lesser ADT(85ms), its MLT is larger (34 sec) because of its model size, 16.31 MB.

In addition, AP and AR metrics are calculated based upon different IoU scales, object dimensions, and the number of detections. There are a total of 11 metrics, namely AP_small, AP_medium, AP_large, AP_IoU = _0.75, AP_IoU = _0.5, AR_@ 1, AR_@ 10, AR_@ 100, AR_{@ 100hboxsmall}, AR_{@ 100hboxmedium}, and AR_{@ 100hboxlarge}. The detailed results attained by the Object Detection models on these 11 metrics are presented in Table 6.

Table 6

The comparison of other metrics achieved by the different models Tensorflow based object models used in this pest detection task

S.No	Object Detection Models	Average Precision (AP)			AP @ IoU		Average Recall (AR)			Average Recall @100
		Small	Medium	Large	0.75	0.5	AR @1	AR @10	AR @100	Small	Medium	Large
		(%)	(%)	(%)	(%)	(%)	(%)	(%)	(%)	(%)	(%)	(%)
1	CenterNet-HourGlass104	24.5	34.79	72.65	79.56	92.17	71.19	77.55	77.68	35.71	45.28	80.37
2	CenterNet-Resnet101_V1	21.85	58.03	75.59	84.64	92.6	72.41	80.31	80.43	30.14	61.96	82.39
3	SSD MobileNet V2 FPN	34.82	53.66	74.1	83.73	93.7	71.7	79.88	79.93	35.28	58.93	81.09
4	SSD-ResNet101_V1	21.52	45.15	74.23	84.55	94.07	71.36	79.69	80.22	25.71	53.07	81.71
5	EfficientDet_D1	18.87	45.04	72.74	82.61	94.72	69.53	76.28	76.75	29.0	59.89	78.62
6	Faster RCNN-ResNet101_V1	24.5	54.12	77.4	83.73	93.7	74.77	82.35	82.83	27.85	65.33	84.46
7	Faster RCNN-Inception_ResNet_V2	27.21	53.66	74.03	84.42	94.82	71.48	78.73	79.02	32.57	58.91	80.93
		Highest value					Second Highest Value					Third Highest Value

While considering other metrics from Table 6, the Faster R-CNN ResNet101_V1 model achieved the highest value in AP_large, AR_@ 1, AR_@ 10, AR_@ 100, AR_{@ 100hboxmedium}, and AR_{@ 100hboxlarge}. The CenterNet-ResNet101_V1 model has achieved higher accuracy in AP_medium and AP_IoU = _0.75 A higher score is achieved in AP_small, AP_IoU = _0.5, and AR_{@ 100hboxsmall}, SSD- MobileNet_V2, Faster R-CNN-Inception_ResNet _V2, and CenterNet-HourGlass104, respectively.

The object detection task is more challenging than the object classification task. In contrast, object classification models only classify objects, while object detection models locate and classify the objects present in an image. It is more challenging in the agricultural field since the background color and pest color are approximately similar for some kinds of pests. Hence, it is hard to distinguish the pest from the background image.

Detection images of all the objection detection models are visualized in Fig. 9 for qualitative comparison. For this, an unseen image from the test dataset is used that contains two pest objects, both belonging to the Cicadellidae pest class (Fig. 9a). The quality of detection is evaluated using a confidence score. The confidence score is a value, predicted by a model that is a probability of an anchor box carrying the object. Overall, all the object models used in this work performed well in detecting the two pest objects and correctly classifying them as Cicadellidae.

Fig. 9

Detection images of different object models used in the pest detection task. a) Actual image belongs to Cicadellidae pest class, b) Using Faster RCNN-ResNet101_V1 model, c) Using CenterNet-Resnet101_V1 model, d) Using SSD MobileNet_V2 model, e) Using EfficientDet _D1 model, f) Using Faster RCNN-Inception_ResNet_V2 model, g) Using CenterNet-HourGlass104 model, h) Using SSD-ResNet101_V1 model.

Further, while analyzing the other model detection based on the confidence score, the Faster R-CNN ResNet101_V1 model detects both the pest objects with a confidence score of 100% (Fig. 9b).. Next to that, the models CenterNet-Resnet101_V1 and SSD MobileNet_V2 also detect both pest objects with confidence score 96%, 90% and 100%, 80%, respectively (Fig. 9.c and 9.d). The models like EfficientDet_D1, Faster RCNN Inception_ResNet_V2 detected both pest objects with a confidence score, ranging from 81% to 88% (Fig. 9 e. and 9. f), and the models like CenterNet-HourGlass104 and SSD-ResNet101_V1 detected both pest objects, with confidence scores from 71% to 87% (Fig. 10. g and 9. h).

Fig. 10

Some of the sample pest images detected by the Faster R-CNN-ResNet101_V1 model, a) Mole Cricket, b) Pieris Canidia, c) Wireworm, d) Xylotrechus, e) Wheat Sawfly, f) Papilio Xuthus, g) Grub, h) Salurnis Marginella Guerr.

In this pest detection task, the Faster R-CNN ResNet101_V1 model is detecting pest objects with a high confidence score. Mostly, the model Faster R-CNN ResNet101_V1 yields a confidence score higher than 85%, and even in some images, it achieved a confidence score of 100%. A few of its detected images are shown in Fig. 10.

In addition, the object detection model must identify multiple objects in an image while the object classification model will classify the image into one class even if it contains multiple objects. These objects present in an image belong to either the same class or a different class. Detection of multiple objects with the same class and multiple objects with different classes are visualized in Fig. 10 (f, g, h) and Fig. 11, respectively.

Fig. 11

Faster R-CNN-ResNet101_V1 model detected image of multi-object with different classes in mosaic augmented image.

From Tables 3 5, and 6, among the 16 evaluation metrics used in this work, the Faster R-CNN ResNet101_V1 model achieved a higher score in 11 metrics when compared to others. Further, it achieved the second largest value in AP_medium. For AP_IoU = _0.5 and AP_small, it has achieved the third-largest values. Additionally, it has achieved the second-least Model size and Average Processing Time(APT), i.e., the second-best value.

Usually, the backbone network is used to extract features that impact the performance of object detection models; because these extracted features are used for detecting the objects. Here, the Faster R-CNN uses Resnet101_V1 as a backbone network, which adopts top-down information flow and residual connection in its network, helping to give a high-level feature map of good resolution. In addition to that, a two-stage detector is better for achieving better accuracy because of the following i) The two-stage detectors filter out most of the undesirable proposals by sampling a sparse set of region proposals, ii) It also has high-quality features of sampled proposals by use of the RoIAlign operation, and iii) It also regress the object location twice i.e, once on each stage, so bounding boxes are better refined than one-stage methods [20]. Since the Faster R-CNN replaces the selective search used in the Fast R-CNN with Region Proposal Network (RPN), let the detection time be more or less the same one-stage detector.

According to Tables 3, 5, and 6, the model CenterNet Resnet101_V1 also performed better in this pest detection task. Among the 16 metrics, this model achieved the highest score in 2 metrics, the second-highest value in 11 metrics. This model has achieved the highest value in AP_medium, AP_@ IoU = _0.75, and the second-highest value in Acc, Pr, Rc, F1, mAP, AP_large, AR_@ 1, AR_@ 10, AR_@ 100, AR_{@ 100hboxmedium}, and AR_{@ 100hboxlarge}. The model size of CenterNet-Resnet101_V1 is ∼16 MB. Following this, the SSD MobileNet_V2 model also brings about scores in pest detection tasks. The first and second scores are achieved in one metric, each. The high score in AP_small and the second-highest score AR_{@ 100hboxsmall}. Third highest in Acc, Pr, Rc, F1, AP_medium, AP_large, AP_@ IoU = _0.5, AR_@ 1, AR_@ 10, and AR_{@ 100hboxmedium}. Comparing other models, SSD MobileNet_V2 is the one with the smallest size, 8 MB, and lesser response time (ADT = 85 ms, APT = 14.05s).

As a part of the pest detection task on IP102: A Large-Scale Benchmark Dataset for Insect Pest Recognition dataset, Wu et al. [41] employed the ResNet model (pre-trained with ImageNet dataset) for the pest detection task. The pest class-wise accuracy attained by the Faster R-CNN ResNet101, CenterNet Resnet101_V1, and SSD MobileNet_V2 models (initially pre-training with the COCO dataset) are compared with the existing ResNet model (initially pre-training with ImageNet dataset) used in [41] and presented in Table 7. Utilizing efficient backbone networks for object detection models, initially pre-training these models with the COCO dataset, and applying an effective augmentation technique on the training dataset offers better Accuracy (Acc) values regarding pest detection than using the pre-trained ResNet model [41].

Table 7

Comparison with existing work

Pest	Pest Name	Acc (%)
Class		EM	PM1	PM2	PM3
P01	Molecricket	35	99.77	98.99	98.51
P02	Redspider	27	97.53	96.89	97.31
P03	Wireworm	45	99.02	98.38	98.15
P04	Armyworm	18	99.02	98.86	98.76
P05	Grub	13	99.67	99.48	99.38
P06	Ampelophaga	48	99.8	99.32	99.32
P07	Lytta polita	59	99.25	98.67	98.73
P08	Meadow moth	15	99.09	98.6	98.51
P09	Pieris canidia	59	99.87	99.48	99.44
P10	Wheat sawfly	19	99.52	99.22	99.19
P11	Xylotrechus	7	99.61	98.47	97.37
P12	Cicadellidae	38	98.25	98.38	98.25
P13	Cicadella viridis	55	96.34	98.51	98.8
P14	Miridae	31	99.51	99.22	99.15
P15	Papilio Xuthus	52	99.51	99.06	99.7
P16	Prodonia Litura	58	99.54	99.12	98.96
P17	Lawana Imitata Milichar	30	98.89	98.57	98.41
P18	Salurnis Marginella Guerr	60	99.06	99.22	99.35
P19	Apolygus Lucorum	56	99.83	99.22	99.19
P20	Locustoidea	36	99.61	98.76	99.35

EM - ResNet (Used in the work [41]), PM1 - Faster RCNN-ResNet101_V1, PM2 - CenterNet-Resnet101_V1, PM3 - SSD MobileNet_V2.

From the above investigation, among the top three performing models, the Faster R-CNN ResNet101 provides the best trade-off among accuracy, model size, and detection time. The SSD-MobileNet_V2 model is lesser in size and gives somewhat lesser accuracy when compared to the Faster R-CNN ResNet101 model and CenterNet Resnet101_V1. The model CenterNet Resnet101_V1 is performing better than SSD-MobileNet_V2, but its model size and detection time are high compared to the other top three performing models.

7 Conclusion

Pests are a severe threat to agricultural yield that damages and induces various diseases in vegetation. As a means of assisting farmers in identifying these pests in their earlier stages, we have performed extensive experiments on a variety of deep learning-based object detection models along with a variety of feature extraction networks. All the models are pre-trained with the COCO dataset and later fine-tuned to the target pest dataset. The pre-trained weights of the models in the COCO dataset enable a relatively short training period to build a robust pest detection model. The performance of these models regarding object detection is evaluated using 16 different metrics. The Faster R-CNN-ResNet101_V1 model has achieved the highest mAP of 74.77% over the other models. From the remaining 15 metrics, it has achieved the highest values in 10 metrics, second highest values in one metric, and third highest values in two metrics. Hence, we demonstrate that the Faster RCNN with ResNet101_V1 model has more discriminative power in detecting multiple pest objects in the agricultural fields. Following this, CenterNet-Resnet101_V1 and SSD MobileNet_V2 are also consistently doing good detection in the pest detection task. The visualized detection images and the obtained results are evident for it.

While building a computer vision-based mobile application for aiding the farmers, a lesser size deep learning model and quickly responding models are required because it is convenient to deploy on the resource-constrained devices used by the farmers. Over the top three best performing models, the SSD MobileNet_V2 and Faster R-CNN ResNet101_V1 models are smaller in size with ∼8 MB and ∼9MB, respectively, and the average processing time (APT) of an image for these models are 14.05 seconds, and 20.13 seconds, respectively. The size of the CenterNet-Resnet101_V1 model is a little larger as its size is∼16MB and hence, its average processing time (APT) is also a little higher, that is 34.08 seconds.

As this Faster R-CNN ResNet101_V1 model is consistently doing good in all the metrics and is quick in detecting objects and convenient in deploying resource-constrained devices, we have concluded this model is an efficient model for this pest detection task. The two-stage detection strategy and use of ResNet101_V1 as a backbone network help the Faster R-CNN model to attain better results in this pest detection task, even if it is smaller in size.

The pest detection approach proposed in this work is through the analysis of still images. Two limitations may be present in this approach. First, the object model fails to locate pest objects if the image is unclear or if the pest objects are not properly positioned within the image. Second, the dead insects and dust bags may be mistakenly detected as living pests in an image-based pest detection approach. The future direction will be focusing to develop a video pest detection system to detect pest objects from the video captured in the agricultural field. In this case, the detection decision relies on continuous frames in a video instead of a single frame.

References

Ahmed

, Din

, Jeon

and Piccialli

, Exploring deep learning models for overhead view multiple object detection, IEEE Internet of Things Journal 7(7) (2019), 5737–5744.

Aleem

, Raj

, Khan

Comparative performance analysis of the resnet backbones of mask rcnn to segment the signs of covid-19 in chest ct scans. arXiv preprint arXiv:2008.09713, 2020.

Bochkovskiy

, Wang

C.Y.

, Liao

H.Y.M.

, Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934, 2020.

Boonsirisumpun

, Puarungroj

, Wairotchanaphuttha

2018, November. Automatic detector for bikers with no helmet using deep learning, Proc of IEEE 22nd Int Computer Science and Engineering Conf. (ICSEC), 2018, pp.1–4.

Bechar

, Moisan

On-line counting of pests in a greenhouse using computer vision, In VAIB 2010-Visual Observation and Analysis of Animal and Insect Behavior, 2010.

Chen

, Lin

, Lu

, Cao

, Wu

, Guo

, Liu

and Wang

F.Y.

, Deep Neural Network Based Vehicle and Pedestrian Detection for Autonomous Driving: A Survey, IEEE Transactions on Intelligent Transportation Systems 22(6) (2021), 3234–3246.

Chen

J.W.

, Lin

W.J.

, Cheng

H.J.

, Hung

C.L.

, Lin

C.Y.

and Chen

S.P.

, A smartphone-based application for scale pest detection using multiple-object detection methods,Article id, Electronics 10(4) (2021), 372.

Chen

K.H.

, Shou

T.D.

, Li

J.K.H.

and Tsai

C.M.

, Vehicles detection on expressway via deep learning: Single shot multibox object detector, Proc of IEEE Int Conf on Machine Learning and Cybernetics (ICMLC) 2 (2018), pp. 467–473.

Das

, Sharma

, Gourisaria

M.K.

, Rautaray

S.S.

, Pandey

A Model for Probabilistic Prediction of Paddy Crop Disease Using Convolutional Neural Network, In Intelligent and Cloud Computing, Springer, Singapore (2021), pp. 125–134.

10.

Duan

, Bai

, Xie

, Qi

, Huang

, Tian

Centernet: Keypoint triplets for object detection, Proc of the IEEE/CVF Int Conference on Computer Vision, 2019, pp. 6569–6578.

11.

DeVries

, Taylor

G.W.

Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017.

12.

Emera

, Sandor

Creation of Farmers’ Awareness on Fall Armyworms Pest Detection at Early Stage in Rwanda using Deep Learning, Proc of IEEE 8th Int Congress on Advanced Applied Informatics (IIAI-AAI), 2019, pp. 538–541.

13.

Everingham

, Van Gool

, Williams

C.K.

, Winn

and Zisserman

, The pascal visual object classes (voc) challenge, Int J of Computer Vision 88(2) (2010), pp. 303–338.

14.

Hou

, Ren

, Zhao

, Wu

and Jiao

, Object Detection in High-Resolution Panchromatic Images Using Deep Models and Spatial Template Matching, IEEE Transactions on Geoscience and Remote Sensing 58(2) (2019), 956–970.

15.

Jiang

, Chen

, Liu

, He

and Liang

, Real-time detection of apple leaf diseases using deep learning approach based on improved convolutional neural networks, IEEE Access 7 (2019), 59069–59080.

16.

Kim

H.J.

, Lee

D.H.

, Niaz

, Kim

C.Y.

, Memon

A.A.

and Choi

K.N.

, Multiple-Clothing Detection and Fashion Landmark Estimation Using a Single-Stage Detector, IEEE Access 9 (2021), 11694–11704.

17.

Lohia

, Kadam

K.D.

, Joshi

R.R.

, Bongale

A.M.

Bibliometric Analysis of One-stage and Two-stage Object Detection, Library Philosophy and Practice (2021), 1–32.

18.

Liu

, Zhao

, Chang

and Hu

, An anchor-free convolutional neural network for real-time surgical tool detection in robot-assisted surgery, IEEE Access 8 (2020), 78193–78201.

19.

Liu

, Ghazali

K.H.

, Han

and Mohamed

I.I.

, Automatic Detection of Oil Palm Tree from UAV Images Based on the Deep Learning Method, Applied Artificial Intelligence 35(1) (2021), 13–24.

20.

, Li

, Yan

, August. Mimicdet: Bridging the gap between one-stage and two-stage object detection, Proc of European Conf on Computer Vision, Springer, Cham, (2020), pp. 541–557.

21.

, Wang

, Zhang

, Xie

, Liu

, Wang

, Chen

, Hu

, Jia

and Hu

, An effective data augmentation strategy for CNN-based pest localization and recognition in the field, IEEE Access 7 (2019), 160274–160283.

22.

Lin

T.Y.

, Goyal

, Girshick

, He

, Dollár

Focal loss for dense object detection, Proc of the IEEE Int Conf on computer vision, 2017, pp. 2980–2988.

23.

Lin

T.Y.

, Maire

, Belongie

, Hays

, Perona

, Ramanan

, Dollár

C.L.

, Zitnick

Microsoft coco: Common objects in context, Proc of European Conf on Computer Vision, Springer, Cham, 2014, pp. 740–755.

24.

Newell

, Yang

, Deng

Stacked hourglass networks for human pose estimation, Proc of European Conf on Computer Vision, Springer, Cham, 2016, pp. 483–499

25.

Pattnaik

, Shrivastava

V.K.

and Parvathi

, Transfer learning-based framework for classification of pest in tomato plants, Applied Artificial Intelligence 34(13) (2020), 981–993.

26.

Pailla

D.R.

, Kollerathu

, Chennamsetty

S.S.

Object detection on aerial imagery using CenterNet. arXiv preprint arXiv:1908.08244, 2019.

27.

Picon

, Seitz

, Alvarez-Gila

, Mohnke

, Ortiz-Barredo

and Echazarra

, Crop conditional Convolutional Neural Networks for massive multi-crop plant disease classification over cell phone acquired images taken on real field conditions,Article No, Computers and Electronics in Agriculture 167 (2019), 105093.

28.

Panchbhaiyye

, Ogunfunmi

Experimental results on using deep learning to identify agricultural pests, Proc of IEEE Global Humanitarian Technology Conference (GHTC), 2018, pp. 1–2.

29.

Roldán-Serrato

K.L.

, Escalante-Estrada

J.A.S.

, Rodríguez-González

M.T.

, Automatic pest detection on bean andpotato crops by applying neural classifiers, Engineering inAgriculture, Environment and Food 11(4) (2018), 245–255.

30.

Russakovsky

, Deng

, Su

, Krause

, Satheesh

, Ma

, Huang

, Karpathy

, Khosla

, Bernstein

and Berg

A.C.

, Imagenet large scale visual recognition challenge, Int, J of Computer Vision 115(3) (2015), 211–252.

31.

Srikanth

, Srinivasan

, Indrajit

, Venkateswaran

Contactless Object Identification Algorithm for the Visually Impaired using EfficientDet, Proc of IEEE 6th Int Conf on Wireless Communications, Signal Processing and Networking (WiSPNET), (2021), pp. 417–420.

32.

Shi

, Qi

, Qin

, Scott

P.J.

and Jiang

, Intersecting Machining Feature Localization and Recognition via Single Shot Multibox Detector, IEEE Transactions on Industrial Informatics 17(5) (2020), 3292–3302.

33.

Sandler

, Howard

, Zhu

, Zhmoginov

, Chen

L.C.

Mobilenetv2: Inverted residuals and linear bottlenecks, Proc of the IEEE Conf on Computer Vision and Pattern Recognition, (2018), pp. 4510–4520.

34.

Sun

, Kong

, Huang

, Tan

, Fang

and Liu

, Feature pyramid reconfiguration with consistent loss for object detection, IEEE Transactions on Image Processing 28(10) (2019), 5041–5051.

35.

Selvaraj

M.G.

, Vergara

, Ruiz

, Safari

, Elayabalan

, Ocimati

and Blomme

, AI-powered banana diseases and pest detection,pp, Plant Methods 15(1) (2019), 1–11.

36.

Tan

, Pang

, Le

Q.V.

Efficientdet: Scalable and efficient object detection, Proc of the IEEE/CVF Conf on Computer Vision and Pattern Recognition (2020), pp. 10781–10790.

37.

Teng

T.W.

, Veerajagadheswar

, Ramalingam

, Yin

and Elara

, Mohan and B.F. Gómez, Vision based wall following framework: A case study with HSR robot for cleaning application,Article id, Sensors 20(11) (2020), 3298.

38.

Tan

, Le

Efficientnet: Rethinking model scaling for convolutional neural networks, Proc of Int Conf on Machine Learning(PMLR), (2019), pp. 6105–6114.

39.

Thenmozhi

and Reddy

U.S.

, Crop pest classification based on deep convolutional neural network and transfer learning,Article id, Computers and Electronics in Agriculture 164 (2019), 104906.

40.

Wei

, Duan

, Song

, Tian

, Wang

AMRNet: Chips Augmentation in Aerial Images Object Detection. arXiv preprint arXiv:2009.07168, 2020.

41.

, Zhan

, Lai

Y.K.

, Cheng

M.M.

, Yang

IP102: A large-scale benchmark dataset for insect pest recognition, Proc of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, (2019), pp. 8787–8796.

42.

Xin

, Wang

An Image Recognition Algorithm of Soybean Diseases and Insect Pests Based on Migration Learning and Deep Convolution Network, Proc of IEEE Int Wireless Communications and Mobile Computing (IWCMC) (2020), pp. 1977–1980.

43.

, Zhang

, Cheng

, Song

and Wang

, Occlusion problem-oriented adversarial faster-RCNN Scheme, IEEE Access 7 (2019), 170362–170373.

44.

Yankun

, Xiaoping

, Wenbo

and Qiqige

, A Color Histogram based Large Motion Trend Fusion Algorithm for Vehicle Tracking, IEEE Access 9 (2021), 83394–83401.

45.

Yang

, Cui

, Yu

, Yuan

Deep Learning Based Steel Pipe Weld Defect Detection. arXiv preprint arXiv:2104.14907, 2021.

46.

Zhu

, Yang

, Zhang

, Chen

CenterNet-Triplets application: surveillance camera illegal management detection, J of Physics: Conference Ser.1684. 012091, 2020, IOP Publishing.

Effective and efficient multi-crop pest detection based on deep learning object detection models

Abstract

Keywords

1 Introduction

2.1 Object detection models

2.2 Existing pest identification approaches

3 Dataset and Pre-processing

4.1 Object detection models

4.1.1 SSD

4.1.2 EfficientDet

4.1.3 Faster R-CNN

4.1.4 CenterNet

4.2 Backbone networks

4.2.1 ResNet101_V1

4.2.2 MobileNet_V2

4.2.3 Inception_ResNet_V2

4.2.4 HourGlass104

4.2.5 EfficientNet

5 Experimental environment and evaluation metrics

5.1 Experimental setup design

Table 2 Different Combinations of object detection models with the backbone networks (feature extractors) Backbone/Models A B C D MobileNet V2 √ ResNet101 V1 √ √ √ EfficientNet √ Inception_ ResNet_V2 √ HourGlass 104 √ A: SSD C: Faster R-CNN B: EfficientDet D CenterNet

5.2.1 Accuracy, Precision, Recall, and F1 Score

References

Table 2
Different Combinations of object detection models with the backbone networks (feature extractors)

Backbone/Models A B C D

MobileNet V2 √

ResNet101 V1 √ √ √

EfficientNet √

Inception_ ResNet_V2 √

HourGlass 104 √

A: SSD C: Faster R-CNN

B: EfficientDet D CenterNet