A new method for the recognition of day instar of adult silkworms using feature fusion and image attention mechanism

Abstract

Identifying the day instar of silkworms is a fundamental task for precision rearing and behavioral analysis. This study proposes a new method for identifying the day instar of adult silkworms based on deep learning and computer vision. Images from the first day of instar 3 to the seventh day of instar 5 were photographed using a mobile phone, and a dataset containing 7, 000 images was constructed. An effective recognition network, called CSP-SENet, was proposed based on CSPNet, in which the hierarchical kernels were adopted to extract feature maps from different receptive fields, and an image attention mechanism (SENet) was added to learn more important information. Experiments showed that CSP-SENet achieved a recognition precision of 0.9743, a recall of 0.9743, a specificity of 0.9980, and an F1-score of 0.9742. Compared to state-of-the-art and related networks, CSP-SENet achieved better recognition performance with the advantage of computational complexity. The study can provide theoretical and technical references for future work.

Keywords

Identification of day instar CSPNet feature fusion image attention mechanism silkworm

1 Introduction

The silkworm is an insect with high economic value, that is mainly used to produce natural silk, and has been widely reared in China, India, Japan and other countries. The life cycle of silkworms is very short and follows a fixed pattern, where the day instar, which is the age of the silkworm in days, is an important biological indicator. It not only reflects the growth state of the silkworm, but also corresponds with the different breeding requirements. For example, silkworms are dormant after the third day of instar 1, the third day of instar 2, the third day of instar 3 and the fourth day of instar 4, and will spit and cocoon after the seventh day of instar 5, failure to follow this pattern indicates the likelihood of disease occurrence or improper rearing method. In addition, during the rearing process, when silkworms are at different instars, they require different humidity and temperatures to ensure healthy growth, and the amount of mulberry leaves they need varies [1–3]. Therefore, identifying the day instar is a fundamental task for precision rearing and behavior analysis. However, the traditional method for instar recognition mainly relies on daily manual recording, which is inefficient and labor consuming. Therefore, a method to accurately and automatically recognize the day instar is urgently needed.

In recent years, with the rapid development of computer science, deep learning and computer vision have received much attention and have been widely applied in agricultural fields [4–6]. Many studies have been conducted on the recognition of plant leaf diseases, insect or pest species, animal growth behavior, and crop or seed species using convolutional neural networks (CNNs), which have achieved encouraging results [7–9]. In the field of silkworms, Yu et al. [10] proposed a method for the identification of male and female silkworm pupae based on CNN, which obtained better results than traditional methods. Shi et al. [11] used MobileNet to recognize silkworm species, which provided a promising precision. Ding and Chen. [12] presented an image recognition method for silkworm diseases using feature maps slicing and AlexNet architecture. Zhang et al. [13] proposed a cocoon quality detection model based on YOLO v4, which achieved state-of-the-art performance. The above studies have illustrated the broad applications of image recognition and deep learning for the silkworms, but none of them have considered the day instar recognition. To fill this gap, this study focuses on this task using deep learning and computer vision.

Many classic CNNs have been born since 2012. ResNet [14] is one of the most widely influential networks. The residual connection it proposed perfectly avoided the gradient vanishing problem, making deep learning networks deeper and more powerful. It has been used in almost all subsequent algorithms, including CNNs and natural language processing (NLP), and has become a representative achievement that influenced the development of deep learning. CSPNet [15] was developed based on ResNet, which not only inherited the excellent extraction performance of ResNet, but also optimized the residual connection method; therefore, it can provide better recognition accuracy and significantly reduce the computational complexity. CSPNet have been widely used as the backbone network of YOLO networks [16]. We also observed the advantages of the recently prevalent feature fusion and attention mechanisms in various vision tasks, and combined them with CSPNet to design an efficient and accurate network, which achieved a superior recognition result.

The main contributions of this work could be summarized as follows:

A new recognition method for day instars of adult silkworms was proposed based on deep learning and computer vision. To the best of our knowledge, this is the first study on this task.

An accurate CSP-SENet was proposed based on the feature fusion and image attention mechanism. Experiments indicated that the performance of the CSP-SENet outperformed the state-of-the-art and related networks.

An image dataset containing 7,000 images of 14-day instars was constructed.

2 Related works

2.1 Feature fusion in CNNs

Feature fusion has a wide range of applications in CNNs, which can significantly expand the receptive fields and enhance the data representations of the network. Feature Pyramid Network (FPN) [17] adopted a top-down architecture with lateral connections to obtain a high-level feature map at all scales, which improved the detection performance of the object detection algorithms on the COCO benchmark. Path Aggregation Network (PANet) [18] added a bottom-up path augmentation to the FPN to enhance the overall feature hierarchy. Bi-directional Feature Pyramid Network (BiFPN) [19] employed learnable weights to learn the importance of different input features, while repeatedly applying the top-down and bottom-up multi-scale feature fusion. In the application field, Sun et al. [20] presented a multi-level feature fusion network for fruit bearing branch key point detection, which combined features in the same spatial and different spatial sizes. Shi et al. [21] proposed a feature fusion method for the recognition of silkworm diseases based on ResNet-50 and FPN. Dong et al. [22] proposed an automatic method for crop pest detection using multi-scale feature fusion. Wei et al. [23] adopted dilated convolution to obtain multi-scale features and presented a method for pest recognition by fusing multi-scale features. Wang et al. [24] proposed an efficient module for an instance segmentation network in pest monitoring using feature fusion and image attention mechanisms. Xia et al. [25] presented a flower bud detection model for hydroponic Chinese kale using FPN to fuse features extracted by Inception v3. These studies showed that feature fusion is an effective tool to improve the performance. Based on the above works and inspired by the recent development of large kernel models [26, 27]; we proposed a feature extraction and fusion method using the hierarchical large kernels.

2.2 SENet with applications

Squeeze-and-Excitation Network (SENet) [28] is a pioneering algorithm for image attention mechanisms that allows the network to focus on the key information and suppress the noise. Researchers have conducted numerous studies using SENet to enhance the capability of the networks, with encouraging results. GhostNet [29] and MobileNet v3 [30] took advantage of SENet to achieve state-of-the-art performance in a lightweight manner. ECANet [31] replaced the fully connection of SENet with a 1D convolutional layer and achieved superior recognition results. Qi et al. [32] proposed a tomato leaf diseases detection method by adding SENet to the backbone of YOLO v5. Zhou et al. [33] used MobileNet and SENet for ore image classification and provided competitive results. Yue et al. [34] proposed a few-shot learning method for synthetic aperture radar image recognition, in which SENet was integrated into a recognition network to enhance the expression capability of the features. Kushnure and Talbar [35] utilized SENet to recalibrate the fused features to capture prominent details of modified high-level features in medical image segmentation. Zhang et al. [36] proposed a MobileNetV2-SENet-based method to identify fish school feeding behavior, which provides better performance than other lightweight networks. Consequently, SENet can significantly improve the effectiveness of CNNs in various tasks. This prompted us to design a recognition algorithm using SENet.

3 Materials and method

3.1 Experimental data

The dataset is essential for training the network. To obtain an accurate dataset of the day instar, image collection and dataset construction were performed in real environments.

3.1.1 Image collection

In this study, a specific silkworm variety named Fang·Xiu×Bai·Chun [37], which is mainly reared in Sichuan Province, China, was used as the experimental sample. All silkworms were reared by a professional in a standard rearing house to ensure the accuracy and objectivity in their instar. The number of silkworms was approximately 500. The image collection was carried out at a fixed time (08 : 00 to 10 : 00 a.m.) in the unit of day instar except during the dormant period, which was from April 28 to May 14 2022, i.e., from the first day of instar 3 to the seventh day of instar 5. The collection site was located in the Sericulture of Sichuan Academy of Agricultural Sciences Institute (Nanchong City, Sichuan Province, China). The collection environment was indoor with natural light.

A smartphone whose brand was iPhone 6 (Apple Computer Inc, Cupertino, California, USA), containing 12 million pixels, was used as the image acquisition device. A tripod was utilized to mount the device, the lens was faced straight down and the silkworm larvae maintained their natural posture during image collection. The aspect ratio of the device screen was set to 1 : 1, to ensure that the body shape of the silkworm did not reshape after image resizing. Mulberry leaves were randomly used as the background. Examples of the original images are shown in Fig. 1. The image of each silkworm was collected only once per day, and 7, 000 images of the 14-day instars were collected.

Fig. 1

Day instar images of Fang·Xiu×Bai·Chun. (a) ∼ (c) are images of the first day to the third day at instar 3, (d) ∼ (g) are images of the first day to the fourth day at instar 4, (h) ∼ (n) are images of the first day to the seventh day at instar 5.

3.1.2 Dataset construction

The size of the original images was 2448×2448 pixels, which was too large for network training. The original images were cropped to a size of 224×224 pixels using bilinear interpolation, and no other preprocessing methods were performed. Each day instar was used as a recognition category, each category contains 500 images, and a dataset of 14 categories was constructed. The dataset was divided into the training, validation, and test sets by the randomly selecting images in a ratio of 6:2:2.

3.1.3 Data enhancement

Data enhancement can effectively enhance the stability of the model and prevent overfitting. Since the silkworm was manually placed on the background for image acquisition during image collection, there were human habits, lighting environment, image background, and other potential influences that could cause overfitting. Therefore, the preprocessing methods containing random rotation, zoom operation, and horizontal or vertical flipping were used to enhance the images in the training set.

As shown in Fig. 2, for a single image, several images were generated using the data enhancement techniques, and only one was randomly used to replace the original image during the training, and the number of images in the training set was not changed.

Fig. 2

Example of data enhancement. (a) is the original image, (b) ∼ (f) are the enhanced images.

3.2 Recognition network for the recognition of silkworm day instar

Due to the high similarity of silkworms in successive instars, it is difficult to distinguish them manually. A strong extraction strategy was required for the network design. Therefore, a powerful module was proposed to realize effective feature extraction for silkworm images.

3.2.1 Effective block for feature extraction and fusion

We designed our feature extraction block based on the basic block of CSPNet. Figure 3 illustrates the basic block of CSPNet and the proposed CSP-SE block.

Fig. 3

Structure of the CSP block and the proposed CSP-SE block. The “3×3 Conv” refers to the 3×3 convolutional layer, the “Ch” means the number of channels, the “s” is stride, the “p” is padding, the “+” is the addition of feature maps, the “C” is the concatenation operation.

As shown in Fig. 3, when using the proposed block to extract the feature, for the input X with the dimension of W × H × C, where W, H and C denote the width, height, and channel sizes, respectively. First, two 1×1 convolutional layers were used to reshape the feature map, resulting in two sub-feature maps with the dimension of W × H × C/2. The formula is as follows: $Y_{1} = Con v_{1 \times 1} (X)$ (1) $Y_{2} = Con v_{1 \times 1} (X)$ (2) where Conv_1×1 refers to the 2D convolutional operation with kernel size of 1 × 1. Y₁, Y₂ ∈W × H × C/2.

Then, the feature extraction was conducted mainly on the one of them, a 1×1 convolutional layer was utilized to reshape the dimension to W × H × C/4, followed by two stacked 7×7 and 5×5 convolutional layers. This hierarchical kernel was used to expand the receptive fields of the network. A conventional 3×3 convolutional layer was adopted after the identity connection. The residual connection was adopted to combine the input and output feature maps. The formula is as follows:

$\begin{matrix} Z_{1} = Y_{1} + Con v_{3 \times 3} (Concat (Con v_{1 \times 1} (Y_{1}), \\ Con v_{5 \times 5} (Con v_{7 \times 7} (Con v_{1 \times 1} (Y_{1}))))) \end{matrix}$ (3) where Concat means the concatenation operation, which realizes the feature fusion in different receptive fields.

The network depth is increased by adjusting the number of loops of Z₁, denoted by “×L” in Fig. 3.

Finally, an image attention mechanism (SENet) was imposed to learn more key information and suppress the interference. Two sub-feature maps were concatenated and a 1×1 convolutional layer was utilized to obtain the output. The formula is as follows: $Z = Con v_{1 \times 1} (Concat (Atten (Z_{1}), Y_{2}))$ (4) where Atten represents the image attention mechanism. Z ∈ W × H × C.

In summary, our CSP-SE block uses channel compression to improve the extraction efficiency, and the hierarchical kernel sizes to obtain the different receptive fields. The large kernel sizes of 7×7 and 5×5 were used to obtain the feature maps of the different receptive fields, and the 3×3 convolutional layer was used to extract more fine-grained information and fuse the features. The image attention mechanism was adopted to achieve the extraction of key information and suppress the interference. Two concatenation operations were used to combine feature maps from the different receptive fields and obtain different semantic information. The pseudocode of the CSP-SE block is presented in Algorithm 1.

Algorithm 1 Pseudocode of the CSP-SE block in a PyTorch-like style.
Class Bottleneck(nn.Module):
def_init(self, c1, c2, shortcut = True, g = 1, e = 0.5):
# in_channel, out_channel, shortcut, groups, expansion
super(Bottleneck, self).__init__()
c_ = int(c2 * e) # hidden channels
self.cv1 = Conv(c1, c_, 1, 1)
# in_channel, out_channel, kernel size, stride.
#Conv: 2D convolution, Batch Normlization and ReLU
self.cv2 = Conv(c_, c_, 7, 1) #7×7 Conv
self.cv3 = Conv(c, c, 5, 1) #5×5 Conv
self.cv4 = Conv(c2, c2, 3, 1, g = g) #3×3 Conv
self.add=shortcut and c1==c2
def forward(self, x):
x_1 = self.cv1(x)
y = self.cv2(x_1)
y = self.cv3(y)
y = torch.cat((x_1, y), 1)
y = self.cv4(y)
if self.add:
return x + y
else:
return x
Class CSP_SE(nn.Module):
def __init__(self, c1, c2, l = 1, shortcut = True, g = 1, e = 0.5):
# in_channel, out_channel, number of L, shortcut, groups, expansion
super(CSP_SE, self).__init__()
c_ = int(c2 * e) # hidden channels
self.cv1 = Conv(c1, c_, 1, 1)
self.cv2 = Conv(c1, c_, 1, 1)
self.cv3 = Conv(2 * c_, c2, 1)
self.m=nn.Sequential(*[Bottle neck(c_, c_, shortcut, g, e = 0.5) for _ in range(l)])
self.attention=SENet(c_) # image attention
def forward(self, x):
x_1 = self.m(self.cv1(x))
x_2 = self.cv2(x)
x_1 = self.attention(x_1)
return self.cv3(torch.cat((x_1, x_2), dim = 1))

3.2.2 Image attention mechanism

The image attention mechanism can direct deep learning networks to focus on the key information, and it mainly contains the spatial and channel attention modules. Because the proposed block has the capability to expand the receptive fields and perform feature fusion in various fine-grained feature maps, that is, it has a strong ability to extract spatial information. Therefore, the channel attention was adopted to obtain key information by recalibrating the channel.

The selected attention mechanism was SENet, which is the pioneering algorithm in this field. As depicted in Fig. 4, for the input feature map X with the dimension of W × H × C, where W, H, and C denote the width, height, and channel sizes, respectively. When calculating the attention weights, a convolutional operation was first used to transform the input into a feature map with the dimension of W × H × C₁, and global average pooling was used to generate the channel-wise statistics. The formula is as follows:

Fig. 4

Schematic diagram of the SENet. The “FC” represents fully connection.

$u = GAP (Con v_{1 \times 1} (X))$ (5) where GAP is the global average pooling.

The two fully connection (FC) layers and the rectified linear unit (ReLU) function were adopted to capture the channel-wise dependencies of the information aggregated in the global average pooling.

$w = δ (W_{2} * Re LU (W_{1} * u))$ (6) where, W₁ and W₂ are the learning weights of FC layers, ReLU refers to the linear rectification function, and “*” represents the linear multiplication. δ is an activation function, which the formula is as follows: $δ (x) = \frac{1}{1 + e^{- x}}$ (7)

Finally, the output was obtained by rescaling w with the channel-wise dependencies using the channel-wise multiplication. The formula is as follows: $Y = X * w$ (8)

3.2.3 Structure of the proposed network

Based on the proposed block, a recognition network for the day instar was proposed, namely CSP-SENet. As shown in Fig. 5, CSP-SENet inherits the streamlined structure of general networks, such as ResNet, CSPNet, and MobileNet. The input size of the proposed network was 224×224×3. First, a 7×7 convolutional layer and a 2×2 max pooling layer were utilized to extract the feature map and reduce the dimension, resulting in a feature map with a dimension of 56×56×32. Then, four stages of the proposed blocks were stacked to extract the feature maps, and a feature map with a dimension of 7×7×512 was obtained. The L in the CSP-SE block was set to 2, 2, 6, 2, respectively, to increase the network depth. Finally, an average pooling layer was used to obtain a feature map with the dimension of 1×1×512, and a fully connection was utilized to obtain the recognition result.

Fig. 5

Schematic diagram of our CSP-SENet.

General operations such as batch normalization and ReLU function were used to normalize and activate the feature values after each convolutional layer.

3.3 Experimental environment

The experiments were performed on a Dell Precision 5820 workstation with an Intel^® core I7-9800X processor, RTX2080 Ti GPU with 11 G memory, and the CUDA-10.0 computational platform. The operating system was Windows10 Professional 64 bit, the programming language was Python 3.7, the programming environment was Jupyter Notebook, and the deep learning framework was Pytorch. In the course of the experiments, categorical cross-entropy was used as the loss function, and Adam was used as the optimizer. The number of mini batch-size was 32, and the number of epochs was 300. The initial learning rate was 0.001, and was multiplied by 0.8 to reduce it if the loss value did not decrease in five consecutive iterations.

All networks were trained on the training and validation sets, and the recognition results were calculated using the test set. The precision, recall, specificity, and F1-score were taken to reflect the recognition performance. The calculation formula are as follows: $Precision = \frac{TP}{TP + FP}$ (9) $Recall = \frac{TP}{TP + FN}$ (10) $Specificity = \frac{TN}{TN + FP}$ (11) $F 1 - score = \frac{2 \times Precision \times Recall}{Precision + Recall}$ (12) where, TP (True Positive) means that the silkworm that was recognized as a positive sample and the recognition result was correct. FP (False Positive) means that the silkworm was classified as a positive sample, but the actual category was a negative sample. FN (False Negative) represents that the silkworm was identified as a negative sample, but the actual category was actually a positive sample.

4 Experimental results

4.1 Recognition results of our CSP-SENet

In this section, our CSP-SENet was trained and evaluated on the constructed dataset, and the loss value and accuracy on the training and validation sets were recorded to evaluate the convergence effect.

Figure 6 depicts the loss value and accuracy during the training process on the training and validation sets. The loss value was considerably high and the accuracy was extremely low at the initial stages of training. As the epoch increased, the loss decreased, and the accuracy increased sharply. The model reached relative stability after 150 epochs. The loss value reached about 0.1 and 0.2, and the accuracy reached about 99% and 97% in the training and the validation sets, respectively, at the end stages of training.

Fig. 6

The loss value and accuracy during the training process in the training and validation sets.

The model weight that achieved the best accuracy on the validation set was used for model testing. To observe the details of the recognition results, a confusion matrix was adopted to visualize the results. As shown in Fig. 7, the labels “3-1” to “5-7” represents the days from the first day at instar 3 to the seventh day at instar 5, the horizontal coordinate denotes the prediction results of the trained model, and the longitudinal coordinate refers to the true label of each category. The number in the diagonal of the matrix indicates the number of correct recognitions, and the other cells represent the number of incorrect recognitions.

Fig. 7

The confusion matrix of CSP-SENet.

Misrecognition of more than three images included: four images of the third day at instar 5 were recognized as the fourth day at instar 5. There were 3 and 4 images of the fourth day at instar 5 that were recognized as the second and third day at instar 5, respectively, and three images of the fifth day at instar 5 that were recognized as the seventh day at instar 5. Based on these results, it can be observed that the images of the adjacent day in an instar are more likely to be misrecognized due to their high similarity, and the images of instar 5 are more likely to be misclassified. This corresponds to real situations where it is more difficult to distinguish the day instar by the appearance in late instar 5 due to the growth of the silkworm.

Table 1 reflects the details of the recognition results for each category according to the confusion matrix. The best results were found for “4-1” and “4-4,” while the lowest results were found for “5-4.” In summary, CSP-SENet achieved a precision of 0.9743, a recall of 0.9743, a specificity of 0.9980, and an F1-score of 0.9742 on the test set of 1400 images.

Table 1

Recognition results were proposed by CSP-SENet

Category	Precision	Recall	Specificity	F1-score
3-1	0.99	0.99	0.99923	0.99
3-2	0.99	0.98020	0.99923	0.98507
3-3	0.99	0.99	0.99923	0.99
4-1	1.0	0.99010	1.0	0.99502
4-2	0.97	0.97980	0.99769	0.97487
4-3	0.99	0.98020	0.99923	0.98507
4-4	1.0	0.99010	1.0	0.99502
5-1	0.99	0.97059	0.99923	0.98020
5-2	0.96	0.94120	0.99691	0.95050
5-3	0.95	0.95	0.99615	0.95
5-4	0.9	0.92784	0.99233	0.91371
5-5	0.94	0.98947	0.99540	0.96410
5-6	0.98	0.98990	0.99846	0.98492
5-7	0.99	0.97060	0.99923	0.98020
Average	0.9743	0.9743	0.9980	0.9742

4.2 Comparison with the state-of-the-art networks

To verify the recognition capability of the proposed network, several state-of-the-art networks, including DarkNet-53 [38], CSPNet [15], ResNet-50 [16], and Swin Transformer [39], were selected for comparison with CSP-SENet.

Table 2 reports the comparison results of five selected networks. It can be seen that CSP-SENet outperformed all the selected algorithms not only in terms of number of parameters, weight size, and GFLOPs, but also in terms of precision, recall, specificity, and F1-score. Specifically, compared to

Table 2
Comparison results of CSP-SENet with other state-of-the-art networks

Network	Params	Weight size	GFLOPs	Precision	Recall	Specificity	F1-score
DarkNet-53	31.5 M	120.2 MB	11.2 G	0.9043	0.9149	0.9927	0.9027
CSPNet	20.8 M	79.4 MB	6.2 G	0.9536	0.9534	0.9964	0.9553
ResNet-50	23.6 M	89.8 MB	8.3 G	0.9479	0.9481	0.9960	0.9476
Swin-Trans (tiny)	27.5 M	104.9 MB	8.7 G	0.9471	0.9429	0.9955	0.9430
CSP-SENet (ours)	9.7 M	36.8 MB	3.0 G	0.9743	0.9743	0.9980	0.9742

DarkNet-53, the number of parameters, weight size and GFLOPs of CSP-SENet were less than one third, but the precision, recall and F1-score values were 0.07, 0.0594 and 0.0715 higher, respectively. Due to the reduced number of convolutional kernels, the number of parameters, weight size, and GFLOPs of CSP-SENet are significantly reduced compared to CSPNet, while the feature fusion and SENet bring significant improvement in recognition performance, with 0.0207, 0.0209 and 0.0189 improvement in precision, recall, and F1-score, respectively. The same advantages are also reflected in the comparison results with ResNet-50 and Swin Transformer. These results showed that the proposed CSP-SENet has better comprehensive capabilities.

4.3 Comparison with related networks

In this section, two types of related networks have been used for comparison with our CSP-SENet, one of which is the combination of CSPNet with image attention mechanisms, such as CSPNet+ECA [40], CSPNet+CBAM [41] and CSPNet+CANet [42], and the other is the application networks of SENet, such as GhostNet [26] and MobileNet v3 small [27].

Table 3 summarizes the results of the selected networks. It can be observed that by adding CBAM, CANet, and ECANet to CSPNet, the recognition precision can be increased without significantly increasing the number of parameters and the computational burden. However, the performance of these networks on the dataset was not as good as that of CSP-SENet alone. Meanwhile, due to the use of lightweight design, GhostNet and MobileNet have significant advantages over CSP-SNet in terms of the number of parameters and GFLOPs, but their recognition results were relatively limited. These results illustrated the recognition performance of the proposed method.

Table 3
Comparison results of CSP-SENet with other related networks

Network Params Weight size GFLOPs Precision Recall Specificity F1-score

CSPNet+CMBA 20.9 M 80.1 MB 6.3 G 0.9657 0.9659 0.9974 0.9656

CSP+CANet 20.9 M 79.7 MB 6.2 G 0.9657 0.9657 0.9974 0.9655

CSPNet+ECA 20.9 M 79.4 MB 6.2 G 0.9650 0.9648 0.9973 0.9647

GhostNet 3.9 M 15.0 MB 0.31 G 0.9579 0.9584 0.9968 0.9579

MobileNet v3 1.7 M 6.4 MB 0.13 G 0.9314 0.9326 0.9947 0.9311

CSP-SENet (ours) 9.6 M 36.8 MB 3.0 G 0.9743 0.9743 0.9980 0.9742

Network	Params	Weight size	GFLOPs	Precision	Recall	Specificity	F1-score
CSPNet+CMBA	20.9 M	80.1 MB	6.3 G	0.9657	0.9659	0.9974	0.9656
CSP+CANet	20.9 M	79.7 MB	6.2 G	0.9657	0.9657	0.9974	0.9655
CSPNet+ECA	20.9 M	79.4 MB	6.2 G	0.9650	0.9648	0.9973	0.9647
GhostNet	3.9 M	15.0 MB	0.31 G	0.9579	0.9584	0.9968	0.9579
MobileNet v3	1.7 M	6.4 MB	0.13 G	0.9314	0.9326	0.9947	0.9311
CSP-SENet (ours)	9.6 M	36.8 MB	3.0 G	0.9743	0.9743	0.9980	0.9742

4.4 Ablation study

To reflect the effectiveness of the crucial components of the proposed method, we decomposed the CSP-SE block into four different structures and then used them to build different networks with the same configuration as CSP-SENet (same number of convolutional kernels and L) for training and testing to verify the recognition effect of each structure. As shown in Fig. 8, the first of these structures was to add SENet to the CSP block (Method I), the second was our CSP-SE block without SENet (Method II), and the third was to use two 3×3 convolutional layers to replace the 7×7 and 5×5 convolutions in the CSP-SENet block (Method III). In addition, we also tested changing the number of each stage L in CSP-SE from 2 : 2: 6 : 2 to 2 : 3: 5 : 2 (which is the same ratio as ResNet-50) (Method IV).

Fig. 8

Three structures for ablation analysis.

Table 4 reports the comparison results for the different components. It can be observed that when SENet was added to CSPNet, and compared to the results of original CSPNet, the recognition precision, recall and F1-score increased by 0.0135, 0.0139 and 0.0118, respectively. When SENet was removed from our CSP-SE block, and compared to the result of CSP-SENet, the recognition precision, recall and F1-score were decreased by 0.0122, 0.0123 and 0.0123, respectively. This demonstrated the effect of SENet on recognition performance. Second, when the 3×3 convolutional layers were used to replace the 7×7 and 5×5 convolutional layers in the CSP-SE block, the recognition precision, recall, and F1-score were decreased by 0.0186, 0.0186, and 0.0186, respectively, demonstrating the effect of the feature fusion from the different receptive fields. Finally, when the stage L in the four stages was changed from 2 : 2:6 : 2 to 3 : 3:6 : 6, the recognition precision, recall, and F1-score were reduced by 0.0036, 0.0036 and 0.0038, respectively, thus demonstrating the role of the stage ratio.

Table 4

Comparison results for different components

Method	Precision	Recall	Specificity	F1-score
I	0.9671	0.9673	0.9975	0.9671
II	0.9621	0.9620	0.9971	0.9619
III	0.9557	0.9557	0.9966	0.9556
IV	0.9707	0.9706	0.9978	0.9704
CSP-SENet	0.9743	0.9743	0.9980	0.9742

5 Conclusions

This study proposes a new method for the day instar recognition of adult silkworms based on feature fusion and image attention mechanisms. The images from the first day of instar 3 to the seventh day of instar 5 were photographed using a mobile phone under real rearing conditions, and a dataset containing 7 000 images was constructed for recognition. An effective module was proposed based on the feature fusion from the different receptive fields and the image attention mechanism to realize the extraction of diversified features and key information, and a network was presented using the proposed module. Based on the experimental results, the following specific conclusions can be drawn:

A recognition method for the day instar of adult silkworms was proposed, which can fill the gaps in this field and provide a theoretical reference for related work on the recognition of silkworms and other insects.

A CSP-SENet was developed based on CSPNet and SENet, in which feature fusion from different receptive fields and SENet were adopted to improve the recognition performance. The experiments demonstrated the advantages of CSP-SENet, which has better recognition precision and cheap computational cost.

However, our study still has some shortcomings. The dataset contains only images of one silkworm species, and the number of images and categories in the dataset need to be further expanded. The key visual and behavioral characteristics of each day age, such as dormancy, food ringing, molting, and spitting, were not sufficiently considered, and the depth of research on the day age recognition needs to be further enhanced. For these reasons, in the follow-up study, we plan to further enrich the dataset to include more images of silkworms. At the same time, we will also focus on studying some key behavioral traits during the silkworm’s growth to expand the practical implications of the study.

Footnotes

Acknowledgments

This study was supported by the Open Competition Mechanism to select the best Candidates from Sichuan Academy of Agricultural Sciences(1 + 9KJGG008); Natural Science Foundation of Sichuan, China (2023NSFSC0498) and the National Modern Agricultural Industrial Technology System Special Project (CARS-18).

References

R.M.

, Zheng

K.F.

, Wei

Q.Y.

, Zhang

X.B.

, Zhang

, Zhu

Y.H.

, Zhao

Y.Y.

and Gu

, Identification and counting of silkworms in factory farm using improved mask R-CNN model, Smart Agriculture 4(02) (2022), 163–173.

Pan

M.L.

, Yang

Y.P.

, Dai

J.Z.

, Qian

Q.J.

, Sun

H.Y.

and Chen

W.G.

, Comparison of acute and residual toxicity of cyantraniliprole and chlorantraniliprole against silkworm (Bombyx mori), Acta Sericologica Sinica 47(06) (2021), 589–594.

Tang

, Jiang

M.G.

, Shi

M.N.

, Dong

G.Q.

, Wang

, Hu

W.J.

, Huang

S.H.

, Chen

X.Q.

, Huang

X.H.

and Pan

Z.X.

, Epidemic law of Bombyx mori nuclear polyhedrosis and its correlation with meteorological factors, Journal of Southern Agriculture 51(5) (2020), 1217–1223.

Kamilaris

and Prenafeta-Boldu

F.X.

, Deep learning in agriculture: a survey. Computers and Electronics in Agriculture 147 (2018), 70–90.

Zhang

, Yang

G.P.

, Liu

Y.K.

, Wang

and Y

Y.L.

, An improved YOLO network for unopened cotton boll detection in the field, Journal of Intelligent and Fuzzy Systems 42 (2022), 2193–2206.

J.X.

, Zhao

, Zhu

S.P.

, Huang

and Jiang

Z.Y.

, An improved lightweight network architecture for identifying tobacco leaf maturity based on Deep learning, Journal of Intelligent and Fuzzy Systems 2 (2021), 1–10.

Thenmozhi

and Reddy

U.S.

, Crop pest classification based on deep convolutional neural network and transfer learning, Computers and Electronics in Agriculture 164 (2019), 104906.

Pereira

C.S.

, Morais

and Reis

M.J.C.S.

, Deep learning techniques for grape plant species identification in natural images, Sensors 19(22) (2019), s19224850.

Karlekar

and Seal

, SoyNet: Soybean leaf diseases classification, Computers and Electronics in Agriculture 172 (2020), 105342.

10.

Y.D.

, Gao

P.F.

, Zhao

Y.Z.

, Pan

G.Q.

and Chen

, Automatic identification of female and male silkworm pupa based on deep convolution neural network, Acta Sericologica Sinica 46(02) (2020), 0197–0203.

11.

Shi

H.K.

, Tian

Y.Y.

, Yang

, Chen

, Su

S.Y.

, Zhang

Z.Y.

, Zhang

J.F.

and Jiang

, Research on intelligent recognition of silkworm larvae races based on convolutional neural networks, Journal of Southwest University (Natural Science Edition) 42(12) (2020), 34–45.

12.

Ding

J.Y.

and Cheng

A.J.

, An Improved Similarity Algorithm Based on Deep Hash and Code Bit Independence, 2019 4th International Conference on Insulating Materials, Material Application and Electrical Engineering (2019).

13.

Zhang

Y.H.

, Yang

H.K.

, Zhu

S.Y.

and He

Z.F.

, Machine vision real time detection of inferior cocoons based on lightweight manipulation network, Transactions of the Chinese Society for Agricultural Machinery 53(04) (2022), 261–270.

14.

K.M.

, Zhang

X.Y.

, Ren

S.Q.

and Sun

, Deep Residual Learning for Image Recognition, 2016 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2016), 770–778.

15.

Wang

C.Y.

, Liao

H.Y.M.

, Wu

Y.H.

, Chen

P.Y.

, Hsieh

J.W.

and Yeh

I.H.

, CSPNet: A New Backbone that can Enhance Learning Capability of CNN, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2020), 1571–1580.

16.

, Liu

S.T.

, Wang

, Li

Z.M.

and Sun

, YOLOX: Exceeding YOLO Series, in arXiv e-prints arXiv (2021), 2107.08430.

17.

Lin

T.Y.

, Dollár

, Girshick

, He

, Hariharan

and Belongie

, Feature Pyramid Networks for Object Detection, arXiv preprint arXiv (2016), 1612.03144.

18.

Liu

, Qi

, Qin

H.F.

, Shi

J.P.

and Jia

J.Y.

, Path Aggregation Network for Instance Segmentation, arXiv preprint arXiv (2018), 1803.01534.

19.

Tan

M.X.

, Pang

R.M.

and Le

Q.V.

, EfficientDet: Scalable and Efficient Object Detection, arXiv preprint arXiv (2019), 1911.09070.

20.

Sun

Q.X.

, Chai

X.J.

, Zeng

Z.K.

, Zhou

G.M.

and Sun

, Multi-level feature fusion for fruit bearing branch keypoint detection, Computers and Electronics in Agriculture 191 (2021), 106479.

21.

Shi

H.K.

, Xiao

W.F.

, Huang

, Hu

C.W.

, Hu

G.R.

and Zhang

J.F.

, Research on recognition of silkworm diseases based on Convolutional Neural Network, Journal of Chinese Agricultural Mechanization 43(1) (2022), 150–157.

22.

Dong

S.F.

, Du

J.M.

, Jiao

, Wang

F.M.

, Liu

, Teng

and Wang

R.J.

, Automatic crop pest detection oriented multiscale feature fusion approach, Insects 13 (2022), 554.

23.

Wei

D.P.

, Chen

J.Q.

, Luo

, Long

and Wang

H.B.

, Classification of crop pests based on multi-scale feature fusion, Computers and Electronics in Agriculture 194 (2022), 106736.

24.

Wang

H.X.

, Li

Y.F.

, Minh

D.L.

and Moon

, An efficient attention module for instance segmentation network in pest monitoring, Computers and Electronics in Agriculture 195 (2022), 106853.

25.

Xia

H.M.

, Zhao

K.D.

, Jiang

L.H.

, Liu

Y.J.

and Zhen

W.B.

, Flower bud detection model for hydroponic of attention mechanism and multi-scale feature, Transactions of the Chinese Society of Agricultural Engineering 37(23) (2021), 161–168.

26.

Liu

, Mao

H.Z.

, Wu

C.Y.

, Feichtenhofer

, Darrell

and Xie

S.N.

, A ConvNet for the 2020s, arXiv e-prints arXiv, (2022), 2201.03545.

27.

Woo

, Debnath

, Hu

R.H.

, Chen

X.L.

, Liu

, Kweon

I.S.

and Xie

S.N.

, ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders, arXiv e-prints arXiv (2023), 2301.00808.

28.

, Shen

, Albanie

, Sun

and Wu

E.H.

, Squeeze-and-Excitation Networks, IEEE Transactions on Pattern Analysis and Machine Intelligence 42(8) (2020), 2011–2023.

29.

Han

, Wang

Y.H.

, Tian

, Guo

J.Y.

, Xu

C.J.

and Xu

, GhostNet: More Features from Cheap Operations. 2020 IEEE / CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020), 1577–1586.

30.

Howard

, Sandler

, Chu

, Chen

L.C.

, Chen

, Tan

M.X.

Wang

W.J

, Zhu

Y.K.

, Pang

R.M.

, Vasudevan

, Le

Q.V.

and Adam

, Searching for MobileNetV3, 2019 IEEE / CVF International Conference on Computer Vision (ICCV) (2020), 1314–1324.

31.

Wang

Q.L.

, Wu

B.G.

, Zhu

P.F.

, Li

P.H.

, Zuo

W.M.

and Hu

Q.H.

, ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020), 11531–11539.

32.

J.T.

, Liu

X.N.

, Liu

, Xu

F.R.

, Guo

, Tian

X.L.

, Li

, Bao

Z.Y.

and Li

, An improved YOLOv5 model based on visual attention mechanism: Application to recognition of tomato virus disease, Computer and Electronics in Agriculture 194 (2022), 106780.

33.

Zhou

W.Y.

, Wang

and Wan

Z.B.

, Ore image classification based on improved CNN, Computers and Electrical Engineering 99 (2022), 107819.

34.

Yue

Z.Y.

, Gao

, Xiong

Q.X.

, Sun

J.P.

, Hussain

and Zhou

H.Y.

, A novel few-shot learning method for synthetic aperture radar image recognition, Neurocomputing 465 (2021), 215–227.

35.

Kushnure

D.T.

and Talbar

S.N.

, HFRU-Net: High-Level Feature Fusion and Recalibration UNet for Automatic Liver and Tumor Segmentation in CT Images, Computer Methods and Programs in Biomedicine 213 (2022), 106501.

36.

Zhang

, Wang

J.P.

, Li

B.B.

, Liu

Y.R.

, Zhang

H.X.

and Duan

Q.L.

, A MobileNetV2-SENet-based method for identifying fish school feeding behavior, Aquacultural Engineering 99 (2022), 102288.

37.

Zhang

Y.H.

, Shen

Y.H.

, Xiao

W.F.

, Zhou

A.L.

and Xiao

J.S.

, Selective Breeding of Bombyx mori Cross Combination Fang·Xiu×Bai·Chun, Acta Sericologica Sinica 40(06) (2014), 1017–1023.

38.

Redmon

and Farhadi

, YOLOv3: An Incremental Improvement, arXiv preprint arXiv (2018), 1804.02767.

39.

Liu

, Lin

Y.T.

, Cao

, Hu

, Wei

Y.X.

, Zhang

, Lin

and Guo

B.N.

, Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, arXiv preprint arXiv (2021), 2103.14030.

40.

Song

H.B.

, Li

, Wang

Y.F.

, Jiao

Y.T.

and Hua

Z.X.

, Recognition Method of Heavily Occluded Beef Cattle Targets Based on ECA-YOLO v5s, Transactions of the Chinese Society for Agricultural Machinery 54(03) (2023), 274–281.

41.

Song

H.B.

, Jiao

Y.T.

, Hua

Z.X.

, Li

and Xu

X.S.

, Endosperm Crack Detection Method for Seed Dipping Maize Based on YOLO v5 OBB and CT Technology, Transactions of the Chinese Society for Agricultural Machinery 54(03) (2023), 394–401+439.

42.

Wang

Q.F.

, Cheng

, Huang

, Cai

Z.J.

, Zhang

J.L.

and Yuan

H.B.

, A deep learning approach incorporating YOLO v5 and attention mechanisms for field real-time detection of the invasive weed Solanum rostratum Dunal seedlings, Computers and Electronics in Agriculture 197 (2022), 107194.

A new method for the recognition of day instar of adult silkworms using feature fusion and image attention mechanism

Abstract

Keywords

1 Introduction

2 Related works

2.1 Feature fusion in CNNs

2.2 SENet with applications

3 Materials and method

3.1 Experimental data

3.1.1 Image collection

3.1.3 Data enhancement

3.2.1 Effective block for feature extraction and fusion

4.1 Recognition results of our CSP-SENet

Table 2 Comparison results of CSP-SENet with other state-of-the-art networks

Footnotes

Acknowledgments

References

Table 2
Comparison results of CSP-SENet with other state-of-the-art networks