SE-Mask R-CNN: An improved Mask R-CNN for apple detection and segmentation

Abstract

Fruit detection and segmentation is an essential operation of orchard yield estimation, the result of yield estimation directly depends on the speed and accuracy of detection and segmentation. In this work, we propose an effective method based on Mask R-CNN to detect and segment apples under complex environment of orchard. Firstly, the squeeze-and-excitation block is introduced into the ResNet-50 backbone, which can distribute the available computational resources to the most informative feature map in channel-wise. Secondly, the aspect ratio is introduced into the bounding box regression loss, which can promote the regression of bounding boxes by deforming the shape of bounding boxes to the apple boxes. Finally, we replace the NMS operation in Mask R-CNN by Soft-NMS, which can remove the redundant bounding boxes and obtain the correct detection results reasonably. The experimental result on the Minneapple dataset demonstrates that our method overperform several state-of-the-art on apple detection and segmentation.

Keywords

Apple detection and segmentation complex background squeeze-and-excitation block aspect ratio soft-NMS

1 Introduction

Apples are grown on a large scale all over the world due to their delicious taste and rich nutrition. The most time-consuming and laborious task in apple orchards is harvesting [1]. Yield estimation is important for the efficient management of harvest operations in apple orchards, but traditional manual estimation is no longer suitable for industrial apple planting in orchards. Many machine learning algorithms have excellent capabilities for apple detection and segmentation and are widely used in apple orchards, these methods can make the yield estimation work with a small number of manual work but get higher precision. However, due to the complex background in the actual apple orchard, factors such as fluctuating illumination, dense fruit distribution, occlusion of fruits by branches and leaves, overlap fruits, camera angle and distance can have certain impacts on target detection, which cause difficulties and challenges in accurate identification of fruits [2].

At present, machine learning algorithms combined with computer vision [3 –8] are still the dominant approaches for fruit detection and segmentation in orchards. Usually, these methods use a series of image pre-processing operations, such as color threshold segmentation, Circular Hough Transform (CHT), fruit edge detection, region growth, and watershed segmentation to extract the features of fruit from images. The extracted features, including color, texture, morphology, or combined multiple features, are put into machine learning models such as supporting vector machine (SVM), K-means clustering, and template matching for supervised or unsupervised learning. In [9], a series of processing operations were taken on citrus images: convert RGB image to HSV, thresholding, orange color detection noise removal, watershed segmentation, and counting. A correlation coefficient R₂ of 0.93 was obtained between the citrus counting algorithm and counting performed through human observation. Zhuang et al. [10] used block-based local homomorphic filtering on citrus image to obtain an illumination-compensated image, an adaptive enhanced red and green chromatic map was then generated, Otsu algorithm, morphology operation, marker-controlled watershed segmentation, and convex hull operation methods were used in combination to locate potential citrus regions from the chromatic map. Local texture information was extracted from the potential regions using local binary patterns and fed to a histogram intersection kernel-based support vector machine to make the final decision. Gene-Mola et al. [11] proposed a fruit detection algorithm based on reflectance thresholding and SVM, and they reduced the fruit occlusions for LiDAR-based approaches from two different ways: applying forced air flow through an air-assisted sprayer and using multi-view sensing. Although the above methods can identify the fruit target from images, the identification accuracy decreases when the condition of the background changes. Furthermore, it is difficult to find an universal method that can detect and segment fruits from different growth stages, especially in complex environments.

Compared with the traditional methods, deep learning methods are widely used in image object detection because of their ability to automatic feature extraction, outstanding performance of learning, and strong adaptability to variances of the working scene. In recent years, The advantages of deep learning are obvious, compared with the algorithms based on handcrafted features such as color, shape, and texture [12]. Tian et al. [13] proposed an improved YOLO-V3 model and used the DenseNet to process feature layers with low-resolution for detecting apples during different growth stages in orchards with complex background. Mao et al. [8] proposed a cucumber detection method with a multi-path convolutional neural network (MPC NN), combined with color component selection and SVM. Sa et al. [14] used imagery obtained from two modalities: RGB and Near-Infrared (NIR) images to train the Faster R-CNN model, and explored early and late fusion methods for combining the multi-modal (RGB and NIR) information. Jia et al. [2] used Residual Network (ResNet) combined with DenseNet as a backbone network for Mask Region Convolutional Neural Network (Mask R-CNN), the input parameter is reduced and recognition speed is fast. Kang and Chen [15] proposed a multi-function network to perform the real-time detection and semantic segmentation of apples and branches in orchard environments by using the visual sensor, the atrous spatial pyramid pooling (ASPP) and the Gated feature pyramid network (FPN), which can enhance the feature extraction ability of the network, a light-weight backbone network based on the residual network architecture is developed to improve the real-time computation performance of the network. Yu et al. [1] proposed a visual localization method for strawberry picking, which was used with Mask R-CNN to generate mask images of ripe fruits, compared with four traditional methods, this method improved the universality and robustness of Mask R-CNN in a non-structural environment.

The methods mentioned above performed good results in fruit detection and segmentation. However, these general object detection and segmentation methods have no special improvement for features of fruits in complex background. Therefore, the performance of apple detection and segmentation methods can be considerably improved by a method designed for the features of apple in complex background.

In this paper, We propose an improved Mask R-CNN for apple detection and segmentation named SE-Mask R-CNN. Firstly, the detector network should pay more attention to the feature maps which contain more information of apple, so the squeeze-and-excitation block [16] is introduced into the ResNet-50 backbone to distribute the available computational resources to the most informative feature map in channel-wise. Secondly, we optimize the bounding box regression loss with aspect ratio to make the Mask R-CNN more suitable for apple detection and segmentation, which can assist the regression of bounding boxes by deforming the shape of bounding boxes to the apple in training stage. Finally, to improve the detection performance in complex background such as overlap and occlusion, we replace the traditional NMS in Mask R-CNN with Soft-NMS [17], which can remove the redundant bounding boxes and obtain the correct detection results reasonably.

The rest of this paper is organized as follows. Section 2 describes the SE-Mask R-CNN for apple detection and segmentation in detail. Section 3 introduces the model training and loss function of SE-Mask R-CNN. Section 4 reports the experimental evaluation and analysis of our method. Section 5 presents conclusions and future work.

2 Method

Mask R-CNN [18] is a simple but efficient image segmentation model, which performs object detection, and instance segmentation at the same time. Mask R-CNN combines Faster R-CNN [19] for object detection and FCN [20] for semantic segmentation. In this section, we introduce SE-Mask R-CNN from three parts, which respectively represent the three important operations of SE-Mask R-CNN, and we optimize parts of these to make the Mask R-CNN more suitable for apple detection and segmentation. The farmework of SE-Mask R-CNN is shown in Fig. 1.

Fig. 1

The framework of SE-Mask R-CNN.

2.1 Feature extraction

The feature extraction network of SE-Mask R-CNN can be changed by different weight layers and depths. In general, a deeper feature extraction network may result in higher accuracy, but the speed of model training and testing will be descended, even when the network reaches a certain depth, the testing errors will increase because of over-fitting. ResNet can deal with this problem effectively by the residual structure which can learn the representation of residuals between inputs and outputs to accelerate the training and avoid over-fitting.

The feature maps extracted by ResNet are abundant, however, the following network should pay more attention on the feature maps which contain more information of apple. So it is necessary to assign more weight to the channels which contain more features of apple than others. The attention mechanism is generated to solve this problem, which can be interpreted as a method that can distribute the available computational resources unevenly, usually to the most informative components of a signal. Squeeze-and-excitation block (SE block) is a kind of channel-wise attention mechanism, so we introduce the SE block (Fig. 2) to the ResNet-50, feature maps generated by the Residual module of each layer is processed by SE block continuously, which can assign weights in each channel of feature maps, and give more weight to the feature maps which contain more information of apple.

Fig. 2

The schema of the original Residual module (left) and the SE-ResNet module (right).

There are three steps in the SE block: squeeze, excitation, and scale (Fig. 3) [16]. Firstly, the feature maps U_C are squeezed by global average pooling to generate channel-wise statistics. Then, the scale factor can be fully captured by a bottleneck with two fully-connected (FC) layers around the non-linearity, which describes the channel-wise dependencies. Finally, the output of the block is obtained by scaling the input feature maps U_C with the scale factor S_C, The output of the SE-block is defined as follows: $\tilde{X_{C}} = F_{scale} (U_{C}, S_{C}) = \sum_{i} S_{i} * U_{i}$ (1) where S_i and U_i respectively represent the ith input feature map and scale factor

Fig. 3

The structure of the squeeze-and-excitation block.

In this way, the post networks will pay more attention to the feature map which contain more information of apple than others, it is useful to improve the accuracy of apple detection and segmentation.

In the feature extraction network with multiple convolution layers, the high layers extract the low-resolution and semantically strong features, high-resolution and semantically weak features were extracted by the underlying layers. To better detect the apple in different scales, the Feature Pyramid Network (FPN) is introduced to the feature extraction network Fig. 4, which is developed for building high-level semantic feature maps at all scales. In the FPN architecture, the size of the top-level feature map is expanded to the same with the low-level feature maps by sampling, the number of channels of low-level feature map is changed by the 1x1 convolution, finally the top-level features are merged with the underlying features.

Fig. 4

The structure of FPN. The top-level features are merged with the underlying features.

2.2 RoIs generation and RoIAlign

The feature maps generated by the feature extraction network are used as the input of the Region Propose Network (RPN) (Fig. 5), According to the size, scale and the shooting distance of apple image, there are five different area scales including 32 × 32, 64 × 64, 128 × 128, 256 × 256, 512 × 512, and three aspect ratios including 1:1, 1:2, 2:1 to be combined randomly to generate 15 kinds of anchor boxes for each corresponding pixel of the feature map, there are two tasks need to do for the anchors: On the one hand, the classification branch with SoftMax layer classify the anchor boxes, we assign a positive label to two kinds of anchors: (i) The anchor whose Intersection over Union (IoU) with a ground-truth box is the highest among all anchors whose IoU is not zero. (ii) an anchor that has an IoU overlap higher than 0.7 with any ground-truth box. On the other hand, the border regression branch is used to regress the coordinates of the anchor boxes. At the end of RPN, the proposal boxes are initially screened from the anchor boxes by the result of the two branches: 2 × 15 scores represent the probabilities of an object, 4 × 15 coordinates represent the offset distance from target boxes. The Regions of Interest (RoIs) are generated by mapping the proposal boxes to the feature maps extracted by the backbone.

Fig. 5

The structure of the Region Propose Network (RPN).

Before the classification, bounding box regression, and instance segmentation, it is necessary to extract the corresponding features of each RoI from feature maps. In Faster R-CNN, this work is done by RoI Pooling, Firstly, the floating-number RoIs are mapped to the corresponding position of the feature maps. Secondly, The RoIs are subdivided into spatial bins. Finally, the feature values of each bin are aggregated by max pooling. However, there are two quantization operation performed in the first two steps, these operations may introduce misalignment between RoIs and the extracted feature [18], and have an negative effect on apple detection and segmentation. The RoI Align can solve this problem, which removes the quantization operation of RoI Pooling and uses bilinear interpolation operation to compute the position of RoIs in the feature maps.

2.3 Apple detection and Instance segmentation

In the last section of SE-Mask R-CNN, the target detection and instance segmentation result are generated by a multi-branch prediction network, which contains three prediction branches. Fully convolutional network (FCN) can segment the apple in pixel-level accurately, It is an end-to-end network including fully convolution and deconvolution to classify each pixel. The fully connected branch is used to classify the boxes in RoIs. The regression branch is used for the bounding boxes regression.

In the regression operation, the Non-maximum suppression (NMS) is used to remove the redundant bounding boxes. If there are two overlapping bounding boxes with different confidence scores, the bounding box with lower confidence score will be removed by NMS operation, despite the score of it, the detection result of overlapping apple will removed incorrectly in this way. So we introduce the Soft-NMS method to Mask R-CNN, which solve this problem by adding only one line of code based NMS, it decreases the score instead of removing the box with a lower score as in NMS, the higher the IoU factor, the more the decrease of the score, finally, the boxes whose score is lower than the threshold will be removed. The comparison of the results respectively generated by NMS and Soft-NMS are shown as Fig. 6.

Fig. 6

The comparison of the results respectively generated by NMS and Soft-NMS. Apple detected by regression branch (left), NMS(middle), and Soft-NMS(right)

3 Model training and loss function

3.1 Transfer learning

Transfer learning means to transfer the trained model parameters to the new model, which can promote the new model training and make up the lack of training data, because most of the basic features extracted from images such as edge, shape are related, so through the transfer learning, we can share the pre-trained parameters with the new model to speed up and optimize the learning efficiency of the model, without learning from scratch [21]. In this work, a pre-trained SE-Mask R-CNN based on the COCO dataset [22] is introduced, COCO dataset is a huge dataset with 118k images and 91 categories.

3.2 Loss function

The loss function of the SE-Mask R-CNN is consist of two major parts: the loss L_RPN computed by classification and regression operation in RPN and the loss of the multi-branch predictive network L_Mul-Branch, so the total loss function can be described as follows:

$L_{total} = L_{RPN} + L_{Mul - Branch}$ (2)

In the training process of RPN, the classification loss is generated by the Softmax Layers to measure the probability of whether the anchor box contains apple or not. And the bounding box regression loss is generated by the regression operation which regress the anchors belong to the foreground. So the L_RPN is described as follows:

$\begin{matrix} L_{RPN} = \frac{1}{N_{cls}} \sum_{i} L_{cls} (p_{i}, p_{i}^{*}) \\ + λ \frac{1}{N_{reg}} \sum_{i} p_{i}^{*} L_{reg} (t_{i}, t_{i}^{*}) \end{matrix}$ (3) where i is the index of an anchor in a batch and p_i is the predicted probability of anchor i whether contain apple. The ground-truth label $p_{i}^{*}$ is 1 if the anchor is positive, 0 if the anchor is negative. t_i is a vector representing the 4 parameterized coordinates of the predicted bounding box to measure the regression loss, and $t_{i}^{*}$ is that of the ground-truth box associated with a positive anchor. The classification loss is computed by the SoftMax loss over two classes (object vs. non-object). For the regression loss, the L1 loss is used by Mask R-CNN, The two terms are normalized with N_cls and N_reg, and a balancing weight λ.

The loss function L_Mul-Branch of the multi-branch prediction network consists of three parts: Classification loss L_cls which is computed by the classification branch, regression bounding box loss L_reg which is computed by the regression branch, and mask loss L_Mask which is computed by FCN. So, the L_Mul-Branch is described as follows:

$\begin{matrix} L_{Mul - Branch} = & \frac{1}{N_{cls 2}} \sum_{i} L_{cls} (p_{i}, p_{i}^{*}) \\ + λ_{2} \frac{1}{N_{reg 2}} \sum_{i} p_{i}^{*} L_{reg} (t_{i}, t_{i}^{*}) \\ + γ \frac{1}{N_{mask}} \sum_{i} L_{mask} (s_{i}, s_{i}^{*}) \end{matrix}$ (4) where the hyperparameters λ₂ and γ are used to balance the loss of the regression and mask branch. s_i and $s_{i}^{*}$ respectively represent the mask binary matrices from the prediction and ground-truth label. Classification loss L_cls, regression loss L_reg, mask loss L_mask of the above two formulas are described as follows:

$\begin{matrix} L_{cls} (p_{i}, p_{i}^{*}) = - {logp}_{i}^{*} p_{i} \end{matrix}$ (5)

$\begin{matrix} L_{reg} (t_{i}^{*}, t_{i}) = {smooth}_{L 1} (t_{i}^{*} - t_{i}), \\ {smooth}_{L 1} (x) = {\begin{matrix} 0.5 x^{2}, if | x | < 1 \\ | x | - 0.5, otherwise \end{matrix} \end{matrix}$ (6)

$\begin{matrix} L_{mask} (s_{i}, s_{i}^{*}) = - (s_{i}^{*} \log (s_{i}) \\ + (1 - s_{i}^{*}) \log (1 - s_{i})) \end{matrix}$ (7) In this work, we aim to detect and segment the apple in orchards, so the shape of predicted boxes and ground-truth boxes should be square, in other words, the aspect ratio is near 1, and the aspect ratio of predicted boxes and ground-truth boxes should be similar, that means the ratio of predicted boxes and ground-truth boxes is also near 1. So we optimize the regression loss L_reg by introducing the aspect ratio in it, which can deform the shape of predicted boxes to make the Mask R-CNN suitable for the shape of apple. The regression loss L_reg is developed by the parameter of the 4 coordinates t_x, t_y, t_w, t_h, which respectively represent the offset distance of the box center location, width, and height between two boxes. The formulas are shown as follows:

$\begin{matrix} t_{x} = \frac{x - x_{a}}{w_{a}}, t_{y} = \frac{y - y_{a}}{h_{a}} \\ t_{w} = \log (\frac{w}{w_{a}}), t_{h} = \log (\frac{h}{h_{a}}) \\ t_{x}^{*} = \frac{x^{*} - x_{a}}{w_{a}}, t_{y}^{*} = \frac{y^{*} - y_{a}}{h_{a}} \\ t_{w}^{*} = \log (\frac{w^{*}}{w_{a}}), t_{h}^{*} = \log (\frac{h^{*}}{h_{a}}) \end{matrix}$ (8) where variables x, x_a, and x^* respectively represent the x of the predicted box, anchor box, and ground-truth box.

So, the ratio of aspect ratio can be calculated as follows:

$ratio (t, t^{*}) = \frac{\frac{w}{h}}{\frac{w^{*}}{h^{*}}} = \frac{\exp (t_{h} - t_{w})}{\exp (t_{h}^{*} - t_{w}^{*})}$ (9) where the ratio is 1 means the aspect ratio of the predicted box is the same with the ground-truth box, so we define the aspect ratio loss is 1-ratio, but in this way, the loss between the predicted box and ground-truth box cannot be same if we change the location of the predicted box and ground-truth box, which is unfair to train a model when the values of predict result and ground truth is changed, so we combine aspect ratio and bounding box regression loss L_reg to form an aspect ratio bounding box regression loss L_at-reg as follows:

$\begin{matrix} L_{at - reg} = L_{reg} + | 1 - \frac{ratio (t, t^{*}) + ratio (t^{*}, t)}{2} | \end{matrix}$ (10)

4 Experiments

The SE-Mask R-CNN proposed in this paper is performed under the deep learning development framework of PyTorch, with TITAN XP for GPU acceleration, Inter (R) Xeon (R) E5-2620 v4 CPU. The network initialization parameters are shown in Table 1.

Table 1
Initialization parameters of SE-Mask R-CNN

Size of input images Epoch Batch size Momentum Initial learning rate Decay

1280x720 13 2 0.9 0.0025 0.0001

Size of input images	Epoch	Batch size	Momentum	Initial learning rate	Decay
1280x720	13	2	0.9	0.0025	0.0001

4.1 Dataset

Limited by the lack of time and effort, researchers usually focused on the small dataset with little difference. Sa et al.[14] used the images acquired in indoor complex environments, but the quantity of images is only 122. Bargoti built a dataset with roughly 1000 images to train and test an apple detection network, but the image size is too small to learn from the feature map by the multiple convolution layers of Mask R-CNN [23]. To verify the stability and reliability of the proposed SE-Mask R-CNN, we conduct experiments on the Minneapple dataset [24]. Minneapple dataset contains 1000 apple images of resolution 1280 x 720 pixels (670 in training sets and 330 in test sets), which were taken either from the sunny or shady side of the tree row in the complex background (Fig. 7), and spread out the data acquisition over multiple days to ensure the varied illumination conditions. In addition, the focus of the image are not on the fruit but the whole fruit tree, which is according with the actual condition.

Fig. 7

The different complex conditions of apple images in Minneapple dataset.

4.2 Evaluation metrics

In this work, we use two kinds of evaluation metrics to evaluate the results of apple detection and segmentation respectively. Intersection over Union (IoU) is used to evaluate the result of apple detection, pixel accuracy is used to evaluate the result of apple segmentation. Moreover, we compute the class IoU and class pixel accuracy for apples. The formula is described as follows:

$IoU = \frac{S_{overlap}}{S_{union}}$ (11)

$Pixel accuracy = \frac{\sum_{i} n_{ii}}{\sum_{i} a_{i}}$ (12) where S_overlap is the intersection area of the predicted bounding box and the true bounding box. S_union is the union area of the two bounding boxes. n_ij is the number of pixels of class i predicted to class j, and a_i is the total number of pixels of class i.

4.3 Performance of the proposed method

In this section, we compare the original Mask R-CNN with our proposed method. The experimental results are summarized in Table 2. The first row shows the performance of the original Mask R-CNN. The second row shows the performance of Mask R-CNN which uses the SE block combined with ResNet-50 as the feature extraction network. The third row shows the performance of Mask R-CNN with SE block and the aspect ratio to the bounding box regression loss. The fourth row shows the performance of our method.

Table 2
Ablation Experiment of the SE-Mask R-CNN

Method Mean IoU Pixel Acc. Class IoU Class Mean Acc.

Mask R-CNN (baseline) 0.766 0.980 0.552 0.848

Mask R-CNN &SE block 0.778 0.982 0.574 0.835

Mask R-CNN &SE block &aspect ratio 0.786 0.983 0.591 0.836

SE-Mask R-CNN (ours) 0.794 0.984 0.605 0.852

Method	Mean IoU	Pixel Acc.	Class IoU	Class Mean Acc.
Mask R-CNN (baseline)	0.766	0.980	0.552	0.848
Mask R-CNN &SE block	0.778	0.982	0.574	0.835
Mask R-CNN &SE block &aspect ratio	0.786	0.983	0.591	0.836
SE-Mask R-CNN (ours)	0.794	0.984	0.605	0.852

As shown in Table 2, the performance of the Mask R-CNN & SE block is better than that of the original Mask R-CNN, the result demonstrates that the feature maps assigned with weights in channel-wise by SE block have high discrimination. The performance of Mask R-CNN & SE block & aspect ratio is better than that of Mask R-CNN & SE block, which proves that the optimized bounding box regression loss by aspect ratio is very effective in apple detection and segmentation, the aspect ratio can constraint the shape of the bounding box to make it similar to the shape of apple. The performance of our method is better than that of the Mask R-CNN & SE block & aspect ratio, which confirms that the Soft-NMS can remove the redundant bounding boxes and obtain the correct detection results more reasonably than NMS. So the Class Mean Accuracy of SE Mask R-CNN is higher than original Mask R-CNN. Besides, As shown in Fig. 8, the segmentation result of our method is better than the original Mask R-CNN under complex and dense environment.

Fig. 8

The original image (left) and the segmentation result respectively generated by Mask R-CNN (middle) and SE-Mask R-CNN (right).

4.4 Comparisons with the state-of-the-art

To further evaluate the performance of our method. We compare the performance between our method and the state-of-the-art. Semi-supervised GMM [25] is a semi-supervised clustering method based on Gaussian Mixture Models (GMM), which can be trained with few labels. User-supervised GMM [25] is the same model as in the semi-supervised GMM, this method used human supervision to create one model per tree row. U-Net [26] is a semantic segmentation network composed of a fully convolutional network, which consists of a contracting path and an expansive path, U-Net can generate great performance with very few training images, so it is suitable for the apple detection and segmentation while the dataset is not large. The experimental results are summarized in Table 3.

Table 3
Comparison of the proposed method and Mask R-CNN

Method Mean IoU Pixel Acc. Class IoU Class Mean Acc.

Semi-supervised GMM 0.635 0.968 0.341 0.455

User-supervised GMM 0.649 0.959 0.455 0.634

U-Net 0.685 0.962 0.410 0.848

SE-Mask R-CNN (ours) 0.794 0.984 0.605 0.852

Method	Mean IoU	Pixel Acc.	Class IoU	Class Mean Acc.
Semi-supervised GMM	0.635	0.968	0.341	0.455
User-supervised GMM	0.649	0.959	0.455	0.634
U-Net	0.685	0.962	0.410	0.848
SE-Mask R-CNN (ours)	0.794	0.984	0.605	0.852

As the comparison shown in Table 3, the performance of our method is superior to terms of state-of-the-art. Our method extracts feature maps with SE-ResNet, which can distribute the available computational resources to the most informative feature map in channel-wise, so the Class IoU and Class Mean Accuracy of our method is superior to that of state-of-the-art. In addition, we introduce the aspect ratio to the bounding box regression loss, which can optimize the Mask R-CNN to be suitable for apple shape, and we replace the NMS by a robust Soft-NMS algorithm, which can remove the redundant bounding box reasonably. so the discriminating between each apple is more accurate than that of state-of-the-art.

In addition, we submit the segmentation result of our method to the Robotic Sensor Network Laboratories (RSN) MinneApple Fruit Segmentation Challenge 1 , and all terms of evaluation metrics are superior to other teams.

5 Conclusion and future work

In this paper, we propose an effective apple detection and segmentation method named SE-Mask R-CNN. Firstly, in the feature extraction network, the SE block assigns weights in each channel of feature maps extracted by ResNet, which can distribute the available computational resources to the most informative feature map. Then, in the RPN, the proposals are detected by anchors, and the RoIs are generated by mapping the proposals on the feature maps. The bounding box regression loss with aspect ratio can deform the shape of apple. Finally, the detection and segmentation result is generated by the multi-branch prediction network. We replace the NMS by the Soft-NMS algrithm to remove the redundant bounding boxes robustly and reasonably. The experimental result on the Minneapple dataset demonstrates that our method can overperform several state-of-the-art on apple detection and segmentation.

Despite the considerable performance achieved by our method, there are still some room for improvement such as missed detection and over-segmentation, especially for the dense distribution of apples in images. In our future work, we will further explore the shape information of the apple to improve the performance.

Footnotes

Acknowledgments

This work was supported in part by the Key Research and Development Project of Shandong Province under Grant 2019JZZY010706 and in part by the Foundation of Taishan Industry Leading Talents.

Conflict of interest

The authors declare that they have no conflict of interest.

References

, Zhang

, Yang

, et al., Fruit Detection for Strawberry Harvesting Robot in Non-Structural Environment Based On Mask-RCNN, Comput Electron Agr (2019), 104846.

Jia

, Tian

, Luo

, et al., Detection and Segmentation of Overlapped Fruits Based On Optimized Mask R-CNN Application in Apple Harvesting Robot, Comput Electron Agr (2020), 105380.

Afonso

M.A.M.A.

Detection of Tomato Flowers From Greenhouse Images Using Colorspace Transformations, In, edCham: (2019), 146–155.

Dias

P.A.

, Tabb

and Medeiros

, Multispecies Fruit Flower Detection Using a Refined Semantic Segmentation Network, IEEE Robotics and Automation Letters 4 (2018), 3003–3010.

Gongal

, Silwal

, Amatya

, et al., Apple Crop-Load Estimation with Over-The-Row Machine Vision System, Comput Electron Agr (2016), 26–35.

Kang

, Chen

, Fast Implementation of Real-Time Fruit Detection in Apple Orchards Using Deep Learning, Comput Electron Agr (2020), 105108.

Kang

, Chen

, Fruit Detection, Segmentation and 3D Visualisation of Environments in Apple Orchards, Comput Electron Agr (2020), 105302.

Mao

, Li

, Ma

, et al., Automatic Cucumber Recognition Algorithm for Harvesting Robots in the Natural Environment Using Deep Learning and Multi-Feature Fusion, Comput Electron Agr (2020), 105254.

Dorj

, Lee

, Yun

, An Yield Estimation in Citrus Orchards Via Fruit Detection and Counting Using Image Processing, Comput Electron Agr (2017), 103–112.

10.

Zhuang

J.J.

, Luo

S.M.

, Hou

C.J.

, et al., Detection of Orchard Citrus Fruits Using a Monocular Machine Vision-Based Method for Automatic Fruit Picking Applications, Comput Electron Agr (2018), 64–73.

11.

Gene-Mola

, Gregorio

, Auat Cheein

, et al., Fruit Detection, Yield Prediction and Canopy Geometric Characterization Using LiDAR with Forced Air Flow, Comput Electron Agr (2020), 105121.

12.

Koirala

, Walsh

K.B.

, Wang

, et al., Deep Learning? Method Overview and Review of Use for Fruit Detection and Yield Estimation, Comput Electron Agr (2019), 219–234.

13.

Tian

, Yang

, Wang

, et al., Apple Detection During Different Growth Stages in Orchards Using the Improved YOLO-V3 Model, Comput Electron Agr (2019), 417–426.

14.

, Ge

, Dayoub

, et al., Deep Fruits: A Fruit Detection System Using Deep Neural Networks, Sensors-Basel 8 (2016), 1222.

15.

Kang

and Chen

, Fruit Detection and Segmentation for Apple-Harvesting Using Visual Sensor in Orchards, Sensors (Basel) 20 (2019), 4599.

16.

, Shen

, Sun

, Squeeze-and-Excitation Networks. (2018).

17.

Bodla

, Singh

, Chellappa

, et al., Improving Object Detection with One Line of Code. (2017).

18.

, Gkioxari

, R P D A et al., Mask R-CNN, 2017 IEEE International Conference on Computer Vision (ICCV) (2017), 2980–2988.

19.

Ren

, He

, Girshick

, et al., Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Net-works, arXiv e-prints (2015), 1497–1506.

20.

Long

, Shelhamer

, Darrell

, Fully Convolutional Networks for Semantic Segmentation, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015).

21.

Shao

, Zhu

and Li

, Transfer Learning for Visual Categorization: A Survey, Ieee T Neur Net Lear 5 (2015), 1019–1034.

22.

Lin

T.Y.

, Maire

, Belongie

S.J.

, et al., Microsoft COCO: Common Objects in Context. In: SpringerSpringer ed Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V. Lecture Notes in Computer Science. (2014), 740–755.

23.

Bargoti

, Underwood

, Deep Fruit Detection in Orchards, 2017 IEEE International Conference on Robotics and Automation (ICRA) (2017), 3626–3633.

24.

Hani

, Roy

and Isler

, Minne Apple: A Benchmark Dataset for Apple Detection and Segmentation, IEEE Robotics and Automation Letters 2 (2020), 852–858.

25.

Roy

, Kislay

, Plonski

P.A.

, et al., Vision-Based Preharvest Yield Mapping for Apple Orchards, Comput Electron Agr (20190, 104897.

26.

Ronneberger

, Fischer

, Brox

, U-Net: Convolutional Networks for Biomedical Image Segmentation, In: Springer-Springer ed Medical Image Computing and Computer-Assisted Intervention - MICCAI 2015 - 18th International Conference Munich, Germany, October 5-9, 2015, Proceedings, Part III. Lecture Notes in Computer Science, (2015), 234–241.

SE-Mask R-CNN: An improved Mask R-CNN for apple detection and segmentation

Abstract

Keywords

1 Introduction

2 Method

3.1 Transfer learning

3.2 Loss function

Table 1 Initialization parameters of SE-Mask R-CNN Size of input images Epoch Batch size Momentum Initial learning rate Decay 1280x720 13 2 0.9 0.0025 0.0001

Table 2 Ablation Experiment of the SE-Mask R-CNN Method Mean IoU Pixel Acc. Class IoU Class Mean Acc. Mask R-CNN (baseline) 0.766 0.980 0.552 0.848 Mask R-CNN &SE block 0.778 0.982 0.574 0.835 Mask R-CNN &SE block &aspect ratio 0.786 0.983 0.591 0.836 SE-Mask R-CNN (ours) 0.794 0.984 0.605 0.852

Table 3 Comparison of the proposed method and Mask R-CNN Method Mean IoU Pixel Acc. Class IoU Class Mean Acc. Semi-supervised GMM 0.635 0.968 0.341 0.455 User-supervised GMM 0.649 0.959 0.455 0.634 U-Net 0.685 0.962 0.410 0.848 SE-Mask R-CNN (ours) 0.794 0.984 0.605 0.852

Footnotes

Acknowledgments

Conflict of interest

References

Table 1
Initialization parameters of SE-Mask R-CNN

Size of input images Epoch Batch size Momentum Initial learning rate Decay

1280x720 13 2 0.9 0.0025 0.0001

Table 3
Comparison of the proposed method and Mask R-CNN

Method Mean IoU Pixel Acc. Class IoU Class Mean Acc.

Semi-supervised GMM 0.635 0.968 0.341 0.455

User-supervised GMM 0.649 0.959 0.455 0.634

U-Net 0.685 0.962 0.410 0.848

SE-Mask R-CNN (ours) 0.794 0.984 0.605 0.852