Abstract
Fruit detection is essential for harvesting robot platforms. However, complicated environmental attributes such as illumination variation and occlusion have made fruit detection a challenging task. In this study, a Transformer-based mask region-based convolution neural network (R-CNN) model for tomato detection and segmentation is proposed to address these difficulties. Swin Transformer is used as the backbone network for better feature extraction. Multi-scale training techniques are shown to yield significant performance gains. Apart from accurately detecting and segmenting tomatoes, the method effectively identifies tomato cultivars (normal-size and cherry tomatoes) and tomato maturity stages (fully-ripened, half-ripened, and green). Compared with existing work, the method has the best detection and segmentation performance for these tomatoes, with mean average precision (mAP) results of 89.4% and 89.2%, respectively.
Introduction
Traditional harvesting is labor-intensive and time-consuming, and the process is gradually being automated. Harvesting robots are increasingly used in agricultural production, where they can perform some of the most labor-intensive harvesting tasks in orchards [1]. Intelligence is an inevitable direction for harvesting robots [2], and vision-based fruit detection is a vital task for them, as it is the front-end perception system before subsequent grasping. However, the vision system is challenging due to factors such as fruit overlapping, occlusion, and illumination variation [3]. Accordingly, fruit detection has been explored by many researchers over the past several decades. Many methods use handcrafted features to detect fruit, and supervised methods based on machine learning are also widely applied in fruit detection.
Handcrafted features are often used to compute fruit class membership. A number of researchers have used color-based segmentation to detect fruit with distinct colors, sometimes integrating additional features such as shape and texture [4]. Plebe and Grasso [5] combined a hue saturation value (HSV) color space with adaptive shape analysis to recognize oranges. Huirong et al. [6] used the red-blue (R-B) chromatic aberration information of images to identify oranges. Kelman and Linker [7] proposed a shape-based method for apple detection, with Canny filters to identify edges in images. Edges corresponding to 3D convex objects were then detected using preprocessing operations and convexity tests. Linker et al. [8] combined color, texture, and edge shape features to detect green apples, achieving 85% accuracy for green apple recognition under natural lighting. Zhao et al. [9] investigated a tomato recognition algorithm based on image feature fusion, and used wavelet transformation to fuse the a*- and I-component extracted from the L*a*b* and YIQ color spaces. Moreover, algorithms such as the circular Hough transform (CHT) and random Hough transform (RHT) have been widely used to fit circular and oval shapes around possible regions in an image to segment fruits [10–12].
Many studies have applied machine learning techniques to fruit detection in agriculture. Ji et al. [13] used a support vector machine (SVM) to improve the accuracy and efficiency of apple recognition after extracting the color and shape features. Liu et al. [14] trained an SVM classifier using histogram of gradient (HOG) descriptors to detect tomatoes. Bulanon et al. [15] used k-means clustering to detect red apples. Zhao et al. [16] identified tomatoes by extracting Haar-like features of grayscale images and classifying them using AdaBoost. Wu et al. [17] proposed an automatic algorithm for the recognition of ripening tomatoes, using a weighted relevance vector machine (RVM) classifier to classify image blocks based on color and texture features.
While traditional machine learning has brought great improvements to fruit detection, these methods require handcrafted features and are usually tedious. Deep learning enables automatic feature extraction, which is highly adaptable in a variety of fruit recognition scenarios [18]. Fruit detection algorithms based on deep learning include two-stage and single-stage detectors.
Two-stage detectors (region proposal followed by classification and detection) include the region-based convolutional neural network (R-CNN) [19], Faster R-CNN [20], and Mask R-CNN [21]. Bargoti and Underwood [22] used Faster R-CNN to detect mangoes and apples in the orchard background with an F1 score over 0.9. Sa et al. [23] explored the early and late fusion method of fusing multimodal (RGB and NIR) information, and applied the Faster R-CNN model with multi-vision sensors to detect fruits and vegetables, improving the detection accuracy of sweet peppers to 83.8%. According to [24], Mask R-CNN is improved by combining ResNet and DenseNet to make it more suitable for segmentation of overlapping apples. Yu et al. [25] used Mask R-CNN to generate segmented images of strawberry fruits, and proposed a visual localization method for strawberry picking points.
Single-stage detectors have a more concise structure, which removes the region proposal stage. Representative methods are You Only Look Once (YOLO) [26–28] and the single shot MultiBox detector (SSD) [29]. For the operation and visual positioning of a banana robot, Wu et al. [30] proposed an improved YOLOv5 model for multi-target recognition of bananas. To detect and locate oil-seeded camellia fruit, YOLO-Oleifera was developed as a fruit detection model [31]. Wang et al. [32] applied YOLOv5 object detection network and DBSCAN point cloud clustering method to determine the location of litchi bunch fruits. Koirala et al. [33] developed a MangoYOLO framework, which achieved an F1 score of 0.968 and average accuracy of 98.3% in mango detection. Yuan et al. [34] proposed a method based on SSD to detect cherry tomatoes.
In the case of tomato detection, the current methods have two main limitations. The detection results of a rectangular bounding box cannot match the tomatoes exactly, and these methods are unable to distinguish cultivars or ripeness, and most can only identify normal-sized ripe or green tomatoes [35–37].
Vision Transformer algorithms applied to image feature extraction were proposed with good results [38]. Inspired by this, we propose a Transformer-based Mask R-CNN structure, making it more suitable for the identification of tomato varieties and ripeness, and apply to tomato detection and segmentation. The remainder of this paper is organized as follows. Section 2 specifies the method of implementation by introducing feature extraction, generation of RoIs, and the multi-branch network structure, also describes the model training and loss function. Experiments and their results are presented in Section 3. We describe our conclusions in Section 4.
Methods
Mask R-CNN [21] is an object detection and instance segmentation algorithm. An extension of Faster R-CNN [20], it adds a mask branch at the end of the model to predict the class of each pixel based on the original classification and regression. We propose a Transformer-based Mask R-CNN structure and apply it to the vision system of a tomato harvesting robot so that it can effectively detect and segment various tomatoes. We replace the backbone network with the Swin Transformer [38] network for feature extraction. The feature maps obtained through the backbone network are used as input to the RPN network, which produces candidate object bounding boxes as regions of interest (RoIs). After obtaining the features of RoIs, the fully connected layers (FC) and fully convolutional network (FCN) are used to generate the classification scores of objects, bounding boxes, and segmentation masks. The model structure is shown in Fig. 1.

Overall structure of improved Mask R-CNN.
Computer vision modeling has long been dominated by convolutional neural networks (CNNs), such as VGG [39], GoogleNet [40], ResNet [41], DenseNet [42], and EfficientNet [43]. In contrast, the most popular network in natural language processing (NLP) is Transformer [44], which is designed for sequence modeling and transformation tasks, as it focuses on modeling long-range dependencies in data.
We use a new vision Transformer, called Swin Transformer, as a backbone network for feature extraction. This backbone uses a hierarchical Transformer whose representation is computed by shifting windows. By restricting the self-attentive computation to non-overlapping local windows while allowing cross-window connections, the shift-window scheme brings higher efficiency. This hierarchy has the flexibility to model at different scales with linear computational complexity associated with image size. These qualities allow the Swin Transformer to better preserve the features of tomato images at multiple scales.
To achieve the best speed-accuracy trade-off, we use a tiny version of the Swin Transformer (Swin-T). The model structure, as shown in Fig. 2, has four steps, each with similar repeating units. In step 1, the input H × W × 3 RGB image is partitioned into non-overlapping patches, each considered a token with size 4 × 4, feature dimension 4 × 4 × 3 = 48, and number

Structure of Swin Transformer.
The Swin Transformer block is the core of each step. Figure 3 shows the structure of two consecutive Swin Transformer blocks. The input feature zl-1 is normalized by LayerNorm (LN) and learned by window multi-head self-attention (W-MSA), and a residual operation obtains the output features

Two Swin Transformer blocks.
The output feature z l of the first block is used as the input feature of the second block, whose output feature is zl+1. SW-MSA is a multi-headed self-attentive module with shifted window configuration, which eliminates the lack of information exchange between non-overlapping windows in the W-MSA layer. The shifted window partition method introduces the connection between adjacent non-overlapping windows in the previous layer, which greatly increases the perceptual field and improves model performance. As a result, the Swin Transformer significantly outperforms the CNN model with less time latency, with the best speed-accuracy trade-off.
The output of the Swin-T backbone network is used as the input of the RPN network, and a series of region proposals are generated. Generally, the tomato images in the dataset have differences in size, due mainly to the distance from which they were photographed. There are also differences in the shapes of normal-sized tomatoes and cherry tomatoes, resulting in a large variation in the horizontal-to-vertical ratio of tomato exposure. Therefore, we designed anchor scales of 4 × 4, 8 × 8, 16 × 16, 32 × 32, and 64 × 64, and aspect ratios of 1:1, 1:2, and 2:1 for random combination, so each point on the feature map corresponds to 15 anchors of different sizes and widths, which correspond to the original map to basically cover all possible tomato targets. The region proposal network (RPN) filters from the huge number of anchors generated and adjusts the position to get the proposal, which is mapped to the feature map obtained by the backbone network, called Regions of Interest (RoIs).
The subsequent fully connected network requires a fixed dimensionality of features, but each RoI corresponds to a different feature size. RoIAlign pools the RoIs to a fixed dimensionality. Unlike the RoI Pooling used in Faster R-CNN, RoIAlign retains all the floating points. The values of many sample points are obtained by bilinear interpolation, and these are pooled for maximum value to get the final value. RoIAlign achieves better performance due to the use of sample points and the preservation of floating points.
Object detection and instance segmentation (FC/FCN)
After RoIAlign, the object detection and instance segmentation results are generated by a multi-branch prediction network. One branch employs a fully connected network (FC) to get the prediction of classification and bounding box regression, and the other, the mask branch, employs a fully convolutional network (FCN) to predict image segmentation. FCN uses convolution and deconvolution to build an end-to-end network. It convolves and pools the image to reduce the size of its feature map, and performs deconvolution to continuously improve the resolution of the feature map through interpolation. Finally, FCN determines the category of each pixel.
Model training and loss function
Due to the small number of images in the tomato dataset, we introduce a pre-trained model based on the ImageNet-1K [45] dataset, which contains 1.28 million training images and 50000 validation images from 1000 classes. The pre-training model extracts the general features of all classes from ImageNet, and the parameters of the model can be adjusted. We use Swin-T as the backbone network, initialized with pre-trained weights on ImageNet with the AdamW [46] training optimizer. An initial learning rate of 0.0001, weight decay of 0.05, and batch size of 2 are used. The training schedule is 36 epochs, with the learning rate decayed by 10× at epochs 27 and 33.
We use multi-scale training for data augmentation during training. This is the setting of different image input scales, randomly selected from multiple scales during training, where the input image is zoomed to the selected scale and fed into the network. Although each iteration only uses one scale, it is different in each iteration. Multi-scale training can effectively enhance object detection accuracy, which increases the robustness of the network with no additional computational effort. The input image size is adjusted with random levels and scale jitter, where the scale of the short edge is randomly sampled from 480 to 800 pixels, and the scale of the long edge is fixed at 1333 pixels.
The training loss has two components. The first is loss of RPN, with both classification and regression components, and the second is training loss of the multi-branch prediction network. The loss function is
The first two parts of the multi-branch loss are calculated in the same way as RPN, with the addition of L
mask
to represent the loss of segmentation. The hyperparameters λ and γ balance the training losses of the regression and mask branches.
Experiments were implemented with the PyTorch framework [47] and mmdetection [48], an open source, PyTorch-based object detection toolbox. We trained and tested models with an Nvidia RTX 2070 for GPU acceleration, Intel Core i7-9700k CPU, and 32 GB memory.
Dataset
An instance segmentation dataset of tomatoes at different maturity stages from Laboro Tomato [49] is used. A total of 804 RGB images of tomatoes were taken from Ide tomato farm in Fujisawa, Kanagawa Prefecture, Japan. The image format is JPG, and two cameras with different resolutions were used to capture the images. One part has 305 tomato images captured with a 3024 × 4032 pixel camera, and the other part has 499 tomato images captured with a 3120 × 4160 pixel camera. These tomatoes were grown in greenhouses and include normal-sized and cherry tomatoes of various maturity stages. Images were taken from different camera angles and under different lighting conditions. This better simulates the operation of the harvesting robot vision system.
The Laboro Tomato dataset has six categories. All tomatoes are divided into two categories according to cultivar (normal size and cherry tomato), and also into three categories depending on the stage of ripening (fully ripened, half-ripened, and green). Fully ripened tomatoes are red on 90% or more of the surface and are ready to be harvested. Half-ripened tomatoes are greenish and need time to ripen. Their surface is 30% to 90% red. Green tomatoes are almost completely green or white, with less than 30% red. Figure 4 shows the categories of tomatoes. The number of tomatoes labeled in each category is shown in Table 1.

Images of six categories of tomatoes. a. Fully ripened, normal size; b. half-ripened, normal size; c. green, normal size; d. fully ripened cherry tomato; e. half-ripened cherry tomato; f. green cherry tomato.
For the experiments, we selected 643 tomato images for training and 161 test images for model evaluation. All images were detected and segmented.
Number of labeled tomatoes with each category in the dataset
We adopted the mean average precision (mAP) metric to quantitatively measure the performance of Transformer-based Mask R-CNN. Both box mAP and mask mAP were evaluated. The value of AP can be calculated by precision and recall. The sample can be divided into the four types of true positive (TP), false positive (FP), true negative (TN), and false negative (FN) for binary classification.
Precision is the overlap area of the predicted area and ground-truth divided by the total predicted area, and recall is the overlap area divided by the area of all ground-truth. Precision (P) and recall (R) are defined as
The precision-recall (P-R) curve can be obtained by using the precision ratio as the vertical axis and the recall ratio as the horizontal axis. AP is the average precision, which represents the area enclosed by the point on the P-R curve and the coordinate axis, and the selection of this point requires balancing the values of precision and recall so that the AP value is maximized. It is calculated as
We performed ablation experiments to verify the performance of our model. Experiments were initialized with the appropriate pre-training model. The long and short edge scales of images were resized to 1333 and 800 pixels, respectively, without changing the aspect ratio. The experimental results are summarized in Table 2.
The ablation experiment results
The ablation experiment results
In the first row, we evaluate the performance of the original Mask R-CNN using ResNet50 as the backbone, which is the baseline of our method. The second row shows the performance of Mask R-CNN, which uses a ResNet101 backbone. Compared with the baseline, the performance improvement with ResNet101 is not significant, but the parameters and GFLOPs of the model increase substantially. In the third row, we use Swin-T as the backbone. Compared to the baseline, we see 2.9% and 3.5% gains in terms of mAP box and mAP mask (IoU=0.5), respectively, with slightly higher parameters and GFLOPs. In the fourth row, we use multi-scale training to resize the input so that the scale of the short edge is between 480 and 800 pixels, while the scale of the long edge is 1333 pixels. The results indicate the effectiveness of multi-scale training. Overall, our approach provides a significant performance improvement over the baseline, with slightly more parameters.

Prediction results of improved Mask R-CNN. a. and b. Predicted results for normal-sized tomatoes. c. and d. Predicted results for cherry tomatoes.
We also compare the proposed Mask R-CNN model with five typical object detection and instance segmentation methods: (1) Yolo v3 [28], (2) SSD [29], (3) Faster R-CNN [20], (4) Cascade Mask R-CNN [50], and (5) Hybrid Task Cascade (HTC) [51]. The first two of these methods perform the prediction and loss calculation of boxes only once, i.e., they have one-stage detection architectures. Other methods use a two-stage detection architecture, which takes advantage of area proposals to significantly improve localization accuracy. Both Cascade Mask R-CNN and HTC continuously optimize the results by cascading multiple detectors. Each defines positive and negative samples based on different IoU thresholds, with the output of the former detector serving as the input to the latter. Table 3 compares the performance of each model.
Comparison with different methods
Comparison with different methods
From Table 3, it is observed that our improved Mask R-CNN with Swin Transformer is superior to other methods. Its detection and segmentation accuracy can reach 89.4% and 89.2%, respectively, which is at the IoU threshold of 0.5. The prediction results of our method are also shown in Fig. 5.
The first three methods only predict the bounding box without predicting the segmentation mask. Compared with Cascade Mask R-CNN and HTC, our model has higher detection accuracy with fewer parameters and GFLOPs. This indicates that our model is able to accurately detect and segment tomatoes from different varieties and growth stages. Our model achieves the best speed-accuracy trade-off, which is essential for tomato harvesting robots.
We presented a fruit detection and segmentation algorithm for tomatoes using Transformer-based Mask R-CNN. Our model can effectively detect normal-sized tomatoes and cherry tomatoes, and determine whether tomatoes are fully-ripened, half-ripened, or green. The model can accurately segment tomatoes in orchards. Based on that, harvesting robots can accurately identify specific types of tomatoes, so as to automatically pick fully-ripened tomatoes, for example. There are wider applications in tomato production based on our study, including harvest forecasts based on tomato maturity, pesticide spraying at specific ripening stages, and temperature control in greenhouses according to ripeness.
In developing this system, we replaced the backbone network based on the Mask R-CNN model. The Swin Transformer for feature extraction provides more improvements over CNNs. We investigated multi-scale training methods to further improve performance. Compared with other methods, ours had the best detection and segmentation performance for all kinds of tomatoes, with detection accuracy of 89.4% and segmentation accuracy of 89.2%. Our method has a small number of parameters and GFLOPs, for the best speed-accuracy trade-off. However, there are still many facts that need improvement. We will explore the detection of tomatoes under different occlusion areas by using more metrics. Also future work will integrate the proposed algorithm with a practical harvesting robot. We will compress the model by methods such as network pruning to achieve light weight and acceleration.
Footnotes
Acknowledgments
This work was supported in part by the Key Research and Development Project of Shandong Province under Grant 2019JZZY010706 and in part by the TaiShan Industrial Experts Programme under Grant tscy20200303.
