Detection of multi-size peach in orchard using RGB-D camera combined with an improved DEtection Transformer model

Abstract

The first major contribution of the paper is the proposal of using an improved DEtection Transformer network (named R2N-DETR) and Kinect-V2 camera for detecting multiple-size peaches under orchards with varied illumination and fruit occlusion. R2N-DETR model first employed Res2Net-50 to extract a fused low-high level feature map containing fine spatial features and precise semantic information of multi-size peaches from Red-Green-Blue-Depth (RGB-D) images. Second, the encoder-decoder was performed on the feature map to obtain the global context. Finally, all detected objects were detected according to each object’s global context. For the detection of 1101 RGB-D images (imaged from two orchards over three years), the R2N-DETR model achieves an average precision of 0.944 and an average detecting time of 53 ms for each image. The developed system could provide precise visual guidance for robotic picking and contribute to improving yield prediction by providing accurate fruit counting.

Keywords

Deep learning peach detection RGB-D image R2N-DETR open orchard

1. Introduction

Peach is one of the most economically important fruit species around the world, and the global yield reached 24.5 million tons in 2018 [1]. Considering the huge yield, peach harvesting is still a challenging task in the peach industry. Currently, manual harvesting is still the most commonly used method for fresh peach harvesting. However, the labor shortages, high labor costs, and inefficiency of manual harvesting force growers to employ more efficient methods for harvesting operations such as automated robot-based picking. Considering that fresh peaches are prone to be bruised due to collisions, most existing studies focus on harvesting peaches from trees by picking rather than bulk methods [2, 3]. A typical fruit picking robot first uses an optical sensor to take images of surrounding environments; then, detects fruits in the images using a machine vision algorithm and; finally, directs the manipulator to pick the fruits through the control system [4]. Due to the variation of peaches’ size, shape, color, and texture, as well as varied illumination and fruit occlusion, the collected images would have both complex foreground (peaches) and background surrounding environments [5]. Accurate and reliable detecting peaches from the images is an essential pre-step and one of the key challenges to robotic picking. To address the aforementioned challenge, this study developed a deep learning powered computer vision system to detect multi-size peaches from images collected in orchards, from which morphological features such as size and shape and the position of peaches can be derived. The information is useful for both peach robotic picking and peach growth and development evaluation.

For robotic picking, the vision system is only required to detect fruits in the target area, which is the area reachable by the manipulator. Information in non-target areas negatively affects the robotic picking, and thus should be removed. To this end, an RGB-D (Red, Green, Blue-Depth) sensor is suitable for image capturing. The Kinect v2 launched by Microsoft can capture RGB and depth information of the scene simultaneously and has been widely used to build low-cost vision systems for fruit detection in orchards [6]. RGB images provide rich color, texture, and geometric shape information of fruits, and the corresponding depth images record the distance between the camera (manipulator) and the objects. The non-target objects in the background could be removed by fusing the color and depth images [7]. After that, fruit positions could be detected from the fused images using object detection algorithms [8].

For the research topic of “general object detection”, which aims to explore the methods of detecting different types of objects under a unified framework, before 2014, traditional object detection algorithms, which were built based on hand-crafted features, attracted a lot of attention; after 2014, deep learning techniques achieved remarkable breakthroughs, which pushes the deep learning-based detection methods forward to a research hot-spot with unprecedented attention [9]. In deep learning rea, objection detection can be grouped into two branches: anchor-based detection and anchor-free detection. The anchor-based detection first presets numerous anchors on the image, then refine each anchor’s coordinates and classifies the object in it, and finally outputs the refined anchor as the prediction. This type of detection develops along two lines: “one-stage detection” and “two-stage detection”. The former (such as the Region-based Convolutional Neural Network (R-CNN) series) frames the detection as a “coarse-to-fine” process [10], whereas the latter (such as the You Only Look Once (YOLO) series) frames the detection as “complete in one step” [11]. Unlike anchor-based detection, anchor-free detections, such as CornerNet, CenterNet, and DEtection Transformer (DETR), eliminate the need of manually designing anchors and directly predict objects using a key or center point or region of the object [12].

As a “detection application”, the methods used to detect fruit in the orchard closely follows the “general object detection” in the past. Tu et al. [13], Wu et al. [3], and Wu et al. [14] detected fruits using hand-crafted feature-based classification methods. Considering the poor generalization of well-designed hand-crafted features, recent studies developed deep learning methods, which have efficient learning capability, robust generalizability, and plasticity, to detect fruits on trees. For example, Song et al. [15] applied a Faster R-CNN to detect kiwifruits in the orchard, taking 347 ms to detect an image and achieving 87.61% average precision (AP). Chu et al. [16] used a suppression Mask R-CNN with an F1-score of 0.905 and a detection time of 250 ms for apple detection and segmentation. Both two-stage R-CNN-based detection models have achieved good detecting performance but needed a long time for fruit detection. Tian et al. [17] employed an improved YOLO version 3 (YOLO-V3) model for real-time apple detection during different growth stages in orchards. YOLO-V4 was employed by Mirhaji et al. [18] to detect oranges in real-time, obtaining an mAP of 90.88% and a processing speed of 23.6 ms. The two researchers reported that the two types of one-stage YOLO detection models have made a tradeoff between accuracy and time consumption. However, the aforementioned four anchor-based detectors need to be fed with prior knowledge to set up anchor boxes. In the training model, massive redundant boxes would cause a serious imbalance of positive and negative samples, which reduces the detection accuracy of the model. Furthermore, the detection requires post-processing, namely non-maximum suppression (NMS) procedure, to remove redundant detected object boxes [17]. The inherent characteristics would undoubtedly reduce the detection accuracy of the models. Moreover, due to the fixed scales and aspect ratios of anchor boxes, the detection models encounter difficulties to handle object candidates with large shape variations [17]. The size and shape of the fruits in the images vary due to different shooting distances and occlusions. Hence, it is challenging to use the anchor-based models to accurately detect fruits with multiple shapes and sizes in orchards.

From the perspective of “information perception”, disturbances such as illumination variations, occlusions, and color changes are common in orchards. As a result, the object area is composed of a fruit and its background appears in the captured images in various forms. Facing the difficulty of detecting objects from complex backgrounds, the interaction between regions on the image helps to determine the attributes of each region [19]. However, the above four detection models extracted spatial features in the image through convolution neural networks (CNNs). Each convolutional module of the CNNs can only focus on the information in a specific area on the image, that is, local convolutional features. The CNNs split the connection between different spatial positions of the image, and cannot perform local feature interaction, which reduces the accuracies of the fruit detection models [20].

An anchor-free DETR, which involves a backbone, the encoder-decoder transformers, and a detect head, was proposed to address the limitations [21]. The implementation of DETR is independent of human-designed prior (including anchor and region proposal) and is completely dispensed with the use of NMS, performing truly end-to-end detection. Meanwhile, the encoder-decoder transformers of DETR can deeply exploit the relationship between the spatial positions of the feature map. In this way, each position is added with attention from other positions, which makes DETR achieve 42.0%–44.9% AP on the COCO dataset [22]. However, struggling to detect small objects is the main disadvantage for DETR. Further analysis, to save DETR from a huge amount of computational cost, its backbone provides subsequent transformers with a low spatial resolution map containing single-scale convolutional features [23]. Because several fruits in the image are defined as small objects [24], fruit detection in the orchard using the DETR continued to pose challenges.

The overall goal of this research was to develop an improved DETR to enhance the accuracy of identification and localization for peaches with various sizes, including small and medium size objects, in orchards. The specific objectives of this study were to (1) apply Res2Net module [25] to optimize the backbone of the DETR; (2) compare the performance of the improved DETR, named R2N-DETR, with DETR and two popular CNNs, e.g., Faster R-CNN and YOLO-V4; (3) verify that the improved backbone optimizing the detection of multi-size objects by the R2N-DETR model.

2. Data collection and processing

2.1 Experimental orchards

The plant materials used in this study were planted in two orchards - Yangshan Town (YT) orchard and Hongsha Bay (HB) orchard - located at Wuxi City, Jiangsu Province, China. There are two cultivars in the YT orchard, named ‘Hujin’ and ‘Baifeng’. Mature ‘Hujin’ peaches appear shadow light red and light green, and the surface of mature ‘Baifeng’ peaches is mostly light green. The height of trees is less than 3 m and the distance between the trunks for two adjacent trees ranges from 2.5 m to 3.5 m. The tree height and planting density in the HB orchard were the same as those in the YT orchard. Imaging data collection was conducted in open orchards with changing illumination and weather conditions. Figure 1 presented representative RGB images collected in the two orchards under different weather conditions (cloudy and sunny) and illumination conditions (overcast, direct lighting, back-lighting). Consequently, there is a big diversity for the peaches in terms of color and size, and the images are with complex backgrounds. All the peaches were mature based on the judgment of the orchard manager.

Figure 1.

Four sample images from the collected dataset: (a) ‘Huji’ peaches imaged in the YT orchard on July 26, 2018, under cloudy and overcast conditions; (b) ‘Huji’ peaches imaged in the YT orchard on July 07, 2019, under sunny and direct lighting conditions; (c) ‘Baifeng’ peaches imaged in the YT orchard on July 15, 2021, under sunny and overcast conditions; (d) ‘Hujin’ peaches imaged in the HB orchard on July 24, 2021, under sunny and back lighting conditions.

2.2 Image acquisition and dataset preparation

A Microsoft Kinect-V2 sensor interfaced with iai_kinect2 Robot Operating System (ROS) (URL: https://github.com/code-iai/iai_kinect2) in Ubuntu 14.04 system was used to acquire imaging data. The iai_kinect2 ROS package was used because it contains an intrinsic calibration procedure that corrects camera distortions, thus, the Kinect-V2 sensor could capture corrected RGB images (Fig. 2a) and depth images (Fig. 2b) with a resolution of 960 $\times$ 540 pixels. Referring to [7], a distance threshold of 1200 was applied for the depth map to obtain the mask of the target area. Then, the RGB images containing only the object areas (Fig. 2c) were generated pixel-wise multiplying the original RGB images by the mask. Because RGB and depth sensors have different fields of view, the surrounding of the matched images was the background. In this study, the 768 $\times$ 512 pixels in the middle of an image were used as the input of the detection model to remove the background without losing information (Fig. 2d). A total of 1101 preprocessed images were obtained by the operation of fusion and cropping. In the following, we referred to the preprocessed images as RGB-D images. The images were manually labeled with the tool of the labelImg (URL: https://github.com/tzutalin/labelImg) and randomly divided into training (70%, 770 images) and test (30%, 331 images) datasets. The details of the datasets are summarized in Table 1.

Table 1
Imaging datasets for peach detection in the two orchards

Datasets	Time	Cultivar	Location	Total #images	Total #peaches
HY-18 ${}^{\text{a}}$	26-Jul-18	Hujin	YT orchard	112	688
HY-19 ${}^{\text{b}}$	7-Jul-19	Hujin	YT orchard	111	467
BY-21 ${}^{\text{c}}$	15-Jul-21	Baifeng	YT orchard	402	2533
HH-21 ${}^{\text{d}}$	24-Jul-21	Hujin	HB orchard	476	3113

${}^{\text{a}}$ Each image contained 4 $\sim$ 20 mature peaches. ${}^{\text{b}}$ Each image contained 2 $\sim$ 18 mature and immature peaches. ${}^{\text{c}}$ Each image contained 5 $\sim$ 18 mature and immature peaches. ${}^{\text{d}}$ Each image contained 5 $\sim$ 20 mature peaches.

Figure 2.

Illustration of the preprocessing of the raw imaging data. (a) Raw RGB image captured by Kinect-V2; (b) Aligned depth image extracted from the point clouds; (c) Preprocessed RGB-D image; and (d) Cropped RGB-D image.

In general, a deep learning-based detection model requires tens of thousands of images for training to increase accuracy, generalization, and robust performance. The self-collected training dataset contained only 770 images which is not enough for the training. To augment the training data, five operations, including flipping, rotation, scale, translation, and noise, were applied to the raw training dataset using Python 3.8. Specifically, firstly, all images were flipped horizontally (770 flipped images); secondly, each raw and flipped image was rotated at an angle in { $-$ 15 ${}^{\circ}$ , $-$ 10 ${}^{\circ}$ , $-$ 5 ${}^{\circ}$ , 5 ${}^{\circ}$ , 10 ${}^{\circ}$ , 15 ${}^{\circ}$ } (9240 rotated images); thirdly, each raw image was randomly scaled to 0.7 $\sim$ 1.3 times of the original size, and the scaled images were cropped or filled to the original size (770 scaled images); fourthly, each raw image was translated to the left or right (up and down) by 0 $\sim$ 122 pixels (0 $\sim$ 65 pixels) horizontally (vertically) (770 translated images); finally, random Gaussian noise was added on each pixel on the raw image (770 noised images). After the data augmentation, the training dataset contained 13090 RGB-D images.

3. Methodologies

In this research, an anchor-free framework, named R2N-DETR, was developed for peach detection (Fig. 3). The R2N-DETR consists of four parts, including backbone, encoder, decoder, and prediction heads. (1) Res2Net-50, which has a strong multi-scale representation ability, was used as the backbone for learning a 2D convolutional feature from an input image. (2) The feature map was flattened and supplemented with a positional encoding, and then was used to feed an encoder to obtain a hidden map containing the position of the objects. (3) A small fixed number of object queries and the output of the encoder were imported to a decoder to obtain the fixed number of output embedding. (4) In the prediction heads, each output embedding was fed to a shared feed forward network (FFN) to predict either a detection (peach and bounding box) or a “no peach”.

Figure 3.

Structure of the R2N-DETR.

Figure 4.

Illustration of the Res2Net-50.

3.1 Backbone convolutional network

The DETR framework has limited ability to detect multi-size objects, especially small ones, because the feature map output from the backbone has low spatial resolution and contains only a single-size feature. However, on one hand, a high-resolution feature map may destroy the detection model, because the computing resources used by the encoder and decoder would increase with the square of the spatial size of the feature map. On the other hand, each layer of the backbone could perceive information of one size area on the input map. To address this problem, the R2N-DETR integrated the first 7 $\times$ 7 convolutional operation and Layers 1 $\sim$ 4 of Res2Net-50 (Fig. 4a) to construct a backbone that could generate a feature map $O_{\textit{map}}\in\mathbb{R}^{24\times 16\times 2048}$ from the input image $I_{\text{img}}\in\mathbb{R}^{768\times 512\times 3}$ (with 3 color channels). Owing to the Res2Net module (Fig. 4b), the map could fuse multi-scale convolutional features while having a low resolution. Specifically, the number of channels of an input map $X\in\mathbb{R}^{W\times H\times C}$ has first adjusted to an integer multiple of the scale dimension s through 1 $\times$ 1 convolution. The adjusted map $X\in\mathbb{R}^{W\times H\times C_{1}}$ was then evenly split into s feature subsets according to the channel dimension, denoted as $x_{i}$ , where each feature subset has the same spatial size and $C_{1}/s$ number of channels. After each feature subset, except $x_{1}$ , a 3 $\times$ 3 convolution was connected, denoted as ${\bm{K}}_{i}\left(∼{}\right)$ . The output of ${\bm{K}}_{i}\left(∼{}\right)$ was denoted as $y_{i}$ , which can be calculated by Eq. (1).

$\displaystyle y_{i}=\left\{\begin{array}[]{ll}x_{i}&i=1;\\ {\bm{K}}_{i}\left({x_{i}}\right)&i=2;\\ {\bm{K}}_{i}\left({x_{i}+y_{i-1}}\right)&2<i\leqslant 3.\\ \end{array}\right.$ (1)

Note that each ${\bm{K}}_{i}\left(∼{}\right)$ could indirectly receive information from all feature subsets before $x_{i}$ , denoted as $x_{j}$ , where $1<j<i$ . Before being processed by ${\bm{K}}_{i}\left(∼{}\right)$ , $x_{j}$ obtains a larger receptive field through one or more convolutions. Due to the combinatorial explosion effect, the $s$ outputs $y_{i}$ contained multiple different combinations of varied-size receptive fields. Then, the information in multiple receptive fields was fused through a 1 $\times$ 1 convolution, which allows the Res2Net module to complete the extraction of multi-scale features. In a word, compared with ordinary convolution, the output of the Res2Net module contains more scale convolutional feature encoders.

Figure 5.

Illustration of the transformer encoder.

3.2 Encoder

The encoder consisted of one 1 $\times$ 1 convolution, one flatten layer, and six standard transformer encoders (including an 8-head self-attention module [26], a feed forward network (FFN), and two ‘skip connections and layer normalization’). The 1 $\times$ 1 convolution was used to reduce the channel number of the ${\bm{O}}_{\textit{map}}$ from 2048 to 256 to create a new feature map $Z_{\textit{map}}\in\mathbb{R}^{24\times 16\times 256}$ . By flattening all channels of a pixel into a one-dimensional vector, the ${\bm{Z}}_{\textit{map}}$ was transformed into a two-dimensional sequence $M_{\textit{map}}\in\mathbb{R}^{384\times 256}$ . A positional encoding $P\in\mathbb{R}^{384\times 256}$ initialized with cosine and sine functions was added by $M_{\textit{map}}$ and then was used to feed the transformer encoder. Finally, the feature map $E\in\mathbb{R}^{384\times 256}$ with added attention at each position was output. The self-attention mechanism performed a global analysis on the feature map and could extract the correlation between different positions and different objects. Since the feature map output from the backbone contained multi-scale features, this correlation involved objects with multiple sizes, including small-size objects. Considering that the encoder’s output position sequence remained invariant, a fixed positional encoding was added to the input of each attention to supplement the information [27].

Figure 6.

Illustration of the Res2Net-50.

3.3 Decoder and prediction heads

Following the DETR, the decoder of the R2N-DETR was used to transform the standard structure (Fig. 6). The R2N-DETR preset 100 tokens (object queries) for peach detection. Because the transformer decoder was permutation-invariant [21]. To detect peaches at different locations, the 100 tokens must be different. The tokens were fed into the first 8-head self-attention module to generate the attention between each token. Then, the feature map ${\bm{E}}$ , the positional encoding ${\bm{P}}$ , and the tokens were processed by the second 8-head self-attention module and FFN. In this procedure, the correlation between each token and each position of the feature map was determined. Further, using the context on the whole image, the pair-wise relations between all objects and tokens, contained in the output embedding, were established. As discussed above, the decoder could match 100 object queries in parallel. After inputting the output embedding into two shared FFNs (one classification FFN and one prediction FFN), the classes and locations of the 100 objects were obtained.

4. Results and discussion

4.1 Network training

The R2N-DETR model had the same characteristics as the DETR model, and both were trained end-to-end based on a linear combination of a negative log-likelihood for class prediction and a bounding box loss [21]. The parameters and resources used in the training R2N-DETR model process were summarized in Table 2. A popular AdamW optimizer was used for updating network weights in training iterations [28]. Note that, the initial learning rates of the backbone and transformers were set to 1e-5 and 1e-4, respectively, and both were updated every 100 epochs. A dropout probability of 0.1 was used in the FFNs of all transformer encoders and decoders. Additionally, each self-attention module contained 8 heads.

Table 2
Parameters and resources used in training models

Parameters	Value or model	Resource	Value or version
Batch size	4	Operation	Ubuntu 18.04
Optimizer	AdamW	Backend	Pytorch 1.8.0
Epoch	1000	Python	3.8
Learning rate drop	100	GPU	GeForce GTX
Weight decay	1.00E-04		1080Ti 11GB
Backbone	Res2Net-50	CUDA cores	3584
Learning rate	1.00E-05	Memory	32 GB
Scale dimension ( $s$ )	4	CUDA	9.1
Encoder and decoder	Transformer	CUDNN	7.5
Learning rate	1.00E-05
Dropout probability	0.1

In this research, the weights of the R2N-DETR were initialized based on transfer learning. First, the weights of the backbone used the ones from the ImageNet-pretrained Res2Net-50 model with the global average pooling layer and fully connected layer removed. Second, the weights of the pre-trained DETR by the COCO dataset were used as the initial weights for the proposed encoder-decoder model. Further, the R2N-DETR model was fine-tuned using our training dataset, and the training ran 1000 epochs. Forthe competitors, DETR, Faster R-CNN, and YOLO-V4, the hyperparameters, and initial weights were set according to the studies by Carion et al. [21], Fu et al. [7], and Mirhajiet al. [18].

After training, all peach detection models were evaluated using Precision ( $P$ , Eq. (2)), Recall ( $R$ , Eq. (3)), Average Precision (AP, Eq. (4)), and detection speed.

$\displaystyle P=\textit{TP}/(\textit{TP}+\textit{FP})$ (2) $\displaystyle R=\textit{TP}/(\textit{TP}+\textit{FN})$ (3) $\displaystyle\textit{AP}=\int_{0}^{1}P_{\left(R\right)}\text{dR}.$ (4)

Where TP, FP, and FN correspond to true positives (the peach objects were detected correctly), false positives (the peach objects were detected incorrectly), and false negatives (the peach objects were missed), respectively. The Precision measures the model’s ability to correctly detect peach objects. The Recall represents the detection model’s ability to find all peach objects in the image. The AP combining $P$ and $R$ was used to evaluate the sensitivity of the detection model to peach, reflecting the global performance of the model. The larger the AP value, the better the performance of the detection model.

4.2 Effectiveness of improved backbone

The feature maps extracted by ResNet-50 (the backbone of DETR) and Res2Net-50 (the backbone of R2N-DETR) were visualized in Fig. 7. The ‘7 $\times$ 7 Conv’ showed the feature map output from the first convolution of the two backbones ResNet-50 and Res2Net-50 for the input image. Both backbones could provide a feature map with rich semantic information for the encoder-decoder. It was observed that from the feature maps from layers 1 to 4 in the ‘ResNet-50’ row, the coarser the spatial information of the objects (peaches), the richer the abstract semantic information. Although the two low-level feature maps (layers 1 and 2) in the ‘ResNet-50’ row contained minor spatial information, the spatial information in the two high-level feature maps (layers 3 and 4) disappeared (i.e. the location of the white pixels on the two feature maps were hard to reflect the locations of the peaches on the input raw RGB-D image). Additionally, ResNet-50 provided only single-scale features for the subsequent encoder, making it difficult for DETR to detect small-size peaches. In contrast, the two low-level feature maps output from Res2Net-50 contained more detailed spatial information. Because the Res2Net module fused the original input image and the feature maps with three receptive field sizes (Fig. 4b), addressing the problem of spatial information loss caused by downsampling. Meanwhile, due to the cascade of convolutions within the Res2Net module, high-level semantic information was also captured by the Res2Net modules in layers 1 and 2. Furthermore, the high-level feature maps not only contained rich semantic information, but also spatial information that could locate the objects. Therefore, Res2Net could be used to improve the ability of DETR’s backbone to extract and fuse spatial (low-level) and semantic (high-level) features, and objective Eq. (1) was achieved.

Figure 7.

Visualized feature maps from ResNet-50 and Res2Net-50.

4.3 Comparison of the R2N-DETR model with other deep learning models

Overall, the proposed R2N-DETR model achieved a better AP while having smaller weights compared with the other popular used detection modes, i.e. YOLO-V4, Faster R-CNN, and DETR (Table 3). The R2N-DETR model achieved an AP of 0.94 for the tested peach images, which was 8.9%, 3.6%, and 3.2% higher than the ones obtained by YOLO-V4 (0.867), Faster R-CNN (0.911), and DETR (0.915), respectively.

Table 3
Performance comparison between the three popular networks and R2N-DETR

Models	AP (%)	Detection speed (ms)	Weights (M)
YOLO-V4	0.867	32	64
Faster R-CNN	0.911	110	128
DETR	0.915	51	41
R2N-DETR	0.944	53	41

Further, the one-stage model YOLO-V4 showed the lowest accuracy. The reasons were as follows. (1) YOLO-V4 mixed the box regression loss function and the object classification loss function, which increases the difficulty of the model learning weights [29]. (2) YOLO-V4 model regressed and classified an anchor using the convolutional features of a point on the deep feature map. Because a single point could only supply little information to the detection head and the widespread noise in the orchard would reduce the recognition of the point. The above shortcomings made YOLO-V4 often misidentify peaches in the orchard [30]. As presented in a representative peach detection example in an RGB-D image (Fig. 8), several peaches were missed by the YOLO-V4 model, such as a ‘Baifeng’ peach in a bright light area and a ‘Hujin’ peach whose surface was mostly covered with leaves.

Figure 8.

Two examples of peach detection using the four deep learning models. The red, yellow, and purple bounding boxes denoted the truly detected, falsely detected, and missing peaches, respectively.

Compared with YOLO-V4, Faster R-CNN, DETR, and R2N-DETR set up separate branches for regressing object locations and classifying objects, which can learn weights in both branches independently. The separate setting resolved the mutual interference of regression and classification and improved the accuracy of detection models [31]. In addition, each of the Faster R-CNN, DETR, and R2N-DETR models comprehensively utilized region-specific deep convolutional features (anchor for Faster R-CNN model, query for DETR and R2N-DETR models) to regress and classify objects. Region-level features were both more expressive and robust than point-level features [30].

In terms of detection accuracy, the Faster R-CNN model was lower than DETR and R2N-DETR models (Table 3). For the Faster R-CNN model, some areas were falsely identified as peaches, additionally, a ‘Baifeng’ peach was detected as two objects due to interference from the leaves (Fig. 8). Because the Faster R-CNN model (1) only used local features with losing information to detect peaches, they were prone to low precision and recall rate; (2) during the training model, the dense anchor caused a serious imbalance of positive and negative samples. The encoder and decoder of DETR and R2N-DETR models were constructed using transformers, and the two models could extract long-range visual dependencies from deep feature maps, enriching the available information. Even if a region was losing information or contained a great deal of noise, its properties could be determined from information in other regions of the feature map [32]. Compared with the Faster R-CNN model, in DETR and R2N-DETR models, only 100 object queries (similar to anchors) were set for each detection, and each query was matched to a ground truth or background through the Hungarian algorithm, alleviating the imbalance impacts [21].

From Fig. 8, we observed that although some small-size peaches could be detected by the DETR, the leaves were false positives detected in the DETR model. Because ResNet-50 could not fuse low- and high-level features in the extracted feature maps as described in Section 3.1. The R2N-DETR model used Res2Net-50 and self-attention to address the problems of feature fusion and the use of global context, respectively, making it outperform DETR.

Regarding the weight size and the detection time for the four peach detection models, the YOLO-V4 model contained 64 M weights and averagely spent 32 ms to process an image. It has the lowest requirements for hardware devices. However, this model has low detection accuracy, which makes it difficult to be applied in the field. Although the Faster R-CNN model has excellent detection accuracy, and acceptable field application value, it contained the weights of 128 M and thus required numerous computing resources for detection, and it took 110 ms to detect each image resulting in insufficient real-time performance [7]. Both the DETR model and R2N-DETR model contained 41 M weights, and one GPU unit (GeForce GTX 1080Ti 11 GB) could execute the two models, and they spent 51 ms and 53 ms respectively to detect an image. Compared with Faster R-CNN, the two models have lower requirements for hardware devices, faster detection speed, and high detection accuracy, so they should be widely used in the field. Furthermore, considering the real-time and detection accuracy comprehensively [33], the R2N-DETR model seems to be the most suitable for field applications among the four models. This research achieves the preset objective Eq. (2).

4.4 Detection performance of the R2N-DETR model on multi-size peaches

In general, according to the pixel area of the corresponding object ground truth box, each object in the images was defined as a small object (area $\leqslant$ 322) and medium (322 $<$ area $\leqslant$ 962) [34]. The RGB-D images used in this research involved 4523 small peaches and 2278 medium peaches. Table 4 showed the detection results based on the two sizes of peaches using R2N-DETR and DETR models, and a visual detection result of an input image shown in Fig. 9. Both two models detected medium peaches with P and R values exceeding 0.90. However, the DETR model obtained a P of 0.817 and an R of 0.909 on small peaches, which were lower compared with the detection on medium peaches. Peaches that are severely occluded and far from the imaging sensor were easily missed by DETR (Fig. 9a). The phenomenon of low detection performance of small objects was also reported in [21, 34, 35]. The R2N-DETR model outperformed the DETR model for detecting small peaches and achieved a P of 0.852, and an R of 0.909. Even peaches that occupy a few pixels on the input image can be accurately detected by R2N-DETR (Fig. 9b). The results demonstrated that the proposed R2N-DETR with the incorporation of Res2Net-50 improved the ability to detect multi-size peaches, especially small peaches, and objective Eq. (3) of this research was achieved.

Table 4
Performance comparison between the DETR and R2N-DETR model on small and medium object subsets

Dataset	Total # of labeled	DETR				R2N-DETR
	peaches	TP	FP	$P$	$R$	TP	FP	$P$	$R$
Small	1897	1725	387	0.817	0.909	1798	313	0.852	0.948
Medium	819	798	86	0.903	0.974	801	84	0.905	0.978

Figure 9.

Small and medium object detection examples using the DETR and R2N-DETR. The red, yellow, and purple bounding boxes denoted the medium, small, and missing objects, respectively.

4.5 Comparison of R2N-DETR with state-of-the-arts

In a previous study using an improved Faster R-CNN for small fruit detection, Mai et al. [36] reported 0.6743 P and 0.8522 R, which is lower than the P of 0.852 and R of 0.948 for small peaches obtained by the R2N-DETR model. Chu et al. [16] developed a suppression Mask R-CNN model to detect apples on trees. They reported 0.88 P and 0.93 R, the former metric higher than the R2N-DETR model’s 0.87 P but the latter metric lower than the R2N-DETR model’s 0.93 P. Using an improved YOLOv3-tiny (named DY3TNet) model to complete the real-time kiwifruit detection in the orchard, Fu et al. [24]reported 0.9005 AP which is 4.83% lower than the AP of 0.944 obtained in this research. Considering the differences in datasets, this study retrained DY3TNet using the peach training dataset, and the novel DY3TNet model achieved an AP of 0.873, which is also lower than that of the R2N-DETR model. Wan et al. [37] reported 0.9251, 0.8894, and 0.9073 AP values for apples, mangoes, and oranges detection using an improved Faster R-CNN, which are all lower than the AP of 0.944 obtained by the R2N-DETR model. However, caution should be taken when comparing the results for different studies, as factors like fruit type, image data type, and the size of the fruit in the image could all have affected the detection results. Nevertheless, the viability of using the Kinect-V2 camera combined with the R2N-DETR to detect multi-size peaches on trees was demonstrated in this study.

There is still room for the R2N-DETR model to improve peach detection in orchards. Specifically, (1) On the construction of the training dataset: The R2N-DETR is data-hungry deep learning that requires a large amount of image data to train detection models. To meet the data requirements, this study applied five data augmentation techniques to expand the size of the image dataset. However, these technologies inevitably introduced noise and ambiguity into the training process, resulting in reduced accuracy of the detection model. Hence, a data augmentation technique that can always keep important regions untouched during augmentation could improve the performance of the R2N-DETR model [38]. (2) On the size of the model; Both the encoder and decoder of the R2N-DETR consist of six standard transformer modules with fully connected layers, which results in the network having a large number of weights for learning. So it is difficult to train the R2N-DETR model. A reliable solution to the above problem is to construct the encoder and decoder of R2N-DETR by using two lightweight transformers with recursive atrous self-attention [39]. The improved encoder and decoder will apply fewer weights to achieve similar functions to the original encoder and decoder. (3) On the small peach detection: Although the R2N-DETR model performed well for small peaches detection, the detection accuracy for small objects could be further improved if the encoder and decoder of the R2N-DETR model are replaced by a dynamic encoder and dynamic decoder, respectively, and the typical feature pyramid is used to connect the backbone with the encoder [40]. We will work on the improvement in the future.

5. Conclusion

This research examined an improved DETR, called R2N-DETR, for multi-size peach detection using RGB and depth images collected by a Kinect-V2 sensor in open orchards, which could help improve robotic fruit picking by providing reliable detections for small- and medium-size fruits. In the R2N-DETR network, the Res2Net-50 extracted multi-scale convolutional features from an input image, and the transformer module improved the R2N-DETR model’s ability to capture long-range visual dependencies. They significantly contributed to the reliable peach detection in orchards. Also, the R2N-DETR model contained 41 M weights and had a detection speed of 53 ms/image, which could be used for real-time field applications. Overall, the R2N-DETR can efficiently and effectively detect multi-size peaches in orchards. The Kinect V2 sensor combined with the R2N-DETR model could be suitable for peach-picking robots. Further work will focus on exploring the fusion method of RGB and depth images, as well as exploring 3D point cloud object detection in recognizing peaches in orchards.

Footnotes

Acknowledgments

This project is supported by the National Natural Science Foundation of China (Grant no. 62273166; 61772240), the 111 Project (B12018), and the Natural Sciences and Engineering Research Council of Canada Discovery Grants program (Grant no. G256643). The authors thank China Scholarship Council (CSC) for the financial support to the author (Yu Yang) to conduct his doctoral research in the Department of Bioresource Engineering at McGill University. The authors also gratefully thank Meng Zhang and Liang Wang for the help on data collection in Yangshan Town orchard and Hongsha Bay orchard, Wuxi City, in China.

References

and Wang

, Genetic resources, breeding programs in China, and gene mining of peach: A review, Horticultural Plant Journal 6 (2020), 205–215. doi: 10.1016/j.hpj.2020.06.001.

Saedi

S.I.

and Khosravi

, A deep neural network approach towards real-time on-branch fruit recognition for precision horticulture, Expert Systems with Applications 159 (2020), 113594. doi: 10.1016/j.eswa.2020.113594.

Wang

and Yang

, Water quality monitoring and evaluation using remote sensing techniques in China: A systematic review, Ecosystem Health and Sustainability 5 (2019), 47–56. doi: 10.1080/20964129.2019.1571443.

Kapach

Barnea

Mairon

Edan

and Ben-Shahar

, Computer vision for fruit harvesting robots–state of the art and challenges ahead, International Journal of Computational Vision and Robotics 3 (2012), 4–34. doi: 10.1504/IJCVR..

Wang

Chen

Zhang

and Zhang

, Applications of machine vision in agricultural robot navigation: A review, Computers and Electronics in Agriculture 198 (2022), 107085. doi: 10.1016/j.compag.2022.107085.

Wang

Liu

Chen

Huang

and Han

, A band selection approach based on a modified gray wolf optimizer and weight updating of bands for hyperspectral image, Applied Soft Computing 112 (2021), 107805. doi: 10.1016/j.asoc.2021.107805.

Majeed

Zhang

Karkee

and Zhang

, Faster R–CNN–based apple detection in dense-foliage fruiting-wall trees using RGB and depth features for robotic harvesting, Biosystems Engineering 197 (2020), 245–256. doi: 10.1016/j.biosystemseng.2020.07.007.

Barnea

Mairon

and Ben-Shahar

, Colour-agnostic shape-based 3D fruit detection for crop harvesting robots, Biosystems Engineering 146 (2016), 57–70. doi: 10.1016/j.biosystemseng.2016.01.013.

Zou

Shi

Guo

and Ye

, Object detection in 20 years: A survey, arXiv preprint arXiv:1905.05055, 2019. doi: 10.48550/arXiv.1905.05055.

10.

Zhao

Z.-Q.

Zheng

S.-T.

and Wu

, Object detection with deep learning: A review, IEEE Transactions on Neural Networks and Learning Systems 30 (2019), 3212–3232. doi: 10.1109/TNNLS.2018.2876865.

11.

Diwan

Anirudh

and Tembhurne

J.V.

, Object detection using YOLO: challenges, architectural successors, datasets and applications, Multimedia Tools and Applications (2022), 1–33. doi: 10.1007/s11042-022-13644-y.

12.

X.-T.

and Jo

K.-H.

, A review on anchor assignment and sampling heuristics in deep learning-based object detection, Neurocomputing 506 (2022), 96–116. doi: 10.1016/j.neucom.2022.07.003.

13.

Xue

Zheng

Wan

and Mao

, Detection of passion fruits and maturity classification using Red-Green-Blue Depth images, Biosystems Engineering 175 (2018), 156–167. doi: 10.1016/j.biosystemseng.2018.09.004.

14.

Zhu

Huang

and Guo

, Using color and 3D geometry features to segment fruit point cloud and improve fruit recognition accuracy, Computers and Electronics in Agriculture 174 (2020), 105475. doi: 10.1016/j.compag.2020.105475.

15.

Song

Liu

and Cui

, Kiwifruit detection in field images using Faster R-CNN with VGG16, IFAC-PapersOnLine 52 (2019), 76–81. doi: 10.1016/j.ifacol.2019.12.500.

16.

Chu

Lammers

and Liu

, Deep learning-based apple detection using a suppression mask R-CNN, Pattern Recognition Letters 147 (2021), 206–211. doi: 10.1016/j.patrec.2021.04.022.

17.

Tian

Yang

Wang

and Liang

, Apple detection during different growth stages in orchards using the improved YOLO-V3 model, Computers and Electronics in Agriculture 157 (2019), 417–426. doi: 10.1016/j.compag.2019.01.012.

18.

Mirhaji

Soleymani

Asakereh

and Mehdizadeh

S.A.

, Fruit detection and load estimation of an orange orchard using the YOLO models through simple approaches in different imaging and illumination conditions, Computers and Electronics in Agriculture 191 (2021), 106533. doi: 10.1016/j.compag.2021.106533.

19.

Zheng

Gao

Zhang

Wang

and Dong

, End-to-end object detection with adaptive clustering transformer, arXiv preprint arXiv:2011.09315, 2020. doi: 10.48550/arXiv.2011.09315.

20.

Heo

Yun

Han

Chun

Choe

and Oh

S.J.

, Rethinking spatial dimensions of vision transformers, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, 11936–11945.

21.

Carion

Massa

Synnaeve

Usunier

Kirillov

and Zagoruyko

, End-to-end object detection with transformers, in: European Conference on Computer Vision, Springer, 2020, 213–229.

22.

Lin

T.-Y.

Maire

Belongie

Hays

Perona

Ramanan

Dollár

and Zitnick

C.L.

, Microsoft coco: Common objects in context, in: European Conference on Computer Vision, Springer, 2014, 740–755.

23.

Zhu

Wang

and Dai

, Deformable detr: Deformable transformers for end-to-end object detection, arXiv preprint arXiv:2010.04159, 2020. doi: 10.48550/arXiv.2010.04159.

24.

Feng

Liu

Gao

Majeed

Al-Mallahi

Zhang

and Cui

, Fast and accurate detection of kiwifruit in orchard using improved YOLOv3-tiny model, Precision Agriculture 22 (2021), 754–776. doi: 10.1007/.

25.

Gao

S.-H.

Cheng

M.-M.

Zhao

Zhang

X.-Y.

Yang

M.-H.

and Torr

, Res2net: A new multi-scale backbone architecture, IEEE Transactions on Pattern Analysis and Machine Intelligence 43 (2019), 652–662. doi: 10.1109/TPAMI.2019.2938758.

26.

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

A.N.

Kaiser

Ł.

and Polosukhin

, Attention is all you need, Advances in Neural Information Processing Systems 30 (2017), 5998–6008.

27.

Bello

Zoph

Vaswani

Shlens

and Le

Q.V.

, Attention augmented convolutional networks, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 3286–3295.

28.

Loshchilov

and Hutter

, Decoupled weight decay regularization, arXiv preprint arXiv:1711.05101, 2017. doi: 10.48550/arXiv.1711.05101.

29.

Roy

A.M.

Bose

and Bhaduri

, A fast accurate fine-grain object detection model based on YOLOv4 deep neural network, Neural Computing and Applications 34 (2022), 3895–3921. doi: 10.1007/s00521-021-06651-x.

30.

Qiu

Liu

and Sun

, Borderdet: Border feature for dense object detection, in: European Conference on Computer Vision, Springer, 2020, pp. 549–564.

31.

Song

Liu

and Wang

, Revisiting the sibling head in object detector, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11563–11572.

32.

Yang

Zhang

Dai

Xiao

Yuan

and Gao

, Focal attention for long-range interactions in vision transformers, Advances in Neural Information Processing Systems 34 (2021), 30008–30022.

33.

Zhang

Liu

Chen

and Ding

, Application of deep learning algorithms in geotechnical engineering: a short critical review, Artificial Intelligence Review 54 (2021), 5633–5673. doi: 10.1007/s10462-021-09967-1.

34.

Zheng

Chen

Pang

Yang

Chen

and Xue

, A mango picking vision algorithm on instance segmentation and key point detection from RGB images in an open orchard, Biosystems Engineering 206 (2021), 32–54. doi: 10.1016/j.biosystemseng.2021.03.012.

35.

Bochkovskiy

Wang

C.-Y.

and Liao

H.-Y.M.

, Yolov4: Optimal speed and accuracy of object detection, arXiv preprint arXiv:2004.10934, 2020. doi: 10.48550/arXiv.2004.10934.

36.

Mai

Zhang

Jia

and Meng

M.Q.-H.

, Faster R-CNN with classifier fusion for automatic detection of small fruits, IEEE Transactions on Automation Science and Engineering 17 (2020), 1555–1569. doi: 10.1109/TASE.2020.2964289.

37.

Wan

and Goudos

, Faster R-CNN for multi-class fruit detection using a robotic vision system, Computer Networks 168 (2020), 107036. doi: 10.1016/j.comnet.2019.107036.

38.

Gong

Wang

Chandra

and Liu

, Keepaugment: A simple information-preserving data augmentation approach, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1055–1064.

39.

Yang

Wang

Zhang

Wei

Lin

and Yuille

, Lite vision transformer with enhanced self-attention, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 11998–12008.

40.

Dai

Chen

Yang

Zhang

Yuan

and Zhang

, Dynamic detr: End-to-end object detection with dynamic attention, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 2988–2997.

Detection of multi-size peach in orchard using RGB-D camera combined with an improved DEtection Transformer model

Abstract

Keywords

1. Introduction

2. Data collection and processing

2.1 Experimental orchards

Table 1 Imaging datasets for peach detection in the two orchards

4. Results and discussion

4.1 Network training

Table 2 Parameters and resources used in training models

Table 3 Performance comparison between the three popular networks and R2N-DETR

Table 4 Performance comparison between the DETR and R2N-DETR model on small and medium object subsets

5. Conclusion

Footnotes

Acknowledgments

References

Table 1
Imaging datasets for peach detection in the two orchards

Table 2
Parameters and resources used in training models

Table 3
Performance comparison between the three popular networks and R2N-DETR

Table 4
Performance comparison between the DETR and R2N-DETR model on small and medium object subsets