Domain adaptive fruit detection method based on multiple alignments

Abstract

While deep learning based object detection methods have achieved high accuracy in fruit detection, they rely on large labeled datasets to train the model and assume that the training and test samples come from the same domain. This paper proposes a cross-domain fruit detection method with image and feature alignments. It first converts the source domain image into the target domain through an attention-guided generative adversarial network to achieve the image-level alignment. Then, the knowledge distillation with mean teacher model is fused in the yolov5 network to achieve the feature alignment between the source and target domains. A contextual aggregation module similar to a self-attention mechanism is added to the detection network to improve the cross-domain feature learning by learning global features. A source domain (orange) and two target domain (tomato and apple) datasets are used for the evaluation of the proposed method. The recognition accuracy on the tomato and apple datasets are 87.2% and 89.9%, respectively, with an improvement of 10.3% and 2.4%, respectively, compared to existing methods on the same datasets.

Keywords

Domain adaptation deep learning knowledge distillation fruit detection

1 Introduction

The global fresh fruit trade is a representation of modern globalization. Increasing fruit consumption is a vital component of the change to a healthier and more sustainable diet [1]. Fruit yield estimate is important for fruit production, beneficial to orchard management and decision making, regarding labor requirements, storage, transport, and marketing [2]. Object detection is an important step towards fruit yield estimation, whose result can be counted to provide the fruit number in images [3].

As a core problem in computer vision, object detection has been intensively studied with traditional image processing techniques and machine learning methods. Currently, two-stage [4] and one-stage [5 –8] deep learning methods have been developed and applied in plant phenotyping including fruit detection, achieving state-of-the-art performance. However, these supervised learning methods for detection, regardless of two-stage or one-stage, rely on vast quantities of training samples and assume that the training and test samples are drawn from an identical distribution. Generalizing a pre-trained model on a source domain to an unseen target domain that has a different distribution, may not necessarily perform well due to the existing domain shift.

To address the above problem, fine-tuning methods resort to annotation in the target dataset, whereas domain adaptation (DA) methods facilitate the transfer of learned knowledge from a pre-trained network in the source domain to the target domain, thereby eliminating the need for extensive annotation in the target domain dataset.

Under the generative adversarial networks (GAN) framework, feature-level and image-level alignment approaches are normally employed to reduce the distribution difference between the source domain and test domain. With regard to plant phenotyping applications, relatively limited research of DA methods have been proposed for object detection. Existing DA methods use CycleGAN [9] for image-to-image transformation [10] and contrastive unpaired translation (CUT) [11] for image patch-level transformation [12]. While these approaches successfully preserve the inherent details of the target itself, it may inadvertently result in the loss of background information in the images, consequently leading to domain discrepancies. It is desirable that the alignment is performed on the local object region to detect.

In this paper, we propose a DA method that fuses image-level and feature-level alignment for cross-domain fruit detection. First, an attention-guided GAN [13] is used to achieve image-level alignment from the source image to the target domain, which suppresses unimportant background features in the source domain and migrates the target features from the source domain to the target domain. The adopted attention-guided GAN helps the network to generate images with a more realistic background and a more detailed target during the transformation that migrates the labels of the source domain to the target domain.

A multiple alignments domain adaptive yolov5 (MADA-yolov5) detection method was then implemented, which integrates the yolov5s (6.0) [7] network and the knowledge distillation framework [14] with the mean teacher model approach [15] for feature-level alignment. The method allows feature changes from the learning source domain to the target domain, so that the student model acquires features from the target domain for higher accuracy. A spatial context perception module [16] is employed to enable the network to filter out unimportant background information, thereby learning the characteristics of the target itself. Experimental results on several public datasets demonstrate the improvement in detection accuracy with the proposed method compared to existing methods.

2 Related work

2.1 Supervised Object detection

Typical two-stage detection algorithms include Faster-RCNN [4], while one-stage methods include yolo-series [5 –8], FCOS [17], etc. The region-based convolutional network [4] first generates many object proposal candidates by a region proposal module, and then further classifies the proposals and refines their locations by regression. Faster-RCNN was used to detect fruit in [18], achieving high detection accuracy and demonstrating robustness to field imaging conditions. Combined with the feature pyramid networks, Faster-RCNN was used for fruit detection in [19] with an improved loss function, in which a focal loss term was included. With a modified definition of intersection of union, Faster R-CNN was used to detect apple, orange, tomato, pomegranate and mango in [20]. Although accurate results can be achieved with these methods, the processing is complex and slow for their adopted two stages strategy.

Different to previous two-stage method, the one-stage detector unifies the category confidence prediction and the bounding box regression in a single-shot manner. An improved yolov3 object detection structure was used in [21] for apple detection, in which densely connected convolutional networks was integrated to process feature layers with low resolution. Furthermore, in addition to the densely connected convolutional networks integrated into yolov3 [5], circular bounding box replacing the traditional rectangular bounding box was adopted in [22] for tomato localization; Swish activation function, and prior box optimization were used for the detection of oil palm fruits in [23]. Compared to the two-stage method, the one-stage method has a clear superiority in terms of inference speed.

2.2 Domain adaptation object detection

Domain adaptive detection methods can be broadly classified into feature-level and image-level alignment ones. In the feature-level method, invariant feature learning using adversarial network is typically embedded in a basic detection model, such as the Faster R-CNN [4]. Faster R-CNN based gradient inversion module in [24] inverts the features of the source domain to the those of the target domain. The region mining and alignment based approach in [25] demonstrates that the key implicit for domain adaptation is local region attention that bridges domain gaps. The detector in these methods is trained to produce domain-invariant features that can deceive the domain discriminator of the GAN, which distinguishes between the source and target domains. The detector that generates domain invariant features may be very slow. In recent work, many domain adaptive detection methods have been proposed based on single-stage methods [14 , 27]. In addition to the feature alignment approach, image-level alignment has been adopted to mitigate the domain gap for DA detection [28, 29]. The alignment is normally implemented with GAN based image-to-image translation [9]. For plant phenotyping applications, image-to-image translation was used for DA fruit detection and wheat detection in [10, 12], respectively. Zhang et al. [10] converted the source fruit images into target fruit ones by CycleGAN networks [9], and then performed a pseudo-labelling process to automatically label the target fruit images. The self-learning method improved the labeling accuracy of the pseudo-labelling. CycleGAN [9] networks and clustering methods were used in [30] to generate domain-adaptive images to expand the diversity of datasets on wheat heads. CUT [11] was used in [12] to transfer sorghum labels to the wheat dataset. The source domain was transferred from a computer-synthesised 3D grape model to a real-world grape model in [31]. This method can preserve the position and geometric information of the fruit well, but other objects such as leaves, sky and tree trunks are not restricted.

Image-to-image translation can reduce the domain gap between the source and target domains significantly. Such a image level translation, however, is prone to mix object distribution with that of the background and leads to noise in the background region, resulting in degradation of the detection accuracy of the subsequent detection process. As a remedy, the detection result on the target domain, called pseudo-label in [10], was refined through a self-learning approach. The self-learning used a cyclic update operation to fine-tune the detection model using the detection result with varied confidence threshold, which is time-consuming. CUT [11] adopted in [12] went further to use patch-wise contrastive learning for image-to-image translation. The alignment is local, whereas it is not context-aware. A post-processing pipeline was employed to reassign and validate the labels after image translation [12]. As a preparation process for the subsequent object detection, it is desirable that the transform between images is context-aware and can differentiate the object and background region. Context-aware instance-level alignment was realized in the adversarial learning embedded DA method [31], in which additional network structure was employed for the local alignment. Our proposed MADA-yolov5 method combines feature and image alignments, which can well suppress the influence of background and enhance the weight of the target itself.

2.3 Attention mechanisms

The attention mechanism is integrated with yolov5 [7] for apple and dragon fruit detection in [32, 33], respectively. The attention mechanism is also combined with FCOS [17] for blurry green fruit detection in [34], where a residual feature pyramid network is used instead of the feature pyramid network in FCOS. In addition to fruit detection, yolov5 with integrated attention mechanism is also used for weed detection in [35, 36] for seedling detection and ear detection, respectively. In [26], a DA method based on an attention mechanism is proposed, and a self-attention method is introduced to make the target focus on the main region and reduce the effect of domain bias. Adding an attention mechanism to the network can bias the assignment of the most informative feature representations, while suppressing the less useful ones. In our method, a context-aware module is added to the detection network with a self-attention-like operation that assigns attention weights to the network using global features.

3 Materials and methods

3.1 Datasets

Two groups of datasets were adopted to validate the proposed domain adaptive detection method: image transformation and object detection datasets. The image transformation dataset [9] consists of apple2orange and orange2tomato, which share same orange subset. The details of the three subsets are listed in Table 1, in which the training sets were used to train our image transformation model between orange and tomato/apple and the (underscored) test sets were not used in this study. The object detection dataset consists of a source domain orange dataset $D_{S} : {I_{S}^{k}, L_{S}^{k}}_{k = 1}^{M}$ [10], where I and L represents the image and the corresponding label, respectively, and two target domain datasets: apple [37] and tomato [38] datasets. The target domain dataset is denoted as $D_{T} : {I_{T}^{k}, L_{T}^{k}}_{k = 1}^{N}$ . The object detection dataset (Table 2) was used to test our image transformation model as well as train and test our DA object detection model.

Table 1
Image transformation dataset

Dataset Training set Test set Size(pixels)

Orange 1019 248 256×256

Tomato 654 102 256×256

Apple 995 266 256×256

Dataset	Training set	Test set	Size(pixels)
Orange	1019	248	256×256
Tomato	654	102	256×256
Apple	995	266	256×256

Table 2

Object detection dataset

Class	Dataset	Training set	Test set	Size(pixels)
Source	Orange	464	200	416×416
Target	Tomato	598	150	1920×1080
Target	Apple	331	82	719×898

3.2 Proposed domain adaptive detection method

The proposed method consists of the following steps: (1) training of the attention-guided generative adversarial networks using the training sets of the image transformation dataset (Table 1) to obtain a generator G that transforms orange image into apple or tomato style image; (2) feeding the source domain dataset images ${I_{S}^{k}}_{k = 1}^{M}$ including training and test sets of the object detection dataset into G to produce the fake target domain images ${I_{S^{'}}^{k}}_{k = 1}^{M}$ , and feeding the target domain dataset images ${I_{T}^{k}}_{k = 1}^{N}$ includeing training sets of the object detection dataset into G to produce the fake source domain images ${I_{T^{'}}^{k}}_{k = 1}^{N}$ ; (3) training the MADA-yolov5 fruit detection model using the datasets $D_{S} : {I_{S}^{k}, L_{S}^{k}}_{k = 1}^{M}$ , $D_{S^{'}} : {I_{S^{'}}^{k}, L_{S}^{k}}_{k = 1}^{M}$ , $D_{T} : {I_{T}^{k}}_{k = 1}^{N}$ and $D_{T^{'}} : {I_{T^{'}}^{k}}_{k = 1}^{N}$ ; (4) inputting the test dataset of target domain in the object detection dataset (Table 2) into the MADA-yolov5 model to get the fruit detection result. In (3), knowledge distillation with mean teacher model is integrated in the detection model. An overview of the proposed method is presented in Fig. 1.

Fig. 1

Framework of the proposed method. From left to right, there are training of the generative model G using the image transformation dataset; The source domain dataset D_S and the target domain one D_T are input into G to obtain the fake target domain dataset $D_{S}^{'}$ and the fake source domain one $D_{T}^{'}$ ; The source domain and the fake target domain datasets (with labels) and the target domain and the fake source domain datasets (without labels) are input into the MADA-yolov5 for training.

3.2.1 Image transformation

To reduce the domain shift between the source and target datasets, the generative model G is trained to transform source images into target domain ones in an unsupervised learning framework. Instead of the images in the object detection dataset (Table 2) with big size, the training set of the image transform dataset (Table 1) with small size is used to train the model G. We adopt the attention-guided GAN [13] to achieve the transformation, which incorporates two attention mechanisms on the basis of CycleGAN [9].

The network implements a transformation between two sets X and Y, denoting the training set of the image transform dataset listed in Table 1. As shown in Fig. 2, it consists of two attention-guided generator networks, G : X → Y and F : Y → X, and two discriminators, D_X and D_Y. With the same structure, the generators G and F both contain three subcomponents: a parameter-sharing encoder G_E/F_E, a content mask generator G_C/F_C, and an attention mask generator G_A/F_A. The feature information of image x ∈ X is extracted by G_E, shared between G_A and G_C. Following the feature extraction, G_A generates q attention masks with 4 output channels, including a background attention mask A^b and q - 1 foreground attention masks ${A^{k}}_{k = 1}^{q - 1}$ ; G_C generates q - 1 content masks ${C^{k}}_{k = 1}^{q - 1}$ with 3 output channels, which, together with the original input image x, make a total of q content masks.

Fig. 2

Attention-guided GAN model for image transformation, consisting of two generators G and F, and two discriminators D_Y and D_X.

The generated image G (x) is composed of the attention masks, the content masks and the input image x as: $\begin{matrix} G (x) = \sum_{k = 1}^{q - 1} (C^{k} A^{k}) + x A^{b} \end{matrix}$ (1) Reversely, the generators F realizes the similar transformation from Y to X.

To balance the transformation between the two domains, the mapping needs to be regularized using a cyclic consistency loss, as follows, $\begin{matrix} L_{cycle} (G, F) & = E_{x \sim X} [∥ F (G (x)) - x ∥_{1}] \\ + E_{y \sim Y} [∥ G (F (y)) - y ∥_{1}] \end{matrix}$ (2) where F (G (x)) is the fake source domain image generated by the generator and G (F (y)) is the fake target domain image generated by the generator. The above regularization reduces the space of possible mapping between X and Y.

To match the distribution of generated images from the source domain to that of the target domain, an adversarial loss is defined as: $\begin{matrix} L_{GAN} (G, D_{Y}, X, Y) & = E_{y \sim Y} [log D_{Y} (y)] \\ + E_{x \sim X} [log (1 - D_{Y} (G (x))] \end{matrix}$ (3) The generator G minimizes the above loss to make the discriminator D_Y can not distinguish between G (x) and y. Thus, the generated image G (x) looks similar to the image from Y. At the same time, D_Y maximizes the adversarial loss maintaining reliability of it. Likewise, an adversarial loss $L_{GAN} (F, D_{X})$ of the generator F and the discriminator D_X is defined, making F generator images similar to those from X.

The use of an identity preserving loss helps to preserve the color of the image and prevent color confusion, and the loss is represented as follows, $\begin{matrix} L_{id} (G, F) & = E_{y \sim Y} [∥ G (y) - y ∥_{1}] \\ + E_{x \sim X} [∥ F (x) - x ∥_{1}] \end{matrix}$ (4)

Based on the above-mentioned losses, the overall optimization objective of the image transformation $L = L_{GAN} + λ_{cycle} L_{cycle} + λ_{id} L_{id}$ (5) where $L_{G A N}$ , $L_{c y c l e}$ and $L_{i d}$ are the adversarial loss, cyclic consistency loss, and identity preserving loss, respectively. $L_{G A N}$ is the sum of $L_{GAN} (G, D_{Y})$ and $L_{GAN} (F, D_{X})$ . λ_cycle and λ_id are weight parameters.

After training the attention-guided generative adversarial network through the image transformation dataset and obtaining the generator G, the source domain images ${I_{S}^{k}}_{k = 1}^{M}$ in the object detection dataset were fed into G to generate the fake target domain images ${I_{S^{'}}^{k}}_{k = 1}^{M}$ , as illustrated in Fig. 3. The fake target domain dataset $D_{S^{'}} = {I_{S^{'}}^{k}, L_{S^{'}}^{k}}_{k = 1}^{M}$ , where the label is the same as that of source domain, i.e., ${L_{S^{'}}^{k}}_{k = 1}^{M}$ = ${L_{S}^{k}}_{k = 1}^{M}$ . The same transformation was used for target domain images to obtain the false source domain images ${I_{T^{'}}^{k}}_{k = 1}^{N}$ .

Fig. 3

Illustration for the inference of G using the source domain image I_S to get the fake source-like target domain image I_S′. The symbols ⊗, ⊕ and circledS denote element-wise addition, multiplication, and channel-wise Softmax, respectively.

3.2.2 Fruit detection network

After the previous transformation, the images in the fake target domain D_S′ retains the background of the source domain D_S while the key features are converted to the style of the essential features in the target domain D_T. This property is beneficial for generalization of the detection model trained with the fake target domain image to the real target domain. However, false detection may be possible due to the size, shape and colour difference between the fake target domain image and the real target domain image. Thus, besides the previous image transformation to align the distributions between the source and target domain, feature alignment is implemented along with the detection model training to reduce the missed detection of the detector.

Knowledge distillation with mean teacher model [14] is employed for the feature alignment in the proposed method. The teacher and student models are connected by exponential moving average (EMA). The teacher model learns the characteristics of the target domain, as shown in Fig. 1, and then teaches the student model how to identify the correct target features. The teacher and student models use the same network structure, as shown in the top of Fig. 4, which integrates yolov5 [7] and a context aggregation block (CA) [16].

Fig. 4

Backone, Neck and Head are the original yolov5 constructions with the addition of the context aggregation block between the Neck and Head.

The base model yolov5(v6.0) includes three modules: Backbone, Neck and Head. The Backbone module consisting of C3 and Conv module is used for feature extraction. The specific composition of C3 module is shown in the bottom of Fig. 4, where the Conv module consists of Conv2d, Batch Normalisation (BN) and Sigmoid Weighted Linear Units (SiLU). The Backbone network generates three-layer feature maps with sizes of H/8 × W/8, H/16 × W/16, and H/32 × W/32, with the input size of image being H × W × 3. These feature maps are fused in the Neck module through the feature pyramid and path aggregation network structures. Finally, the Head module outputs three detection layers, with the output feature map sizes being H/8 × W/8, H/16 × W/16, and H/32 × W/32. The last target is realised through non-maximum value suppression and IOU filtering. To improve the feature fusion ability of the network, the CA [16] is added between the Neck and Head modules, which is similar to the self-attention mechanism, and a jump link is added to enrich the network features.

The overall loss function of the network is as follows, $\begin{matrix} L & = L_{\det} (I_{S}, B_{S}, C_{S}) + L_{\det}^{†} (I_{S^{'}}, B_{S}, C_{S}) \\ + α \cdot L_{dis}^{†} (I_{T^{'}}, I_{T}) + β \cdot L_{con} \end{matrix}$ (6) where α and β are hyperparameters. $L_{\det}$ is the loss function with the source domain set $D_{S} : {I_{S}^{k}, L_{S}^{k}}_{k = 1}^{M}$ as the input for training the detection model, $L_{\det} (I_{S}, B_{S}, C_{S}) = L_{box} (B_{S}; I_{S}) + L_{cls, obj} (C_{S}; I_{S})$ (7) where $L_{box}$ is the Ciou loss [39], $L_{cls, obj}$ is the Focal loss [40]. I_S, B_S and C_S are the input image, annotation and category information of the image, respectively. Similarly, $L_{\det}^{†}$ is the loss function with the fake target domain set $D_{S^{'}} : {I_{S^{'}}^{k}, L_{S^{'}}^{k}}_{k = 1}^{M}$ as the model input.

The distillation loss function $L_{d i s}^{†}$ is used to compensate for the inconsistency between the teacher and student models. The input are the unmarked fake source domain image $D_{T^{'}} : {I_{T^{'}}^{k}}_{k = 1}^{N}$ and the target domain image $D_{T} : {I_{T}^{k}}_{k = 1}^{N}$ . The detailed formula is as follows, $\begin{matrix} L_{dis}^{†} (I_{T^{'}}, I_{T}) \\ = L_{\det} (I_{T}, G_{B} [F_{B} (I_{T^{'}})], G_{C} [F_{C} (I_{T^{'}})]) \end{matrix}$ (8) where $F_{B}$ and $F_{C}$ are the predictions of the teacher model for bounding box coordinates and object classes with high maximum category scores, respectively. $G_{B}$ and $G_{C}$ are the corresponding filters. Specifically, by filtering the bounding boxes, high confidence bounding boxes are selected to generate pseudo-labels for features of the student model instances in the target domain. Then, the relationship between the student and teacher model are established through the EMA weight update. The formula of EMA is as follows: $\begin{matrix} P_{t} = γ P_{t} + (1 - γ) P_{s} \end{matrix}$ (9) where $P_{s}$ and $P_{t}$ are the weight of the student and teacher models, respectively. γ is the exponential decay. Finally, the consistency constraint $L_{con}$ is used to keep the predicted results as consistent as possible,

$\begin{matrix} L_{con} \\ = {∥ L_{\det} (I_{S}, B_{S}, C_{S}) - L_{\det}^{†} (I_{S^{'}}, B_{S}, C_{S}) ∥}_{2} \end{matrix}$ (10)

The added spatial context awareness module is shown in Fig. 5. The mapping is performed by 1 × 1 convolution, replacing the matrix multiplication between the query and the key with a linear transformation, while reducing the computational complexity and improving the implementation of self-attention.

Fig. 5

The context aggregation block. C, H and W denote number of channels, height, and width, respectively. ⊗ and ⨀ denotes batched matrix multiplication and broadcast hadamard product, respectively.

4 Results

The proposed DA fruit detection method was tested with the datasets described in Sec. 3.1. The training sets of the image transformation dataset were used to train the the generative model G that transforms orange domain to apple and tomato ones. Inference of G on the source and target images of the object detection dataset produced fake source-like target, and fake source images, respectively, which were used to train the object detection model R along with the real source and target images. Inference of R on the real target domain (test sets) of the object detection datasets produces the last detection result. Precision, Recall, F₁ score, and mAP were used to evaluate the detection results.

The experiments were conducted with PyTorch (1.11.0) on Ubuntu 20.04.4 LTS with an NVIDIA GeForce RTX3060 (12G) graph-ics card. For training of the generator G, linear recession was adopted for the learning rate. The learning rate was set as 2 × 10^-5 for the first 100 training cycles and gradually declined to 0 for the subsequent 100 training cycles. The momentum factor and batch size were set as 0.5, and 1, respectively.

The inference of G on the source domain images D_S produced the fake source-like target domain $D_{S_{-}^{'} tomato}$ and $D_{S_{-}^{'} apple}$ . Fig. 6 illustrates the image transformation results, compared to those transformed with CycleGAN adopted in the EasyDAM [10]. It can be seen that the images generated by CycleGAN are shifted in tone, with a lot of noise and with background color being confusing. The fruit is blurred in the image. There exists large gap between the generated images with the target domain image. In contrast, images with higher similarity to the target domain is obtained by attention-guided GAN adopted in our method. The increase in noise, blurring of the vision, and tonal shift all impact the accuracy of the single-stage detector [41]. Thus, images generated by attention-guided GAN are more beneficial to the subsequent object detection.

Fig. 6

Comparison of the generated images using different image transformations. (a) Result using CycleGAN; (b) Result using attention-guided GAN; (c) Target domain image.

Further comparison of the transformed results by CycleGAN and attention-guided GAN is presented in Fig. 7, where T-SNE [42] dimensionality reduction algorithm is employed to analyze and display the feature distribution of images.

Fig. 7

Feature distribution of the target domain overlapping with that of the transformed source domain (i.e. fake source-like target domain) using CycleGAN and attention-guided GAN separately. As indicated by the arrows, the distribution of the transformed source using attention-guided GAN is closer to target domain than that using CycleGAN.

4.1 Label validation of fake target domain image

Training a yolov5 model using the training set images of D_S, $D_{S_{-}^{'} tomato}$ and $D_{S_{-}^{'} apple}$ produces three different detection models. The batch size was set as 32, the number of training times as 100. The optimizer used the stochastic gradient descent (SGD) algorithm. The learning rates of the three models were set as 0.01, 0.01, and 0.001, respectively. The inference results of the three models on the corresponding test set images are listed in Table 3. The source dataset D_S has ground truth label, while the label of $D_{S_{-}^{'} tomato}$ and $D_{S_{-}^{'} apple}$ was transferred from D_S through image transformation. Thus, the consistent detection accuracy between the source and the transformed domain reported in Table 3 validates the availability of the label in $D_{S_{-}^{'} tomato}$ and $D_{S_{-}^{'} apple}$ .

Table 3
Evaluation results for test datasets of D_S, $D_{S_{-}^{'} tomato}$ and $D_{S_{-}^{'} apple}$

Model Datasets Precision Recall F1 Score mAP

yolov5 D _S 0.937 0.912 0.92 0.963

yolov5 $D_{S_{-}^{'} tomato}$ 0.935 0.912 0.92 0.955

yolov5 $D_{S_{-}^{'} apple}$ 0.935 0.915 0.92 0.959

Model	Datasets	Precision	Recall	F1 Score	mAP
yolov5	D _S	0.937	0.912	0.92	0.963
yolov5	$D_{S_{-}^{'} tomato}$	0.935	0.912	0.92	0.955
yolov5	$D_{S_{-}^{'} apple}$	0.935	0.915	0.92	0.959

4.2 Label validation of real target domain image: orange to tomato

The last row of Table 4 shows the detection accuracy of our method, i.e. the inference accuracy on the target domain tomato dataset. The proposed method achieves high accuracy. For the image transformation stage shown in Fig. 8, the proposed method is higher than the CycleGAN method in terms of generated image quality, background realism, and similarity between the target and source domains.

Table 4
Comparison of different models for tomato detection

Model Precision Recall F1 Score mAP

MADA-yolov5+CycleGAN 0.943 0.742 0.83 0.856

MADA-yolov5+AttentionGAN 0.937 0.774 0.85 0.872

Model	Precision	Recall	F1 Score	mAP
MADA-yolov5+CycleGAN	0.943	0.742	0.83	0.856
MADA-yolov5+AttentionGAN	0.937	0.774	0.85	0.872

Fig. 8

Comparison of the generated images using different image transformations. (a) Result using CycleGAN; (b) Result using attention-guided GAN; (c) original image.

Table 5 presents a comparison of the detection accuracy of the MADA-yolov5 method with other methods on the target domain tomato dataset. In the table, KD stands for knowledge distillation and CA stands for context aggregation block. Through the knowledge distillation method, feature alignment is achieved across domains, which allows the target domain information to be passed to the student model. This improves the model’s ability to identify targets (Fig. 9b). Compared to the original yolov5 network, improved yolov5 identified the top right corner of the different colour of tomato.

Table 5

Comparison of different models for tomato detection

Model	Precision	Recall	F1 Score	mAP
EasyDA [10]	0.769	0.767	0.768	0.769
yolov7 [8]	0.893	0.771	0.83	0.829
MADA-yolov5(KD,CA)	0.937	0.774	0.85	0.872
yolov5 [7]	0.901	0.735	0.81	0.813
MADA-yolov5(KD)	0.892	0.784	0.83	0.861(+4.8)
MADA-yolov5(KD,CA)	0.937	0.774	0.85	0.872(+1.1)

Fig. 9

(a) Original yolov5 detection result; (b) Detection result after adding knowledge distillation structure; (c) Detection result after adding context aggregation block.

To illustrate the role of the CA block, a heat map is used to show the attentional weights of the network. As shown in Fig. 10, after adding the CA block, the network learns the global features and the attention weight of the network shifts from the lower left corner to the target itself.

Fig. 10

CA block contrast heat map. (a),(c) heat map without CA Block. (b),(d) that with CA block added. The heavier the colour in the heat map, the higher its weighting.

4.3 Label validation of real target domain image: orange to apple

The last row of Table 6 shows the detection accuracy of our method, i.e. the inference accuracy on the target domain apple dataset. The proposed method achieves high accuracy. For the image conversion stage shown in Fig. 11, the image conversion using the CycleGAN method loses some image information in the target domain (Fig. 7a), and the background of the image becomes cluttered with many noisy spots, all of which greatly affect the detection accuracy of the subsequent model. This results in many leaves being identified as targets, whereas the use of attention-guided GAN ensures that the background is realistic and converts the vast majority of targets to the style of the corresponding domain, ensuring the accuracy of the detector and a much lower probability of recognition errors.

Table 6
Comparison of different models for apple detection

Model Precision Recall F1 Score mAP50

MADA-yolov5+CycleGAN 0.818 0.796 0.81 0.856

MADA-yolov5+AttentionGAN 0.864 0.855 0.86 0.899

Model	Precision	Recall	F1 Score	mAP50
MADA-yolov5+CycleGAN	0.818	0.796	0.81	0.856
MADA-yolov5+AttentionGAN	0.864	0.855	0.86	0.899

Fig. 11

Comparison of the generated images using different image transformations. (a) Result using CycleGAN; (b) Result using attention-guided GAN; (c) Original image.

Table 7 presents a comparison of the detection accuracy of the MADA-yolov5 method with other methods on the target domain apple dataset. As shown in Fig. 12, yolov5 misses many targets when tested on the apple dataset (Fig. 12a), and its recall rate increases dramatically after alignment by target features through knowledge distillation. This is because after alignment by target features, its darker targets are transformed into brighter source domain targets (Fig. 12b). This part of the target features will be learned by the student model and will therefore improve the performance of the detector. However, with feature matching the network, as the occlusion in the apple dataset will be very high, the feature reversal learning will also partially learn the leaf information into the network, resulting in an increase in the false positive rate. With the CA block, however, the network will learn context-rich features, and its accuracy will therefore increase and false detections will decrease (Fig. 12c).

Table 7

Comparison of different models for apple detection

Model	Precision	Recall	F1 Score	mAP
EasyDA [10]	0.828	0.836	0.832	0.875
yolov7 [8]	0.83	0.775	0.80	0.866
MADA-yolov5(KD,CA)	0.864	0.855	0.86	0.899
yolov5 [7]	0.828	0.835	0.83	0.87
MADA-yolov5(KD)	0.809	0.861	0.83	0.878(+0.8)
MADA-yolov5(KD,CA)	0.864	0.855	0.86	0.899(+2.9)

Fig. 12

(a) Original yolov5 detection result; (b) Detection result after adding knowledge distillation structure; (c) Detection result after adding CA Block.

As can be seen from the heat map (Fig. 13), with the addition of the CA block, the network has a richer feature extraction capability, with its attentional weight shifted to the apple target itself, reducing the network error detection and increasing the accuracy of the network. This improves the overall efficiency of the model.

Fig. 13

CA Block contrast heat map. (a) Heat map without CA Block, (b) That with CA Block added. The heavier the colour in the heat map, the higher its weighting.

5 Conclusion

This paper proposes an attention-guided domain adaptive fruit detection method. The generated images using the adopted attention-guided GAN are more similar to the target domain image with a more realistic background. Integrating knowledge distillation with the mean teacher model method achieves feature alignment. Introducing the context-aware module obtains rich feature learning and improves network detection capabilities. Image-to-image transformation is an admissible method for DA fruit detection. Local transformation that is context-aware is more desirable and contributes to the subsequent detection. Existing problem of this study is that the performance is not stable enough in the label transfer process, which depends on the setting of the hyperparameters. Efforts will be made to address these issues in the future work.

References

Mason-D’Croz

, Bogard

J.R.

and Sulser

T.B.

, Gaps between fruit and vegetable production, demand, and recommended consumption at global and national levels: An integrated modelling study, The Lancet Planetary Health 3(7) (2019), e318–e329.

, Fang

and Zhao

, Fruit yield prediction and estimation in orchards: A state-of-the-art comprehensive review for both direct and indirect methods, Computers and Electronics in Agriculture 195 (2022), 106812.

Koirala

, Walsh

K.B.

and Wang

, Deep learning–Method overview and review of use for fruit detection and yield estimation, Computers and Electronics in Agriculture 162 (2019), 219–234.

Ren

, He

and Girshick

, Faster r-cnn: Towards real-time object detection with region proposal networks, Advances in Neural Information Processing Systems (2015), 28.

Redmon

and Farhadi

, Yolov3: An incremental improvement, arXiv preprint arXiv:1804.02767, 2018.

Bochkovskiy

, Wang

C.Y.

and Liao

H.Y.M.

, Yolov4: Optimal speed and accuracy of object detection, arXiv preprint arXiv:2004.10934, 2020.

Glenn, https://github.com/ultralytics/yolov5, 2020.

Wang

C.Y.

, Bochkovskiy

and Liao

H.Y.M.

, YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7464–7475.

Zhu

J.Y.

, Park

and Isola

, Unpaired image-to-image translation using cycle-consistent adversarial networks, Roceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2223–2232.

10.

Zhang

, Chen

and Wang

, Easy domain adaptation method for filling the species gap in deep learning-based fruit detection, Horticulture Research (2021), 8.

11.

Park

, Efros

A.A.

and Zhang

, Contrastive learning for unpaired image-to-image translation, Computer Vision–ECCV 2020:16th European Conference, Proceedings, Part IX 16. Springer International Publishing, Glasgow, UK, 2020, pp. 319–345.

12.

James

, Gu

and Chapman

, Domain Adaptation for Plant Organ Detection with Style Transfer, 2021 Digital Image Computing: Techniques and Applications (DICTA) IEEE, 2021, pp. 1–9.

13.

Tang

, Liu

and Xu

, Attentiongan: Unpaired image-to-image translation using attention-guided generative adversarial networks, IEEE Transactions on Neural Networks and Learning Systems, 2021.

14.

Zhou

, Jiang

and Lu

, SSDA-YOLO: Semi-supervised domain adaptive YOLO for cross-domain object detection, Computer Vision and Image Understanding 229 (2023), 103649.

15.

Tarvainen

and Valpola

, Mean teachers are better role models: Weight-averaged consistency targets improve semisupervised deep learning results, NIPS 30 (2017).

16.

Liu

, Li

and Hu

, Learning to Aggregate Multi-Scale Context for Instance Segmentation in Remote Sensing Images, arXiv eprints, 2021: arXiv: 2111.11057.

17.

Tian

, Shen

and Chen

, Fcos: Fully convolutional onestage object detection, Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9627–9636.

18.

Bargoti

and Underwood

, Deep fruit detection in orchards, IEEE International Conference on Robotics and Automation (ICRA) IEEE, 2017, pp. 3626–3633.

19.

Häni

, Roy

and Isler

, A comparative study of fruitdetection and counting methods for yield mapping in apple orchards, Journal of Field Robotics 37(2) (2020), 263–282.

20.

Behera

S.K.

, Rath

A.K.

and Sethy

P.K.

, Fruits yield estimation using Faster R-CNN with MIoU, Multimedia Tools and Applications 80 (2021), 19043–19056.

21.

Tian

, Yang

and Wang

, Apple detection during different growth stages in orchards using the improved YOLO-V3 model, Computers and Electronics in Agriculture 157 (2019), 417–426.

22.

Liu

, Nouaze

J.C.

and Touko

P.L.

, Mbouembe, YOLO-tomato: A robust algorithm for tomato detection based on YOLOv3, Sensors 20(7) (2020), 2145.

23.

Junos

M.H.

, Mohd Khairuddin

A.S.

and Thannirmalai

, Automatic detection of oil palm fruits from UAV images using an improved YOLO model, The Visual Computer (2021), 1–15.

24.

Chen

, Li

, Sakaridis

, Dai

and Gool

L.V.

, Domain adaptive faster r-cnn for object detection in the wild, in CVPR, 2018, pp. 3339–3348.

25.

Zhu

, Pang

and Yang

, Adapting object detectors via selective cross-domain alignment, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 687–696.

26.

Vidit

and Salzmann

, Attention-based domain adaptation for single stage detectors, arXiv preprint arXiv:2106.07283, 2021.

27.

Hnewa

and Radha

, Multiscale domain adaptive yolo for cross-domain object detection, arXiv preprint arXiv:2106.01483, 2021.

28.

Hsu

H.K.

, Yao

C.H.

and Tsai

Y.H.

, Progressive domain adaptation for object detection, Proceedings of the IEEE/CVFWinter Conference on Applications of Computer Vision, 2020, pp. 749–757.

29.

Shen

, Huang

and Shi

, CDTD: A large-scale cross-domain benchmark for instance-level image-to-image translation and domain adaptive object detection, International Journal of Computer Vision 129 (2021), 761–780.

30.

Hartley

Z.K.J.

and French

A.P.

, Domain adaptation of synthetic images for wheat head detection, Plants 10(12) (2021), 2633.

31.

Chen

, Zheng

and Ding

, Harmonizing transferability and discriminability for adapting object detectors, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 8869–8878.

32.

Yan

, Fan

and Lei

, A real-time apple targets detection method for picking robot based on improved YOLOv5, Remote Sensing 13(9) (2021), 1619.

33.

Zhang

, Wang

and Zhang

, Dragon fruit detection in orchard natural environment by integrating lightweight network and attention mechanism, Frontiers in Plant Science (2022), 4171.

34.

Liu

, Jia

and Wang

, An accurate detection and segmentation model of obscured green fruits, Computers and Electronics in Agriculture 197 (2022), 106984.

35.

Wang

, Cheng

and Huang

, A deep learning approach in-corporating YOLO v5 and attention mechanisms for field real-time detection of the invasive weed Solanum rostratum Dunal seedlings, Computers and Electronics in Agriculture 199 (2022), 107194.

36.

and Wu

, Improved YOLO v5 wheat ear detection algorithm based on attention mechanism, Electronics 11(11) (2022), 1673.

37.

Häni

, Roy

and Isler

, MinneApple: A benchmark dataset for apple detection and segmentation, IEEE Robotics and Automation Letters 5(2) (2020), 852–858.

38.

, Chen

T.S.

and Ninomiya

, Intact detection of highly occluded immature tomatoes on plants using deep learning techniques, Sensors 20(10) (2020), 2984.

39.

Zheng

, Wang

and Ren

, Enhancing geometric factors in model learning and inference for object detection and instance segmentation, IEEE Transactions on Cybernetics 52(8) (2021), 8574–8586.

40.

Lin

T.-Y.

, Goyal

, Girshick

, He

and Doll'ar

, Focal loss for dense object detection in ICCV, 2017, pp. 2980–2988.

41.

Yuan

, Choi

and Bolkas

, Sensitivity examination of YOLOv4 regarding test image distortion and training dataset attribute for apple flower bud classification, International Journal of Remote Sensing 43(8) (2022), 3106–3130.

42.

Van der Maaten

and Hinton

, Visualizing data using t-SNE, Journal of Machine Learning Research 9(11) (2008).

Domain adaptive fruit detection method based on multiple alignments

Abstract

Keywords

1 Introduction

2 Related work

2.1 Supervised Object detection

2.2 Domain adaptation object detection

2.3 Attention mechanisms

3 Materials and methods

3.1 Datasets

Table 1 Image transformation dataset Dataset Training set Test set Size(pixels) Orange 1019 248 256×256 Tomato 654 102 256×256 Apple 995 266 256×256

Table 3 Evaluation results for test datasets of D S , D S - ′ tomato and D S - ′ apple Model Datasets Precision Recall F1 Score mAP yolov5 D S 0.937 0.912 0.92 0.963 yolov5 D S - ′ tomato 0.935 0.912 0.92 0.955 yolov5 D S - ′ apple 0.935 0.915 0.92 0.959

Table 4 Comparison of different models for tomato detection Model Precision Recall F1 Score mAP MADA-yolov5+CycleGAN 0.943 0.742 0.83 0.856 MADA-yolov5+AttentionGAN 0.937 0.774 0.85 0.872

Table 6 Comparison of different models for apple detection Model Precision Recall F1 Score mAP50 MADA-yolov5+CycleGAN 0.818 0.796 0.81 0.856 MADA-yolov5+AttentionGAN 0.864 0.855 0.86 0.899

References

Table 1
Image transformation dataset

Dataset Training set Test set Size(pixels)

Orange 1019 248 256×256

Tomato 654 102 256×256

Apple 995 266 256×256

Table 3
Evaluation results for test datasets of D_S, $D_{S_{-}^{'} tomato}$ and $D_{S_{-}^{'} apple}$

Model Datasets Precision Recall F1 Score mAP

yolov5 D _S 0.937 0.912 0.92 0.963

yolov5 $D_{S_{-}^{'} tomato}$ 0.935 0.912 0.92 0.955

yolov5 $D_{S_{-}^{'} apple}$ 0.935 0.915 0.92 0.959

Table 4
Comparison of different models for tomato detection

Model Precision Recall F1 Score mAP

MADA-yolov5+CycleGAN 0.943 0.742 0.83 0.856

MADA-yolov5+AttentionGAN 0.937 0.774 0.85 0.872

Table 6
Comparison of different models for apple detection

Model Precision Recall F1 Score mAP50

MADA-yolov5+CycleGAN 0.818 0.796 0.81 0.856

MADA-yolov5+AttentionGAN 0.864 0.855 0.86 0.899