Abstract
Yucatan has a variety of plant species of melliferous importance. The honey produced in Yucatan has several special properties that make it one of the most demanded internationally. Analyzing the pollen grains present in honey is essential to determine its quality and identify its plants of origin. This study is a time-consuming process that must be carried out by highly trained palynologists. In this work, we propose an improved model based on a fully convolutional neural network for the automatic detection of pollen grains in microscopic images of four plant species of Yucatan to contribute to the analysis of the honey designation of origin.
Introduction
The state of Yucatan, located in the southeast of Mexico, houses a wide variety of plant species essential for honey production. Yucatan’s honey has distinctive characteristics, such as color, flavor, and aroma, that make it exceptional, and therefore, increase its popularity and demand in the international markets. Pollen grains can be found in honey, and thanks to their rich morphological diversity between species, they can be analyzed to assess the quality of the honey as well as determine the botanical origin of the plant species that produced it [3]. However, the process of identifying and counting pollen grains requires the expertise of trained specialists through visual inspection using microscopes. This task may be slow and take much time to complete. Therefore, this work proposes to automate the identification of pollen grains by using deep learning models to accelerate and improve the accuracy of the process.
Currently, there are several computational methods for pollen grains classification using machine learning and pattern recognition techniques [18]; however, there are few works focused on their automatic detection (localization and classification). For instance, in [7], Martínez et al. used second-order texture feature extraction techniques from the gray level co-occurrence matrix (GLCM) at multiple scales. These methods were applied on microscopic images of pollen grains. The extracted features served as input data for an artificial neural network used in classification. This approach yielded an accuracy rate of 88% for correct classification. However, this classification is only based on a reduced set of features that may not be sufficient for proper differentiation between species. On the other hand, in [5], He et al. proposed an unsupervised method for the automatic analysis of pollen grains in unlabeled images. Their database consists of 650 pollen granule images collected from honey samples. They used a convolutional neural network based on the YOLO structure [12] to generate a set of bounding boxes for every object detected in the image. Based on the bounding box coordinates, image fragments of potential grains were extracted. Subsequently, a convolutional neural network, incorporating the VGG-16 architecture [15], automatically extracts features from these image fragments. Finally, the K-means algorithm is selected to recognize similar groupings based on the extracted features. This method aims to identify clusters of pollen grains with analogous features in unlabeled images. This method does not classify image objects by species but instead focuses on grouping objects based on their similarities. In [9] Murkute classifies pollen grains by an ensemble of fully convolutional neural networks (FCNNs) based on the EfficientNet architecture [16]. These models were trained and tested on the publicly available Pollen-13K database [1], with 13,000 images of pollen grains. Using this approach, the models achieved an accuracy of 97.28% for classifying individual pollen grains. This method can be used only for single object classification, this is, images containing one pollen granule only. If multiple models are running in an ensemble, the response time will increase. Additionally, this study has two layers of fully connected neural networks at the end of each module that are based on EfficientNet. Unfortunately, this slows down the inference process even further.
In this work, we introduce a more efficient deep learning model based on the architectures Res2Unet [10] and YOLOv4 [2] to achieve real-time detection of pollen grains of four beekeeping-relevant species in the state of Yucatan (see Fig. 2).
Methodology
Dataset
The database used in this study consists of microscopic images obtained from the Laboratorio de Ciencia de Alimentos del Campo Experimental Mocochá del Centro de Investigación Regional del Sureste. The dataset comprises 370 images of pollen grains derived from four plant species of beekeeping importance in the state of Yucatan. The plant species are shown in Fig. 1, and are the following: Box Catzín (Acacia gaumeri), Chaká (Bursera simaruba), Jabín (Piscidia piscipula), and Tajonal (Viguiera dentata). Table 1 shows detailed dataset distribution for all pollen grain images. All images were acquired using an optical Motic microscope equipped with 10×, 40× and 100× objectives. The images were taken in RGB format and had a resolution of 3664 × 2748 pixels. The database is available at http://tinyurl.com/5n7bk3yr. In order to speed up the training process and reduce the GPU memory occupied, the dimensions of these images were reduced by 25% of their original size, resulting in a final size of 916 × 687 pixels.

Pollen grains samples of the four species used in this work. (a) Box catzín; (b) Chaká; (c) Jabín; (d) Tajonal.
Pollen images distribution for training, validation and test subsets

YOLO-based model architecture diagram.
There exists a diverse range of neural network architectures that are specifically designed for object detection. Among these architectures, the YOLO (You Only Look Once) network [12] has gained popularity due to its fast detection speed and comparable accuracy to other complex networks that require several seconds to perform detection. Consequently, the YOLO network is well-suited for real-time object detection applications. The YOLO neural network is a single-stage convolutional model that addresses the object detection problem by treating it as a regression problem. Since its first release, there have been multiple versions of this architecture that have incorporated incremental enhancements, resulting in improved performance and faster detection [2, 19].
The YOLO system is comprised of three primary components, namely the backbone module responsible for feature extraction, the neck module responsible for combining these extracted features, and the head module responsible for detection. The backbone module is tasked with extracting significant features from the input image through a series of convolutions. The neck module combines these extracted features to generate complex and abstract representations. Lastly, the head module is responsible for detecting the objects, producing a vector that includes the coordinates of the bounding boxes covering the objects within the image. One notable benefit of the YOLO architecture is its flexibility in terms of the backbone module. It allows for the substitution of the original module (CSPDarkNet-53) with a different one, enabling the training of a model specifically tailored for a particular set of objects. This adaptability enhances efficiency, as the CSPDarkNet-53 architecture is primarily designed for the detection of diverse objects, including those found in the ImageNet dataset [4, 13].
One strategy for improving the ability of a fully connected neural network (FCNN) to identify and analyze patterns in images is to increase the number of layers comprising the network, hence increasing its depth. As a result, there is an increase in the quantity of parameters that must undergo optimization through the gradient descent procedure, so imposing a limitation on the maximum number of layers that may be added to a given model. Due to this, deep neural networks encounter a phenomenon known as the vanishing gradient problem, which is extensively documented. This problem occurs when the gradient vector, which represents the direction and magnitude of the error signal used to update the model’s parameters, diminishes to zero. Consequently, this renders it impractical to continue training the model. As a result, the system becomes unable to continue its learning process. To tackle this matter, the concept of a residual connection was suggested by He et al. [6].
This concept can be mathematically represented as follows:
A fully convolutional network architecture is presented in [10] for the segmentation of the Trypanosoma cruzi parasite. The architecture, based on the ResUnet, enhances the segmentation capability of the original model. Improvements are achieved by combining the Unet and ResNet architectures and modeling ResNets as a system that follows a differential equation in the form of:
The FCN consists of three primary modules, namely the encoder, bridge, and decoder. The encoder module is responsible for extracting the inherent features of the input image. The bridge module acts as a connector between the encoder and decoder, facilitating the flow of information. Lastly, the decoder module is designed to generate a feature map that represents the segmented image, restoring it to its original size. This is necessary because the size of the feature map decreases by half after each convolution operation in the encoder module.
This work introduces a novel backbone for the YOLOv4 architecture, trained specifically for the identification and localization of pollen grains from various plant species found in the region of Yucatan (code may be available at https://github.com/clirlab/polen). The design of our backbone is based on the Res2Unet architecture, improved with the Mish activation function [8] and the utilization of weighted residual connections. The proposed model introduces a convolutional technique that incorporates residual components, as depicted in Fig. 3.

Proposed residual unit scheme.
Each of the weights used in short and long skips are normalized as follows:
There is evidence that shallower residual networks exhibit superior performance in comparison to their deeper counterparts [17, 20]. Furthermore, the utilization of weights in the residual connections serves to mitigate the issue of gradient vanishing and facilitates the network’s convergence at an accelerated rate in comparison to a conventional network without residual connections [14].
Each residual unit consists of two layers that include merged convolutions with batch normalization and a Mish activation function, as depicted in Fig. 4. Figure 5 visually depicts the adaption of the backbone architecture proposed in the YOLOv4 network. Table 2 provides a more detailed breakdown of the number of filters and kernel size for each convolution in our proposed model. Only 3 × 3 kernels have padding. “sc” means shortcut. Filters column is in the format: Number of filters, Kernel size/Stride.

Proposed residual unit structure.

Proposed YOLOv4-based model.
Proposed WRes2Net backbone structure
Four convolutional neural network architectures were evaluated in experiments aimed at automating the detection of pollen grains. The first YOLOv4 designs were employed, along with YOLOv7 for comparative analysis, while including our own backbone (WRes2Net) in the models. The modifications to the original architectures were implemented using code obtained from the open source repository Darknet 1 [11]. The studies were conducted using a computer equipped with an Nvidia RTX 2070 GPU to enhance the efficiency of the training process.
Each of the four models had 10,000 iterations of training. The optimization algorithm employed in the present work involves the utilization of stochastic gradient descent with restarts. The initial learning rate was 0.0026, the momentum value was 0.949 and the decay rate was 0.0005. Figure 6 visualizes the training error behavior of YOLOv4, YOLOv7, YOLOv4+WRes2Net, and YOLOv7+WRes2Net networks. Our model (YOLOv4+WRes2Net) exhibits rapid convergence similar to original YOLOv4.

Models learning curves: a) Original YOLOv4; b) Original YOLOv7; c) YOLOv4+WRes2Net; d) YOLOv7+WRes2Net.
The model we have developed possesses an estimated 10.2 million trainable parameters, whereas the original YOLOv4 model is equipped with roughly 27.5 million parameters, as indicated in Table 3. Despite this disparity, our model effectively achieves a comparable level of detection.
Number of trainable parameters of evaluated models (backbone only)
Figure 7 shows a sample of qualitative results of all plant species for every model. Performance of all trained models is presented in Table 4. The values of the weights (W1 and W2) learned in the shortcut layers were likewise obtained in the models incorporating our backbone (see Table 5).

Detection results for all plant species (a)–(d) Ground truth, (e)–(h) YOLOv4; (i)–(l) YOLOv7; (m)–(p) YOLOv4+WRes2Net; (q)–(t) YOLOv7+WRes2Net.
Performance results from the evaluated models
Final weight values in the shortcut layers for the YOLOv4 and YOLOv7 models with our backbone
Pollen analysis is a very useful technique, whether to define the quality of honey, safely label its botanical origin or to determine where the bees have collected nectar. In this study, we present a deep-learning model developed to identify pollen grains in microscopic images. Our model is based on YOLOv4 and Res2UNet, which have been widely recognized for their effectiveness in object detection and segmentation tasks. Results of our model indicate that it achieved a mean average precision (mAP) of 98.83% at a threshold of 0.5 in our database. Having a smaller number of parameters compared to other architectures, our model achieves a performance that is similar to YOLOv4 with its CSPDarknet-53. This improvement in detection speed without losing precision has considerable significance. To the best of our knowledge, this is the first deep model for pollen analysis of plant species of Yucatan.
