Tomato maturity stage prediction based on vision transformer and deep convolution neural networks

Abstract

Automated assessment of tomato crop maturity is vital for improving agricultural productivity and reducing food waste. Traditionally, farmers have relied on visual inspection and manual assessment to predict tomato maturity, which is prone to human error and time-consuming. Computer vision and deep learning automate this process by analysing visual characteristics, enabling data-driven harvest decisions, optimising quality, and reducing waste for sustainable and efficient agriculture. This research demonstrates deep learning models accurately classifying tomato maturity stages using computer vision techniques, utilising a novel dataset of 4,353 tomato images. The Vision Transformer (ViT) model exhibited superior performance in classifying tomatoes into three ripeness categories (immature, mature, and partially mature), achieving a remarkable testing accuracy of 98.67% and the Convolution neural network (CNN) models, including EfficientNetB1, EfficientNetB5, EfficientNetB7, InceptionV3, ResNet50, and VGG16, achieved testing accuracies of 88.52%, 89.84%, 91.16%, 90.94%, 93.15%, and 92.27%, respectively, when tested with unseen data. ViT significantly surpassed the performance of CNN models. This research highlights the potential for deploying ViT in agricultural environments to monitor tomato maturity stages and packaging facilities smartly. Transformer-based systems could substantially reduce food waste and improve producer profits and productivity by optimising fruit harvest time and sorting decisions.

Keywords

Tomato maturity assessment vision transformer computer vision deep learning convolution neural network

1. Introduction

The agriculture industry has witnessed significant technological advancements, particularly in crop monitoring and management. Among various crops, Tomatoes hold significant economic importance and are extensively cultivated worldwide. Determining tomato maturity stages ensures optimal harvest timing, quality control, and efficient supply chain management. Traditionally, farmers rely on manual inspection to assess tomato maturity, which is labour-intensive, time-consuming, and prone to subjective errors. Recent advancements in machine learning and computer vision techniques automate tomato maturity stage prediction, which has gained attention due to their potential to streamline the process, enhance accuracy, and reduce labour costs. In this century, computer vision offers a promising and significant set of tools for rapid, low-cost, non-destructive approaches for tomato maturity prediction. Computer vision and deep learning techniques are popular in the agriculture and food industries due to their effectiveness in performing critical tasks, such as grading, sorting, and fruit recognition. In addition, deep neural techniques have been used to classify the tomato maturity stage, such as immature, mature and partially mature [1]. Deep neural networks possess the valuable capacity to study raw tomato images across various datasets and independently identify meaningful patterns that characterise maturity levels. This data-driven strategy circumvents the need for manual feature extraction. Deep learning models based on convolutional architectures have set new benchmarks for tomato image analysis, particularly in maturity prediction [2] and sorting applications [3]. The key to their success lies in efficiently utilising pre-trained neural networks. These networks have been trained on extensive labelled datasets, which can significantly reduce the computational power associated with the training process. Deep neural networks allow efficient fine-tuning for specific tasks like tomato maturity classification through automated feature learning and transfer learning and can outperform analysing tomato images to determine maturity level. Tomatoes go through distinct colour changes, such as green (immature), pink (partially mature), and red (mature). Image processing analyses tomato images and extracts colour features to determine their ripeness stage. Feature extraction from images using deep learning presents a promising solution for the classification process with high accuracy and efficiency. This colour-based computer vision approach allows consistent and scalable assessment of tomato maturity compared to subjective human evaluation. The concept could also be extended to assess the ripeness of other fruits and vegetables.

Several studies have explored the application of deep learning in agriculture, particularly using (Convolutional Neural Networks) CNNs for crop and fruit classification tasks. CNNs have demonstrated remarkable success in image recognition tasks by automatically extracting and learning features from images. Non-destructive prediction of tomato maturity has been evaluated using colour sensors that measure parameters like green and red [4]. Hyperspectral imaging [5] and spatially resolved spectroscopic systems [6] to assess qualities of tomatoes like soluble solids and firmness. Computer vision systems and deep neural networks have shown promise for various fruit quality assessments [7, 8], including size estimation [9, 10] and colour-based maturity prediction [11]. Techniques such as deep learning, transfer learning, and Back Propagation Neural Networks (BPNNs) have been developed for applications maturity detection [12, 13, 14], crop identification [15] and fruit sorting [16]. Maturity detection methods had been proposed involving image processing and CNNs to detect tomato maturity and even predict harvest times for individual fruits [17, 18]. Recent advancements in deep learning could predict disease in the tomato leaf [19]. Additionally, Transformers was used for tomato ripeness prediction using segmentation [20, 21]. These technologies aimed to support the development of automated tomato sorting and grading tools to optimise harvesting. Moreover, real-time tomato detection, using deep learning models such as the Multi-level deep residual network [22], YOLOv4-tiny model [23] and YOLOv5 model [24] had been deployed. Despite the proven effectiveness of CNNs in image-based applications within agriculture, they have limitations in capturing long-range dependencies between image parts due to their local receptive fields. Vision Transformer (ViT), a novel architecture proposed by Dosovitskiy et al. [25] has shown remarkable performance in various computer vision tasks, surpassing traditional CNN architectures in certain scenarios. ViT provides self-attention mechanisms to process images as sequences of patches rather than relying on the inductive biases of convolutional layers like CNNs. This potentially provides ViTs more flexibility and efficiency in capturing global dependencies within images. However, ViT remains relatively unexplored for agricultural uses, particularly in predicting crop maturity stages like in tomatoes. Moreover, the lack of exploration of ViT in agriculture provides an interesting research opportunity. Additionally, comparative studies evaluating efficiency and performance against established transfer learning CNN approaches could offer useful insights.

This research reflects the effectiveness of the proposed ViT model by comparing transfer learning-based CNNs in predicting the maturity stages of tomatoes from images. It involves a comprehensive evaluation based on accuracy, computational efficiency, and the ability to generalise across different tomato varieties and environmental conditions. The results put a significant step towards harnessing the power of deep learning in transforming agricultural practices, ultimately contributing to enhanced food security and sustainability. Moreover, it provides detailed insights into each approach. This research endeavours to contribute to advancement in agricultural practices and pave the way for more effective crop management strategies.

The remaining section of this manuscript is organised as follows: Section 2 details the previous studies related to this field. Section 3 introduces groundbreaking deep-learning techniques for categorising tomato ripeness; and in Section 4, subsequent results of classification models across the proposed dataset are demonstrated. Finally, Section 5 summarises the findings and future implications of this study.

2. Related studies

Tomato quality and maturity are crucial in marketability, shelf life, and overall consumer acceptance. Traditional methods for assessing these factors often involve destructive testing, which is time-consuming, labour-intensive, and results in product loss. Researchers have explored non-destructive techniques for evaluating tomato maturity and quality factors during ripening and storage to overcome these challenges.

Hyperspectral imaging and visible and near-infrared (VIS/NIR) spectroscopy have proven effective techniques for assessing various quality factors in tomatoes during storage and ripening. Mollazade et al. [26] predicted quality factors of tomato fruit during storage using backscattering and multispectral imaging techniques, employing a novel strategy that utilised wavelengths identified by the fusion of genetic algorithm and t-Score techniques. Similarly, Dai et al. [5] utilised hyperspectral imaging to identify maturity stages in tomatoes and quantify lycopene content. In the domain of VIS/NIR spectroscopy, Lu et al. [27] accurately estimated tomato ripeness, highlighting the effectiveness of this method in evaluating fruit maturity. Saad et al. [28] monitored the physicochemical qualities of intact tomatoes over a 12-day storage period using VIS/NIR spectroscopy in the range of 350–1050 nanometers, while Akpolat et al. [29] developed a rapid technique using handheld Raman spectroscopy and chemometric methods for profiling and quantifying major carotenoids in tomatoes.. Zhang et al. [30] evaluated tomato soluble solids content, an indicator of maturity, using VIS/NIR spectroscopy across different stages. Their study utilised 168 tomato samples at various ripeness levels, employing two systems that measured wavelength ranges of 500-930 nanometers and 900–1400 nanometers. Huang et al. [6] demonstrated that spatially resolved spectroscopy could assess maturity in different layers of the tomato pericarp, while Huang et al. [31] specifically focused on predicting tomato fruit firmness using visible and near-infrared spectroscopy. These studies collectively show the ability of hyperspectral imaging and spectroscopic techniques to predict and monitor various tomato quality parameters throughout the storage and ripening stages. Beyond hyperspectral and spectroscopy, other optical techniques have shown promise for nondestructive tomato analysis. For example, Pieczywek et al. [32] tested biospeckle methods against spectroscopy for predicting maturity indices like firmness and pigment content in tomatoes. Damayanti et al. [4] designed an RGB imaging system using a TCS3200 colour sensor to predict tomato maturity and vitamin C content automatically. Nurulhuda et al. [33] monitored fluorescence characteristics on tomato skin and flesh during over-ripening.

Several studies had developed fuzzy logic systems for agricultural modelling and prediction tasks. Chen et al. [34] used a fuzzy optimisation strategy to improve the prediction of fruit maturity using visible-NIR spectroscopy. Huang et al. [6] proposed a fuzzy Mask R-CNN model to identify the ripeness levels of cherry tomatoes automatically. Villaseñor-Aguilar et al. [35] compared fuzzy logic and neural networks for classifying bell pepper ripeness levels from images. Beyond fuzzy logic, some studies had combined fuzzy methods with deep learning for agriculture. Villaseñor-Aguilar et al. [36] developed a low-cost sensor correlating tomato lycopene content to epicarp colour using fuzzy logic. Hsieh et al. [37] used computer vision and fuzzy logic techniques to identify mature green tomatoes and predict fruit size. Soft computing techniques, such as fuzzy logic and genetic algorithms, are increasingly being employed in various health monitoring sectors to address complex challenges and enhance the robustness and accuracy of healthcare systems. Umoh et al. [38] underscored the integration of interval type-2 fuzzy logic (IT2FL) with the flower pollination algorithm (FPA) in healthcare systems, proposing a hybrid intelligent telemedical monitoring and predictive system. This model enhances robustness in handling healthcare data, overcoming the limitations inherent in individual methods. Kadam et al. [39] explored cardiac arrhythmia classification through a two-stage model that combines Genetic Algorithm feature selection and an ensemble of Support Vector Machines with bagging. This approach not only optimises the feature subset but also improves classification quality through bootstrap aggregating.

Deep learning techniques have shown promise for various agricultural applications such as plant disease detection, fruit quality assessment, and crop growth monitoring. For plant disease detection, Prabhakar et al. [40] used a deep Residual Network (ResNet101) to assess the severity of early blight disease in tomato leaves. Juyal and Sharma [41] proposed using an R-CNN mask to identify and segment disease-infected regions in tomato leaves accurately. Elleuch et al. [42] introduced a deep learning-based diagnostic method for identifying plant diseases caused by nutrient deficiencies, utilising transfer learning architectures like VGG-16 and ResNet. This approach aims to surpass the detection accuracy of existing methods by analysing leaf images. Some studies had focused on monitoring crop growth for robotic harvesting. Seo et al. [43] developed a real-time tomato monitoring system to detect ripe fruits in greenhouses using Faster R-CNN. Tej et al. [44] compared ResNet152 and ResNet50 with and without data augmentation to recognise tomato and pepper diseases. Islam et al. [45] proposed an attention-based dilated CNN model called attention-based dilated CNN logistic regression (ADCLR) for tomato leaf disease detection. Beyond disease and growth monitoring, some work had assessed fruit maturity and quality. Aherwadi et al. [46] aimed to identify the best deep learning algorithms for non-destructive prediction of fruit maturity and quality to determine shelf life. Bouni et al. [19] used CNNs and transfer learning for tomato leaf disease identification. Nahak et al. [47] evaluated diseases present in the plant utilising CNN and LabVIEW software. Nennuri et al. [48] proposed a multi-stage deep learning technique for different crop and disease identification. Finally, for quality assessment of stored produce, Stasenko et al. [49] proposed synthesised and segmented visible/near-infrared spectral imaging data to predict decay and fungal zones in stored apples using generative adversarial networks and CNNs. Furthermore, the study by Li et al. [50] utilised MHSA-YOLOv8 for tomato maturity detection and counting, achieving mAP scores of 0.864 and 0.916, respectively. ViT also plays a pivotal role in image classification. Kadri et al. [51] addressed Alzheimer’s disease classification using a hybrid approach that combines CNNs and ViT in a Cross ViT Wide Residual Squeeze-and-Excitation Network. They also innovated with a self-attention progressive generative adversarial neural network for data augmentation, tackling the challenge of limited data in neuroimaging analysis. Deep learning has shown significant promise for various agricultural applications, with continued research improving disease detection, growth monitoring, maturity and quality prediction, and post-harvest quality assessment.

In recent years, various techniques have been employed to predict tomato crop maturity, including hyperspectral imaging, visible and near-infrared (VIS/NIR) spectroscopy, and CNNs. While these methods have shown promising results, the application of ViT for tomato maturity prediction remains relatively unexplored. The proposed study aims to bridge the gap in the existing literature by conducting a comprehensive evaluation of the ViT model compared to various transfer learning-based CNN approaches for tomato maturity prediction.

3. Methodology

This study introduces a sequential process for classifying the maturity stages of tomatoes, depicted in Fig. 1. The process involves taking images from the dataset of tomatoes and then applying image pre-processing techniques before training and then, validating for deep learning models. Subsequently, rigorous testing is carried out to ensure precise classification of tomato maturity stages.

Figure 1.

Flowchart for tomato maturity stage classification.

3.1 Dataset description

The dataset utilised in this research contains 4353 images of tomatoes of three stages immature, partially mature and mature. The training set contains 3120 images and 780 images were assigned to the validation set, used for model optimisation and hyperparameter tuning during training. The validation images enable unbiased evaluation of model performance on new data. Finally, 453 unseen tomato images constituted the test set for final model testing and accuracy quantification. Rigorous benchmarking on the test data ensures the machine learning approach can precisely classify the maturity stage of tomato crops. Strategic data preparation techniques, such as pre-processing, resizing and splitting among train, validation and test sets, are crucial for creating standardised and unbiased image datasets.

3.2 ViT model architecture

ViT has emerged as a promising architecture for computer vision tasks. The study proposes customising a ViT model for automated tomato maturity stage prediction. ViT applies the transformer architecture shown in Fig. 2 commonly used in natural language processing directly to image classification. The ViT model would first split tomato images into small patches, treating each as a visual word. The sequence of image patches passes through multiple encoder blocks containing multi-headed self-attention layers to model global dependencies between patches. After the sequence of encoder self-attention operations, the model applies a Multilayer perception (MLP) head block for final classification into maturity categories. Within each encoder block, a layer normalisation step first normalises image patch inputs.

Figure 2.

ViT model architecture.

The normalised inputs enter a multi-headed self-attention layer to relate different spatial regions, capturing visual relationships valuable for assessing ripeness based on colour, firmness, and shape changes over time. Residual skip connections after each attention layer help convergence during backpropagation. The specific model parameters, such as the number of encoder blocks, attention heads, hidden layer dimensions, and embedded image patch resolution, can be optimised via cross-validation on the tomato image dataset. Additionally, regularisation techniques will help to limit overfitting and improve generalisation across diverse tomato varieties, lighting conditions, and growth levels. The ViT architecture’s strength in implicitly learning global interdependencies between local pattern representations in images can complement convolutional approaches. The transformer-based architecture’s inherent strength in relating spatial regions provides a promising result for robust computer vision across diverse settings.

3.3 CNN model architectures

Deep convolutional Neural Network (DCNN) architecture (Fig. 3) for automated assessment of agricultural products typically contains input and output layers alongside multiple hidden layers of convolutional filters, pooling operations, nonlinear activations, and fully connected mappings that enable representation learning. Through hierarchical feature extraction, CNN models can capture visual qualities indicative of internal ripeness and projected shelf life. classification from image datasets.

Figure 3.

Basic CNN model architecture.

Low-level convolutional layers detect basic edges, colours, and textures, while deeper layers integrate these into higher-level features correlating with maturity timelines. Various established CNNs leverage this translation invariant local receptive fields to achieve state-of-the-art performance on parallel perception tasks. For this research, promising architectures like EfficientNetB1, EfficientNetB5, EfficientNetB7, InceptionV3, ResNet50 and VGG16 will be evaluated as baseline models for non-destructive tomato quality.

3.3.1 EfficientNetB1

EfficientNetB1 [52] is a state-of-the-art CNN architecture designed for high performance and efficiency. It features an input image dimension of 240 $\times$ 240 and uses compound scaling to uniformly scale up the network depth, width, and resolution with a set of fixed scaling coefficients. The base architecture comprises multiple inverted bottleneck residual blocks called MBConv. These MBConv blocks employ depthwise separable convolutions and squeeze-and-excitation optimisation to maximise accuracy and efficiency. As the input progresses through the network, the number of channels increases while the spatial dimensions decrease, allowing the network to capture low-level features and high-level semantic information. EfficientNetB1 has 7.8 million parameters. The compound scaling and advanced MBConv blocks provide an optimal balance between accuracy and computational efficiency.

3.3.2 EfficientNetB5

EfficientNetB5 [52] is a scaled-up variant of EfficientNet4 that utilises compound scaling to uniformly increase the depth, width, and resolution of the network architecture for improved accuracy. It features an increased input image dimension of 456 $\times$ 456 compared to 240 $\times$ 240 for EfficientNetB1. With compound scaling, the width is expanded using more channels in the MBConv blocks, and spatial resolution is increased. The base building block remains the MBConv, with the same combination of inverted bottleneck convolution, depthwise separable convolution, and squeeze-and-excitation optimisation. As the input image progresses through the network, channel capacity progressively increases while spatial dimensions are decreased through strided convolutions, enabling more semantic feature representations. EfficientNetB5 contains 30 million parameters. The compound scaling balances accuracy gains from increased depth, width and resolution with efficiency from the depthwise separable convolutions in MBConv blocks.

3.3.3 EfficientNetB7

EfficientNetB7 [52] is the most advanced model in the EfficientNet family of CNN architectures, achieving state-of-the-art performance and efficiency through extreme compound scaling. It uses an ultra-high-resolution input size of 600 $\times$ 600 which provides an opportunity for the model to learn more granular spatial feature representations. With the scaled coefficients from compound scaling. The width is also expanded significantly from EfficientNetB1 and EfficientNetB5 by using substantially more channels in the inverted bottleneck MBConv blocks in the later stages. As the spatial size is reduced through strided convolutions down the network backbone, the number of channels is progressively expanded. This widening of channels enables the capture of richer semantic representations and intricate interdependencies between spatial features in the data. EfficientNetB7 has a very high capacity with 66 million parameters. The ultra-wide design and high-resolution training empower EfficientNetB7 to achieve state-of-the-art accuracy on computer vision benchmarks.

3.3.4 InceptionV3

InceptionV3 [53] is a CNN architecture designed for efficient computation and high performance in computer vision tasks. It is built around Inception modules, which consist of parallel convolutional filters of different sizes (1 $\times$ 1, 3 $\times$ 3, 5 $\times$ 5) that capture spatial context and cross-channel correlations. The outputs of these filters are concatenated into a single output vector. The architecture also employs 1 $\times$ 1 convolutions for dimensionality reduction, batch normalisation for higher learning rates, and auxiliary classifiers to combat overfitting. InceptionV3’s design allows deep iterations without overfitting while minimising computational cost, making it suitable for resource-constrained environments.

3.3.5 ResNet50

ResNet50 [54] is a CNN architecture that utilises residual learning to enable the training of very deep models without accuracy degradation. The core component is the residual block, which learns a residual mapping using stacked 3 $\times$ 3 convolutional layers with batch normalization and ReLU activations. Shortcut connections between the block inputs and outputs allow information to skip layers by summing the residuals. The architecture begins with a 7 $\times$ 7 convolution and max pooling, followed by four sections of residual blocks with intermittent downsampling, ultimately flattening features for a fully connected layer and softmax classifier. This deep topology leverages feature reuse along shortcuts to ease optimisation and expand representation capacity, allowing over 150 layers without degradation and delivering record-breaking image classification performance.

3.3.6 VGG16

VGG16 [55] is a CNN architecture consisting of a stack of convolutional layers arranged in blocks, each containing multiple convolutional layers using small 3 $\times$ 3 filters. This design allows the network to learn a hierarchy of features, from low-level edges to high-level semantics, while reducing the number of parameters. The spatial dimensions are gradually reduced through max-pooling while the number of feature maps increases. The extracted features are then flattened and fed into fully connected layers for high-level reasoning and classification, culminating in a softmax layer for class prediction. VGG16’s deep and wide structure, with a total of 16 layers, provides it with a remarkable capacity to learn complex visual representations and has served as a foundation for many subsequent CNN architectures.

3.4 Fine-tuning and performance metrics

The pre-trained deep learning models can be fine-tuned to detect various maturity levels of the tomato dataset. The initial convolutional layers that extract generic visual features are frozen, while the later higher-level feature layers are unfrozen and trained at a low learning rate using tomato ripeness data and maturity labels to learn domain-specific patterns. The model performance will be evaluated using accuracy (Eq. (1)), precision (Eq. (2)), recall (Eq. (3)), F1-score (Eq. (4)), and area under the receiver operating characteristics curve (AUROC). Confusion matrices will be a key metric to assess model skill at tomato maturity stage classification, with rows corresponding to actual maturity labels and columns indicating predicted ripeness levels. The goal is to optimise the classifier to achieve higher precision and generalise the model.

\displaystyle\text{Accuracy}=\frac{\text{TP}+\text{TN}}{\text{TP}+\text{TN}+% \text{FP}+\text{FN}},

(1)

\displaystyle\text{Precision}=\frac{\text{TP}}{\text{TP}+\text{FP}},

(2)

\displaystyle\text{Recall}=\frac{\text{TP}}{\text{TP}+\text{FN}},

(3)

\displaystyle\text{F1 score}=\frac{2(\text{Precision}\times\text{Recall})}{(% \text{Precision}+\text{Recall})},

(4)

where TP represents true positives, TN indicates true negatives, FP denotes false positives and FN represents false negatives.

4. Results and discussion

This research demonstrates tomato maturity stage classification utilising the ViT model and Deep CNNs. These models were trained on an open-source dataset [56] comprising 4353 tomato images. This dataset was split into training, validation and testing sets, consisting of 3120, 780 and 453 images, respectively. Several performance metrics were used to predict the efficacy of the ViT and CNN models, including precision, recall, Area Under the Curve (AUC) scores from the receiver operating characteristic curve (ROC) curve, confusion matrices and test accuracies. PyTorch and Keras libraries were utilised to develop the ViT and CNN-based deep learning models, respectively. The training was performed on an NVIDIA GeForce GTX 1070 Ti GPU for 50 epochs. The ViT and CNNs models were trained on field captured tomato image dataset to categorise the maturity stage into three classes: immature, partially mature and mature tomatoes.

4.1 Performance evaluation of ViT model

The ViT model was trained for 50 epochs on the newly created tomato dataset. The training accuracy (Fig. 4a) increased from an initial value of 0.43 to 0.94 within five epochs and then gradually increased to a final accuracy of 99.16%, indicating the model’s improving ability to classify maturity classes in the training data over time correctly. Meanwhile, the validation accuracy (Fig. 4a) curve displayed a similar upward trajectory. starts from 0.69 and reaches 0.91% within five epochs and then gradually increases to 98.75% and becomes plateauing to the training curve. This demonstrates that while the model continued improving on the training data past epoch 5, generalising the model without overfitting and underfitting. The training loss (Fig. 4b) decreased from an initial value of 1.2 to a final loss of 0.0831, indicating an improving ability of the model to distinguish between the maturity classes. The Validation was concurrently carried out on unseen images after each training epoch iteration. The validation loss (Fig. 4b) curve displayed a downward trajectory initially at 0.9 and, within ten epochs, reached 0.12, then before reaching a minimum at epoch 50 and then plateauing. Based on the loss and accuracy curves, training was halted at epoch 50 to deploy the best iteration of the model just before overfitting effects. The model parameter state at the optimal checkpoint will be saved for tomato maturity detection on new datasets.

Figure 4.

Accuracy and loss distribution of ViT model.

Figure 5.

Confusion matrix and ROC curve of ViT model.

Based on the loss and accuracy curves, training was halted at epoch 50 to deploy the best iteration of the model just before overfitting effects. The model parameter state at the optimal checkpoint will be saved for tomato maturity detection on new datasets. The confusion matrix (Fig. 5a) was constructed on a test set of 453 tomato images classified into immature, partially mature and mature ripeness levels. The model achieved an overall accuracy of 98.67%, correctly predicting 153 out of 153 immature tomatoes, 154 out of 156 mature tomatoes, and 140 out of 144 partially mature tomatoes. This reliable identification capability is evidenced by the high diagonal confusion matrix counts complemented by robust precision and recall metrics for all ripeness categories. The high precision and recall for all the class classes confirm the model’s exceptional capability in detecting tomato maturity image samples. The ROC curve (Fig. 5b) metric demonstrates that, in the immature and partially Mature class, the AUC is 1, representing a perfect model for classifying the immature and partially mature class. Similarly, for the mature class, the AUC of 0.9996 is extremely very high, illustrating the model’s strong capability to distinguish the mature category accurately.

4.2 Performance evaluation of CNN models

This section presents a quantitative evaluation of CNN models, such as EfficientNet B1, EfficientNet B5, EfficientNet B7 Inception V3, ResNet50, and VGG16, for classifying tomato ripeness from images. The models were trained over 50 epochs, with their evaluations in the form of the validation accuracy. The models achieved strong performance, with validation accuracy ranging from 91.02% to 98.75%, reaching 50 epochs. The top performers were ViT at 98.75%, ResNet50 at 95.89%,VGG16 at 94.23%, InceptionV3 at 93.3%, EfficientNetB1 at 93.07%, EfficientNetB5 at 92.56%, and EfficientNetB7 at 91.02% validation accuracy.

Figure 6.

Accuracy and loss distribution of CNN models in the training process.

Figure 7.

Accuracy and loss distribution of CNN models in the validation process.

4.2.1 EfficientNetB1

The EfficientNetB1 model steadily improves training and validation metrics over 50 epochs. Starting from 0.73, training accuracy (Fig. 6a) climbs gradually to 0.99, aligned with the training loss (Fig. 6b), dropping from 0.94 to 0.15. Similarly, validation accuracy (Fig. 7a) rises steadily from an initial 0.86 to nearing 0.9307, while the validation loss (Fig. 7b) decreases from 0.61 to 0.47. Taken together, these curves demonstrate the model effectively learning patterns from the training data and transferring those gains to improved predictions on the unseen validation data over time.

Figure 8.

Confusion matrix and ROC curve of EfficientNetB1 model.

The confusion matrix is shown in Fig. 8(a), which displays the model’s categorisation across the three maturity classes: Immature, Mature, and Partially Mature. The model identifies the Mature class, with an exceptional 0.9978 AUC shown in Fig. 8(b) reflecting near-perfect true positives and minimal false positives. Classification also reliably distinguishes immature instances, with a strong 0.9845 AUC (Fig. 8b). However, some overlap exists between immature and partially mature, seen through more errors between the pair and a comparatively lower 0.9667 AUC (Fig. 8b) for partially mature. Testing accuracy achieved 88.52%, which indicates the model capably categorises maturity level from the available training data.

4.2.2 EfficientNetB5

This model demonstrates strong, stable learning throughout training. It starts with high initial training (Fig. 6a) and validation accuracy (Fig. 7a), around 0.664 and 0.814 respectively, indicating good initial performance. Rapid gains in the first ten epochs improve training accuracy to 0.97 and validation above 0.908 Both continue gradual gains through the 50 epochs, ending at peak values of 0.986 training accuracy and 0.9256 validation. The training (Fig. 6b) and validation loss (Fig. 7b) curves exhibit a decreasing trend in learning the model effectively. This aligned progress demonstrates effective optimisation and a robust modelling approach. The confusion matrix (Fig. 9a) shows a reliable classification, with a large predominance of examples accurately assigned to their true classes, as seen in the strong diagonals. The ROC (Fig. 9b) curves also showcase impressive distinction of the three classes, with AUCs of 0.9879, 0.986, and 0.9693 showing balanced true positives and false alarms for immature, mature and partially mature, respectively. Additionally, The model achieved 89.84% testing accuracy for unseen data.

Figure 9.

Confusion matrix and ROC curve of EfficientNetB5 model.

4.2.3 EfficientNetB7

This EfficientNetB7 model starts with 0.64 training accuracy (Fig. 6a), with rapid gains to a final value of 0.981. However, smooth training loss (Fig. 6b) descent from a competitive 1.11 to 0.156 shows more gradual refinement. Validation accuracy (Fig. 7a) precisely ties top results at 0.91 but followed marginally alternate early dynamics before sharing the 0.38 final loss (Fig. 7b) and achieved 91.16% testing accuracy. Validation accuracy (Fig. 7a) precisely ties top results at 0.91 but follows marginally alternate early dynamics before sharing the 0.38 final loss (Fig. 7b) and achieved 91.16% testing accuracy. The confusion matrix (Fig.10a) displays reliable classification with strong diagonals and indicates minimal confusion between partially mature and mature classes. Meanwhile, AUCs (Fig. 10b) of 0.9886, 0.9965, and 0.9708 demonstrate balanced true and false positives distinguishing immature, mature and partially mature respectively.

Figure 10.

Confusion matrix and ROC curve of EfficientNetB7 model.

4.2.4 InceptionV3

The training accuracy (Fig. 6a) of the InceptionV3 model exhibited a remarkable ascent, starting from a modest 0.72 and increasing to an impressive 0.97. This means the model learned the training data very well. Simultaneously, the training loss (Fig. 6b) experienced a sudden decline, starting from 0.96 to 0.24 within the first 20 epochs before gradually descending to a respectable 0.14. The validation phase also commenced with a strong performance, as the accuracy surged from 0.86 to an impressive 0.926 over the initial 30 epochs. However, the metrics reached a plateau unexpectedly early, attaining a maximum accuracy of 0.9333 (Fig. 7a) and a corresponding loss of 0.422 (Fig. 7b). The confusion matrix (Fig. 11a) revealed generally accurate categorisation, although a partially mature class introduced some confusion, resulting in an overall testing accuracy of 90.84%. Furthermore, the ROC (Fig. 11b) analysis demonstrated the model’s capability to separate classes effectively, with an average AUC of around 0.9873.

Figure 11.

Confusion matrix and ROC curve of InceptionV3 model.

4.2.5 ResNet50

The ResNet50 model exhibits rapid initial learning, with training accuracy (Fig. 6a) increasing from 0.82 and loss (Fig. 6b) decreasing to 0.83. Accuracy plateaus below the peak performance of other models, with gains slowing from 0.96 onwards, while loss reduction decelerates from 0.43. The validation phase reinforces this trend, with accuracy starting at 0.91 and loss at 0.65, subsequently validation accuracy (Fig.7a) and loss (Fig. 7b) reached after 50 epoch at 0.95 and 0.34 respectively.

Figure 12.

Confusion matrix and ROC curve of ResNet50 model.

Figure 13.

Confusion matrix and ROC curve of VGG16 model.

The confusion matrix (Fig. 12a) reflects classification results with strong diagonals and minimal off-diagonal errors. Meanwhile, ROC (Fig. 12b) analysis shows middling distinction, with AUCs all below top performers at 0.9998, 0.9974, and 0.9938 for immature, mature and partially mature, respectively. The ResNet50 model’s attainment of a 93.15% testing accuracy on the test dataset indicates promise in its ability to classify the maturity levels within a tomato dataset.

4.2.6 VGG16

The training process of the VGG16 model was unpredictable and showed a lot of ups and downs. Initially, the model’s accuracy (Fig. 6a) started low at 0.726 but saw a significant improvement around epoch 10 with 0.954, eventually reaching an accuracy of 0.998. However, the journey was not smooth. The loss (Fig. 6b) started at 3.35, the highest among all losses but experienced sharp changes within five epochs with 0.69, ending at 0.18 after 50 epochs. The validation process (Fig. 7), which tests the model on new data, was even more unstable. Both accuracy and loss had significant fluctuations throughout the training and validation. Accuracy (Fig. 7a) began at 0.89, dipped, recovered, and then dropped again before reaching 0.94. The loss (Fig. 7b) was erratic and ended at a slightly disappointing 0.46. These fluctuations suggest that the model’s learning was not steady and was affected by random factors that made its performance vary significantly at different times. Despite these challenges, the model managed to achieve a decent testing accuracy of 92.27%.

The model handles three-class classification (Fig. 13a) separating Immature, Mature, and Partially Mature examples. The performance peaks identify the Mature class with an exceptional AUROC (Fig. 13b) of 0.9992, reflecting a near-flawless true positive rate with minimal false positives. The classifier also reliably distinguishes Immature items as shown through a strong 0.9885 AUC. However, decision boundaries between Immature and Partially Mature groups display some overlap, evident through a comparatively moderate 0.9805 AUC for Partially Mature.

Table 1
Performance of deep learning models on test data.

Model	Stages	F1 score	AUC	Recall	Precision	Overall accuracy (%)
ViT	Immature	0.99	1.00	1	0.97	98.67
	Mature	0.99	0.99	0.99	1.00
	Partially Mature	0.98	1.00	0.97	0.99
EfficientNetB1	Immature	0.91	0.98	0.91	0.88	88.52
	Mature	0.93	0.99	0.93	0.88
	Partially Mature	0.80	0.96	0.72	0.90
EfficientNetB5	Immature	0.93	0.98	0.94	0.92	89.84
	Mature	0.91	0.98	0.92	0.91
	Partially Mature	0.85	0.96	0.83	0.87
EfficientNetB7	Immature	0.92	0.98	0.87	0.99	91.16
	Mature	0.93	0.99	0.99	0.88
	Partially Mature	0.88	0.97	0.87	0.89
InceptionV3	Immature	0.95	0.99	0.96	0.97	90.94
	Mature	0.93	0.99	1.00	0.86
	Partially Mature	0.85	0.97	0.79	0.91
ResNet50	Immature	0.95	0.99	1.00	0.90	93.15
	Mature	0.95	0.99	0.99	0.90
	Partially Mature	0.86	0.99	0.76	0.99
VGG16	Immature	0.93	0.98	0.97	0.90	92.27
	Mature	0.97	0.99	0.99	0.95
	Partially Mature	0.88	0.98	0.83	0.95

ViT model and CNN architectures were examined, including EfficientNetB1, EfficientNetB5, EfficientNetB7, InceptionV3, ResNet50, and VGG16. The quantitative analysis revealed strong model performance for non-invasive tomato maturity evaluation. The ViT architecture outperformed CNN models and achieved 98.67% testing accuracy in classifying the three ripeness stages. Additionally, other metrics are considered in the Table 1 for model evaluation, such as F1 score, AUC, precision, and recall, ViT presents an attractive solution for agricultural industry applications balancing accuracy and efficiency.

5. Conclusions

This study proposes a ViT model for the non-invasive evaluation of tomato ripeness stages surpassing CNNs. The ViT model achieved a testing accuracy of 98.67% and compared with CNN architectures, such as EfficientNetB1 (88.52%), EfficientNetB5 (89.84%), EfficientNetB7 (91.16%), InceptionV3 (90.94%), ResNet50 (93.15%), and VGG16 (92.27%). The top performer from all CNN models, ResNet50, attained 93.15% accuracy, which was 5% less than that of the ViT model. This remarkable high accuracy and strong performance of the ViT model was achieved due to its self-attention mechanism, which captures long-range dependencies and global relationships within the image more effectively than CNNs and highlights the importance of a transformer-based model for tomato maturity assessment. Combining the advantages of transformer models in modelling long-range dependencies in image data with transfer learning from large-scale pre-trained models, the ViT model could learn different stages of ripeness classification from the limited training data. Revolutionising crop management, the transformer-based ripening stage classifier emerges as a game-changer in precision agriculture. The integration of the transformer-based ripening stage classifier into various technological platforms, such as mobile applications, computer systems, and embedded devices, opens up a wide range of possibilities for monitoring crop maturity throughout the entire agricultural supply chain. This could lead to significant reductions in food waste, higher profits, and improved decision-making by optimising harvesting time, which leads to an efficient and sustainable agricultural system that benefits all producers.

References

Begum

Hazarika

M.K.

, Maturity detection of tomatoes using transfer learning, Measurement: Food7 (2022), 100038.

Nahak

Pansuriya

Pratihar

D.K.

Deb

A.K.

, Vision transformer-based transfer learning approach for tomato maturity stage classification. In: Proceedings of the 15th International Conference on Soft Computing and Pattern Recognition. Springer (2024).

Arjenaki

O.O.

Moghaddam

P.A.

Motlagh

A.M.

, Online tomato sorting based on shape, maturity, size, and surface defects using machine vision, Turkish Journal of Agriculture and Forestry37(1) (2013), 62–68.

Damayanti

Hendrawan

Susilo

Oktavia

, et al., Prediction of tomatoes maturity using tcs3200 color sensor. In: IOP Conference Series: Earth and Environmental Science. vol 475, IOP Publishing (2020), p. 012011.

Dai

Sun

Huang

Zhang

Tian

Wang

Sun

Luan

, Application of hyperspectral imaging as a nondestructive technology for identifying tomato maturity and quantitatively predicting lycopene content, Foods12(15) (2023), 2957.

Huang

Chen

Sun

, Assessment of tomato maturity in different layers by spatially resolved spectroscopy, Sensors20(24) (2020), 7229.

Azizah

L.M.

Umayah

S.F.

Riyadi

Damarjati

Utama

N.A.

, Deep learning implementation using convolutional neural network in mangosteen surface defect detection. In: 2017 7th IEEE international conference on control system, computing and engineering (ICCSCE). IEEE (2017), pp. 242–246.

Mputu

H.S.

Abdel-Mawgood

Shimada

Sayed

M.S.

, Tomato quality classification based on transfer learning feature extraction and machine learning algorithm classifiers, IEEE Access (2024).

Ferrer-Ferrer

Ruiz-Hidalgo

Gregorio

Vilaplana

Morros

J.R.

Gené-Mola

, Simultaneous fruit detection and size estimation using multitask deep neural networks, Biosystems Engineering233 (2023), 63–75.

10.

Apolo-Apolo

Martínez-Guanter

Egea

Raja

Pérez-Ruiz

, Deep learning techniques for estimation of the yield and size of citrus fruits using a uav, European Journal of Agronomy115 (2020), 126030.

11.

Kim

Lee

D.H.

Kim

K.C.

Choi

J.M.

, Tomato maturity estimation using deep neural network, Applied Sciences13(1) (2022), 412.

12.

Rizzo

Marcuzzo

Zangari

Gasparetto

Albarelli

, Fruit ripeness classification: A survey, Artificial Intelligence in Agriculture7 (2023), 44–57.

13.

De Luna

R.G.

Dadios

E.P.

Bandala

A.A.

Vicerra

R.R.P.

, Tomato fruit image dataset for deep transfer learning-based defect detection. In: IEEE International Conference on Cybernetics and Intelligent Systems (CIS) and IEEE Conference on Robotics, Automation and Mechatronics (RAM), IEEE (2019), pp. 356–361.

14.

De Luna

R.G.

Dadios

E.P.

Bandala

A.A.

Vicerra

R.R.P.

, Tomato growth stage monitoring for smart farm using deep transfer learning with machine learning-based maturity grading. AGRIVITA, Journal of Agricultural Science42(1) (2020), 24–36.

15.

Eftekhari

Yang

, Machine learning approaches for crop identification from remote sensing imagery: A review. In: Proceedings of the 14th International Conference on Soft Computing and Pattern Recognition (SoCPaR 2022). vol. 648, Springer Nature (2023), p. 325.

16.

Wan

Toudeshki

Tan

Ehsani

, A methodology for fresh tomato maturity detection using computer vision, Computers and Electronics in Agriculture146 (2018), 43–50.

17.

Naranjo-Torres

Mora

Hernández-García

Barrientos

R.J.

Fredes

Valenzuela

, A review of convolutional neural network applied to fruit image processing, Applied Sciences10(10) (2020), 3443.

18.

Behera

S.K.

Rath

A.K.

Sethy

P.K.

, Maturity status classification of papaya fruits based on machine learning and transfer learning approach, Information Processing in Agriculture8(2) (2021), 244–250.

19.

Bouni

Hssina

Douzi

, et al., Impact of pretrained deep neural networks for tomato leaf disease prediction, Journal of Electrical and Computer Engineering2023 (2023).

20.

Shinoda

Kataoka

Hara

Noguchi

, Transformer-based ripeness segmentation for tomatoes, Smart Agricultural Technology4 (2023), 100196.

21.

Khan

Hassan

Shafay

Fahmy

Werghi

Seneviratne

Hussain

, Convolutional transformer for autonomous recognition and grading of tomatoes under various lighting, occlusion, and ripeness conditions. arXiv preprint arXiv:230701530. (2023).

22.

Liu

Xia

, A novel and high precision tomato maturity recognition algorithm based on multi-level deep residual network, Multimedia Tools and Applications79 (2020), 9403–9417.

23.

Mbouembe

P.L.T.

Liu

Sikati

Kim

S.C.

Kim

J.H.

, An efficient tomato-detection method based on improved yolov4-tiny model in complex environment, Frontiers in Plant Science14 (2023), 1150958.

24.

Han

Hao

Sun

Xue

, Tomatoes maturity detection approach based on yolov5 and attention mechanisms. In: 2022 IEEE 4th International Conference on Civil Aviation Safety and Information Technology (ICCASIT). IEEE (2022), pp. 1363–1371.

25.

Dosovitskiy

Beyer

Kolesnikov

Weissenborn

Zhai

Unterthiner

Dehghani

Minderer

Heigold

Gelly

, et al., An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv preprint arXiv:201011929. (2020).

26.

Mollazade

Omid

Akhlaghian Tab

Rezaei Kalaj

Mohtasebi

S.S.

, Data mining-based wavelength selection for monitoring quality of tomato fruit by backscattering and multispectral imaging, International Journal of Food Properties18(4) (2015), 880–896.

27.

Wang

Liu

, Rapid assessment of tomato ripeness using visible/near-infrared spectroscopy and machine vision, Food Analytical Methods10(6) (2017), 1721–1726.

28.

Saad

Jha

S.N.

Jaiswal

Srivastava

Helyes

, Non-destructive quality monitoring of stored tomatoes using vis-nir spectroscopy, Engineering in Agriculture, Environment and Food9(2) (2016), 158–164.

29.

Akpolat

Barineau

Jackson

K.A.

Akpolat

M.Z.

Francis

D.M.

Chen

Y.J.

Rodriguez-Saona

L.E.

, High-throughput phenotyping approach for screening major carotenoids of tomato by handheld raman spectroscopy using chemometric methods, Sensors20(13) (2020), 3723.

30.

Zhang

Yang

Chen

Tian

Wang

Fan

Xin

, Nondestructive evaluation of soluble solids content in tomato with different stage by using vis/nir technology and multivariate algorithms, Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy248 (2021), 119139.

31.

Huang

Chen

, Prediction of firmness parameters of tomatoes by portable visible and near-infrared spectroscopy, Journal of Food Engineering222 (2018), 185–198.

32.

Pieczywek

P.M.

Nowacka

Dadan

Wiktor

Rybak

Witrowa-Rajchert

Zdunek

, Postharvest monitoring of tomato ripening using the dynamic laser speckle, Sensors18(4) (2018), 1093.

33.

Nurulhuda

Al Riza

Kuramoto

Suzuki

Kondo

, et al., Monitoring of fluorescence characteristic in tomato surface during over-ripening stage. In: IOP Conference Series: Earth and Environmental Science. vol. 733, IOP Publishing (2021), p. 012016.

34.

Chen

Qiao

Feng

Cai

, A fuzzy optimization strategy for the implementation of rbf lssvr model in vis–nir analysis of pomelo maturity, IEEE Transactions on Industrial Informatics15(11) (2019), 5971–5979.

35.

Villaseñor-Aguilar

M.J.

Bravo-Sánchez

M.G.

Padilla-Medina

J.A.

Vázquez-Vera

J.L.

Guevara-González

R.G.

García-Rodríguez

F.J.

Barranco-Gutiérrez

A.I.

, A maturity estimation of bell pepper (capsicum annuum l.) by artificial vision system for quality control, Applied Sciences10(15) (2020), 5097.

36.

Villaseñor-Aguilar

M.J.

Padilla-Medina

J.A.

Prado-Olivarez

Botello-Álvarez

J.E.

Bravo-Sánchez

M.G.

Barranco-Gutiérrez

A.I.

, Low-cost sensor for lycopene content measurement in tomato based on raspberry pi 4, Plants12(14) (2023), 2683.

37.

Hsieh

K.W.

Huang

B.Y.

Hsiao

K.Z.

Tuan

Y.H.

Shih

F.P.

Hsieh

L.C.

Chen

Yang

I.C.

, Fruit maturity and location identification of beef tomato using r-cnn and binocular imaging technology, Journal of Food Measurement and Characterization15(6) (2021), 5170–5180.

38.

Umoh

Eyoh

Murugesan

V.S.

Abayomi

Udoh

, Hybrid intelligent telemedical monitoring and predictive systems, International Journal of Hybrid Intelligent Systems17(1-2) (2021), 43–57.

39.

Kadam

Jadhav

Yadav

, Bagging based ensemble of support vector machines with improved elitist ga-svm features selection for cardiac arrhythmia classification, International Journal of Hybrid Intelligent Systems16(1) (2020), 25–33.

40.

Prabhakar

Purushothaman

Awasthi

D.P.

, Deep learning based assessment of disease severity for early blight in tomato crop, Multimedia Tools and Applications79 (2020), 28773–28784.

41.

Juyal

Sharma

, Detecting the infectious area along with disease using deep learning in tomato plant leaves. In: 2020 3rd International Conference on Intelligent Sustainable Systems (ICISS). IEEE (2020), pp. 328–332.

42.

Elleuch

Marzougui

Kherallah

, Diagnostic method based dl approach to detect the lack of elements from the leaves of diseased plants, International Journal of Hybrid Intelligent Systems17(1-2) (2021), 33–42.

43.

Seo

Cho

B.H.

Kim

K.C.

, Development of monitoring robot system for tomato fruits in hydroponic greenhouses, Agronomy11(11) (2021), 2211.

44.

Tej

Nasri

Mtibaa

, Detection of pepper and tomato leaf diseases using deep learning techniques. In: 2022 5th international conference on advanced systems and emergent technologies (IC_ASET). IEEE (2022), pp. 149–154.

45.

Islam

M.S.

Sultana

Farid

F.A.

Islam

M.N.

, s Rashid

Bari

B.S.

Hashim

Husen

M.N.

, Multimodal hybrid deep learning approach to detect tomato leaf disease using attention based dilated convolution feature extractor with logistic regression classification, Sensors22(16) (2022), 6079.

46.

Aherwadi

Mittal

Singla

Jhanjhi

Yassine

Hossain

M.S

, Prediction of fruit maturity, quality, and its life using deep learning algorithms, Electronics11(24) (2022), 4100.

47.

Nahak

Jana

Pratihar

D.K.

Deb

A.K.

Ray

, Study on vision system for disease detection using cnn and labview of an agricultural robot. In: Proceedings of the 12th International Conference on Soft Computing and Pattern Recognition (SoCPaR 2020) 12. Springer (2021), pp. 345–353.

48.

Nennuri

Kumar

R.H.

Prathyusha

Tejaswini

Kanishka

Sunitha

, A multi-stage deep model for crop variety and disease prediction. In: International Conference on Soft Computing and Pattern Recognition. Springer (2022), pp. 52–59.

49.

Stasenko

Shukhratov

Savinov

Shadrin

Somov

, Deep learning in precision agriculture: Artificially generated vnir images segmentation for early postharvest decay prediction in apples, Entropy25(7) (2023), 987.

50.

Zheng

Long

Gao

, Tomato maturity detection and counting model based on mhsa-yolov8, Sensors23(15) (2023), 6701.

51.

Kadri

Bouaziz

Tmar

Gargouri

, Crossvit wide residual squeeze-and-excitation network for alzheimer’s disease classification with self attention progan data augmentation, International Journal of Hybrid Intelligent Systems17(3-4) (2021), 163–177.

52.

Tan

, Efficientnet: Rethinking model scaling for convolutional neural networks. In: International conference on machine learning. PMLR (2019), pp. 6105–6114.

53.

Szegedy

Vanhoucke

Ioffe

Shlens

Wojna

, Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2016), pp. 2818–2826.

54.

Zhang

Ren

Sun

, Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2016), pp. 770–778.

55.

Simonyan

Zisserman

, Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:14091556. (2014).

56.

Khatun

Razzak

Islam

M.S.

Uddin

P.D.M.S.

, Tomato maturity detection and quality grading dataset (2023).

Tomato maturity stage prediction based on vision transformer and deep convolution neural networks

Abstract

Keywords

1. Introduction

2. Related studies

3. Methodology

3.2 ViT model architecture

3.3.2 EfficientNetB5

3.3.3 EfficientNetB7

3.3.4 InceptionV3

3.3.5 ResNet50

3.3.6 VGG16

3.4 Fine-tuning and performance metrics

4.1 Performance evaluation of ViT model

Table 1 Performance of deep learning models on test data.

References

Table 1
Performance of deep learning models on test data.