Abstract
Smart farming, also known as precision agriculture or digital farming, is an innovative approach to agriculture that utilizes advanced technologies and data-driven techniques to optimize various aspects of farming operations. One smart farming activity, fruit classification, has broad applications and impacts across agriculture, food production, health, research, and environmental conservation. Accurate and reliable fruit classification benefits various stakeholders, from farmers and food producers to consumers and conservationists. In this study, we conduct a comprehensive comparative analysis to assess the performance of a Convolutional Neural Network (CNN) model in conjunction with four transfer learning models: VGG16, ResNet50, MobileNet-V2, and EfficientNet-B0. Models are trained once on a benchmark dataset called Fruits360 and another time on a reduced version of it to study the effect of data size and image processing on fruit classification performance. The original dataset reported accuracy scores of 95%, 93%, 99.8%, 65%, and 92.6% for these models, respectively. While accuracy increased when trained on the reduced dataset for three of the employed models. This study provides valuable insights into the performance of various deep learning models and dataset versions, offering guidance on model selection and data preprocessing strategies for image classification tasks.
Keywords
Introduction
The broad utility of classification algorithms across various domains and real-time applications draws attention to their usage in the agriculture domain. Intelligent farming, also known as smart farming or precision agriculture, refers to the use of advanced technology and data-driven approaches to enhance the efficiency, productivity, and sustainability of agricultural practices. It involves the integration of various technologies, such as sensors, data analytics, automation, and machine learning, to make informed decisions and optimize farming operations [1]. Intelligent farming helps farmers produce more with fewer resources while minimizing environmental impact.
Smart farming plays some significant key roles in fruit classification using computerized algorithms to accurately categorize fruits based on various attributes. One of which is computer vision techniques and image processing algorithms that offer great help in the analysis of images of fruits. This technology allows for the identification of different fruit varieties, shapes, sizes, colors, and defects. It is particularly useful in sorting and grading fruits based on visual characteristics. Another key role is machine learning and artificial intelligence algorithms that are trained on vast datasets of fruit images and attributes. These algorithms can learn to classify fruits based on visual and spectral features, making them capable of automated fruit classification.
The previously mentioned key roles are integrated together in the methodology presented in this paper to serve to detect fruits in agriculture. Correct fruit detection holds significant importance for several reasons such as increasing harvesting efficiency and reducing labor requirements by automating the harvesting process. This is particularly crucial for large-scale fruit production where manual harvesting can be time-consuming and costly. Another important reason is marketability as properly sorted and graded fruits are more marketable. Furthermore, accurate fruit detection is critical for quality control in the agricultural sector, ensuring that only high-quality fruits reach the market [2].
Image classification is a prominent task in the field of deep learning, wherein sophisticated mathematical operations are employed to analyze input images. The primary objective is to assign one or multiple labels to an image, indicating its membership in specific classes. In contrast to image classification, object detection tasks encompass not only labeling the image but also identifying the objects present within it and localizing their precise locations. This capability allows deep learning algorithms to effectively handle noisy photographs and accurately identify specific objects despite complex and cluttered backgrounds. The focus of this article centers around the classification task utilizing the Fruits-360 dataset, specifically curated to facilitate such tasks. The dataset’s diverse range of fruits and meticulous preprocessing efforts by researchers ensure a clear background and facilitate meaningful results within the scope of our investigation.
Artificial intelligence applications rely on the fundamental principle of pattern recognition to effectively identify novel, previously unseen inputs. Similar to human cognition, these algorithms require a process of education, commonly referred to as learning. Learning can manifest in various forms, and the present article focuses on two specific types. The first type explored in this study is supervised learning where the classification and detection of fruits are accomplished through the utilization of Convolutional Neural Networks [3]. CNN, a technique specifically suited for visual tasks operates by transforming input images into arrays of matrices, where each pixel or element within the image’s matrix is processed to derive meaningful information for classification purposes. These processed elements are then aggregated to generate distinct classes, serving as labels for each photograph [4]. In this paradigm, the algorithm is trained using labeled examples, whereby it learns to associate inputs with their corresponding correct classes. By exposing the algorithm to a diverse range of labeled inputs, it becomes capable of accurately classifying new, unlabeled inputs.
In addition to CNNs, the implementation of various models led to the emergence of a novel learning paradigm known as Transfer Learning, which represents the second type of learning utilized in this study. Transfer learning involves training a model on a particular dataset and optimizing it until it attains an acceptable level of performance. Subsequently, the knowledge acquired through this training is transferred and applied to a different task, typically involving a new dataset that shares similarities with the previous one. This approach leverages the benefits of pre-trained models, allowing for improved efficiency and performance in handling new datasets. This research uses four models, VGG16, ResNet50, MobileNetV2, and EfficientNet-B0. Figure 1 shows a basic common description of the models.

Transfer learning.
The subsequent sections of this paper are organized as follows: Section 2 provides a review of related literature, highlighting the main contributions of this research. Section 3 presents the methodology and implementation workflow employed. Section 4 offers a detailed analysis of the employed models. Section 5 lists all the results with their analysis. Finally, Section 6 concludes the research, highlights future directions, and draws overall conclusions.
Incorporating artificial intelligence (AI) into the agricultural domain has demonstrated remarkable efficacy in the current technological era. Lots of researchers focused on the fruit detection domain aiming to have a robotic procedure to discriminate between different types of fruits. Previous studies have explored various aspects to enhance model accuracy, including the application of transfer learning, variations in the number and type of images, and the influence of epochs. Notably in [5], utilizing a pre-trained model resulted in a twofold increase in classification accuracy. Furthermore, increasing the number of epochs and the quantity of data samples proved to be critical factors in improving accuracy. Conversely, the inclusion of a higher number of classes in the classification scenario led to decreased accuracy, while a reduced number of classes yielded higher accuracy.
In a study by [6], a dataset comprising two categories of ripened and unripe grapes was collected. A simple convolutional neural network (CNN) was trained using HSV, morphological feature maps, and RGB inputs, and subsequently tested on a support vector machine (SVM) classifier. The CNN model achieved an accuracy of 74.49%, while the SVM model reached 69%. Another investigation focused on classifying two categories of cherries, employing an improved CNN architecture and hybrid pooling methods. The CNN model was compared to other models, including KNN, ANN, Fuzzy, and EDT. The CNN model achieved an impressive accuracy of 99% [7].
In [8], a novel model was proposed, combining the support vector machine classifier with the power of deep learning to classify 40 classes of fruits. This model was compared to six transfer learning models. The results demonstrated that the combination of SVM with deep learning, as well as the VGG16 model, outperformed GoogleNet, ResNet18, ResNet50, AlexNet, and VGG19 in terms of performance. Moreover, another study investigated the application of two transfer learning models, EfficientNet and MixNet, on a real-world dataset. The results indicated an increase in accuracy compared to other models [9].
Furthermore, a convolutional neural network model augmented with enhancements using the DenseNet-201 pre-trained model achieved an impressive accuracy of 98.58% [10]. Moreover, researchers have explored various convolutional neural network architectures for the classification of tomato diseases, with CNN consistently outperforming traditional machine learning models [11].
In [12], a dataset of Oudemansiella raphanipes, a fungus found in China, was collected and labeled into five classes. Various models including VGG16, ResNet50, InceptionV3, NasNet-Mobile, EfficientNet, and MobileNetV2 were trained on this dataset. MobileNetV2, with certain improvements, exhibited superior performance with an accuracy of 98.75%. The task of fruit quality classification was addressed in [13], utilizing a dataset comprising six classes with varying levels of quality. The ResNet50 model achieved the highest accuracy, despite employing a larger number of parameters compared to MobileNetV2 and EfficientNetB0, achieving an accuracy of over 95%.
The Fruit-360 dataset was utilized in [14] to evaluate the performance of models such as VGG16, ResNet50, InceptionV3, DenseNet, and InceptionResNetV2. InceptionResNetV2 exhibited exceptional accuracy, reaching 99%. Another study explored classification and detection tasks using different models and multiple datasets. The findings revealed that the construction of models from scratch is commonly employed in classification tasks, while modified pre-trained models are prevalent in detection tasks. Additionally, the authors recommended the collection of a substantial number of real images for real-time applications [15]. Investigations involving CNN and ResNet50 V2 models on a fruit dataset consisting of 41 classes concluded that both models exhibited resistance to overfitting. Notably, ResNet50 V2 outperformed CNN with an accuracy of 98.89% [16].
In summary, these studies collectively contribute valuable insights into the integration of AI and deep learning techniques in agricultural applications. They highlight the effectiveness of transfer learning, dataset curation, and model selection for achieving accurate and efficient fruit classification systems.
In this research, we aim to conduct a comprehensive comparison between a Convolutional Neural Network (CNN) and commonly used transfer learning models in a real-world implementation environment. Specifically, we will evaluate their performance using the fruits360 benchmark dataset, which consists of 131 classes. The inclusion of this extensive dataset poses a challenge in terms of computational time, making it an important aspect to consider.
Furthermore, this research takes into account the significance of big data in assessing the true performance of these models on large-scale datasets. By utilizing the Fruits360 dataset, we can gain insights into the models’ capabilities when dealing with substantial amounts of data. Notably, this research also pays close attention to the time factor, accurately monitoring and recording the running time of each model. To the best of our knowledge, this aspect has not been extensively explored in previous research articles, making it a unique and valuable contribution of this study.
Methodology
The main objective of this research is fruit recognition as a pre-step to reach a completely automated harvesting process. The datasets used contain not only different types of fruits but are extended to include subtypes. The proposed model incorporates artificial intelligence and machine learning for this goal using two types of learning. First is the supervised learning model using the CNN algorithm and second is transfer learning illustrated in four models, VGG16, ResNet50, MobileNet-V2, and EfficientNet-B0. This research paper presents a comprehensive examination of multiple transfer learning models, focusing on their accuracy and training time for the fruit classification task.
Datasets
The models introduced in this research are evaluated using two datasets, one of them is a reduced version of the other. Fruits360 is the main name of the dataset that includes ten versions with the first version in 2017 and the last one in 2021 [17].
Fruits360-Original version
This is a dataset of images of different types and subtypes of fruits and vegetables. It was originally introduced in 2017 after which it became a benchmark dataset in the field of fruit classification. Images were collected using a low-speed motor to capture all sides of each fruit. A white background was added to remove noise from the image and algorithms were used to extract the fruit from the background to eliminate the effect of different lighting conditions as shown in Fig. 2. Dataset images are 100x100 pixels with 131 classes, and 90483 samples, split into 67692 samples for training, and 22688 for testing.

Sample of Fruits-360 dataset.
This is the last available version of the Fruits360 dataset that was introduced in 2021. The choice of a smaller version of the dataset helps reduce the training time and cost which facilitates the trial of the models with more parameters’ modifications. This dataset allows the achievement of optimal results and a deeper understanding of the model’s performance. This reduced version (Fruits360-v10) contains 24 classes with original-size fruit images. The data is about 13K images that are separated as training, validation, and test sets with percentages of 50%, 25%, and 25% respectively. Figure 3 shows the training distribution of the reduced dataset highlighting the contained twenty-four classes. All images undergo a preprocessing phase where they are resized into the same size of 100*100 pixels since this is the smallest size that can be achieved while maintaining good quality of the image features and serving the training cost.

Classes distribution of the training dataset in the Fruits360_reduced version.
The five models’ implementation begins with the initial step of data augmentation, which is a technique commonly used in machine learning and computer vision tasks to artificially expand the training dataset by applying various transformations to the existing data. These transformations include rotations, translations, flips, zooms, and changes in brightness or contrast. The purpose of data augmentation is to increase the diversity and variability of the training data, which helps improve the model’s generalization and robustness [18]. In the proposed models, shear range and rotation parameters were altered. Subsequently, the workflow is divided into two distinct sections based on the types of models employed, as illustrated in Fig. 4.

Proposed approach workflow.
CNN is a classic deep neural network with a feature map added to it. It is widely used in computer vision tasks [19]. Firstly, images are fed to the CNN to extract features by the kernels/filters to extract the simple features and then it is passed to a pooling layer to minimize the image size, and then it is passed to other kernels/filters to identify complex features, and then passed again to pooling layer and so on depending on the data that must be extracted from the image. After this process, the matrix of the image is flattened and entered into the fully connected layer (which is the basic deep neural network). In each layer, random weights are generated and at the end of the first pass, the difference between the actual and predicted values is calculated. These calculations are used to adjust the weights in the backward pass. By this point, the network has new weights to operate within the next forward pass. The network performs a cycle of forward and backward propagation until it reaches convergence as shown in Fig. 5.
The proposed CNN consists of a convolutional layer where features are extracted from images. Then comes the Pooling layer where its main task is to minimize the image. The previous criterion is repeated three times for three layers. Furthermore, a fully connected layer is employed in which the matrix of the images is flattened and fed to the fully connected classic deep neural network that contains two layers with different numbers of nodes for training. A dropout is added to prevent overfitting and two optimizers are employed to choose the better accuracy. Table 1 records all CNN parameters for the two used datasets. Finally, it is time to train the network.

Architecture of convolutional neural network.
CNN parameters: same number of layers for the two datasets with modifications in the number of kernels
Neural networks optimize their performance and reach convergence by calculating weights that were initialized randomly and then calculating the loss between actual and predicted output, after that it recalculates and initializes new weights based on the previous weights. Transfer learning starts with loading the pre-trained model. Then, the final layer is replaced while freezing the initial layer to keep the same weights used with the pre-trained dataset. Resizing data images is one important task for this kind of learning after which comes the network training step.

VGG16 architecture for original version on the left and reduced version on the right.
VGG16 parameters

Architecture of residual network.
The last three models EfficientNetB0, MobileNetV2, and ResNet50: shared the same architecture for both datasets which is shown in Fig. 8. The only exception is the number of neurons in the output layer (131 for the large dataset, and 24 for the reduced dataset). Parameters employed for the three models on both datasets are recorded in Table 3.

Common Architecture of ResNet50, MobileNet-V2, and EfficientNet-B0.
ResNet50, MobileNet-V2, and EfficientNet-B0 parameters for both datasets
Activation function
Activation functions choose for the neuron to be activated or not. It ensures that the neuron learns beneficial information rather than just wasting time processing useless information. Activation functions produce the outputs from a collection of inputs. In this article, two types of activation functions have been used ReLU and softmax.
ReLU
Rectified Linear Unit “Equation 1”, (ReLU) does not set all the neurons active at once, neurons are deactivated only if the output is less than zero “Fig. 9”. Rectified Linear Unit is computationally efficient and accelerates the convergence of the loss function to its global minimum. The function is used in the convolutional and hidden layers.

Graphical representation of ReLU activation function.
The Softmax activation function calculates the probability of each class, it is considered as multiple sigmoid. It is used in the output layer to return the class of the input as shown in Fig. 10. Equations 2–4 show the calculations of the softmax, argmax, and the sigmoid kernel functions respectively.

Graphical representation of softmax activation function.
This research proposes a comparison between two famous learning techniques, machine learning and transfer learning, to achieve accurate fruit classification. A big dataset is used first that requires special hardware specifications. So, training was done on the Google Collaboratory framework with Tesla T4 GPU and 15 Megabytes of Memory.
Data augmentation of rescaling, flipping, shearing, and zooming are done to both training and validation sets in all models. Models are trained on the training set and evaluated over the validation set. All models were trained for 50 epochs and 64 batch size using different optimizers with 0.001 learning rate, the implemented loss function is categorical cross entropy due to multiple fruits’ categories. The stopping criteria is the number of epochs in order to study the performance of the model during the whole process.
Evaluation metrics
Evaluation of the five proposed models’ performance is done through two stages. The first stage is to compare the differences between using the two versions of the Fruits360 dataset. In this stage, accuracy (Equation 5) is the chosen metric as it can easily distinguish and highlight which is more accurate.
In the second stage, measurement metrics are extended to include precision, the area under the curve, and the categorical cross-entropy loss function. Precision (Equation 6) is an important metric in the underlying application hence classifying fruits is made to compete with humans’ true positive rate. The area under the curve graph (AUC) reflects the ability of the model to distinguish between classes. The AUC curve plots true positives against false positives and is illustrated in Equation 7. In classification tasks, the cross-entropy loss (Equation 8) serves as a fundamental metric for assessing the performance of our machine learning or deep learning models. It quantifies the disparity between the predicted probability and the intended outcome, making it the predominant choice for loss functions in such scenarios.
Comparing datasets
The two employed datasets, as mentioned earlier, are two versions of the same dataset. The original dataset is a huge one that actually consumed a lot of time in the running phase. So, it has been mandatory to record the running time of the five models on the Fruits360-Original version dataset that are shown in Fig. 11. The performance of the two datasets when tried out with the five models is available in Table 4.

Training time of the five models on the Fruits360-Original version dataset.
Validation accuracies of the five proposed models on the two employed datasets
Table 4 shows that accuracy is improved using the reduced version dataset for CNN, VGG16 and MobileNetV2. This is a very important conclusion where training time on a smaller dataset is better specially when increasing performance. ResNet50 and EfficientNetB0 are more complex and deeper models compared to simpler architectures like CNN and MobileNet which may be a direct cause of their disappointing results. They have a larger number of parameters, which can lead to overfitting when the dataset is small. Additionally, data Size in deep and complex models like ResNet50 and EfficientNetB0 typically require a larger amount of data to generalize well. When you have a smaller dataset, these models may not have enough examples to learn meaningful representations, leading to lower accuracy.
As this version of the dataset is small compared to the original dataset, it facilitates monitoring and recording more metrics for the five employed models. All results are recorded on the training and validation datasets as previously mentioned. CNN model was trained from scratch with 418K trainable parameters and initialized the random weights. L2 of rate (0.01) regularization technique was used to yield the best results possible as well as a dropout of 20%. Consequently, the CNN model recorded a precision of 98.86%, a loss of 1.8%, and an AUC of 1 as shown in Fig. 12.

Training and validation results of the CNN model when trained on the Fruits-360 Reduced version.
On the validation dataset, the VGG model recorded a precision of 99.15%, a loss of 2%, and an AUC curve of 1. The ResNet50 model recorded a precision of 95.58%, a loss of 2%, and an AUC of 0.9 with around 49k parameters were trained. MobileNet-V2 recorded a precision of 99.54%, a loss of 1.7%, and an AUC of 1 and has 30k trainable parameters trained. With 30K trainable parameters, the EfficientnetB0 model recorded 0% precision, 3.2% loss, and an AUC of 0.6 which means that the model is incapable of distinguishing between different classes. Figures 13–16 summarize the results during different epochs of the four transfer learning models VGG16, ResNet50, MobileNet-V2, and EfficientNetB0 respectively.

Training and validation results of the VGG16 model when trained on the Fruits-360 Reduced version.

Training and validation results of the ResNet50 model when trained on the Fruits-360 Reduced version.

Training and validation results of the MobileNetV2 model when trained on the Fruits-360 Reduced version.

Training and validation results of the EfficientNet-B0 model when trained on the Fruits-360 Reduced version.
The VGG16 model showed almost steady-state values when approaching the end of the 50 epochs for the three evaluation metrics. This proves that choosing this number of epochs was satisfactory. Generally, the VGG16 model results are very promising and prove it can be a reliable model in such applications.
Similarly, the precision, loss and AUC of the ResNet and MobileNet models approximates to convergence. AUC values are excellent and indicate that the models have a high ability to distinguish between positive and negative examples. It suggests that the model’s predictions have a high probability of correctly ranking the instances, which is crucial in classification or ranking tasks. Loss results indicate that the model’s predictions are very close to the actual labels on average. Lower loss values are generally better, and a loss of 0 would represent a perfect match between predictions and actual labels.
As Precision is the ratio of true positive predictions to all positive predictions, a precision of around 0.99 means that almost all positive predictions made by the model are correct. This suggests that when the model predicts a positive outcome, it is highly likely to be correct, which is especially important in tasks where false positives are costly or undesirable.
EfficientNet-B0 results were extremely unstable as shown in Fig. 16. If two transfer learning models are trained on the same parameters (same architecture, hyperparameters, and pre-trained weights) but produce extremely different results, there are several potential explanations for this discrepancy. One important and misleading explanation is Random Initialization. Even when using the same architecture and pre-trained weights, deep learning models typically have some degree of randomness during initialization. Random weight initialization, random data shuffling, or other stochastic processes can lead to different convergence paths during training. Small initial differences in weights or data order can compound over time and result in divergent training trajectories.
Another reason may be the weight Initialization from Pretrained Models. Even if the architectures and hyperparameters are the same, if the two models were initialized from different pretrained weights or at different points during training, this can result in divergent outcomes.
The present study presents a comprehensive comparative analysis, evaluating the performance of a convolutional neural network (CNN) model in conjunction with four transfer learning models: VGG16, ResNet50, MobileNet-V2, and EfficientNet-B0. The accuracies achieved by these models, using the Fruit-360 benchmark original dataset, were recorded as 95%, 93%, 99.8%, 65%, and 92.6%, respectively.
Notably, during the testing phase, an interesting observation was made regarding the ResNet50 model, wherein the validation accuracy surpassed the training accuracy. This phenomenon is likely attributed to the similarities between the validation and training data samples. However, no evidence of overfitting was found, as indicated by the validation loss being lower than the training loss. To address this issue, the application and evaluation of a cross-validation method are recommended.
Furthermore, a comparison between the original version of the Fruits360 dataset and its most recent reduced version is added. The original version training time was very big compared to the reduced version with the models CNN, VGG16 and MobileNetV2 recording higher validation accuracies.
Regarding the CNN model, it exhibited exceptional performance despite being trained for only one hour and having the highest number of parameters. This model achieved remarkably acceptable accuracy for the classification of both employed datasets. This gives a vision of being still in a need to give the traditional machine learning techniques a try before skipping to the recent transfer learning approach.
Future research endeavors will focus on optimizing the hyperparameters of the aforementioned models using the same dataset. Additionally, the inclusion of novel models such as YOLO (You Only Look Once) and SSD (Single Shot Detector) will be explored to enable comparative evaluations of their performance. Furthermore, efforts will be directed toward developing a software program to translate the findings of this study into a practical real-world application.
