Plant diseases can cause significant losses to agricultural productivity; therefore, their early prediction is much needed. So far, many machine learning-based plant disease prediction models have been recommended, but these models face a problem of noisy class label dataset that degrades the performance. Noisy class label dataset results from the improper assignment of positive class labels into negative class data samples or vice versa. Hence, a precise and noise-free plant disease model is required for a better prediction. The current study proposes noise reduction-based hybridized classifiers for plant disease prediction. One tomato and four soybean disease datasets have been selected to conduct the proposed research. The Adaptive Sampling-based Class Label Noise Reduction (AS-CLNR) method has been used along with the Support Vector Machine (SVM) approach for noise reduction. The noise-minimized datasets have been fed into the Extreme Learning Machine (ELM), Decision Tree (DT), and Random Forest (RF) classifiers whose parameters are optimized using Genetic Algorithm (GA) for developing plant disease prediction models. The performances of all these models viz. Hybrid SVM-GA-ELM, Hybrid SVM-GA-DT, and Hybrid SVM-GA-RF have been evaluated using Accuracy, Area under ROC Curve, and F1-Score metrics. Further, these classifiers have been ranked using the statistical Friedman Test in which the Hybrid SVM-GA-RF classifier performed the best. Lastly, the Nemenyi test has also been performed to find out if significant differences exist between various classifiers or not. It was found that 33.33% of the total pairs of hybrid classifiers show a remarkably different performance from one another.
The agricultural sector plays a key role in the development of socio-economic sector of a country. However, the occurrence of various plant diseases can cause major economic and production losses in the agricultural and farming industry. The latest report of the Food and Agricultural Organization (FAO) of the United Nations claims that plant diseases can cause global economic losses up to $220 billion every year [1]. Therefore, the detection of plant diseases has become an indispensable research domain for the researchers, and a wide range of research is already going in this field for the last two decades. Earlier, crop monitoring has been used to identify diseases in plants, where agricultural experts have adopted a naked-eye approach [2]. However, this approach demands continuous monitoring from the experts, which might be very costly for a large farm field. Thus, a cheap, automatic, fast, and more accurate approach is required for plant disease detection [3]. Consequently, agricultural scientists have started using machine learning techniques for developing automatic prediction models for forecasting plant diseases. Prior studies show that many machine learning and deep learning algorithms are extensively being used to predict plant diseases [4, 5, 6, 7, 8, 9, 10, 11, 12]. Classification is an imperative task in the domain of machine learning algorithms. Basically, it identifies class labels of the data samples defined by a set of unique features (attributes) in a dataset. The primary goal of classification is to correctly identify the labels for those data samples whose attributes are already known, but the class labels are unknown [13]. However, the classification task suffers from the problem of dirty dataset [14]. Here, dirty dataset refers to the noisy class label data. The noisy class label data is a dataset that got mislabeled due to the improper assignment of a negative label on a positive data sample or vice versa. The noisy datasets can degrade the performance of machine learning algorithms, so, the use of suitable noise removal techniques becomes necessary for the classification task. Therefore, this study deals with one of the most prominent methods for handling of noisy datasets while developing plant disease prediction models, i.e., the Adaptive Sampling-based Class Label Noise Reduction (AS-CLNR) technique. Random Over Sampling (ROS) technique has also been used to deal with the imbalanced datasets. The dataset with a huge mismatch between the numbers of data samples belonging to different sets of classes is termed as an imbalanced dataset.
The current study proposes three hybridized techniques for plant disease prediction. These include Support Vector Machine (SVM) [15] for noise reduction with the help of the AS-CLNR method followed by classification using three different classifiers, one at a time, namely, Decision Tree (DT) [16], Extreme Learning Machine (ELM) [17], and Random Forest (RF) [18] classifiers. The parameters of these classifiers are optimized using a well-known optimization algorithm, i.e., Genetic Algorithm (GA) [19].
Hence, the three proposed hybrid classifiers are as follows: Hybrid SVM-GA-DT, Hybrid SVM-GA-ELM, and Hybrid SVM-GA-RF. Five plant disease datasets have been used in this study: the Tomato Powdery Mildew Disease (TPMD) dataset, and four variants of Soybean Disease datasets viz. Soybean Alternaria Leaf Spot Disease (SB-ALSD), Soybean Phytophthora Rot Disease (SB-PRD), Soybean Frog Eye Leaf Spot Disease (SB-FELSD), and Soybean Brown Spot Disease (SB-BSD). All these datasets are initially imbalanced due to which ROS technique has been used in this study for balancing the data. Further, the results obtained from the proposed classifiers have been assessed and compared using three performance metrics, namely, Accuracy, Area under ROC Curve (AUC), and F1-Score (F) for each of the five datasets. Furthermore, the paper also provides a statistical proof for comparing different techniques, which strengthens the conclusion of this work.
The remaining paper is structured as follows: Section 2 highlights previous studies related to plant disease prediction. Further, Section 3 presents the details of the preliminaries used in this study. Subsequently, Section 4 elaborates the proposed approach adopted in this paper. Afterward, experimental results are highlighted in Section 5, followed by conclusion and future scope in Section 6.
Related work
So far, many researchers have used machine learning approaches for plant disease forecasting. In 2006, Kaundal et al. developed an SVM-based online server to predict rice blast disease. They also claimed that SVM performed better than back-propagation neural network, multiple regression, and generalized regression neural network algorithms to forecast plant diseases [5].
SVM is further used in 2010 for the early detection of diseases in sugar beet plants by Rumpf et al. They conducted the experiment on hyperspectral data to classify healthy and infected sugar beet leaves with 97% accuracy [20]. Further, in 2013, Liu et al. also proposed a multiclass prediction model to classify non-diseased wheat leaves and the leaves infected with stripe rust, leaf blight, powdery mildew, and leaf rust diseases. They used image-based wheat disease data in their study, classified by Radial Basis Function (RBF)-SVM approach. Based on image processing techniques and the RBF-SVM algorithm, the authors obtained classification accuracy up to 96% [21].
Later, in 2016 Sabrol and Kumar proposed an intensity-based feature extraction technique and the DT approach for tomato plant disease diagnosis. The authors claimed that their prediction model was 76% accurate [16]. In the same year, Chung et al. used GA with the SVM approach to predict Bakanae disease in rice crop. The proposed model was able to classify the diseases and healthy rice seedling with 87.9% accuracy [22]. Afterward, in 2017, Fuentes et al. proposed the deep-learning-based prediction models for tomato disease diagnosis. They used three Convolutional Neural Network (ConvNet) based Meta architectures viz. Single Shot Multibox Detector, Faster Region-based ConvNet, and Region-based Fully ConvNet. Each one of these architectures was combined with Residual Neural Network and VGGnet based feature extractors. The proposed models performed the detection of various pests and diseases in the tomato plants [6].
A year later, Verma et al. presented a survey on multiple machine learning and image processing techniques used for tomato plants’ disease forecasting [23]. Afterward, in 2019, they extended their research by constructing a deep learning-based mobile application to diagnose diseases in tomato plants [24]. In 2020, Verma et al. have further developed three deep learning-based CNN models to identify the severity of late blight disease in the tomato plants [25]. In the same year, Verma et al. have also explored the use of capsule networks in potato diseases classification. Their proposed model has achieved an accuracy of 91.83% [26].
In 2020, Bhatia et al. used the concept of ELM with various resampling techniques, i.e., Synthetic Minority Over-sampling (SMOTE), ROS, and Importance Sampling (IMPS) to predict the powdery mildew disease in the tomato plants using TPMD dataset. They found that IMPS-ELM outperformed all other models with 89.91% accuracy [27]. However, none of these researchers has used any noise reduction-based approach for plant disease prediction in his/her study. Although, in 2020, Bhatia et al. proposed a noise reduction-based Hybrid SVM-LR algorithm for powdery mildew disease forecasting in tomato plants [28].
Further, as an extension to this study, three hybrid classifiers viz. Hybrid SVM-GA-ELM, Hybrid SVM-GA-DT, and Hybrid SVM-GA-RF have been proposed in this paper for plant disease prediction. Also, the parameters of ELM, DT, and RF have been optimized with the help of GA in these hybrid classifiers. An extensive statistical comparison has also been done amongst all the three proposed classifiers.
Materials and methods
Datasets
In this paper, tomato and soybean disease datasets have been used to develop a plant disease forecasting system. The current study focuses on binary class datasets, including information about the conduciveness of a particular disease for a specific plant on the basis of weather-based parameters and plants’ local and global attributes. The following subsections describe these datasets in detail.
TPMD dataset
TPMD dataset shows whether a tomato plant is prone to powdery mildew disease or not with respect to few standard weather parameters such as Temperature (T), Global Radiation (GR), Leaf Wetness (LW), Wind Speed (WS), and Relative Humidity (RH) [29]. This dataset contains 244 data-samples, out of which 217 samples belong to the Non-Conducive (NC) class, whereas only 27 samples belong to the Conducive (C) class. As shown in the Table 1, TPMD dataset contains seven attributes from which T, GR, LW, WS, and RH are taken as the independent variables. However, Class is considered as a target variable that shows if the climatic condition of a specific day is favorable or not for powdery mildew disease development in tomato plants.
Sample of TPMD dataset
Date
T
RH
LW
WS
GR
Class
9/20/2006
26.6
26.6
22
1
56
NC
9/21/2006
27.9
68
23
2
55
NC
9/22/2006
26.5
72
25
2
54
NC
9/23/2006
25.3
76
24
1
56
C
9/24/2006
25.8
80
22
1
55
C
9/25/2006
26.6
83
25
1
58
C
Representation of ROS technique.
Soybean disease datasets
Here, four variants of the soybean disease dataset, namely, SB-ALSD, SB-PRD, SB-FLSD, and SB-BSD, have been developed using the Soybean Large (SBL) dataset available on the University of California at Irvine (UCI) Repository of Machine Learning Databases [30]. SBL dataset is a multiclass dataset with 683 samples containing information regarding 19 soybean diseases based on different climatic factors and plants’ global and local attributes. It includes 35 attributes and one target disease class, as shown in Table 2. Overall, SBL dataset includes 23905 numbers of total entries (683*35) with 2337 missing values, which were handled using random forest imputation method. This method uses random forest algorithm to predict missing values after training the classifier on observed values [31]. Further, for the proposed approach, the SBL dataset has been converted into various binary-class soybean disease datasets for each of the 19 soybean diseases. These binary datasets have been formed for a particular soybean disease by replacing the name of that disease by ‘C’ class for conduciveness and ‘NC’ class for non-conduciveness. Though, in this study, only four out of the 19 soybean disease datasets have been selected viz. SB-ALSD, SB-PRD, SB-FLSD, and SB-BSD. There are 92 ‘C’ and 591 ‘NC’ classes in SB-BSD dataset, whereas the SB-PRD dataset includes 88 ‘C’ and 595 ‘NC’ classes. Similarly, SB-FLSD and SB-ALSD datasets contain 91 ‘C’ and 592 ‘NC’ classes each. All the four datasets include 683 data-samples with 35 unique features and one target attribute, i.e., Class.
SBL dataset description
Attribute name
Possible values
Precipitation
Less than, equal or greater than normal
Temperature
Less than, equal or greater than normal
Hail
Yes, no
Plant stand
Normal, less than normal
Time of incidence
April to October
Crop-hist
Same last year, different last year, same last two years, same last seven years
Area-damaged
Scattered, whole agricultural fields, groups of plants in highlands areas, groups of plants
in lowlands areas
Severity
Severe, potentially severe, minor
Seed Germination
Less than 80%, 80–89%, 90–100%
Seed Treatment
Fungicide, other, none
Plant Growth
Abnormal, normal
Leaves
Abnormal, normal
Leaf spots-halo
With or without yellow hallows
Leaf spots-margin
Does not apply, with or without water soaked margin, with water soaked margin
Leaf spot-size
Does not apply, greater than 1/8”, less than 1/8”
Leaf-shredding
Present, absent
Leaf malformation
Present, absent
Leaf mildew growth
Lower surface, upper surface, absent
Stem
Abnormal, normal
Presence of Lodging
Yes, no
Canker lesion color
Brown, tan, dark-brown or black
Stem cankers
Lower than soil line, above second node, absent, at or marginally above soil line
External decay
Watery, dry, absent
Mycelium on stem
Present, absent
Fungal fruiting body on stem
Present, absent
Sclerotia – internal or external
Prsent, absent
Internal discoloration
Brown, black, none
Fruits Pods
Does not apply, normal, few present, diseased
Fruits Spots
Does not apply, distorted, brown spots with black specks, colored, absent
ROS [32] technique is used for balancing and resampling the imbalanced datasets. This technique randomly adds replicas of existing minor samples in the original dataset to balance it with major classes. However, many resampling techniques can be used for data balancing such as Adaptive Synthetic (ADASYN) [33], Synthetic Minority Over-sampling (SMOTE) [34], borderline-SMOTE [35] etc. Yet, these techniques have several drawbacks like high computational complexity, severe distortion to the minor class distribution, and lower computational efficiency. Thus, in this study, the ROS technique has been selected, which not only overcomes these drawbacks but also generates appropriate artificial samples from the minor classes for data balancing [36]. Figure 1 shows a diagrammatic representation of the ROS technique for the binary class dataset.
Adaptive sampling technique (AST)
Adaptive Sampling Technique (AST) [37] can be used to minimize the class label noise (incorrect label assignment) in the dataset. AST acts as a package for different conventional classifiers such as the k-Nearest Neighbor, LR, SVM, and Linear Discriminant Analysis. This method provides probabilities of the positive and negative data sample, derived after train set classification using conventional classifiers. These probabilities can be used further to obtain noise-minimized training data which can diminish the possibility of picking mislabeled data to develop the prediction model and consequently, deliver a well-trained, accurate, and robust model.
Support vector machine (SVM)
SVM is an eminent machine learning algorithm proposed by Chervonenkis and Vapnik in 1963 [38]. In this algorithm, a point is plotted in -dimensional feature space for each sample of the dataset. Here, specifies the number of features present in the data, and every individual axis in the space represents a specific feature. The SVM algorithm can be used for both regression and classification. However, in this study, it is used for classification purpose. SVM classifier tries to identify an ideal hyperplane for the accurate classification of two labeled classes. The hyperplane with the maximum marginal difference is termed as the optimal hyperplane. Assume, is the training data with number of samples, where each data sample has features and a class label with one of the two values (). SVM attempts to discover a splitting hyperplane with a maximum marginal difference between binary classes, which is measured along a perpendicular line to that hyperplane. For instance, in Fig. 2, two classes NC and C are completely divided by a dotted hyperplane whose equation is written as:
Splitting hyperplane in SVM.
The main aim is to find a splitting hyperplane, so the distance between and is maximized. The distance between and is . Maximizing is the same as minimizing , so this problem can be rewritten as:
In the above equation indicates that:
In other words, it can be said that the data in class NC should be on the left-hand side of . However, the data in class C have to be on the right-hand side. Figure 2 shows a linear separable classification problem. SVM uses kernel trick in the case of non-linear separable problems, where the kernel converts the non-linear separable problem into a linearly separable problem by including new dimensions to it. Linear, polynomial, and radial basis function kernels are the extensively used kernels with SVM.
Extreme learning machine (ELM)
Huang et al. first introduced ELM as a Single Layer Feed Forward Neural Network (SL-FFNN) [39]. The structure of ELM is built up of a single layer of hidden nodes, in which the weights among hidden nodes and the inputs are assigned randomly, and do not vary during the prediction and the training phases. Moreover, the weights that connect the output and hidden nodes can be trained very fast [40]. Assume is the training data with data samples and, input features and a network is also given with input units, hidden layer neurons, and outputs. The output of the ELM training model can be written as follows:
In Eq. (4), , indicates the weight vector that links the hidden neurons to the output neuron. denotes a vector of hidden neurons’ outputs for a specific input data sample . So, can be written as Eq. (5):
In Eq. (5), indicates the bias of th hidden neuron, denotes the weight vector of th hidden neuron, and represents an activation function. After providing & , in the next step, a matrix of hidden layer output H is formed. It is a matrix whose column is the vector of the output of the hidden layer . Consequently, the weight matrix, is further calculated by a well-known technique named as Moore-Penrose pseudo inverse [41] as shown in Eq. (6):
In the above equation, indicates a matrix; its column is the actual target vector . Based on Eq. (6) the class label for the new data sample can be found out as follows:
DT [42] is also a supervised machine learning algorithm, which is also termed as a rooted directed tree. In this tree, the internal nodes represent a test on an attribute, and they also include the outgoing edges. The decision made basis on this test is known as the split criterion. All the other nodes apart from the internal nodes are the leaves or the terminal nodes, and each terminal node is assigned to a class or probability of that class. Accordingly, it can be said that a DT is made up of a number of terminal nodes # and internal nodes #. Assume that each internal or test node performs a test on a feature (attribute) and routes a new data sample s to the right child node or to the left child node . If the split criterion is based on a single attribute and a threshold value then:
According to Eq. (3.6), the new data sample will be routed to the left child node if the value of the attribute of is smaller than the threshold value . However, it will be assigned to the right child node if the value of the attribute of is greater than or equal to the threshold value . All the terminal nodes contain votes for the outcome variable ; here, represents the number of classes present in outcome variable . In each step, the best split is determined by a special split criterion. The most widely used split criteria of DT are Entropy (Eq. (9)) and Gini Index (Eq. (10)):
Random forest (RF)
RF [43] is an ensemble supervised learning algorithm which consists of multiple DTs. RF classifier includes the two most common approaches, namely, bootstrapping and aggregation. Here, bootstrapping signifies that each decision tree is trained in a parallel fashion on several train data subgroups using a subset of the existing attributes. Bootstrapping ensures that each DT has a unique identity in this procedure to minimize the overall variance of the RF algorithm. Lastly, the RF classifier aggregates the decisions of each DT. This procedure is termed as aggregation, which makes RF classifier more robust and generalizable than other conventional algorithms. Figure 3 shows an example that includes a simple implementation of an RF classifier on a training dataset with four attributes , , , and and two classes, 1 and 2.
Example of RF classifier implementation.
-fold cross-validation
-fold cross-validation [44] is an eminent approach of machine learning, which is used to estimate a classifier’s performance. In this validation, the dataset is arbitrarily divided into equal parts or folds, and -1 of them are taken as the train data. Only one fold is taken as the testing data for validation. As shown in Eq. (11), is the average value of the testing accuracies () resulting from the -fold cross-validation:
In current study, 10-fold cross validation has been used. Figure 4 shows an example of 10-fold cross validation procedure.
Example of 10-fold cross-validation.
Genetic algorithm (GA)
Genetic algorithm (GA) is an optimization algorithm that was first introduced by Goldberg and Holland using the concepts of natural selection and genetics [45, 46]. It is a search-based optimization technique used to determine the optimal or near-optimal solution for a complex problem that otherwise takes infinite time to get solve. In the current study, GA helps in selecting the optimal values of the parameters such that the accuracy percentages can be maximized to achieve the best prediction rate for plant disease classification. Fitness function forms the base of GA, which provides the best or the optimum solution for a particular problem. It takes the solution as an input and, in return, produces the fitness (suitability) of the solution as output. Various combinations of parameters are tested to check the fitness of the solution. The best fit solution can be obtained with the help of the following steps:
Parent Selection: Parent selection is the first phase of GA. In this phase, the initial population of parameters is selected as parents’ variables for mating. Here, the population refers to the subset of all probable solutions of a particular problem. The initial population can be generated using random selection or heuristic algorithms. The parents’ variables further mate and recombine to produce their children or off-springs. A proper selection of parents’ variables is must to obtain the best optimal solution for the concerned problem. The process of parent selection is very crucial to the convergence rate of the GA as good parents drive individuals to a better and fitter solutions. There are various selection schemes such as Truncation Selection, Proportionate Selection, Tournament Selection [47], Linear Ranking Selection, Exponential Ranking Selection [48], Roulette Wheel Selection, and Boltzmann Selection [49], which can be used for parent selection in GA. In the current study, Exponential Ranking Selection has been used for this purpose.
Crossover Operation: In this phase of GA, a crossover operator is used on the selected parents’ variables to produce children using parents’ genetic properties. Various crossover operators can be used during this operation viz. ring crossover, multi-point crossover, shuffle crossover, one-point crossover, uniform crossover, and many more. Several permutations of parents’ chromosomes are performed to achieve the off-spring (child) chromosomes using one of these crossover operators. The number of times a crossover occurs for chromosomes in one generation is known as crossover rate. 100% crossover rate implies that each offspring is created by crossover. If it is 0%, then the whole new generation of individuals is to be entirely duplicated from the ancestral population, except those that were a result of the mutation process. The range of the crossover rate is between 0 and 1. In the current study, one point crossover has been used with the crossover rate of 0.5.
Mutation Operation: Mutation refers to the reasonable adjustment in off-spring chromosomes to obtain a new offspring chromosome. This operation produces a broad search space by inducing diversity in the genetic population. Various mutation operators can be used during this operation viz. bit-flip mutation, inversion mutation, random resetting mutation, swap mutation, and scramble mutation. The number of chromosomes that are muted in one generation is termed as mutation rate. The range of the mutation rate is between 0 and 1. In the current study, inversion mutation has been used with the mutation rate of 0.5.
The procedure of GA starts with the population initialization process by using random or heuristic algorithms. Further, from this population, parents’ chromosomes (parameters) are selected for mating, and the value of fitness function is calculated for these selected parents’ variables. Afterward, crossover and mutation operations are applied over parents’ chromosomes to achieve off-spring chromosomes. Finally, these off-spring chromosomes swap the parents’ chromosomes in the population, and the process repeats until the termination criterion is reached. A termination criterion is a crucial phase of GA to break the loop with the best solution. Some conditions that can be used as the termination criterion are: maximum number of iterations reached or population size gets equal to the chromosomes validated, or the best fitness function gets equal to the mean of the values of a fitness function for all iterations. Figure 5 depicts the basic structure of the GA.
Basic structure of GA.
Confusion matrix.
Performance evaluation metrics
In the current study, the results of the plant disease prediction models developed using the proposed classifiers are evaluated using Accuracy, AUC, and metrics, which are further discussed in Subsections 3.10.1, 3.10.2, and 3.10.3 respectively. All the performance metrics have been evaluated using a particular type of table known as the ‘confusion matrix’ [50]. For a binary class dataset, the confusion matrix is a 2 2 matrix, as shown in Fig. 6. In this matrix, the correct and the incorrect predictions for each class are represented with their count in the form of True Negative (TN), True Positive (TP), False Positive (FP), and False Negative (FN). TP is the case when the original class of a data sample was 1 (Positive Class), and the predicted class is also 1 (Positive Class). FN is the case when the original class of a data sample was 1 (Positive Class), but the predicted class is 0 (Negative Class). TN is the case when the original class of a data sample was 0 (Negative Class), and the predicted class is also 0 (Negative Class). FP is the case when the original class of a data sample was 0 (Negative Class), but the predicted class is evaluated as 1 (Positive Class).
Accuracy
The accuracy of a classifier is defined as the percentage of correctly classified (predicted) classes [51]. It includes both the positive as well as negative classes. Equation (12) shows the formula for the accuracy metric:
Area under ROC curve.
Area under ROC curve (AUC)
AUC is a significant performance metric used for the evaluation of various prediction models. It measures the area under the ROC curve, which is plotted between the False Positive Rate (FPR) on the x-axis and the True Positive Rate (TPR) or recall on the y-axis, as shown in Fig. 7. Equations (13) and (14) present the formulas for FPR and TPR, respectively. A higher value of AUC signifies better prediction capability of a classification model [52].
F1-score (F)
F is also one of the important performance metrics used for the analysis of the classification models [51]. It is a harmonic mean of the TPR (recall) (Eq. (14)) and Precision (Eq. (15)) as formulated in Eq. (16).
Statistical tests
The current study uses the Friedman Test along with the Nemenyi post-hoc analysis to estimate the predicting capability of the proposed classifiers.
Friedman test
Friedman Test is a statistical non-parametric test that ranks the different techniques based on the differences among their performances [53]. It is a distribution-free test used to rank the Hybrid SVM-GA-DT, Hybrid SVM-GA-ELM, and Hybrid SVM-GA-RF classifiers for each plant disease dataset. Before applying the Friedman test, the following hypotheses have been formed:
Null Hypothesis (H): There is no notable difference among the performances of different hybrid classifiers used during this experimentation.
Alternative Hypothesis (H): There exists a notable difference among the performances of different hybrid classifiers used during this experimentation. The formula, as shown in Eq. (17) is used to evaluate the Friedman measure:
Where indicates the number of classifiers considered for the ranking, represents the number of datasets, and stands for each participant classifier’s average rank. After placing the values of these parameters in Eq. (17), the resultant value, i.e., , is further compared with the value available in the chi-square distribution table. As per the result of this comparison, if the value of lies in the critical region, which means that the value of is greater than the value of , null hypothesis gets rejected, and alternative hypothesis gets accepted. Else, null hypothesis gets accepted, and the alternative hypothesis gets rejected. Successively, Friedman’s Individual Rank (FIR) formula is also used to rank the participant classifiers, as shown in Eq. (18):
In Eq. (18), stands for the number of datasets while indicates cumulative rank. Here, the classifier having the lowest value of FIR is considered as the best performer, whereas the classifier having the highest value of FIR is taken as the worst performer. Suppose the mean ranks of classifiers achieved from the FIR formula are found to be remarkable. In that case, it becomes essential to apply the Nemenyi test to investigate if the differences in mean ranks are statistically notable or not.
Nemenyi test
Nemenyi Test is a post-hoc test that compares various classifiers’ performance to find out if there exists any statistically notable difference amongst these classifiers or not [54]. First of all, using Eq. (19), Critical Difference (CD) is found with the help of (number of classifiers) and (number of datasets):
In the above equation, stands for critical value, measured by Demsar’s studentized range statistics [55]. Subsequently, for every pair of classifiers, the difference between mean ranks is calculated. If a pair has a rank difference greater than or equal to CD, then it can be inferred that the performance of that particular pair is statistically notable. Else, the performance of that particular pair is not considered statistically significant.
Proposed approach
The proposed approach introduces a robust and efficient methodology to find out a noise-free classifier for plant disease prediction. The noise here refers to class label noise where data samples are labeled with wrong classes due to insufficient information, encoding errors, lack of understanding, and experts’ mistakes. The class label noise available in a plant disease dataset can adversely affect a forecasting model’s classification performance. Thus, one of the objective of this study is to find a noise-free prediction model for plant disease forecasting. A diagrammatic representation of the proposed approach is presented in Fig. 8. R-Studio Version 1.1.463 has been used for the implementation of the proposed approach. The implementation of current study has been done under the guidance of various experts of agriculture and data science domain. In the initial phase of this study, i.e., during data collection and pre-processing, great attention has been paid to integrating the domain knowledge of several experts from the Indian Agricultural Research Institute (IARI). In addition to this, advice of various professors of Guru Gobind Singh Indraprastha University has been taken during the development of hybrid classifiers for plant disease prediction. The overall methodology of this study is described in further subsections.
Block diagram of the proposed approach.
Data preprocessing
This phase of the proposed approach includes two steps. The first step focuses on the balancing of plant disease datasets. This helps in dealing the class imbalance problem of selected datasets, which in turn improves the prediction capability of forecasting models. The next step focuses on dividing the data sets into train and test data for developing and testing the prediction models for plant disease forecasting. Since, the current study focuses on the weather-based detection of plant diseases using machine learning algorithms. Therefore, initially, five plant disease datasets viz. TPMD, SB-ALSD, SB-PRD, SB-FLSD, and SB-BSD have been selected for conducting this study. All these datasets face a problem of wide gap between the number of samples in the majority and minority classes. The imbalance ratio [56] (size of major class/size of minor class) for each plant disease dataset used during experimentation is given in Table 3. The five datasets, i.e., TPMD, SB-PRD, SB-FLSD, SB-BSD, and SB-ALSD have an imbalance ratio of 8.04, 6.76, 6.51, 6.42, and 6.51, respectively. A well-known resampling technique, i.e., ROS, has been used in the current study to balance these datasets. The class distribution of these datasets before and after implementing ROS technique has been shown in Table 3. Afterward, all the datasets have been divided into a 70-30 train-test ratio. Table 4 shows the number of samples in the train and test set for each of the five datasets.
Distribution of classes before and after ROS for plant disease datasets
Dataset
Imbalance ratio
Before ROS
After ROS
C
NC
# of samples
C
NC
# of samples
TPMD
8.04
27
217
244
217
217
434
SB-PRD
6.76
88
595
683
595
595
1190
SB-FLSD
6.51
91
592
683
592
592
1184
SB-BSD
6.42
92
591
683
591
591
1182
SB-ALSD
6.51
91
592
683
592
592
1184
Number of Training and Testing samples for each plant disease dataset
Dataset
Total number of samples
# of samples in train-set
# of samples in test-set
TPMD
434
303
131
SB-PRD
1190
833
357
SB-FLSD
1184
828
356
SB-BSD
1182
827
355
SB-ALSD
1184
828
356
Noise removal using the AS-CLNR method
In this phase, the AS-CLNR method has been used to remove class label noise from the training data. In the AS-CLNR method, AST has been applied with SVM algorithm to obtain the positive and negative class probabilities. Subsequently, based on these probabilities, noise-minimized training data has been obtained using predicted class adjustment criteria, as shown in Table 5. Figure 9 shows the flowchart of the AS-CLNR method.
Flowchart of AS-CLNR method.
Detailed steps of AS-CLNR
The complete procedure of the “AS-CLNR” method is described as follows: Let us assume that there is a noisy binary-class training dataset denoted as , where each sample , in has attributes (means ), and is the target value of , where 0 & 1. It is also assumed that based on the random classification noise where each label is flipped independently with some probability [57], specific noise rates i.e. and are also associated with the data-samples belong to the 0 (negative) and 1 (positive) classes respectively. Note that and are the aggregated statistics of and where:
To obtain a noise-free training data, it is necessary to develop a probabilistic classification model on the noisy dataset where class label prediction of each data sample results in posterior probability as Eq. (22). Here, the SVM classifier has been used to develop a probabilistic classification model.
In the above equation, posterior probability has been calculated using radial kernel SVM based on Platt’s method [58] as shown in Eq. (4.2.1):
In the above equation, is a support vector set, and and are the parameters of the sigmoidal link function which transforms SVM’s output into probability. In the proposed approach, AST has been used in conjunction with the SVM algorithm to develop probabilistic model with class label noise. As illustrated in Fig. 10, AST wraps around an SVM model that iteratively updates the training dataset ( is the number of iterations) by weighted sampling from with an updated probability of mislabeling for each data-sample. The termination criterion of AST is shown in Eq. (24):
The above mentioned termination criterion has been used to summarize the predictions for all the data samples of . The AST process got terminated if the probabilities difference between the current iteration and previous iteration -1 is less than the value of . Here, the value of has been taken as 0.01. The final probabilities of Positive (1) and Negative classes (0) (as per Eqs (25) and (26)) obtained after the final iteration () of AST have been further utilized to develop noise-minimized training set using predicted class adjustment criteria as shown in Table 5. Here, POS represents the probability of positive class, whereas NEG indicates the probability of negative class.
Predicted class adjustment criteria
Comparison of probabilities
Class adjustment
POS NEG
1 (Conducive)
POS NEG
0 (Non-Conducive)
POS NEG
No Changes (Same as original training data)
AST framework for handling class label noise.
Optimization and validation of classifiers
This phase of the proposed method includes optimization and validation of hybrid classifiers viz. hybrid SVM-GA-ELM, hybrid SVM-GA-DT, Hybrid SVM-GA-RF. In the current study, hybrid SVM-GA-ELM, hybrid SVM-GA-DT, and hybrid SVM-GA-RF refer to the hybrid models that are generated by combining three different techniques. Here, SVM algorithm has been used to minimize noise from the train datasets with the AS-CLNR method’s help as explained in Section 4.2. ELM, DT, and RF classifiers have been used for the classification purpose, and the parameters of these classifiers have been optimized using GA.
Optimization of Hybrid SVM-GA-ELM
There are three most prominent parameters of ELM that defines the performance of the Hybrid SVM-GA-ELM algorithm, i.e., the number of hidden layer neurons (), input weights (), and activation function (). So, here the activation function has been taken as ‘Rectified Linear Unit’ (Relu), and the values of have been randomly selected. However, the value of another parameter, i.e., () has been optimized using GA. The optimization process of this parameter has been discussed in Algorithm 1.
Algorithm 1: Optimization of Hybrid SVM-GA-ELM Classifier
Input: Population size (), maximum number of iterations (), lower bound (), and upper bound ()
Output: Optimal value of parameter
Initialize: Set initial values of input parameters for variable as: 50; 100; 30; 2000
begin
1. Generate the random population of within its lower and upper bounds.
2. Compute the fitness function () value in the initial population of the variable using following equation:
3. On the basis of fitness values, best fit is selected with maximum .
4. Crossover operator is applied and new population is generated using parent chromosomes comparing their .
5. Mutation is performed on the new population by randomly changing the genes.
6. is computed for the new population.
7. Repeat 4–7 until stopping criterion is reached.
8. Return the last optimized value of parameter .
end
Algorithm 2: Optimization of Hybrid SVM-GA-DT Classifier
Input: Population size (), maximum number of iterations (), lower bound (), and upper bound ()
Output: Optimal value of parameter and
Initialize: Set initial values of input parameters:
for variable as: 50; 100; 1; 30
for variable as: 50; 100; 1; 20
begin
1. Generate the random population of and within their lower and upper bounds.
2. Compute the fitness function () value in the initial population of the variables and using following equation:
3. On the basis of fitness values, best fit is selected with maximum .
4. Crossover operator is applied and new population is generated using parent chromosomes comparing their .
5. Mutation is performed on the new population by randomly changing the genes.
6. is computed for the new population.
7. Repeat 4–7 until stopping criterion is reached.
8. Return the last optimized value of parameter and .
end
Optimization of Hybrid SVM-GA-DT
The performance of Hybrid SVM-GA-DT algorithm, depends on the three most prominent parameters of DT i.e., maxdepth () (longest path from the root node to the leaf node), minsplit () (minimum number of samples that must be contained by a node), and split criterion (). Here the value of () has been taken as the ‘Gini Index’, whereas, the value of other parameters, i.e., and have been optimized using GA. The optimization process of these parameters has been discussed in Algorithm 2.
Optimization of Hybrid SVM-GA-RF
The predictive capability of Hybrid SVM-GA-RF algorithm depends on the parameters of RF, i.e. ntree () (number of trees), mtry () (number of attributes available for splitting at each node), and maxnode () (maximum numbers of nodes that a tree can contain). All of these parameters have been optimized using GA. The optimization process of these parameters has been discussed in Algorithm 3.
Algorithm 3: Optimization of Hybrid SVM-GA-RF Classifier
Input: Population size (), maximum number of iterations (), lower bound (), and upper bound ()
Output: Optimal value of parameter , and
Initialize: Set initial values of input parameters:
for variable as: 50; 100; 100; 1000
for variable as: 50; 100; 1; 5
for variable as: 50; 100; 2; 16
begin
1. Generate the random population of , and within their lower and upper bounds.
2. Compute the fitness function () value in the initial population of the variables , and using following equation:
3. On the basis of fitness values, best fit is selected with maximum .
4. Crossover operator is applied and new population is generated using parent chromosomes comparing their .
5. Mutation is performed on the new population by randomly changing the genes.
6. is computed for the new population.
7. Repeat 4–7 until stopping criterion is reached.
8. Return the last optimized value of parameter , and .
end
Afterward, noise-minimized training data is used to construct plant disease classification models based on all three hybrid classifiers. 10-fold cross-validation is also used throughout training phase to decrease validation bias. This method splits all the data samples into ten subsets, out of which nine subsets are used to train the model, and the tenth subset validates that model. This procedure is performed repeatedly to obtain disease prediction values on each data sample.
Performance evaluation and statistical analysis
After training, every prediction model is assessed using testing data. Successively, three prominent performance metrics explained in Section 3.10 viz. Accuracy, AUC, and F have been used for evaluating the performance. The variations in the performance of various hybrid classifiers have further been evaluated for significance using the statistical Friedman test. Later, Friedman test results have also been validated using the Nemenyi test, as explained in Section 3.11.
Results and discussions
The current section explains the experimental results of the study performed on the selected plant disease datasets. Initially, all the datasets have been balanced using the ROS technique. Subsequently, each of the balanced plant disease datasets obtained after resampling has been divided into 70% training and 30% testing data. Further, plant disease prediction models have been developed on the training data using the proposed Hybrid SVM-GA-ELM, Hybrid SVM-GA-DT, and Hybrid SVM-GA-RF classifiers. 10-fold cross-validation has also been used during the training phase. In the Hybrid SVM-GA-ELM classifier, the value of the number of hidden layer neurons has been taken in the range of 30 to 2000, and the Rectified Linear Unit’ has been used as an activation function with ELM technique. The parameters of DT for Hybrid SVM-GA-DT comprises of maxdepth , minsplit , and split criterion , where the value of maxdepth, and minsplit have been taken in the range of 1 to 30 and 1 to 20, respectively. Also, the Gini Index has been taken as the split criterion. Similarly, in the case of Hybrid SVM-GA-RF, three parameters have been taken with RF technique viz. ntree , mtry , and maxnode . During the experiment, the values of ntree, mtry, and maxnode have been taken in the range of 100 to 1000, 1 to 5, and 2 to 16, respectively. Here, GA has been used to determine the optimal value of the parameters from an explicitly decided range of a particular parameter. Table 6 shows the values of the optimized parameters for each hybrid classifier with respect to all the plant disease datasets.
Optimized parameters’ values of hybrid classifiers for plant disease datasets
Dataset
Hybrid SVM-GA-ELM
Hybrid SVM-GA-DT
Hybrid SVM-GA-RF
TPMD
1718
29
3
517
1
16
SB-ALSD
1582
10
4
782
5
15
SB-PRD
257
12
5
107
5
16
SB-FELSD
1029
6
7
483
3
16
SB-BSD
1310
11
13
420
5
16
Further, Tables 7–9 provide the values of Accuracy, AUC, and F for all the three classifiers and for each of the five plant disease dataset. It is clear from these tables that for the SB-BSD dataset, Hybrid SVM-GA-DT performed the best (marked in bold), and Hybrid SVM-GA-ELM performed the worst (marked as underlined) with respect to all the performance metrics. In contrast, for all the other datasets, Hybrid SVM-GA-RF performed the best (marked in bold) and Hybrid SVM-GA-ELM performed the worst (marked as underlined).
Accuracy of hybrid classifiers for plant disease datasets
Techniques
TPMD
SB-ALSD
SB-PRD
SB-FELSD
SB-BSD
Hybrid SVM-GA-ELM
93.13%
85.39%
93.84%
79.49%
86.48%
Hybrid SVM-GA-RF
95.42%
96.63%
100%
96.91%
94.65%
Hybrid SVM-GA-DT
94.66%
95.79%
99.44%
96.35%
95.49%
AUC value of hybrid classifiers for plant disease datasets
Techniques
TPMD
SB-ALSD
SB-PRD
SB-FELSD
SB-BSD
Hybrid SVM-GA-ELM
0.9340
0.8552
0.9402
0.8027
0.8675
Hybrid SVM-GA-RF
0.9552
0.9667
1
0.9695
0.9473
Hybrid SVM-GA-DT
0.9481
0.9581
0.9946
0.9642
0.9549
F1-Score value of hybrid classifiers for plant disease datasets
Techniques
TPMD
SB-ALSD
SB-PRD
SB-FELSD
SB-BSD
Hybrid SVM-GA-ELM
0.9280
0.8680
0.9402
0.8094
0.8702
Hybrid SVM-GA-RF
0.9571
0.9655
1
0.9707
0.9471
Hybrid SVM-GA-DT
0.9497
0.9573
0.9945
0.9652
0.9565
Although, all the classifiers have been successfully implemented for each plant disease dataset, still, it is difficult to conclude which classifier should be used for plant disease prediction due to the conflicting results. The results are inconclusive because of the dissimilar structure of datasets.Hence, hybrid classifiers’ performances are evaluated using the Friedman test based on the accuracy metric. The results of the Friedman test are shown in Table 10. The table indicates that the Hybrid SVM-GA-RF classifier is the best performer with the lowest mean rank (marked in bold). This test helped to determine that there exists a classifier that can be considered as the most appropriate and truthful technique to develop a plant disease prediction model. The favorable behavior of Hybrid SVM-GA-RF is due to the ensemble nature of the RF technique. RF technique is based on bagging algorithm that not only reduces the variance but also lessens the possibility of over-fitting of the model, which in turn improves the accuracy of the prediction model. In terms of average accuracy, Hybrid SVM-GA-RF shows a significant improvement of 10.33% and 0.39% with respect to Hybrid SVM-GA-ELM and Hybrid SVM-GA-DT, respectively.
Mean ranking of classifiers on applying Friedman test
Techniques
Hybrid SVM-ELM
Hybrid SVM-RF
Hybrid SVM-DT
Mean Rank
3.00
1.20
1.80
The critical region’s value for the level of significance equal to 5% and degree of freedom equal to 2 (, where is the total number of classifiers used in this study) has also been evaluated using Eq. (17). Further, the value for is read from the chi-square table corresponding to 95% significance level and degree of freedom equal to 2. According to the Friedman test’s null hypothesis, it was found that, at 0.05 level of significance, the Friedman measure lies in the critical region for each plant disease dataset. Therefore, it is inferred that a remarkable difference exists between various hybrid classifiers’ performances by rejecting the null hypothesis and accepting the alternative hypothesis. Table 11 shows the test statistics of the Friedman test applied for finding out the ranks of classifiers based on the accuracy metric. After the Friedman test, post-hoc analysis using the Nemenyi test has been performed to check if the differences between the performances of various hybrid classifiers based on the FIR values are statistically significant. The differences in classifiers’ ranks which are greater than or equal to CD are shown in bold in Table 12. It is observed that 1 out of 3 (pair of Hybrid SVM-GA-RF and Hybrid SVM-GA-ELM), i.e., 33.33% of the total pairs of the classifiers has the difference above or equal to CD. It shows that the performance of this pair is found to be significantly different using the Nemenyi test. However, the differences between the performances of all other pairs are not significant.
Test statistics of Friedman test
Statistical parameters
N
Chi-square
df
Value
5
8.400
2
Results of Nemenyi test
Techniques
Hybrid SVM-GA-RF
Hybrid SVM-GA-DT
Hybrid SVM-GA-ELM
Hybrid SVM-GA-RF
–
0.6
1.80
Hybrid SVM-GA-DT
–
1.20
Hybrid SVM-GA-ELM
–
Time complexity of proposed models
The term time complexity represents the time span of an algorithm to run, as a function of the input’s length. It computes the time taken to run an algorithm. It is necessary to figure out the complexity in order to determine the best prediction model from the various available models. In the current study, three different types of hybrid models i.e., Hybrid SVM-GA-ELM, Hybrid SVM-GA-DT, and Hybrid SVM-GA-RF have been proposed to predict plant diseases using various plant disease datasets. The time complexity of all these proposed models depends on following three parameters:
SVM algorithm, which has been used for noise reduction from the plant disease datasets. The time complexity of SVM algorithm is , where represents the size of training dataset [59].
GA, which has been used for optimizing parameters of the particular classifier. The time complexity of GA depends on the population size (), maximum number of generations/iterations (), and the individual parameters, whose optimal value has to be calculated. For Hybrid SVM-GA-ELM this individual parameter is number of hidden layer neurons (). However for Hybrid SVM-GA-DT classifier these individual parameters are maxdepth () and minsplit (), whereas ntree (), mtry (), and maxnode () are three individual parameters in case of Hybrid-SVM-RF.
Classifiers, which have been used for classification of plant diseases. In the current study ELM, DT, and RF classifiers have been used for predicting plant diseases. The time complexity of ELM is , where represents number of hidden layer neurons and N is the size of training dataset [60]. However, time complexities of decision tree and random forest classifiers are [61] and [62] respectively, where represents number of attributes, is size of training data, and is number of decision trees present in a random forest classifier.
Therefore, The overall time complexities of Hybrid SVM-GA-ELM, Hybrid SVM-GA-DT, and Hybrid SVM-GA-RF are , , and , respectively. Table 13 shows the time complexity and computation time of proposed models. The average running (computation) time of Hybrid SVM-GA-ELM, Hybrid SVM-GA-DT, and Hybrid SVM-GA-RF are 1.20 seconds, 6.60 seconds, and 299.60 seconds respectively based on various plant disease datasets. It is clear from Table 13 that Hybrid SVM-GA-ELM is the most time-efficient, whereas the Hybrid SVM-GA-RF is the most time consuming classifiers amongst all the three proposed classifiers. However, in terms of different performance metrics and statistical test (as shown in Tables 6 to 10) Hybrid SVM-GA-RF performed the best, whereas Hybrid SVM-GA-ELM performed the worst. So, the finding of this study claims that there is always a trade-off between the performance and computational time of a model. In other words we can say that the model with high performance will be less time-efficient or vice-versa. Further, Fig. 11 shows the running time of proposed methods in terms of TPMD, SB-ALSD, SB-PRD, SB-FELSD, and SB-BSD datasets.
Time complexity and computation time of proposed models
Datasets
Hybrid SVM-GA-ELM
Hybrid SVM-GA-DT
Hybrid SVM-GA-RF
TPMD
2 Sec.
1 Sec.
33 Sec.
SB-ALSD
1 Sec.
7 Sec.
205 Sec.
SB-PRD
1 Sec.
7 Sec.
270 Sec.
SB-FELSD
1 Sec.
9 Sec.
441 Sec.
SB-BSD
1 Sec.
9 Sec.
549 Sec.
Avg. running time
1.20 Sec.
6.60 Sec.
299.60 Sec.
Time complexity
Comparison with previous studies
In 1997, Guzman-Plazola developed a spray prediction model for tomato powdery mildew disease based on a well-known machine learning-based classifier, i.e., Linear Discriminant Analysis (LDA) [4]. His model was further verified by the sensor-based meteorological dataset named as the TPMD dataset, which was collected by the Bakeer et al. in 2013 during their research [29]. After a long time, in 2020, this TPMD dataset was used by Bhatia et al. in their paper for developing a prediction model using the ELM algorithm. They have balanced this data set using various resampling techniques, i.e., Random Under Sampling (RUS), Importance Sampling (IMPS), ROS, and SMOTE. The results of their research show that the IMPS-ELM approach has outperformed all other techniques in terms of AUC and Accuracy measures [27]. In the same year, they also proposed a new Hybrid SVM-LR classifier for powdery mildew disease prediction using the TPMD dataset. They analyzed the performance of the proposed classifier based on three metrics: Accuracy, AUC, and F. These studies did not used any optimization technique or statistical tests while conducting their research [28]. Table 14 compares the proposed method with the exiting Hybrid SVM-LR and IMPS-ELM algorithms in terms of Accuracy and AUC metrics. It is noticeable from Table 14 that all the proposed classifiers, i.e., Hybrid SVM-GA-RF, Hybrid SVM-GA-ELM, and Hybrid SVM-GA-DT have performed better than the existing algorithms, both in terms of Accuracy and AUC. In the current study, it is impossible to compare soybean datasets with the previous research. As per the proposed approach’s requirement, the multiclass SBL dataset has been divided into 19 binary class soybean datasets and only four of them have been used in this research. Also, none of the researchers in the previous studies have used these types of datasets in their research. Therefore, a fair comparison is not possible in this regard, that’s why only TPMD has been considered for comparison.
Comparison with existing approaches
Techniques
Accuracy
AUC
Proposed Techniques
Hybrid SVM-GA-RF
95.42%
0.9552
Hybrid SVM-GA-ELM
93.13%
0.9340
Hybrid SVM-GA-DT
94.66%
0.9481
Existing Techniques
IMPS-ELM
89.91%
0.8857
Hybrid SVM-LR
92.37%
0.9270
Computation (running) time of proposed methods.
Conclusion and future scope
This study’s objective was to find a noise reduction-based classifier that can efficiently be utilized in plant disease prediction. An extensive comparison amongst three hybrid classifiers on five plant disease datasets was conducted. Further, the performances of these classifiers have been compared using Accuracy, AUC, and F metrics. The variation between these classifiers’ performances motivated the authors to use the Friedman test to check the statistical significance of the applied methods. Further, the Nemenyi test has also been conducted to find if the statistical difference exists in the performances of different pairs of hybrid classifiers. The following points highlight the main findings of this study:
The work concludes that based on all the performance metrics, Hybrid SVM-GA-DT performed the best for SB-BSD dataset, whereas for all other datasets, Hybrid SVM-GA-RF was the best performer.
The Friedman test’s results helped to decide that the Hybrid SVM-GA-RF classifier was the best performer amongst all the classifiers for each dataset with respect to the accuracy metric.
Finally, the post hoc analysis claimed that only 33.33% of the total pairs of hybrid classifiers show a significantly different performance.
In the future, this study can be extended using more prominent resampling and noise reduction techniques. The authors are also collecting some real-life plant disease datasets to validate this work further. Some of the evolutionary and meta-heuristic techniques can also be used for better results. This classifier would help the farmers to analyze whether a specific weather condition can lead to diseases in plants or not so that they can take the necessary preventive measures well in advance.
Footnotes
Acknowledgments
This work is financially supported by Department of Science and Technology (DST) under a project with reference number “DST/Reference.No.T-319/2018-19”. We are grateful to them for their immense support.
References
1.
Food & Agricultural Organization (FAO) of the United Nation, New standards to curb the global spread of plant pests and diseases, 2020. http://www.fao.org/news/story/en/item/1187738/icode.
2.
WeizhengS.YachunW.ZhanliangC. and HongdaW., Grading method of leaf spot disease based on image processing, in: 2008 International Conference on Computer Science and Software Engineering, Hubei, China, IEEE, December 12–14, 2008. doi: 10.1109/CSSE.2008.1649.
3.
Al BashishD.BraikM. and Bani-AhmadS., Detection and classification of leaf diseases using K-means-based segmentation and neural-networks-based classification, Information Technology Journal10(2) (2011), 267–275. doi: 10.3923/itj.2011.267.275.
4.
Guzman-PlazolaR.A., Development of a spray forecast model for tomato powdery mildew (Leveillula Taurica (Lev). Arn.), University of California, Davis, 1997.
5.
KaundalR.KapoorA.S. and RaghavaG.P.S., Machine learning techniques in disease forecasting: A case study on rice blast prediction, BMC Bioinformatics7(1) (2006), 485. doi: 10.1186/1471-2105-7-485.
6.
FuentesA.YoonS.KimS. and ParkD., A robust deep-learning-based detector for real-time tomato plant diseases and pests recognition, Sensors17(9) (2017), 2022. doi: 10.3390/s17092022.
7.
VermaS.BhatiaA.ChugA. and SinghA.P., Recent Advancements in Multimedia Big Data Computing for IoT Applications in Precision Agriculture: Opportunities, Issues, and Challenges, in: Multimedia Big Data Computing for IoT Applications, 2020, pp. 391–416. doi: 10.1007/978-981-13-8759-3_15.
8.
BhatiaA.ChugA. and SinghA.P., Plant disease detection for high dimensional imbalanced dataset using an enhanced decision tree approach, International Journal of Future Generation Communication and Networking13(4) (2020), 71–78. doi: 10.33832/ijfgcn.2020.13.4.07.
9.
BhatiaA.ChugA. and SinghA.P., Statistical analysis of machine learning techniques for predicting powdery mildew disease in tomato plants, International Journal of Intelligent Engineering Informatics9(1) (2021), 24–58. doi: 10.1504/IJIEI.2021.116087.
10.
SahuP.ChugA.SinghA.P.SinghD. and SinghR.P., Deep learning models for beans crop diseases: Classification and visualization techniques, International Journal of Modern Agriculture10(1) (2021), 796–812.
11.
SahuP.ChugA.SinghA.P.SinghD. and SinghR.P., Deep Learning Models for Crop Quality and Diseases Detection, in: Proceedings of the International Conference on Paradigms of Computing, Communication and Data Sciences, 2021, pp. 843–851.
12.
SahuP.ChugA.SinghA.P.SinghD. and SinghR.P., Implementation of CNNs for crop diseases classification: A comparison of pre-trained model and training from scratch, International Journal of Computer Science and Network Security (IJCSNS)20(10) (2020), 206. doi: 10.22937/IJCSNS.2020.20.10.26.
13.
ChaudharyA.KolheS. and KamalR., An improved random forest classifier for multi-class classification, Information Processing in Agriculture3(4) (2016), 215–222. doi: 10.1016/j.inpa.2016.08.002.
14.
HernándezM.A. and StolfoS.J., Real-world data is dirty: Data cleansing and the merge/purge problem, Data Mining and Knowledge Discovery2(1) (1998), 9–37. doi: 10.1023/A:1009761603038.
15.
DalalS., A comparative study and analysis on the classification of ECG signals, PhD Thesis, 2016.
16.
SabrolH. and KumarS., Intensity based feature extraction for tomato plant disease recognition by classification using decision tree, International Journal of Computer Science and Information Security14(9) (2016), 622.
17.
VishwakarmaV.P. and DalalS., Neuro-fuzzy hybridization using modified S membership function and kernel extreme learning machine for robust face recognition under varying illuminations, EAI Endorsed Transactions on Scalable Information Systems, 2020, 1–11. doi: 10.4108/eai.13-7-2018.163575.
18.
RameshS.HebbarR.NivedithaM.PoojaR.ShashankN. and VinodP.V., Plant disease detection using machine learning, in: 2018 International Conference on Design Innovations for 3Cs Compute Communicate Control (ICDI3C), 2018, pp. 41–45. doi: 10.1109/ICDI3C.2018.00017.
19.
DalalS. and VishwakarmaV.P., GA based KELM optimization for ECG classification, Procedia Computer Science167 (2020), 580–588. doi: 10.1016/j.procs.2020.03.322.
20.
RumpfT.MahleinA.-K.SteinerU.OerkeE.-C.DehneH.-W. and PlümerL., Early detection and classification of plant diseases with support vector machines based on hyperspectral reflectance, Computers and Electronics in Agriculture74(1) (2010), 91–99. doi: 10.1016/j.compag.2010.06.009.
21.
Zhong LiuL.ZhangW.Bao ShuS. and JinX., Image recognition of wheat disease based on RBF support vector machine, in: 2013 International Conference on Advanced Computer Science and Electronics Information (ICACSEI 2013), 2013, pp. 307–310. doi: 10.2991/icacsei.2013.77.
22.
ChungC.-L.HuangK.-J.ChenS.-Y.LaiM.-H.ChenY.-C. and KuoY.-F., Detecting bakanae disease in rice seedlings by machine vision, Computers and Electronics in Agriculture121 (2016), 404–411. doi: 10.1016/j.compag.2016.01.008.
23.
VermaS.ChugA. and SinghA.P., Prediction Models for Identification and Diagnosis of Tomato Plant Diseases, in: 2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI), IEEE, 2018, pp. 1557–1563. doi: 10.1109/ICACCI.2018.8554842.
24.
VermaS.ChugA.SinghA.P.SharmaS. and RajvanshiP., Deep Learning-Based Mobile Application for Plant Disease Diagnosis: A Proof of Concept With a Case Study on Tomato Plant, in: Applications of Image Processing and Soft Computing Systems in Agriculture, IGI Global, 2019, pp. 242–271. doi: 10.4018/978-1-5225-8027-0.ch010.
25.
VermaS.ChugA. and SinghA.P., Application of convolutional neural networks for evaluation of disease severity in tomato plant, Journal of Discrete Mathematical Sciences and Cryptography23 (2020), 273–282. doi: 10.1080/09720529.2020.1721890.
26.
VermaS.ChugA. and SinghA.P., Exploring capsule networks for disease classification in plants, Journal of Statistics and Management Systems23(2) (2020), 307–315. doi: 10.1080/09720510.2020.1724628.
27.
BhatiaA.ChugA. and SinghA.P., Application of extreme learning machine in plant disease prediction for highly imbalanced dataset, Journal of Statistics and Management Systems23(6) (2020), 1059–1068. doi: 10.1080/09720510.2020.1799504.
28.
BhatiaA.A.ChugA. and SinghA.P., Hybrid SVM-LR classifier for powdery mildew disease prediction in tomato plant, in: 2020 7th International Conference on Signal Processing and Integrated Networks (SPIN-2020), IEEE, 2020, pp. 218–223. doi: 10.1109/SPIN48934.2020.9071202.
29.
BakeerA.R.T.Abdel-LatefM.A.E.AfifiM.A. and BarakatM.E., Validation of tomato powdery mildew forecasting model using meteorological data in egypt, International Journal of Agriculture Sciences5(2) (2013), 372. doi: 10.9735/0975-3710.5.2.372-378.
30.
DuaD. and GraffC., UCI Machine Learning Repository, 2017. Available at: http://archive.ics.uci.edu/ml.
31.
StekhovenD.J., missForest: Nonparametric missing value imputation using random forest, Astrophysics Source Code Library, 2015, ascl–1505.
32.
BatistaG.E.PratiR.C. and MonardM.C., A study of the behavior of several methods for balancing machine learning training data, ACM SIGKDD Explorations Newsletter6(1) (2004), 20–29. doi: 10.1145/1007730.1007735.
33.
HeH.BaiY.GarciaA.E. and LiS., ADASYN: Adaptive synthetic sampling approach for imbalanced learning, in: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), IEEE, 2008, pp. 1322–1328. doi: 10.1109/IJCNN.2008.4633969.
34.
ChawlaN.V.BowyerK.W.HallL.O. and KegelmeyerW.P., SMOTE: Synthetic minority over-sampling technique, Journal of Artificial Intelligence Research16 (2002), 321–357. doi: 10.1613/jair.953.
35.
HanH.WangW.-Y. and MaoB.-H., Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning, in: International Conference on Intelligent Computing, Springer, 2005, pp. 878–887. doi: 10.1007/11538059_91.
36.
ZhangJ. and ChenL., Clustering-based undersampling with random over sampling examples and support vector machine for imbalanced classification of breast cancer diagnosis, Computer Assisted Surgery24(sup2) (2019), 62–72. doi: 10.1080/24699322.2019.1649074.
37.
YangP.OrmerodJ.T.LiuW.MaC.ZomayaA.Y. and YangJ.Y.H., AdaSampling for positive-unlabeled and label noise learning with bioinformatics applications, IEEE Transactions on Cybernetics49(5) (2018), 1932–1943. doi: 10.1109/TCYB.2018.2816984.
38.
SuykensJ.A.K. and VandewalleJ., Least squares support vector machine classifiers, Neural Processing Letters9(3) (1999), 293–300. doi: 10.1023/A:1018628609742.
39.
HuangG.B.ZhuQ.Y. and SiewC.K., Extreme learning machine: a new learning scheme of feedforward neural networks, in: 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No. 04CH37541), IEEE, 2 (2004), 985–990. doi: 10.1109/IJCNN.2004.1380068.
40.
VishwakarmaV.P. and DalalS., A novel approach for compensation of light variation effects with KELM classification for efficient face recognition, in: Advances in VLSI, Communication, and Signal Processing, Springer, Singapore, 2020, pp. 1003–1012. doi: 10.1007/978-981-32-9775-3_89.
41.
DalalS.VishwakarmaV.P. and SisaudiaV., ECG classification using kernel extreme learning machine, in: 2018 2nd IEEE International Conference on Power Electronics, Intelligent Control and Energy Systems (ICPEICES), IEEE, 2018, pp. 988–992. doi: 10.1109/ICPEICES.2018.8897416.
BreimanL., Random forests, Machine Learning45(1) (2001), 5–32. doi: 10.1023/A:1010933404324.
44.
WongT.-T., Performance evaluation of classification algorithms by k-fold and leave-one-out cross validation, Pattern Recognition48(9) (2015), 2839–2846. doi: 10.1016/j.patcog.2015.03.009.
45.
GoldbergD.E. and HollandJ.H., Genetic algorithms and machine learning, Machine Learning3(2–3) (1988), 95–99. doi: 10.1023/A:1022602019183.
46.
DalalS. and VishwakarmaV.P., A novel approach of face recognition using optimized adaptive illumination-normalization and KELM, Arabian Journal for Science and Engineering45 (2020), 9977–9996. doi: 10.1007/s13369-020-04566-8.
47.
ThierensD. and GoldbergD., Convergence models of genetic algorithm selection schemes, in: International Conference on Parallel Problem Solving from Nature, Springer, 1994, pp. 119–129. doi: 10.1007/3-540-58484-6_256.
48.
ShuklaA.PandeyH.M. and MehrotraD., Comparative review of selection techniques in genetic algorithm, in: 2015 International Conference on Futuristic Trends on Computational Analysis and Knowledge Management (ABLAZE), Springer, 2015, pp. 515–519. doi: 10.1109/ABLAZE.2015.7154916.
49.
GandhiS.KhanD. and SolankiV.S., A comparative analysis of selection scheme, International Journal of Soft Computing and Engineering2(4) (2012), 131–134.
50.
BhatiB.S. and RaiC.S., Analysis of support vector machine-based intrusion detection techniques, Arabian Journal for Science and Engineering45(4) (2020), 2371–2383. doi: 10.1007/s13369-019-03970-z.
51.
HaqueR.U.MridhaM.F.HamidM.A.Abdullah-Al-WadudM. and IslamM.S., Bengali stop word and phrase detection mechanism, Arabian Journal for Science and Engineering45 (2020), 3355–3368. doi: 10.1007/s13369-020-04388-8.
52.
HongH.NaghibiS.A.DashtpagerdiM.M.PourghasemiH.R. and ChenW., A comparative assessment between linear and quadratic discriminant analyses (LDA-QDA) with frequency ratio and weights-of-evidence models for forest fire susceptibility mapping in China, Arabian Journal of Geosciences10(7) (2017), 167. doi: 10.1007/s12517-017-2905-4.
53.
FriedmanM., A comparison of alternative tests of significance for the problem of m rankings, The Annals of Mathematical Statistics11(1) (1940), 86–92. doi: 10.1214/aoms/1177731944.
54.
LessmannS.BaesensB.MuesC. and PietschS., Benchmarking classification models for software defect prediction: A proposed framework and novel findings, IEEE Transactions on Software Engineering34(4) (2008), 485–496. doi: 10.1109/TSE.2008.35.
55.
DemšarJ., Statistical comparisons of classifiers over multiple data sets, Journal of Machine Learning Research7(Jan) (2006), 1–30.
56.
AhlawatK.ChugA. and SinghA.P., Benchmarking framework for class imbalance problem using novel sampling approach for big data, International Journal of System Assurance Engineering and Management10(4) (2019), 824–835. doi: 10.1007/s13198-019-00817-6.
57.
NatarajanN.DhillonI.S.RavikumarP.K. and TewariA., Learning with noisy labels, in: Advances in Neural Information Processing Systems, Vol. 26, 2013, pp. 1196–1204.
58.
PlattJ., Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods, Advances in Large Margin Classifiers10(3), 61–74.
59.
YangY.LiJ. and YangY., The research of the fast SVM classifier method, In: 2015 12𝑡ℎ International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP), IEEE, 2015, pp. 121–124. doi: 10.1109/ICCWAMTIP.2015.7493959.
60.
IosifidisA.TefasA. and PitasI., On the kernel extreme learning machine classifier, Pattern Recognition Letters54 (2015), 11–17. doi: 10.1016/j.patrec.2014.12.003.
61.
SaniH.M.LeiC. and NeaguD., Computational complexity analysis of decision tree algorithms, in: International Conference on Innovative Techniques and Applications of Artificial Intelligence, Springer, 2018, pp. 191–197. doi: 10.1007/978-3-030-04191-5_17.
62.
RoyS.S.DeyS. and ChatterjeeS., On the kernel extreme learning machine classifier, IEEE Sensors Journal20(18) (2020), 10792–10800. doi: 10.1109/JSEN.2020.2995109.