Abstract
In the paper we will try to generate fuzzy rule based systems (FRBSs) that follow three objectives: accuracy, interpretability and robustness. Accuracy is based on the Number of Patterns that are Correctly Classified (NCP), interpretability is measured as the Number of Rules (NR) and Sum of the Rules Length (SRL) and Robustness is measured as Sum of the accuracy, NR and SRL Standard Deviations (Sum of SDs) in successive runs. The algorithms that have high quality results with low SDs and in each run, the output is not being different, are called robust algorithms. Our algorithm consists of two stages based on the Krill Herd (KH) evolutionary algorithm; in the first stage the candidate rules are generated intelligently so in the second stage, those rules will be selected that get us closer to our objectives. Stage 2 of our algorithm can be used as a post processing algorithm on other algorithms and converts those to robust algorithms. The results show that, our algorithm has zero SDs with high accuracy and we have been successful in improving those three objectives that were in conflict.
Keywords
Introduction
Since we have three objectives in conflict and improvement in each objectives, deteriorate the others, we have to use multi-objective methods that are able to perform trade-off between the objectives. Using of the multi-objective evolutionary algorithms (MOEAs) is one of the options. Since FRBSs are interpretable, we apply a MOEA to learn relevant FRBS that satisfy our objectives like [1–4].
In [5], a Genetic Based Machine Learning (GBML) method is suggested to assess the alternative project portfolios in a real-world financial services institution. In this study, two objectives are considered: maximizing the accuracy and minimizing the complexity. In [6], the rule bases are generated by exploiting a rule and condition selection strategy, which selects a reduced number of rules from candidate rules during the evolutionary process. Also in this study, the accuracy and complexity are major objectives. In [7], the authors suggest a novel objective measure to quantify the similarity of the rules. The other objectives in this study are average support value and accuracy. In [8], the authors suggest a subgroup discovery technique based on multi-objective evolutionary fuzzy systems that follow the accuracy and interpretability objectives on influenza A virus. In [9–12] authors proposed Genetic Algorithms (GA) that consists of two stages. Algorithms in the first stage generate candidate rules and in the second stage perform a selection from candidate rules; how to generate candidate rules and how to select the rules has significant impact on the end results.
In most FRBS learning studies, improving the accuracy, minimization of the NR and the SRL are major criteria. But in addition of these measures, we considered the Sum of SDs in consecutive runs as a new measure that other studies do not attention to these variances directly according to our knowledge. We call this approach, Robust Fuzzy Rule Base Learning (Robust-FRBL). Robustness is the one of the most important measures in data mining, because people can trust to mining algorithms in which results do not confuse people.
Our algorithm includes two stages. In the first stage, a binary krill Herd (binary KH) is applied in order to generate high quality candidate rules. In the generation process, binary KH uses a random pool that contains random rules (number of random rules is an input parameter), and so selects best combination of them in the stage 2. Stage 1 is performed Number of Runs (that is a input parameter) times and all resulted rules are accumulated in the Main Pool (MP). MP will not change and rules will maintain their information until stage 2. Mentioned information includes the Number of Correctly Classified Patterns (NCP) and the Number of Misclassified Patterns (NMP) that is used in the stage 2 of algorithm. In the stage 2, binary KH is applied again with the difference that, in this stage candidate rules are used instead of random rules. Binary KH in stage 2 will be applied to select best combination of the candidate rules that are generated in the stage 1. In this regards, the Rule Set Analyzer (RSA) and MP clustering ideas will be used in order to reduce output variance. An algorithm that, it’s output in successive runs does not vary, is called a robust algorithm. Because of the obtained accuracy in the first stage will never be worse in the second stage, even better accuracy is also more likely. Any other good algorithms can be used instead of the KH in the first stage and accuracy in stage 2 will be increased. Our main goal is to offer some heuristics that increases the accuracy of others algorithms and turns their algorithms to a robust algorithm. These heuristics can be implemented by another evolutionary algorithm, but because of swarm-genetic based nature of KH, binary selection of rules has performed well.
The paper is structured as follows: in the next section, the preliminaries are explained. Preliminaries include a description of datasets, fuzzy rules reasoning method and KH algorithm. In the Section 3, we explain our two-stage proposed method. Results of each stage are shown and analyzed in separated subsections. Comparison with others is demonstrated in Section 4 and in the Section 5, we offer our conclusion.
Preliminaries
In this section the preliminaries are explained. Preliminaries include a description of datasets, fuzzy rules reasoning method and KH algorithm.
Datasets
People in the medicine usually do not have a great knowledge of data mining. They need the output of a data mining tool to be near natural language, and to have high quality and robust, so that no confusion takes place. Because our method has a high quality and is interpretable and robust, we have chosen the medical data.
In the paper, four UCI medical datasets are used (PIMA, Breast cancer, Heart, CMC). We try to extract the fuzzy rules to determine the presence or absence of diabetes in a way to minimize variance, on the PIMA dataset. Despite the fact that many forms of heart diseases can be prevented or treated with healthy lifestyle choices, we will try to take a step in the early detection of the heart diseases. In fact, every human is likely to control his breeding and we try to extract mining fuzzy rules that suggest better contraceptive methods with low variance. Most doctors feel that early detection of breast cancer saves thousands of lives each year and we are happy to do our work in this area. Table 1 shows our datasets with their characteristics that are used in this paper. All of the features are numeric in the paper.
Fuzzy reasoning method
For n-dimensional pattern classification problems, the following fuzzy if–then rules are applied in the design of our robust algorithm:
To improve interpretability of fuzzy rules we used linguistic variables in this study. Variable X i has a linguistic set U = {VL, L, M, H, VH} that is illustrated in Fig. 1; each value of X i uniformly shows 1/5 of domain [0, 1]. We also used don’t care value as all of those values. The total number of if - then rules with n features is 6 n and we decided to select a best rule set from them. Selecting a rule set enlarges the search space and solving this problem with large space computationally is very expensive.
Our work is a c-class problem in the n-dimensional space with numeric attributes. Also we have m real instances X p = (Xp1, Xp2, … , X pn ) , p = 1, 2, 3, … m are given as training patterns.
In this classification system, the result C
j
and certainty grade of rule CF
j
will be calculated by the following: first we calculate the compatibility of each training pattern with the fuzzy if - then rule R
j
by the following then we calculate the whole compatibility grades for each class:
If there is more than one class with maximum value for βclass h (R
j
), we do not assign any consequent class for R
j
. Then we can calculate certainty grade of R
j
and for an input pattern X
p
= (Xp1, Xp2, …, X
pn
) the winner rule R
j
will be calculated by following:
Thus, we can identify the consequent class of input pattern with the aforementioned formula.
We apply Krill Herd (KH) [13] algorithm as our evolutionary algorithm that is suggested more recently. KH has high exploration and exploitation ability on the search space. The position of the krill individuals is formulated by three main actions: movement induced by the presence of other individuals, foraging activity, and random diffusion [13]. KH has the particle swarm nature with the genetic characteristics. We explain more about KH in the Section 3.1. We modify the KH to suggest a robust learning algorithm to learn high quality FRBSs. Unfortunately, the KH is offered to optimize the continuous problems and not applied in the discrete problems and data mining context, especially on learning FRBSs, until now. In this paper we propose a binary KH for extraction fuzzy rules in the medical domain according to the binary particle swarm optimization concepts [14] with innovation.
In this section we explain the behavior of KH and in the Section 3 we introduce a novel binary KH to generate fuzzy rule sets in which it follows our measures: Maximization of accuracy and minimization of SRL, NR, and standard deviations of these measures.
The general policy of the KH is that krills are moving toward food and spaces with more congestion. In the natural system, the fitness of each individual is supposed to be a combination of the distance from the food and from the highest density of the krill swarm [13]. The time-dependent position of an individual krill in 2D surface is explained by the following three main actions [15]: for ith krill we have
A. Movement induced by other krill individuals (N)
According to theoretical arguments, the krill tries to keep a high density and move due to their mutual effects [15]. The direction of motion induced, is calculated from local effect of swarm density, a target effect of swarm density, and a repulsive effect [15].
B. Foraging activity (F)
First, we define a virtual food and then we calculate the foraging motion in terms of two main effective parameters. The first one is the food location and the second one is the previous experience about the food location [13].
The food effect is defined in terms of its location. The center of food should be found at first and then try to formulate food attraction. This cannot be determined but can be estimated.
C. Random diffusion (D)
The physical diffusion of the krill individuals is considered to be a random process [13]. This motion can be express in terms of a maximum diffusion speed and a random directional vector. The following Lagrangian model is generalized to an n dimensional decision space:
After new positions of krill are determined, genetic operators (crossover and mutation) are applied. For more information about KH algorithm, especially the parameters descriptions, you can refer to [13]. It should be noted that everyone can be use own algorithm rather than KH algorithm in stage 1 and 2 of our proposed method that is suggested to extracting low number of the high quality rules in robust way.
This section consists of two subsections. In the first subsection we propose our approach to intelligently generate the candidate rules. Generating the candidate rules in our approach is a learning process by using binary KH. Binary KH runs several times in parallel, and each parallel thread gives a rule set as output. All of the output rules are accumulated in a pool that we call “Main Pool”. Rules of the MP are used in the second stage as candidate rules. Working of the candidate rule generating will be described and the results will be shown in the first subsection. In the second subsection binary KH will be used again for selecting the best rule set from the MP in which our objectives are satisfied. In the end of this subsection, the results are analyzed based on two important parameters (number of clusters and minimum NCP). The comparison with others is demonstrated in Section 4. The results show that we have been successful in simultaneously improving three criteria (accuracy, interpretability and robustness) that were in conflict.
Generating candidate rules (Stage 1)
Here, we will explain binary KH and describe that how MP is filled. Binary KH algorithm to generate high quality candidate rules is illustrated in the Fig. 2, step by step. In step 1, a Random Pool (RP) is created that contains a fixed number of fuzzy if-then rules for each class in a random way. Each antecedent in if-then rules accept one of the fuzzy linguistic terms as very-low, low, medium, high, very high and don’t care. Fuzzy linguistic terms are randomly distributed into rules in a way that don’t care probability is higher than others. In the step 2 krills are initialized. The number of krills dimension is equal to number of rules of the random pool. 1 for ith binary dimension of krill, shows that ith rule in random pool is used in the krill and 0 shows that ith rule in the random pool is not used in the krill. So, each krill shows a rule set. After initialization in step 3, krills are evaluated and NCP and NMP of the krills are determined. Each krill represents a rule set and each rule set has some rules and rules have different NCPs in current rule set; The event that takes place in step 4 (RSA step) is that krills are analyzed using NCPs. In analyzing of a rule set, these three groups of rules are eliminated: first, the rules that their NCPs are lower than min-ncp (min-ncp is the one of the important input parameters) on validation dataset. The second group is consisted of a percent of worst rules in the rule set (percent-of-elimination is another input parameter) on validation dataset. The worst rules are determined by sorting the rules by their fitness. Fitness is calculated as follows:
RSA eliminate N (N = percent-of-elimination * number of rules in rule set) worst rules. It is possible to have duplicated rules in the rule set, and so, the third group of rules that should be eliminated, are the rules that are duplicated. After eliminating these three groups of rules, rule set is evaluated again. After an elimination process, the found rule set changes and should be evaluated again since elimination of a rule affects the entire rule set. RSA will be run until no pruning occurs. In step 5, best krill is determined. Determination of the best krill has a huge effect on results and objectives. Many modes of the best krill detection are examined and finally, we decide to use a policy that says: the krill that has best fitness, is the best krill; if there are two krill with equal fitness, the krill have lower NR will be selected and if NRs of them are equal too, the krill has lower SRL will be selected and if SRLs are equal, one of them are selected as best krill in random way. This policy gives the best accuracy, interpretability and robustness from other policies like giving any weight to each of the objectives in classic multi-objectivemethods.
Induced Motion, Foraging Motion, Physical diffusion (6, 7 and 8) steps of approach are similar to KH, but krill values are binary (1 or 0). In step 9, genetic operators are used. Crossover and Mutation are applied on each dimension of the problem. In Crossover, the approach uses tournament selection [16] and for Mutation in each dimension, mutation probability in changing 1 to 0 is higher than changing a 0 to 1; with this policy we have low NR and SRL. In the last step, new position of the krill should be determined. In the binary KH, the positions are updated as in continuous version. The major difference between binary KH with continuous version is that X i (t + Δt) is normalized by sigmoid function, then a random number between [0.5, 1] (that is experimentally obtained) is generated, if sigmoid (X i (t + Δt)) is higher than random number, 1, and else 0 is applied. Binary KH for generating the candidate rules is performed several times in a parallel way and each thread of parallelism generates a rule set. Rule sets accumulated in the MP and are sent to second stage of our algorithm. We should not change NCPs of rules in MP, until the first step of Stage 2 is finished.
We set global parameters of binary KH for generating the candidate rules according to follows: Population Size = 50; Number of parallel threads = 30; Max-Iteration = 300; w-inertia = 0.5; D-max = 0.002; N-max = 0.01; V-Food = 0.02; P-crossover = 0.9. All of the parameters are chosen by trial and error. These parameters are explained in [13]. The parameter values after tuning by trial and error are listed in Table 2.
First column in Table 2 shows dataset name; the second column shows number of the training patterns; the third column shows number of validation patterns; the fourth column shows number of the test patterns and the fifth column shows minimum NCP for each individual’s rules where, I t is index of ith execution thread. I t makes it possible to have elected rules that come from entire search space, since in each parallel thread we have a different value for min-NCP. The sixth column shows percent of individual rules that should be eliminated; next column shows the number of random rules in the initial and three lost columns shows probabilities of the mutation. Probability of the cross over on all datasets is set to 0.8.
We run 30 threads of binary KH for generating candidate rules on each four dataset, and results that are obtained are shown in Table 3. A-SD, SRL-SD and NR-SD shows standard deviations of accuracy, sum of rules length (SRL) and number of rules (NR) respectively. The number of MP rules shows the number of candidate rules that are obtained from all threads.
By minimizing Sum of SDs, our robust algorithm gives robust results that people can trust on it. In fact, our main innovation is in this stage, and we suggest a robust method that get the MP and give a rule set as a result that not only has high accuracy and interpretability, but also has low variance on results.
Stage 2 description
In this subsection we explain our method and how it can select a rule set from MP, to satisfy our objectives. After finishing the first stage of the algorithm, MP is sent to the second stage. In second stage of our algorithm, QR (shows number of Qualified Rules that is an input parameter) high quality rules are selected and divided to NC (shows number of Clusters that is another input parameter) clusters. Binary KH is executed on these clusters in a manner which will be described. After sorting the MP by policy of Equation 7, we need to separate quality rules and remove low quality rules. We assume that QR is equal to 150 in all datasets. Mechanism of Stage 2 is illustrated in Fig. 3. MP is sorted by Equation 7 and in step 2, QR rules are selected, this selection procedure works as follows: let us assume that we have 500 candidate rules in the MP and 150 rules are to be selected. First 50 best rules are picked up from sorted rules, and then 450 remaining rules are added to 50 best rules, one at a time. Each resulting 51 rules are separately evaluated and 450 fitness values are resulted from 450 rule sets (each of 450 rules with first 50 rules establishing a rule set that has 51 rules). These fitness values are saved and sorted, so 50 rules from the 450 rules that have highest fitness value along with the first 50 rules, are picked. These new 50 rules are added to first 50 rules.
Again, the remaining 400 rules are added to 100 selected rules up to now, one at a time, and then the rule sets are evaluated and sorted. Then the 50 rules with highest fitness values are picked again. These 50 rules added to 100 selected rules. Picking mechanism continues until we have QR high quality rules. This mechanism has some benefits: after continuing this mechanism, first, we have high quality rules in terms of fitness values, and second, we have rules that come from different spaces of search space; in fact we pick up rules that classify patterns that have not been classified until now and this means that selected rules have low dependencies. This characteristic of picking mechanism has significant effect on our objectives. After picking QR rules, in step 3 rules are divided to NC clusters equally, in which first clusters contain first rules in QR selected candidate rules. In step 4 binary KH is used to select a rule set from first cluster. Binary KH in this stage is similar to stage 1 approach, except that in this stage candidate rules that are generated in stage 1 are applied rather than the random rules pool. The resulting rule set from step 5 (illustrated in Fig. 3) that have the best fitness grouped with results from step 4 are sent to the next step. In step 5, the selected rule set by binary KH, should have the best performance together with step 4 result. Finally in the last step, binary KH is performed on cluster NC (the last cluster) and selects the rule set grouped with the results from NC - 1 that has the best performance. The last result is evaluated with the test data set. Number of clusters (NC) and min-ncp are two important parameters that should be tuned for each dataset. The process illustrated in Fig. 3 is performed with different number of clusters (NC) and different values for min-ncp in the RSA step of binary KH. Following pseudo code is provided for demonstrating the parameter setting of Stage 2. (Fig. 4).
NC parameter contains the number of clusters values in each run of Stage 2. We have chosen values for NC that almost have all the cases of clustering. Min-ncp-value contains values for min-ncp parameter in RSA step of binary KH. Overall Stage 2 of our algorithm was run for 1450 times, 15 runs for each values of NC, in which in each run, 6 states of min-ncp are considered, and each combination of NC and min-ncp are executed 15 times for calculate SDs. In the end, we show significant results of these runs and the effect of NC and min-ncp will be analyzed.
Experimental results of the Stage 2
In this subsection, results of Stage 2 in 15 runs are shown. In Table 4, second column specifies how many clusters are applied and third column specify what min-ncp value is used. The fourth column shows average of the accuracy, and fifth column shows average of SRL, the sixth column shows average of NR, the penult column shows the average sum of SDs and last column shows average spent time for execution.
Results and discussion
Impact of RSA, Clustering, and MP on accuracy, NR, SRL and sum of SDs (SD of accuracy, NR and SRL) are analyzed in this section. Figure 5 contains 8 charts; in all charts vertical axis shows the number of used clusters and the value of min-ncp. For example 1_2 shows that one cluster is used and the value of min-ncp is equal to 2. In each chart, a straight line depicts a linear trend in the data by function y and correlation coefficient (R2) depicts dependency between vertical axis and horizontal axis. The charts are obtained after 15 runs on heart dataset.
Impact of RSA: charts a, b, c and d show the impact of RSA. We assume that we have one cluster and by changing in the value of min-ncp, changes are shown. The main impact of RSA is concluded on NR and SRL. Since RSA reduces the number of rules, subsequently the SDs are reduced too. Chart 7-a shows impact of RSA on NR. As can be seen, by increasing the value of min-ncp, the number of rules is reduced significantly. Chart 7-b, shows that by increasing in value of min-ncp, the sum of rules lengths are reduced significantly. The decreases in NR and SRL are due to removing the low quality rules in each iterations ofalgorithm.
Because of reducing in NR and SRL, normally our SDs are be reduced, and we show this event in chart 7-c. chart 7-d indicates that increasing in value of min-ncp, necessarily does not result in better accuracy. In order to obtain better accuracy, we have to find an appropriate value for min-ncp.
Impact of the clustering: charts e, f, g and h show the impact of the number of clusters. We assume that the values of min-ncp are equal to one and the charts represent the changing in the number of clusters, and accordingly the changes in results. The main impact of the number of clusters is decreasing in SDs. The number of clusters effect on SDs illustrated in chart 7-e. As can be seen, by increasing in the clusters the sum of SDs has been reduced. Chart 7-f shows that, with the increasing number of clusters, also somewhat accuracy have been increased since the search space is reduced and best rules are picked in the early selection process. Impact of the number of clusters on NR and SRL are shown in chart 7-g and 7-h. the charts show that more clusters cause less number of the rules and less sum of the rules length. It seems that, the fragmentation of MP is because of this event. In fact the search method in small search spaces searches better.
Impact of MP: keeping up accuracy besides of improving interpretability and robustness is one of the major benefits of MP. In fact generated the best rule sets in Stage 1 are accumulated in the MP as candidate rules and algorithm tries to select the best combination of candidate rules, and best rules result in best accuracy. Just by having MP, we are able to think about minimizing the SDs, NR and SRL.
By analyzing the effects of MP, RSA and clustering, we can say that we are close to our objectives. The robustness, interpretability and generating high quality rules that were contradicting each other, are improved, but this improvement has its costs, such as high complexity of the process and the time consumed.
Comparison with others
To compare our results with other studies, we have used KEEL software [30, 31]. We have downloaded the latest version of the software (V2014-01-29) from http://www.keel.es. 15 recent evolutionary rule learning algorithms are used, in which 5 of them learn crisp rules and 10 of them learn fuzzy rules in an evolutionary way. we tried to set the best parameters to them. Each algorithm on each dataset, has been run 15 times and so the accuracy average and the standard deviation of the accuracy is evaluated. The results are listed in Table 5. We have tried to apply the same parameters on KEEL algorithms as used in our method. We just compare our results in terms of accuracy and SD of accuracy, because the number of rules (NR) in KEEL algorithms is fixed parameter and is an input parameter. Since we set it ourselves, we do not compare KEEL algorithms with our algorithm, in aspects of NR and SRL. To set this parameter, we set it with the mean value of NR in our algorithm (Robust-EFRL). As you can see, our two-stage algorithm worked better than others. Our emphasis is that if there is a better rule learning algorithm that the results of it is better than our algorithm, if we use it in first stage of our algorithm, not only results will be improved but also that algorithm will be turned to a robust algorithm (as shown in Table 5 our results at the end of the algorithm are improved significintly in related to Stage 1 of our algorithm). In fact Stage 2 of our algorithm can be used as a post proccessing algorithm on each algorithm. Comparing results of the Stage 1 and Stage 2 of our algorithm shows that we have been successful in improving three objectives that were in conflict.
It was one of the main objectives of the paper to reduce the sum of SDs. This value equals to zero here. The generated rules are fuzzy, and their count and length is reasonable. The accuracy of classification in our method is more than other methods that compared here.
Conclusion
In the paper we tried to generate FRBSs that follow three major objectives: Accuracy, Interpretability and robustness. The rules are extracted by a robust method; results are analyzed and compared with other works. In compare with other algorithms, our algorithm is the best on each 4 datasets. On Breast dataset, the Robust-EFRL got % 99.22 ±0 accuracy with acceptable NR and SRL; on CMC dataset, the Robust-EFRL got % 55.33 ±0 accuracy with 18 rules; On heart dataset, the Robust-EFRL, got % 88.0 ±0 accuracy with 9 rules; On Pima dataset, the Robust-EFRL got % 80.82 ±0 with 12 rules that are acceptable. The results show that our algorithm has zero SDs with high accuracy and we have been successful in improving three objectives that were in conflict.
