Abstract
Recently, the topic of data mining has attracted considerable attention from both academia and industry. Data mining is the process of extracting useful knowledge from large amounts of data. Among the types of knowledge to be mined, classification knowledge is the most widely exploited in engineering applications. A variety of methods exist for inductive learning of crisp classification knowledge. This paper presents a new inductive learning algorithm called FuzzyRULES that extracts fuzzy classification rules from a database of examples. The use of fuzzy sets and fuzzy logic methods not only provides a powerful, flexible approach to handling vagueness and uncertainty, but also increases the expressive power and comprehensibility of the induced classification knowledge. An example involving the induction of process planning rules is used to illustrate the operation of FuzzyRULES. The algorithm has also been compared against conventional crisp and fuzzy rule induction algorithms on several benchmark data sets. The results obtained have shown that FuzzyRULES induces more compact and more accurate classification rule sets.
Introduction
Data mining is the process of discovering useful knowledge from data [1–3]. The goal is to analyse the data to reveal regularities and relationships that are non-trivial and previously unknown. Data mining consists of a series of steps including data preparation, data modelling and model evaluation. The most important step in data mining concerns applying appropriate mining algorithms to the prepared data. Classification learning is the most common data mining technique. It employs a set of pre-categorised instances to develop a model that can classify new instances from the same population. Classification is a problem frequently encountered in engineering [4–8]. Among the various learning approaches developed for classification, inductive learning is perhaps the most commonly adopted in real-world applications [9, 10]. Inductive learning techniques are simple and fast. Another advantage is that they generate models that are easy to understand. Finally, inductive learning classifiers are more accurate compared with other classification techniques.
Inductive learning techniques can be divided into two main categories, namely, decision tree induction and rule induction. There are a variety of algorithms for building decision trees [11–13]. These learningsystems are categorised as “divide-and-conquer” inductive systems. The knowledge induced by these systems is represented as decision trees which can be transformed to a set of If-Then rules. As with decision tree learning, there are many rule induction algorithms [14–17], which can be categorised as separate-and-conquer inductive systems. Unlike decision tree learning, rule induction directly generates If-Then rules.
Fuzzy logic has been increasingly used in data mining systems performing classification, due to its similarity with some aspects of human reasoning [18, 19]. A key feature of fuzzy logic models is that they can handle vagueness and uncertainty. Fuzzy logic allows one to deal with fuzzy and high-level notions (e.g. being small) while processing low-level information (e.g. size is 84.5 mm) [20].
Various learning methods have been developed for inducing fuzzy decision trees [21–28]. Fuzzy rules can be indirectly learned by converting fuzzy decision trees to fuzzy rule sets [29], or by extracting fuzzy rules from neural networks [30–35]. Genetic algorithms [36–40] and Bayesian classifiers [41–43] were also employed to obtain fuzzy rule sets.
There exist very few fuzzy rule learners that directly induce fuzzy classification rules from data [44–48]. Hühn and Hüllermeier [47] introduced a fuzzy rule-based classification method called FURIA, which extends the RIPPER crisp rule induction algorithm [15]. FURIA learns fuzzy unordered rule sets and employs a rule stretching method to deal with the “no match” problem that occurs when using the induced rules to infer the class of a new instance. Experimental results demonstrated that FURIA significantly outperforms the original RIPPER, as well as other classifiers such as C4.5 [12], in terms of classification accuracy. However, the improved performance of FURIA comes at the price of an increase in the execution time. One problem with FURIA is that it first generates crisp rules and then fuzzifies these rules by replacing the crisp intervals of the numerical attributes appearing in the original rules by fuzzy intervals. Therefore, much work is wasted in learning crisp rules and subsequently fuzzifying these rules. Also, fuzzifying rules increases the number of covered instances, which entails the repetition of the whole induction process. This is because, in rule learning, the induction of the second rule is based on the instances remaining after the removal of the instances covered by the first rule. If the first rule is to be fuzzified, this would affect the induction of the second rule. Another problem is that FURIA fuzzifies rules in a greedy fashion, and uses the rule purity metric to measure the quality of fuzzification. Greedy search does not guarantee to find the best fuzzification of each rule. The purity metric obtains its optimal values when no negative instances are covered. Also, it does not aim to cover many positive instances. As a result, the purity metric tends to select very specific rules covering only a small number of instances. This is undesirable since rules covering few instances are unreliable, especially where there is noise in the data.
RULES-F Plus is a fuzzy rule induction algorithm for continuous output handling that is based on the structure of the RULES-5 Plus crisp rule learner [16, 48]. RULES-F Plus iteratively employs a specific rule-forming process in order to create a complete rule set covering all instances. One of the main problems of RULES-F Plus is that it learns a complete and consistent rule set that tries to cover all of the positive and none of the negative training instances. In the case of noisy data, specific rules based on a few training instances tend to be selected. Although these rules perform perfectly on the training instances, their predictive accuracy on future test instances is often lower because rules formed on the basis of small numbers of instances are susceptible to noise. Consequently, the rule set is both large and not of the highest accuracy.
This paper introduces a novel algorithm, FuzzyRULES (Fuzzy RULe Extraction System), for the induction of fuzzy rules based on the principles of fuzzy sets and fuzzy logic. The use of fuzzy sets increases the expressiveness of the rules, allowing semantically more natural conditions to be described. Furthermore, fuzzy sets provide a natural mechanism for dealing with continuous data. Finally, they offer a mechanism for reasoning in the presence of uncertainty, vagueness and ambiguity. One of the advantages of FuzzyRULES compared with most existing fuzzy induction algorithms is that it integrates the fuzzy set techniques into the rule induction process and allows the automatic creation of membership functions. FuzzyRULES also employs a fast and noise-tolerant search method for fuzzy rules generation. Finally, it uses simple and effective methods for fuzzy rule evaluation. The outcome is an efficient algorithm that produces accurate and compact fuzzy rule sets.
The remainder of this paper is organised as follows. Section 2 reviews some basic concepts of fuzzy logic relevant to fuzzy rule generation. Section 3 describes the proposed FuzzyRULES algorithm. An example illustrating the learning process of FuzzyRULES is described in Section 4. Section 5 presents an experimental evaluation of FuzzyRULES. Conclusions and future work are provided in Section 6.
Basic fuzzy set concepts
Introduced by Zadeh [49], fuzzy logic enables the creation of mathematical models for domains with boundaries that are not sharp. This section covers the basic concepts of fuzzy sets needed to develop the FuzzyRULES algorithm.
Let U be the universe of discourse of a variable u. A fuzzy set A in U is defined by a membership function μ A (u) ∈ [0, 1] which describes the degree to which u belongs to the fuzzy set A. For u ∈ U, μ A (u) =0 means that u does not belong to the fuzzy set and μ A (u) ∈ ]0, 1] means that u belongs to the fuzzy set with a degree of membership μ A (u).
Let A and B be two fuzzy sets in U with membership functions μ A (u) and μ B (u), respectively. The following fuzzy operators and measures can be defined [50–52].
FuzzyRULES is based on the learning strategy used in the RULES-6 crisp rule induction algorithm [53], previously developed by the author. RULES-6 belongs to the RULES family of simple inductive learning algorithms which have proven useful in several manufacturing and engineering applications [54–56]. This section describes the proposed FuzzyRULES algorithm and highlights its distinctive features.
Representation
FuzzyRULES extracts fuzzy If-Then rules directly from a set of instances called the training set T. Each instance I is described by its nominal class value C I and by a vector of n a attributes (A1, …, A i , …, A n a ). Each attribute value in instance I is either nominal or continuous. An instance I can therefore be formally defined asfollows:
A fuzzy rule R is described by a conjunction of conditions on each attribute () and the value (C t ) of an associated target class (the class to be learned). It can be represented as follows:
For the ith attribute A i , a fuzzy condition has the form [A i is ], where is the linguistic value (term) of the ith attribute in rule R. Each linguistic value represents a fuzzy set for which every instance in the training set has a corresponding membership value. For continuous attributes, linguistic values can be obtained by defining membership functions on the domains of the attributes. Nominal attributes values can be converted to linguistic terms by crisp functions with membership degrees of either 0 or 1.
Due to the notion of fuzziness, an instance I will have a particular “degree of match” (μ R (I)) with each rule R. The degree of match is obtained by first assessing the membership degree of each attribute value () of the instance with regard to the corresponding linguistic term () in a rule R. Using these membership degrees, μ R (I) can be computed using the intersection operator (Equation (2)) as follows:
During the rule induction process, an evaluation function is used to score rules directly based on the membership of instances to them, and this has an important effect on the search for good rules. In practice, many of these instances can match a rule antecedent to a small degree, and the summation of all these small degrees of membership can give the false impression that the rule is good, which can undermine the reliability of the evaluation function. To prevent such instances from being covered, a user-defined α-cut threshold can be applied to the instance memberships. The membership of an instance to a rule, μ R (I), is defined to be zero when it lies below the α-cut threshold value. After the generation of rules, defuzzification must be performed when the rules are used to classify instances. For this process, the α-cut threshold can be applied again, i.e. a rule fires when the antecedent membership is above the threshold value.
When the value of μ
R
(I) is greater than or equal to the specified value of α, rule R is said to α-cover instance I. The set of instances in the training set T that are α-covered by a rule R can be defined using the α-cut operator (Equation (3)) as follows:
A rule set is a disjunction of a number of rules {R1, …… , R l , …… , R n r }, and is denoted as RS = {R1 ∨ …… ∨ R l ∨ …… ∨ R n r } . The degree of cover of instance I by the rule set RS can thus be evaluated using the union operator (Equation (1)) as follows:
A simplified description of the FuzzyRULES algorithm is given in Fig. 1. The algorithm induces fuzzy rules in an iterative fashion. In each iteration, it takes a “seed” instance not covered by previously created rules to form a new rule, as outlined in Fig. 2. Having found a rule, FuzzyRULES marks those instances that are α-covered by the rule and appends the new rule to its rule set. This process continues until all instances in the training set are covered. Note that the instances covered by previously formed rules are only marked in order to stop FuzzyRULES from repeatedly finding the same rule. However, these instances continue to be used to guide the rule specialisation process (i.e. the process of adding conditions to an existing conjunction of conditions) and to assess the accuracy and generality of newly formed rules. Also, FuzzyRULES does not work on a class-per-class basis. The class of the selected seed instance is considered “positive” (i.e. the target class) and all the remaining classes are regarded as “negative”.
The generated rules are simplified using a post pruning procedure based on the minimum description length (MDL) principle [57]. The pruning procedure is aimed at finding the rule set for each class that minimises the total coding length. It first examines each rule in turn starting from the last rule added to see whether it could be omitted or not. Then, if a rule cannot be omitted, it is pruned using a greedy strategy by repeatedly deleting a single condition starting from the last condition, if this improves (decreases) the total coding length, until no further improvements in the coding length is possible.
To construct a rule, FuzzyRULES performs a pruned general-to-specific beam search. The search aims to generate rules which cover as many instances from the target class and as few instances from the other classes as possible, while ensuring that the seed instance remains covered. As a consequence, simpler rules that are not consistent, but are more accurate for unseen data, can be learned. To increase the efficiency of the learning process, several search-space pruning tests are used to reduce the size of the search space and to stop further specialisation of a candidate rule when it cannot improve the classification performance of the rule set. The pruning tests can be implemented by associating two sets of linguistic terms, ValidTerms and InvalidTerms, to each candidate rule. The first set contains the valid linguistic terms that may still be used to specialise a rule. Initially, this set includes the linguistic terms constructed for all attributes values in the selected seed instance. The second set contains the invalid terms that will not lead to improved rules.
The learning of single rules starts with the most general rule whose body is empty. Specialisation of the rule involves incrementally adding conditions to its body. Possible conditions are linguistic terms defined on the attributes values of the selected seed instance. In the case of continuous attributes, conditions of the form [A i is ] are created, where is the linguistic term corresponding to the continuous value () of attribute A i in the selected seed instance s. Membership functions have to be designed in order to specify the degree to which each attribute value () belongs to the corresponding linguistic term () of that attribute. A method that can automatically generate fuzzy membership functions for continuous attributes is described in Section 3.5. In the case of nominal attributes, the above fuzzy representation is also employed. However, the values of nominal attributes are converted to linguistic terms by assigning crisp functions with membership degrees of either 0 or 1. For each condition, a new ChildRule is formed by adding the condition to the current ParentRule. The term used to form the appended condition is removed from the ValidTerms set of the ChildRule to prevent repetition of specialisation effort. The score or quality measure of each new ChildRule is then computed and the rule with the best score is remembered.
The new ChildRule is examined to determine if it can be pruned away. A ChildRule that does not cover a minimum number of positive instances(i.e. instances belonging to the class defined by the rule) is removed from the search since no improved rules will be obtained from additional specialisations of this rule. Also, a ChildRule that does not exclude any new negative instances (i.e. instances not belonging to the class defined by the rule) is deemed ineffective since either it leaves out positive instances only, or it keeps the covered instances unchanged. A ChildRule that has become consistent is prevented from further specialisation as this will only decrease the number of positive instances covered by this rule and therefore yield lower values for the quality measure. If the ChildRule is discarded, the last linguistic term used to form it is added to the InvalidTerms set of its ParentRule so as to ensure that it will be removed from other specialisations of the same ParentRule. Otherwise, the ChildRule is added to a set of newly formed expressions representing fragments of potential rules, the ChildRules set. Thus, the ChildRules set only contains useful expressions that can be employed for further specialisation. This process is repeated until there are no rules remaining to be specialised in the ParentRules set which is the complete set of partial rules.
Another test that allows sections of the search space to be pruned away is now applied to each expression in the ChildRules set after the best expression overall in the current specialisation step is identified. Expressions are removed from the ChildRules set if their optimistic values of the quality measure cannot improve on that of the current BestRule. A simple optimistic value is obtained by determining the quality measure of a rule with the same positive cover as the current rule but with zero negative cover. The last linguistic terms used to generate these rules are added to the InvalidTerms set of their parents. All InvalidTerms are then deleted from the corresponding set of ValidTerms for each expression in the ChildRules set. Such terms cannot lead to a viable specialisation from any point in the search space that can be reached via identical sets of specialisations and thus excluding them will prevent the unnecessary construction of ineffective specialisations at subsequent specialisation steps.
After all duplicate rules have been eliminated, the best w expressions from the ChildRules set are chosen to replace all expressions in the ParentRules set. The comparison between expressions is based on the quality measure defined in the next subsection. If two expressions have equal quality measures, the simpler rule, in other words, the one with fewer conditions is selected. If both the quality measure and the number of conditions of the expressions are the same, the more general expression which covers more instances is chosen. The specialisation process is then repeated until the ParentRules set becomes empty. During specialisation, the best expression obtained is stored and returned at the end of the procedure.
Fuzzy computation of the rule quality measure
Given that the rule induction process could be conceived as a search process, a metric is needed to estimate the quality of rules found in the search space and to direct the search towards the best rule. The rule quality measure is the most influential bias in rule induction. In real-world applications, a typical objective of a learning system is to find rules that optimise a rule quality criterion that takes both training accuracy and rule coverage into account so that the rules learned are both accurate and reliable.
A quality measure must be estimated from the available data. All common measures are based on the number of positive and negative instances covered by a rule. Several different metrics are used in existing algorithms [58–60]. For some of these, fuzzy variations were designed and used to score rules [61, 62]. A rule quality measure that performs very well, especially in the presence of noise, is the m-probability-estimate [63]. For a rule R, this measure can be defined as follows:
FuzzyRULES employs a fuzzy version of them-probability-estimate to score the performance of rules on the training set during the learning process, and to select the set of best candidates for further exploration. Let A be the antecedent of rule R and C its consequent. Then the fuzzy version of them-probability-estimate function can be defined using the α-cut operator (Equation (3)) and the cardinality measure (Equation (4)) as follows:
In FuzzyRULES, m is set to n
c
, the number of classes, and the a priori probability P0 (C
t
) is assumed to be equal to the training accuracy of the empty rule that predicts the target class, namely:
When there are both continuous and nominal attributes in a data set, most rule induction techniques discretise continuous attributes into intervals and the discretised intervals are treated in a similar way to nominal values during induction. In the crisp case, discretisation results in “hard” intervals and an attribute value either belongs to a certain interval or not. In the fuzzy case, however, an attribute value belongs to an interval to a certain degree. Discretised intervals, therefore, should be fuzzily interpreted. An intuitive way to obtain fuzzy intervals for each continuous attribute is to discretise its domain into several crisp intervals and fuzzify these intervals accordingly. This section describes how the fuzzy intervals of continuous attributes are defined and how membership functions are constructed for every associated fuzzy interval.
A simple method of crisp discretisation is to divide the domain of each attribute into intervals with equal length or equal frequency [64]. This method, however, generally results in poor system performance. On the other hand, it is also difficult and time consuming to find an optimal discretisation [65]. An efficient method is to find several possible cut points and perform partitioning for a limited number of times. Fayyad and Irani [66] proved that with instances sorted in ascending order by the values of a continuous attribute, possible cut points exist in the boundary where two adjacent instances belong to different classes. A cut point with the highest information gain [67] is then used to split the range of the attribute. This process is iteratively applied until astopping criterion based on the MDL principle [68] is met. This method is employed in this study to obtain crisp intervals for each continuous attribute in the data being mined. A crisp interval can then be fuzzified by associating an appropriate fuzzy membership function with it. An attribute value can then be classified into different intervals with different degrees at the same time. Membership functions can be assigned by domain experts or derived from statistical data [69]. Fuzzy clustering can also be used to determine membership functions [70]. This section proposes a new method that generates membership functionsautomatically.
Typical shapes of membership functions are triangular, trapezoidal, and bell-shaped. Triangular membership functions are chosen in this study as they are simple and often used in fuzzy sets. These functions can be calculated from the boundary points of the crisp intervals using the algorithm detailed below. The example given in Fig. 3 is used to illustrate the steps of the algorithm. Figure 3(a) shows the cut points obtained by applying the crisp discretisation method described above. These points can be used to form a set of disjoint intervals which can be described by means of crisp membership functions as shown in Fig. 3(b). The value 1 is assigned to each membership function if the attribute value is placed inside the corresponding interval. The remaining membership values are equal to 0. In Fig. 3(c), overlapping intervals described by fuzzy membership functions are shown. A point near to a cut point belongs to two fuzzy sets with a membership degree less than 1 and greater than 0 for both membership functions. Assuming that the sum of two adjacent membership functions is always one and the crossing points of these functions coincide with the cut point in the interval partition, the triangular membership functions can be generated as follows.
where v is a value in the domain of a continuous attribute A, L j is the jth fuzzy linguistic term associated with attribute A and μ L j (v) is a fuzzy membership function specifying the degree to which the original value v of attribute A belongs to the respective linguistic term of the attribute.
In this case, the only parameters that need to be determined are the a
j
(j = 1, …, n
l
) values. Those values can be obtained from the set of c
k
(k = 1, …, n
l
- 1) cut points as follows:
When the induced rule set is used to classify a new instance, three outcomes are possible: One rule or multiple rules having the same consequent covers the new instance, More than one rule covers the new instance and the matched rules have different consequents, or No rules cover the new instance.
Each case requires a different classification procedure to predict a label for the new instance. The FuzzyRULES algorithm applies its fuzzy rule set to classify new instances in the same way as the classical rule sets created by RULES-6. The difference is that the degree of match between each instance I and rule R (Equation (7)), and the α-cut threshold (Equation (8)) will be used when computing the coverage of rules. In the first case, FuzzyRULES simply assigns the class predicted by the matched rule(s) to the new instance. To resolve the conflict between rules in the second case, FuzzyRULES selects the rule with the highest membership value to classify the new instance (Equation (9)). In the third case, the generally adopted solution is to use a default rule which simply assigns the most frequent class in the entire training set to the new instance to be classified, independently of its attributes. This method is appropriate in cases where the number of classes in a training set is small and one of the classes contains a predominant number of instances. However, the results deteriorate when the number of classes grows, and instances are more evenly distributed. FuzzyRULES classifies the new instance by assigning to it the class of the nearest rule in the rule set. To find the nearest rule, a simple distance measure introduced by Bigot [71] is employed.
Illustrative example
This section provides an example of applying the FuzzyRULES algorithm to support process planning by proposing feasible process routes to produce a part. All possible technological routes in a given manufacturing environment are stored in a database. When it is required to select a route for a new part, the values of the part’s attributes are entered and an appropriate route can be retrieved from the database. The selected route can be further modified ifnecessary.
Table 1 shows a sample training data used for the development of process routes. Each instance is described by four attributes of the part with the following values: Heat Treatment (Yes, No), Material (Steel_3135, Aluminium, Steel_1045), Tolerance (6.5, 8, 8.5, 10, 11, 13, 14, 15), and Surface Finish (Low, Medium, High). There are four classes with values (R1, R2, R3, R4) denoting the possible technological routes to produce the part.
FuzzyRULES starts inducing rules for the first seed instance (Heat Treatment = Yes, Material =Steel_1045, Tolerance = 8, Surface Finish = Medium, Route = R1). The prespecified values of the algorithm parameters MinPositives (the minimum number of positive instances covered), MinNegatives (the minimum number of negative instances excluded), w (the beam width), and the α-cut threshold are 1, 1, 1, and 0.5, respectively. The post pruning procedure is disabled.
Figure 4 shows the entire set of rules that could be derived from the selected seed instance. Each node represents a rule, where the terms present in the rule are shown in square brackets. The first two values in the parentheses are the numbers of positive and negative instances covered by the rule, respectively. The third value in the parentheses is the evaluation obtained by the respective rule. The circled nodes indicate the rules examined by FuzzyRULES and the bold ones show the best rules chosen in each specialisation level.
The search process starts with the default most general rule having null condition “If [No Condition] THEN Route is R1”. This rule has a score of 0.145 as computed by the m-probability-estimate measure defined in Equation (11). The default rule is used to initialise the BestRrule and the ParentRules set. The ChildRules set is initialised to empty.
Linguistic terms are then used to create new child rules by specialising the ChildRule(s) in the ParentRules set. Four linguistic terms are constructed for the values of the attributes in the selected seed instance. These are [Heat Treatment is Yes], [Material is Steel_1045], [Tolerance is ], and [Surface Finish is Medium]. The membership functions generated by FuzzyRULES to define the linguistic terms of the numerical attribute Tolerance are shown in Fig. 5. The values 9.25 and 12 in the figure are the cut points generated by the crisp discretisation method of FuzzyRULES to split the range of the Tolerance attribute. From these values, the parameters a j which define the boundaries of the membership functions can be computed using Equation (16). The linguistic term is chosen to represent the value 8 of the Tolerance attribute in the chosen seed instance since the membership degree is maximum. can be computed using Equation (13) and its value is 0.95.
The formed linguistic terms are used to generate the following four child rules: ChildRule 1: IF [Heat Treatment is Yes] THEN Route is R1, ChildRule 2: IF [Material is Steel_1045] THEN Route is R1, ChildRule 3: IF [Tolerance is ] THEN Route is R1, ChildRule 4: IF [Surface Finish is Medium] THEN Route is R1.
These four rules have better scores than the BestRule and the first three rules have the same evaluation. ChildRule 1 is generated first and it is therefore chosen to replace the BestRule. The new child rules are then examined to determine if they can be pruned away based on the search-space pruning tests. Since the new child rules are not consistent, their MinPositives and MinNegatives values are higher than the predefined values, and their optimistic scores are better than the current BestRule, they are added to the ChildRules set for further specialisation.
The best w rules from the ChildRules set are chosen to replace all rules in the ParentRules set. Since w is equal to one, only the ChildRule “IF [Heat Treatment is Yes] THEN Route is R1” with the best evaluationis chosen, and the specialization process is then repeated. New linguistic terms except the term “[Heat Treatment is Yes]” are added to the current rule. The following three child rules are generated: ChildRule 1: IF [Heat Treatment is Yes] AND [Material is Steel_1045] THEN Route is R1, ChildRule 2: IF [Heat Treatment is Yes] AND [Tolerance is ] THEN Route is R1, ChildRule 3: IF [Heat Treatment is Yes] AND [Surface Finish is Medium] THEN Route is R1.
ChildRules 1 and 2 obtained the same scores and their evaluation is higher than that of the BestRule. Again, ChildRule 1 is selected to replace the BestRule as it is generated first. ChildRules 1 and 2 are consistent as they cover no negative instances. Also, ChildRule 3 does not exclude (uncover) more negative instances. Although FuzzyRULES can create further specialisations to these rules, this can only cause their evaluation to become worse because only the same number or fewer positive instances will be covered. Therefore, ChildRules 1, 2 and 3 are removed from the search process and they are not added to the ChildRules set. As the ChildRules set does not have any new rules to specialise, FuzzyRULES halts the search and returns the BestRule.
The result of the process is the following rule “IF [Heat Treatment is Yes] AND [Material is Steel_1045] THEN Route is R1”. This rule is added to the rule set, and the training instances that are α-covered by it are marked as covered. Using Equations (7) and (8), it can be easily determined that the generated rule α-covers instance 7 in addition to instance 1. Since there are still instances in the training set that are not covered by the formed rule, the learning procedure must generate new rules to cover these instances. The four rules produced by the FuzzyRULES algorithm are listed in Fig. 6. The classification accuracy of these rules on the training data is 100% .
Experimental results
This section compares FuzzyRULES against other crisp and fuzzy rule learning algorithms. FuzzyRULES was compared first with the RULES-6 crisp rule induction algorithm on which the fuzzy version was based. FuzzyRULES was compared also against the well-known crisp inductive learner C5.0 [13] which is probably the best performing induction algorithm commercially available. Finally, FuzzyRULES was compared against the recently developed RULES-F Plus fuzzy rule learning algorithm [48].
Three criteria were used to evaluate the performance of the tested algorithms, namely, classification accuracy, rule set complexity and number of rules evaluated during the search process. The complexity of a rule set is measured by the total number of rules or total number of conditions in that rule set. In order to draw reliable conclusions about the behaviour of the learning algorithms, 30 data sets (listed in Table 2) were considered. All data sets were obtained from the University of California at Irvine (UCI) repository of machine learning databases [73]. In the experiments conducted in this study, the hold-out approach was used to partition the data into training and test data [74, 75]. Each large data set with more than 1000 instances was randomly divided once into a training set with two-thirds of the data and a test set with the remaining one-third. For small data sets with fewer than 1000 instances, this partitioning procedure and the induction process were repeated ten times, and the results were averaged. RULES-6, C5.0, RULES-F Plus, and FuzzyRULES each has a number of parameters whose values determine the quality of their induced rule sets. The default parameters of RULES-6, and C5.0 were used. For RULES-F Plus, the experiments reported in [48] set the beam width to 2 and the noise level (NL) to 0.1, 0.2,or 0.3.
Therefore, these parameter values were used in this study. For FuzzyRULES, the α-cut threshold was set to 0.5.
Comparison with RULES-6
Table 3 shows the results obtained. As can be seen from the table, the performance obtained by FuzzyRULES was impressive. There were considerable improvements in the classification accuracy for 17 data sets. The accuracy degraded for 9 data sets. For the remaining 4 data sets, equivalent results were obtained. It can also be noted from the table that FuzzyRULES produced much more compact rule sets than RULES-6. The total number of rules decreased by 25.5 per cent (from 781 to 582). Also, the total number of conditions dropped by 19.3 per cent (from 2332 to 1883). The capability of the FuzzyRULES algorithm to produce simple and accurate rules can be attributed to the use of fuzzy linguistic terms in its rules antecedents, which supports the generation of more natural and more comprehensible rules, and makes FuzzyRULES more robust in tolerating imprecise, conflict, and missing information. Despite using a more complex rule-forming process, the fewer and more general rules created by the FuzzyRULES algorithm made it much faster than RULES-6 as indicated in Table 3. The total number of evaluations fell by 15.7 per cent (from 176369 to 148704). These results confirm that FuzzyRULES obtains rules that are less complex, but more accurate than those extracted by RULES-6.
Comparison with C5.0
FuzzyRULES was compared with C5.0 for the data sets listed in Table 2. C5.0 has a facility to generate a set of pruned production rules from a decision tree. Table 4 presents the results for each algorithm on each data set. In each case, the accuracy of the test data and the complexity of the resulting rule sets are given. The number of rules was taken as a measure of the complexity of the rule set. A complexity of 1 was assigned to the default rule. It is clear from Table 4 that the accuracy obtained by FuzzyRULES was, in total, higher than that of C5.0. In addition, FuzzyRULES achieved higher accuracies for 19 out of 30 data sets, whereas C5.0 yielded better accuracies for 11 out of 30 data sets. It is also clear from the table that, in total, FuzzyRULES created fewer rules than C5.0. The number of rules was lower for 25 data sets and higher for 5 data sets for FuzzyRULES when compared with C5.0. Taking into account both the number of rules created and the test accuracy obtained, it can be concluded that the FuzzyRULES algorithm outperformed C5.0 in the classification experiments conducted. Although in most situations the C5.0 decision tree induction method works well, it has a number of problems due to its representation of rules and its strategy for induction. First, it often happens that identical subtrees have to be learned at various places in a decision tree. Second, by minimising the average entropy of a set of instances, C5.0 disregards the fact that some attributes or attribute values may be irrelevant to a particular classification. Finally, C5.0 performs a kind of hill-climbing search for rules, by exploring just one alternative at a time, which makes it very sensitive to the problem of local maxima. The FuzzyRULES algorithm, on the other hand, does not suffer from these problems. It uses attribute-value pairs instead of attributes when inducing rules and thus avoids generating inappropriate and overspecialised models. Also, FuzzyRULES has the advantage of being less sensitive to the local maxima problem, since it explores w alternatives at a time, where w is the beam width.
The problems of the C5.0 algorithm can be demonstrated by applying it to the data of the process planning problem. Figure 7 displays the decision tree constructed from this data and Fig. 8 shows the rules derived from this tree. C5.0 generated rules that are more complex than those of FuzzyRULES. Some of the rules and conditions produced by C5.0 are redundant such as the first condition in the first rule, and the second rule which does not cover anyinstances.
Comparison with RULES-F Plus
The performance of FuzzyRULES has been compared against that of RULES-F Plus. Table 5 presents the results obtained. It is clear from Table 5 that FuzzyRULES achieved a substantially lower complexity without sacrificing accuracy for most data sets. The accuracy obtained by FuzzyRULES over all the data sets was in total higher than that produced by RULES-F Plus with the three noise levels considered. Also, it produced significantly fewer rules in total than RULES-F Plus for all noise levels. Moreover, FuzzyRULES gives better performance results than the best results obtained by RULES-F Plus with the optimal noise level. In terms of classification accuracy, FuzzyRULES outperformed RULES-F Plus for 20 out of 30 data sets. On 3 data sets, FuzzyRULES developed rule sets with accuracies similar to RULES-F Plus. Only on 7 data sets, did RULES-F Plus outperform FuzzyRULES. In terms of the number of rules created, FuzzyRULES obtained fewer rules 13 times and RULES-F Plus 15 times. Both algorithms produced equal number of rules for the remaining 2 data sets. However, RULES-FPlus created only one rule (“IF anything THEN predicts the class with the largest number of training instances”) for the data sets Anneal (NL = 0.3), German (NL = 0.3), Hepatitis (NL = 0.3), Hypothyroid (NL = 0.1, 0.2, and 0.3), Shuttle (NL = 0.3), and Sick-euthyroid (NL = 0.1, 0.2, and 0.3). This demonstrates a deficiency of the RULES-F Plus algorithm as it fails to learn any interesting pattern in the above-mentioned data sets. It could therefore be concluded that FuzzyRULES gives the best performance when considering both the classification accuracy and the number of rules. The improved performance of the FuzzyRULES algorithm compared to the RULES-F Plus algorithm can be attributed to the ability of FuzzyRULES to handle noise in the data, which is achieved by employing a search method that tolerates inconsistency in the rule specialisation process. Also, FuzzyRULES adopts a very simple criterion for evaluating the quality of rules and a robust method for handling attributes with continuous values, which further improves the performance of the algorithm.
Conclusions and future work
This paper has presented FuzzyRULES, a novel algorithm for inducing fuzzy classification rules from data. FuzzyRULES uses an approach similar to that employed by the RULES-6 crisp rule induction algorithm. However, unlike RULES-6 and other crisp rule induction procedures, FuzzyRULES is able to handle fuzzy data, and thus has the all benefits associated with fuzzy sets. Continuous attributes, for example,are processed in a much more natural way, and vagueness and ambiguity can be readily handled. The operation of FuzzyRULES has been illustrated in an industrial application concerning the automatic identification of possible process routes to produce a part. FuzzyRULES has also been compared against RULES-6 and C5.0 crisp inductive learners, as well as the recently developed RULES-F Plus fuzzy learning algorithm. The results obtained have clearly shown the strong performance of the new algorithm.
In this research, all output values were of the discrete, nominal type. However, it is common for real-world data to have continuous values. Future work should focus on developing fuzzy rule induction algorithms that can predict continuous output values. Also, this study only considered the triangular membership function and one simple method for describing it. Future work should investigate other membership functions such as trapezoidal functions and other methods of determining the cut points including those based on the density of attributevalues such as clustering techniques.
Footnotes
Acknowledgments
The author wishes to thank the Industrial Engineering Department at Zagazig University, Egypt for providing a good environment, facilities and financial means to complete this paper.
