Selection of decision rules based on attribute ranking

Abstract

Rule classifiers enhance understanding of data and enable to represent learned knowledge in an explicit structural form. To further improve this understanding and generalisation properties of predictors, the process of selection of decision rules can be executed. The paper presents a post-processing approach to this task. In a proposed research framework there were exploited rankings of attributes the rules refer to, thus transferring the established importance of features to the rules based on them. The attributes were weighted by selected statistical measures and machine learning algorithms. The rule classifiers were constructed in Dominance-Based Rough Set Approach and employed in the domain of stylometry for the task of authorship attribution.

Keywords

Rule classifier selection of rules attribute ranking stylometry authorship attribution DRSA

1 Introduction

Construction of a rule classifier reminds the process of describing an object by references to its characteristics, qualities, evaluated measures, or distinguishing features [31]. A decision algorithm consists of elementary rules specifying particular conditions on all involved variables, which, when met, point to a certain class or classes of objects. In this context these characteristic features of objects are called attributes, conditional if they appear in the premise part of a rule, decision when they indicate class label or value.

To achieve high classification accuracy, out of all possible or available rules those with the highest quality should be chosen, which means firstly discovering this quality, however it is defined. The properties of objects always need to be expressed by conditions on attributes, yet they can be combined in many ways, and their efficiency reflects on relevance of inferred rules.

Rule quality can be questioned in three approaches:

in pre-processing, when not rules but data is modified, for example by attribute and instance selection, which leads to generating such rules that would not be inferred otherwise, or their subset;

during the induction stage, and only rules meeting constraints with respect to length, support, coverage [32], or other measures are constructed [19];

in post-processing, where some of the generated rules are selected while others rejected, with possible re-calculation of conditions [20].

Various algorithms for induction of decision rules exist, and whether they infer only a limited number of rules or all possible rules on examples, once a set of rules becomes available, its elements can be analysed further, and some of them chosen basing on additional selection criteria, regardless of earlier filtering. Standard approaches to rule selection use their parameters, such as length or support. There can also be considered conditional attributes the rules refer to, or some inherent mechanisms employed, such as relative reducts in rough set processing [23], one of methodologies applied in cases of incomplete or uncertain data [14].

In the paper there is presented a post-processing approach to rule filtering, employing algorithms dedicated to attribute ranking [30]. Rankings belong with filters and they are typically used in pre-processing of features [7]. To all variables some weights are assigned, and their values lead to the specific order reflecting relevance. Thus in the proposed framework the importance of induced rules is studied from the perspective of relevance of characteristic features, evaluated by means independent on the classifier employed.

The first step of the framework involves preparation of input datasets and calculation of decision algorithms. The parameters of induced rules are analysed to discover their values and ranges. In the second step for all features some rankings are found by selected statistical measures and machine learning algorithms [13]. In the third step from the set of available rules there are filtered out those with conditions on attributes selected from the obtained rankings. The starting point in the search for rules is the empty set, to which the groups of rules are added with each new attribute being considered. The constructed rule classifiers are tested and the overall trends in their performance observed.

The experiments indicate that the features regarded as higher ranking recall rules leading to decision algorithms with fewer constituent rules while preserving the satisfactory classification accuracies. Thus the proposed approach proves to be useful, however, it cannot guarantee obtaining the best results, which is shown in comparison with the selection of rules based on other attribute orderings. Yet to make certain that the performance is the best, an exhaustive search through all possible subsets of rules would have to be executed and it is impractical even for short decision algorithms.

In all tests the problem to solve was a binary classification with balanced data, with the task of authorship attribution for studied text samples. The recognition of authorship was based on linguistic preferences and habits of writers observable in their works, captured by lexical and syntactic textual descriptors, calculated for the analysed texts parts [8].

The paper is organised as follows. Section 2 presents the background, Section 3 explains stages of the proposed framework, the test results are discussed in Section 4, and Section 5 concludes the paper.

2 Background

The research presented in this paper involved rule classifiers, selection of decision rules, attribute ranking methodologies, and stylometric analysis of texts.

2.1 Construction of rule classifiers with DRSA

Mining of data results in learned knowledge that can be an internal and implicit part of a predictor, inseparable from it, as in artificial neural networks, for which weights assigned to interconnections are learnt, or it can be expressed explicitly, as in rule classifiers, where inferred rules directly list conditions on attributes. The straightforward manner of expression of decision rules has an advantage of enhancing understanding of underlying patterns discovered in data [31].

Rough set processing is one of data mining approaches leading to induction of decision rules. Its classical version was invented by Z. Pawlak [14] and Dominance-Based Rough Set Approach (DRSA) is its modification supporting multi-criteria decision making [21]. In the rule induction process the notions of preference and dominance are used. All values are considered either as equally or less preferred, or equally or more preferred, and for all involved variables partial orderings in value sets are required.

The conditions on attributes, learned from the available instances, use these weak preferences — the attributes taking on either equal or lower values, or equal or higher values than the calculated thresholds point out to the decision class or classes [11], as the rules classify to at most or at least some decision class Cl _t for a single decision attribute ( Cl = {Cl _t}, for t = 1, …, K , all classes ordered according to increasing preference): ${Cl}_{t}^{\geq} = ⋃_{s \geq t} {Cl}_{s} and {Cl}_{t}^{\leq} = ⋃_{s \leq t} {Cl}_{s} .$ (1)

Inferred rules are characterised by their parameters, that is specific conditions included, lengths, and supports. The length of a rule equals to the number of conditions present in the premise part of this rule [25], while the support of the rule indicates the part of the training instances that validate the rule, that is with the matching conditions and the same decision.

In approaches to induction of decision rules there can be a number of reasons for particular way of processing. The aim can be that of obtaining rules that are interesting from some point of view [3], or with specific properties such as length or support [24]. Another goal can be to find specific number of rules, such as a minimal cover of the learning samples [32], or all rules that can be generated. At the end the result of application can be considered as the most important and telling factor of rule quality, which leads to measures based on the system performance [19].

When only few rules are inferred they are found relatively quickly, but others can exist, which would be better for some intended purpose. When a high number of rules is generated, the time needed for processing can be significant, this high number of induced rules can cause some problems in classification, and also makes recognition of rule qualities harder. On the other hand, with more rules there are more options to be chosen from. It is not necessarily the most efficient, yet if it can be afforded, post-processing of many rules enables the widest context, the most complex analysis of inferred rules, and selection of those that provide the best fit to requirements, whatever theyare [27].

2.2 Selection of decision rules

Once some rules are induced, whichever approach to this process is employed, the decision needs to be made whether in construction of the final algorithm all inferred rules are used, or just their selection.

When the decision is in favour of rule filtering, it can be executed by specifying some firm requirements to be met, or by a systematic search through the space of available dimensions. In the latter case the starting point can be an empty set, to which then single rules or their groups, chosen from those generated, are added in the forward manner [16], or the filtering commences with the set of all inferred rules from which elements are rejected backwards, which is called pruning [19]. The stopping point can be defined by reaching certain cardinality of the rule set, by its quality indicated through predictive accuracy, or other measures.

One of the approaches to rule selection is to exploit the direct characteristics, such as length or support. Rules with just few conditions frequently are of higher quality as they possess better descriptive and generalisation properties [25]. In contrast, the longer rules are often cases of overfitting, which are very close descriptions of patterns present in the training data, so close that the generalisation property is limited. A high support means that a rule captures the pattern present in many learning samples, more likely to appear also in unknown cases [24]. The low value of support indicates that the pattern described by it occurs rarely, and it cannot be expected to match the testing data.

When the two parameters are used as constraints for a decision algorithm, typically the rules are expected to have lengths not higher than some set threshold (thus have some acceptable maximal lengths), and supports not lower than the specified value (that is some required minimal support), to be selected for inclusion in the final decision algorithm.

The notion of support is also employed in definitions of several rule interestingness measures [3], which can be used as the basis for rule filtering. These measures are also combined together by operators of addition or multiplication (possibly with some weights) to built new complex measures [4].

Since construction of a decision rule starts with selection of attributes, yet another way to look at rules is through conditional attributes they refer to. We can prune rules either by analysing or re-defining conditions [20], or exploiting embedded mechanisms such as relative reducts from rough set processing — minimal subsets of attributes preserving the quality of classification of the complete set of features [29].

As the attributes are responsible for rule quality in such significant part, we can reasonably expect that observations on importance of characteristic features in the input space can be translated to the space of induced rules, referring to these variables. The notion of attribute relevance leads to the idea of ordering them by this relevance, thus constructing a ranking, with elements weighted by the score assigned through some of the well-known approaches to feature ranking.

In the proposed research framework the rule selection process starts with the empty set to which elements are added, depending on the number of attributes selected from their ranking, and the decision rules referring to these attributes. The stopping point for the procedure can be reached when a certain number of variables are selected from a ranking, or a certain number of rules, when some satisfactory classification accuracy is achieved, or when all available features are selected, which means obtaining the complete set of induced rules. The latter approach can be considered as too costly in terms of processing and time needed, yet it enables observation of overall trends in performance, which could be hard to detect otherwise.

2.3 Attribute ranking by statistical measures and machine learning approaches

Most classification systems do not perform well when they have to relay on characteristic features that are repetitive, excessive, or irrelevant [7]. Reduction of dimensionality and gaining information on relevance of attributes can be achieved with the help of ranking algorithms that belong with filter category of feature selection approaches.

Filters base on general characteristics of data and they can be treated as pre-processing engines since they work independently on any classification system. Their general nature is their strong trait, as they are applicable in any domain or to any methodology. On the other hand, generality is also a weak point — they allow for close tailoring to specifics of neither task nor technique, which means that they can beoutperformed.

The search for features can be executed in forward or backward direction [27]. The stopping point for the procedures depends on the specific goal, it can be reached when a certain number of features are selected, when all possibilities are exhausted, or when some defined measures of importance for features match the required thresholds.

When a feature selection algorithm is applied to the entire set of available variables their ranking can be obtained, in which each attribute is assigned some weight, or importance or relevance score that leads to their ordering [29]. If the ranking function works independently on a predictor employed in pattern recognition, it plays the role of a filter [30]. Once the ranking is established, some subset of variables can be chosen, limited by their number, or with the scores higher than the required minimum.

Some of attribute ranking algorithms are able to detect irrelevant or redundant attributes and discard them, while in others each variable is given some score and they all are considered as relevant, although the degrees of their importance can significantly vary.

Statistical feature ranking methods [13] refer to concepts from information theory, such as entropy, defined by Claude E. Shannon [18] for random variable x as: $H (x) = - \sum_{i} P (x_{i}) {log}_{2} (P (x_{i})),$ (2) with P (x _i) giving the probability of x _i. For the decision attribute Cl with K classes the entropy is: $H (Cl) = - \sum_{i = 1}^{K} P ({Cl}_{i}) {log}_{2} (P ({Cl}_{i})),$ (3) and for a feature x _f it equals: $H (x_{f}) = - \sum_{x} P (x_{f} = x) {log}_{2} (P (x_{f} = x)) .$ (4) Such form has the logical sense for discrete features while for continuous ones either discretisation is needed or integral should replace the sum.

P (x _i, y _j) is the probability of the joint occurrence of x _i and y _j for the two events x and y, for which the entropy is calculated as: $H (x, y) = - \sum_{i, j} P (x_{i}, y_{j}) {log}_{2} (P (x_{i}, y_{j})) .$ (5)

P (x _i|y _i) gives the posterior probability of x given values of y, and the entropy of x while observing values of y, that is conditional entropy can be found by: $H (x | y) = H (x, y) - H (y) .$ (6)

By employing the concept of entropy for attributes and classes, Information Gain coefficient (InfoGain) is defined as: $InfoGain (Cl, x_{f}) = H (Cl) - H (Cl | x_{f}) .$ (7)

Gain Ratio (GainR) is the result of normalisation of Information Gain by the entropy of a feature x _f: $GainR (Cl, x_{f}) = \frac{H (Cl) - H (Cl | x_{f})}{H (x_{f})} .$ (8)

Symmetrical uncertainty (SymmU) estimates the worth of an attribute x _f with respect to the class Cl: $SymmU (Cl, x_{f}) = 2 [\frac{H (Cl) - H (Cl | x_{f})}{H (Cl) + H (x_{f})}] .$ (9)

Relief algorithm is dedicated to filtering of features, applicable for both binary and multi-class recognition [9]. It executes feature ranking basing on instances. In this algorithm, with the pseudo-code shown in Algorithm 1, attributes are assigned a score depending on their ability of discerning decision classes.

In the iterative process the set of instances is sampled and the function diff(a, x ₁, x ₂) calculates the difference between two instances, x ₁ and x ₂, with respect to values of attribute a, with the return of 1 for different values and 0 for the same in case of nominal attributes, and normalised actual difference for numerical attributes. The results are averaged from contributions of k closest neighbours within the same class (called near-hits H) and the closest neighbours in other classes (near-misses M), with weighting misses by the prior probability of each class. The distances between these tested samples weighted by the number of iterations and the number of neighbours accumulate to the final scores given to features [28].

OneR attribute evaluation employs the accuracy measure from the OneR classifier [6] that exploits the fact that short rules often result in high classification accuracy. The pseudo-code is given in Algorithm 2. In the first stage of the algorithm for each attribute a single rule is chosen from the set of candidate rules, constructed for all attribute values and the classes they most frequently classify to. The qualities of rules are evaluated by establishing their classification accuracies, and the highest leads to the selection of the best rule. In the second stage from these chosen best rules a single one, with the lowest error, is selected as 1-rule. To obtain the ranking, all attributes are weighted by the scores corresponding to errors calculated for rules.

Support vector machines (SVM) base on an algorithm that finds the maximum margin hyperplane, such hyperplane that gives the greatest separation between classes [12]. The instances in minimum distance from the maximum margin hyperplane are support vectors and their set uniquely defines the hyperplane by: $f (X) = b + \sum_{iissupportvector} α_{i} {Cl}_{i} X (i) \cdot X .$ (10)

Cl _i corresponds to a class value (1 if the instance is in this class and -1 otherwise) for the training sample x _i, X (i) · X a scalar product of the training sample and the testing sample, while b and α _i are numerical parameters with values established in the learning process. The weights α _i are equal 0 for the training patterns that are not support vectors, and non-zero otherwise. When their weights are higher than 0 but lower than a certain defined threshold, the support vectors are marginal and their average gives the bias value b.

When linear SVM is used in selection or ranking of attributes [31], they are eliminated recursively basing on the size of their coefficients, calculating their squared weights. After each reduction, in which the highest-ranking feature is eliminated, the learning step is repeated as long as the remaining subset of variables is not empty thus employing the backward approach.

2.4 Stylometric textual attributes

Stylometric analysis of texts relies on the fundamental assumption that the ways in which people formulate their thoughts and ideas while putting them down in writing are unique. When sufficient numbers of representative writing samples are available, a writing style can be characterised, styles can be compared to find similarities and differences, and authorship can be recognised with certain level of reliability [15].

Stylometric studies have origins in the historic past, when analysis of texts with respect to style required comparisons of manuscripts while looking for some linguistic patterns which could be detected with the careful but bare eye. Accessibility of modern computers simplifies this task as employing their high computational capabilities makes possible referring to quantitative (instead of qualitative) descriptors, which are less subjective, and considering common parts of speech [10]. In authorship attribution lexical and syntactic descriptors are most popularly employed, reflecting frequencies of usage for selected function words and punctuation marks.

As the styles are unique to their authors, so are the sets of expressing them specific linguistic markers, which means that there is no universal rule how to construct such sets and it is a common practice to transfer the task of finding relevant features to the area of data mining techniques applied [8]. These approaches typically belong with statistic oriented computations [17], or artificial intelligence area [24], where authorship attribution can be treated as a classification task, with categorisation of text samples by their authors.

3 Post-processing framework of rule selection

The framework for research described in this paper consisted of the following stages:

First stage—pre-processing which encompassed:

construction of input datasets,

induction of all rules on examples decision algorithm within DRSA methodology,

analysis of parameters for generated decision rules and pruning by constraints on these parameters,

Second stage—calculation of selected rankings of all considered features,

Third stage—construction of modified classifiers by filtering of decision rules while following attribute rankings,

Fourth stage—execution of tests and analysis of obtained results,

explained in more detail in the following subsections.

3.1 Input datasets

Two groups of input datasets were involved in the experiments performed, learning and testing for two male (T. Hardy and H. James), and two female (E. Wharton and J. Austen) writers. Their novels are relatively long, so they were divided into smaller parts to allow for some variations of styles, increase the numbers of samples, and keep similar lengths of texts.

Since in one text, even with significant length, the differences in a writing style are less likely to appear than when it is compared to some other text, the learning and testing text parts were taken from completely separate writings. Otherwise the resulting classification accuracy could be falsely high.

For all groups of text samples there were calculated frequencies of usage of 17 lexical and 8 syntactic markers, the first chosen from the list of the most common function words in English language, and the second the popularly used punctuation marks, which gave the total of 25 characteristic features. As occurrence frequencies are continuous numbers, a data mining technique to be applied either needs to be capable of dealing with this kind of data (as DRSA), or some discretisation approach is required [1].

3.2 Induction of decision algorithms

To proceed, DRSA approach requires definitions of preference orders, either gain or cost, for all considered attributes. The gain type says that the higher values of attributes the higher class label should be indicated, while the cost preference happens when the lower values of attributes point to the higher class label. These preferences need to be assigned arbitrarily, established through tests, or some other applied algorithm [26].

To arrive at the final decision a preliminary test of preferences was executed, by checking the performance of the generated minimal cover algorithms for both datasets, for both preferences. For female writer dataset the better result was obtained for preferences of gain type, while for male writer dataset the cost preference resulted in higher classification accuracy. These two preference orders were employed in the induction of all decision rules on the training examples.

The induction procedures returned the full algorithms with 46,191 rules for male, and 62,383 for female writer dataset. Classification with so high numbers of rules for both testing sets resulted only in ambiguous decisions, as so many rules matched samples that the groups of them pointed to different classes. When the only type of decision is ambiguous one, it is best to introduce some additional constraints on rules, that is to prune some of them in order to keep those which are the best, which is typically understood as providing the highest classification accuracy.

In the described experiments the selection of rules was executed in two ways. Firstly, to provide a reference point for comparison, they were filtered basing on their explicit parameters which is a standard approach, secondly by exploiting calculated rankings of attributes, as explained in the following sections.

3.3 Pruning of rules driven by their parameters

For the two full algorithms induced the distributions of rules with respect to lengths and supports are given in Table 1. In both cases the ranges of rule lengths were the same, from 1 to 10. Rules recognising female writers had the numbers of supporting samples from 1 to 77, while for male writers it was from 1 to 64.

The significantly high parts of those decision rules (over 34% for female, and over 45% for male writers) had support equal 1, which means that they were validated by a single training sample. Such rules are unlikely to match the testing samples, even if they are short which would suggest their generality, but for this level of support the average rule length was the highest (4.67 for female, and 5.17 for male dataset). As supports of rules increased, lengths of rules on average decreased. At the highest support values the rules from both algorithms had just single or pairs of conditions.

The rules from full algorithms were next filtered in the space of two dimensions of length and support, to construct new decision algorithms containing rules with lengths not higher than, and supports at least that high as required, in the following manner. When the maximal rule length was set to 1, only rules with a single condition were selected. When the threshold for length was 2, the rules with single conditions or their pairs were filtered, and so on. Minimal support equal 1 means selecting all rules, as the support lower than 1 is not possible to occur. Minimal support equal at least 2 filtered all rules validated by 2 or more samples, etc.

The performance of all decision algorithms constructed by pruning rule supports and lengths is displayed in Fig. 1. It should be noted that in the paper in all tables support values are listed as numbers of validating examples, however, they could also be given in relation to the total numbers of available samples. Then for female writers the maximal support of 77 could be given as 77/100, and 64 for male writers as 64/90, which was recalculated for all graphs in Fig. 2. All presented results specify only the percentage of correctly classified cases. The ambiguous decisions were treated as incorrect and rejected.

Analysis of performance for rule classifiers yields observations on trends and predictive accuracies as they differ depending on a point in space of filteringparameters. For example for female writers for support equal at least 33/100 and the rules with at most 2 conditions the classification was 80.00% . Male writer classifier with rules not longer than 3 and support equal at least 28/90 recognised 70.00% of the testing samples.

As could be expected, in most cases rejection of longer rules increased performance, and in the similar way worked discarding of rules with lower supports. When maximal rule length was 3 or higher the performance was mainly driven by support values. For the maximal rule length equal to 7 or more, even though the numbers of retrieved rules differed, as can be seen in Table 1, the classification accuracy depended solely on constraints on support.

The maximal classification accuracy discovered for the female writer dataset was 86.67% , for minimal support in the range from 61 to 66, with or without constraints on maximal rule length. For male writers it was 76.67% for support equal at least 40 or 41, with maximal rule length not lower than 4. Actually, at this level of support there were no rules with lengths higher than 4. These two levels of classification accuracy were used as reference points in comparisons with rule selection governed by attribute rankings.

3.4 Attribute rankings

Two rules with the same values of length and support may still vary in quality because of features they depend on. Thus the sets of conditional attributes the rules refer to are also their parameters. Specific conditions can be analysed, but their total number is several times as high as that of the induced rules. Another way of processing is by application of some algorithms for feature selection, in particular attribute rankings, which is the key notion of the proposed framework.

Filters claim to be able to discover the importance of attributes basing on calculated values of statistical measures or machine learning approaches, and a selection of those (listed in Section 2.3) was used in the presented research to obtain rankings of attributes, assigning scores to considered elements, and putting them in certain orders. It is generally assumed that a ranking of attributes is organised in such a way that as first there are listed variables given the highest scores at the weighting step, then those less and lessimportant.

Rankings of features were established with WEKA workbench [5], with default parameters for procedures and executing 10-fold cross validation. As for both female and male writer datasets the distributions of attribute values for considered classes were different, even though the available attributes were the same, there were similarities as well as differences among their respective rankings, as can be seen in Table 2. The right-most column of the table shows a random ordering of attributes used in tests for comparisonpurposes.

For all approaches the feature established as the highest ranking was the same, namely the frequency of usage of “not” for female writers, and the frequency of "and" for male. These observations confirm some findings on differences between female and male writers: reportedly women often use negative forms of speech, while men like making lists or enumerations [22].

The decisions with respect to the least ranking features were not so uniform, for female writers and statistical approaches it was "from", but Relief, SVM and OneR each gave a different answer. It was even worse for male writers, where "of" and a comma appeared twice, and a fullstop and "on" once. For women the frequencies of usage of punctuation marks were found as higher ranking, while for men rather lower ranking.

In order to exploit a ranking, the first K features are selected from the list of N available, K being the desired number of attributes (which may even equal N). The number of variables selected can depend on some imposed constraints, the entire set can be tested to observe the overall trends in performance in relation to the number of used variables, or the procedure can be stopped when some other criteria are satisfied.

3.5 Rule selection governed by attribute rankings

Using a ranking to construct rule classifiers corresponds to filtering decision rules with conditions only on those attributes that were selected from the ranking. If, among all conditions of a rule, there exists at least one, which refers to some variable that is not selected from the ranking, the rule is not included in the set of recalled rules, regardless of its other conditions. In other words, for the rule to be selected its conditions need to correspond to a subset (not necessarily a proper subset and certainly not the empty set) of any cardinality of the set of attributes chosen from the ranking.

For example, for the ranking based on information gain for female writers, taking 4 top ranking features means considering “not”, a semi-colon, a colon, and a comma. To use them for rule filtering, from the complete set of available rules there were chosen those with any conditions only on these 4 attributes. There were 79 such rules, and they gave base to construction of the new classifier. For the ranking returned by SVM for male writers, 5 most important attributes were “and”, “that”, “by”, “not”, and “but”. The full algorithm contained 134 rules referring to these features or their combinations and these rules constituted the new algorithm. In the similar manner all presented rankings with varying numbers of attributes were used.

Only testing over the set of all available features can show whether the classifier, established at some moment as the best performing, is in fact the best. So, if it can be afforded, it is best to complete the entire search path, starting with the empty set of attributes, then adding to it and recalling rules that refer to these conditional attributes, till the full sets of variables and rules are tested. This is why for each ranking the whole range of selected features was tested, from filtering of rules referring to just 1 attribute to the complete set of rules referring to all 25 attributes, as shown in Table 3.

The subsets of selected rules were naturally overlapping with increasing the number of considered features, because the procedure of filtering worked forwards. With each step more and more of highest ranking attributes were chosen, and with them the rules referring to those features. As attributes appeared with varying frequencies in the induced rules, it was reflected in the numbers of retrieved rules.

For the constructed decision algorithms quality can be evaluated in many approaches, by calculation of coverage, or comparing numbers of filtered rules. As the main aim of induced rule classifiers was to recognise patterns of writing for as many samples as possible, testing with the separate testing sets was executed, which is commented and discussed in the next section.

4 Discussion of the obtained results

The detailed results for the decision algorithms built while exploiting selected attribute rankings are listed in Table 3. Out of all parameters presented, the number of retrieved rules is the most direct answer to the selection process. Yet it is not the most telling from the application point of view. Alas, neither few nor many recalled rules guarantee high predictive accuracies.

For further improvement the standard approach of rule pruning by their support can be employed, and treated as an additional constraint on the constructed rule classifiers. Imposed hard constraints caused many of the best algorithms to be exactly the same, as the same good rules were kept while others were rejected.

For all considered rankings for female writer dataset the trends in classification accuracy of the new rule classifiers were clearly monotonic, namely increasing. From the point of view of performance they were very close to each other, yet one can be stated as the best: GainR led to achieving the maximal classification accuracy of 86.67% at the third step of feature and rule selection, while at the second step it was 82.22% . When all results of rule selection for this dataset were compared it became apparent that just 3 features were required to obtain the best performance: frequencies of usage for “not”, a semi-colon, and a comma. Once the rules referring to these 3 attributes were filtered out, the constructed decision algorithms gave the best performance when supports of rules were limited.

For male writers the classification results were generally worse. To get close to the best performance at least 4, or 5 variables were needed. The best outcome was given by OneR ranking, and none of rankings resulted in a strictly monotonic trend in performance.

For comparison there were executed tests with selection of rules while following reversed Relief ranking and a random ordering of attributes listed in Table 2. For the random order it cannot be stated that a variable is considered as highest- or lowest-ranking. Therefore, both this ordering of variables and its reverse were exploited in forward selection of decision rules for the two datasets (the same ordering for both datasets this time), with the results shown in Table 4.

The comparison of performance for the random order and reversed Relief with the best previously found, GainR for female and OneR for male writer dataset, is shown in Fig. 2. At first glance the two graphs could be found rather surprising, as for both datasets in several cases the classification accuracy was higher than the maxima established before. Some random combinations of features and rules including them resulted in construction of far better decision algorithms. For the reversed Relief ranking the recognition ratios rose slowly and gradually, and many more variables and rules were needed to arrive at some satisfactory level.

Deeper analysis of this last group of classifiers confirmed some of the earlier findings and expectations. Firstly, it should be noted that in both plots only the maximal predictive accuracy is displayed, obtained by limiting the set of rules not only by their included conditions, which is the result of exploited variable ordering, but also taking only subsets of rules with specific supports, ensuring the best recognition.

Secondly, comparisons of sets of features that led to construction of these better algorithms revealed that they contained some of the higher ranking attributes, present in rules with relatively high supports. If, as first there were selected features appearing in conditions of highly supported rules, the retrieved good rules constituted the solid base for all constructed classifiers, further enhanced by adding more rules along with more variables. When, in the initial steps of the filtering process, there were chosen low ranking attributes, the rules selected with them gave poor classification accuracies, even with additional constraints on support, as most of them had low support values. The performance improved only when many more rules were retrieved, with attributes of at least some average rank. Thus the rankings established previously and their usefulness in rule selection were validated by application.

On the other hand, it is clearly visible that the classifiers based on all tested rankings were outperformed, yet such possibility was actually expected, as the rankings were based on dataset characteristics, without taking into account any specifics of the classification systems applied. Also, both rankings and rule filtering algorithms did not execute exhaustive search procedures and could not guarantee obtaining the best results.

5 Conclusions

The paper presents a research framework dedicated to filtering of decision rules, incorporating attribute ranking algorithms in the post-processing of induced rule classifiers. The first step of the framework was focused on pre-processing, which consisted of preparation of input datasets and generation of all rules on examples decision algorithms by Dominance-Based Rough Set Approach. In the second step the rankings of all available characteristic features were established through several selected approaches.

In the next stage of processing, the defined rankings of attributes were exploited to recall rules from the sets of all available elements in the forward manner. The rules with conditions on the highest ranking features were filtered to form new decision algorithms, the performance of which was then tested. The process can be stopped when a certain number of attributes is included in considerations, when the number of retrieved rules reaches some specified level, when the predictive accuracy of classifiers satisfies requirements, or when all features and all rules are studied. Completing the search path, which starts with the empty sets of variables and rules, and ends with the complete sets, enables to observe overall trends and offers a better chance of finding maximal predictive accuracy.

All tests were executed for binary authorship attribution with balanced classes, for two datasets, one for female and another for male writers. Frequencies of usage of selected function words and punctuation marks were employed as characteristic features.

The test results confirmed the expectations and validated the usefulness of the proposed framework. The selection of rules referring to highly ranking attributes led to decision algorithms with good generalisation properties and sufficiently high support, while lower ranking attributes recalled mainly poor rules. Thus the relevance of attributes present in rule conditions was translated to importance of rules retrieved by them. However, as the proposed approach relied on algorithms for filtering of features, it could not guarantee obtaining the best classification accuracy.

Filters disregard the characteristics of a predictor employed, thus the systems based on them can be expected to be outperformed by other approaches. An example of such situation was shown for decision algorithms constructed while exploiting a random ordering of features and reversed Relief ordering. Even in these cases it could be observed that higher ranking features contributed the most to the quality of the classifier through the rules in which they appeared, while when dealing with lower ranking attributes many more were needed to arrive at some satisfactory performance.

Acknowledgments

DRSA processing was executed with 4eMka Software [2, 21]. The research described in the paper was performed within the project BK/RAu2/2015 at the Silesian University of Technology, Gliwice, Poland.

References

10.

11.

12.

13.

14.

15.

16.

17.

18.

19.

20.

21.

22.

23.

24.

25.

26.

27.

28.

29.

30.

31.

32.