Evaluating pattern restrictions for associative classifiers

Abstract

Associative classification is a pattern recognition approach that integrates classification and association rule discovery to build accurate classification models. These models are formed by a collection of contrast patterns that fulfill some restrictions. In this paper, we introduce an experimental comparison of the impact of using different restrictions in the classification accuracy. To the best of our knowledge, this is the first time that such analysis is performed, deriving some interesting findings about how restrictions impact on the classification results. Contrasting these results with previously published papers, we found that their conclusions could be unintentionally biased by the restrictions they used. We found, for example, that the jumping restriction could severely damage the pattern quality in the presence of dataset noise. We also found that the minimal support restriction has a different effect in the accuracy of two associative classifiers, therefore deciding which one is the best depends on the support value. This paper opens some interesting lines of research, mainly in the creation of new restrictions and new pattern types by joining different restrictions.

Keywords

Associative classifiers restrictions type of patterns classifier evaluation

1. Introduction

One capability usually associated to learning is the ability to analyze a large number of specific observations to extract and retain the important common features that characterize classes of these observations. In 1982, Mitchel [42] formalizes this as the problem of generalization: determining the most important generalizations that allows to differentiate a collection of positive objects from a collection of negative objects.

To select important generalizations, a predicate formed by a collection of restrictions is generally used. A restriction is a simple Boolean expression that can be evaluated on each observation, and ranges from the simple evaluation of a numerical measures to complex relations between observations. According to the type of restrictions used, these generalizations take particular names like subgroups[54], emerging patterns[15], contrast sets[10], supervised descriptive rule[46], contrast patterns[14], and discriminative patterns[39]. In this paper, we will refer to all with the term contrast patterns.

Associative classification (AC) is a pattern recognition approach that integrates classification and association rule discovery to build classification models (classifiers). These models are always formed by a collection of contrast patterns mined from the training sample. Associative classifiers are used in very different tasks like sentiment analysis [45], photovoltaic technology concentration [29], music analysis [12], prediction of chronic kidney disease [48], and heart disease prediction[50].

The high diversity in types of contrast patterns pose some challenges to select the appropriate one for a given problem. Firstly, the relation between the model accuracy and the type of contrast pattern used in problems is unknown. Published experimental comparisons derive only conclusions about the comparative behavior of different types using dataset collections. Secondly, every type of contrast pattern is formed by a particular subset of restrictions. Nevertheless, comparative studies evaluate types of contrast patterns, obscuring the contribution of each individual restriction. Finally, some types of patterns are introduced as novel although they are composed by already known restrictions or have minor changes with respect to previously defined ones.

Several authors have reported comparisons among classifiers on different datasets arriving to conclusions of which is the best classifier. As shown in this paper, without considering the imposed restrictions and values, these comparisons are meaningless. One algorithm can be significantly better than another algorithm under some restrictions and at the same time be significantly worst under different restrictions with the same datasets. For example, we show that classifier iCAEP[56], which is frequently considered superior to CAEP [16], is inferior for lower values of minimal support restriction.

A similar problem arises for researches comparing types of contrast patterns. For example, contrast patterns including the jumping restriction, that forbid the patterns to appear in other classes, are considered the most discriminative [20]. Nevertheless, we found that dataset noise degrade the quality of jumping patterns. Other example, the minimal support restriction is usually considered as very important to guarantee the quality of contrast patterns. Nevertheless, we found that, on some classifiers, increasing the minimal support deteriorates the classifier accuracy.

In 2018, García-Borroto[24] performed a theoretical study about types of contrast patterns from the perspective of their restrictions. For a better understanding, he grouped restrictions based on intuitions, which are insights about how a good contrast pattern should be. In order to provide experimental support to the conclusions in [24], González-Méndez et al. [30] performed some experiments. Nevertheless, their experiments have some important limitations.

In this paper, we extend the experimental evaluation performed in [30], with the following changes:

(1)
To avoid the bias that could create to use a particular miner, we mine all the frequent patterns after discretization using the FP-Growth algorithm. Since FP-Growth cannot deal with numerical attributes, we use the Entropy discretizer [21].
(2)
We evaluate a larger collection of restrictions, covering all the intuitions introduced in [24].
(3)
We use three different associative classifiers, with different algorithms for contrast pattern selection and vote aggregation.
(4)
To avoid the problem of the impact of imbalance datasets in associative classifiers[41], we only use balanced datasets.
(5)
We perform statistical analysis when appropriate.

According to our best knowledge, no similar study has been published, neither from the theoretical nor the experimental component. This study, along with its two previous papers, shift the focus from type of patterns to restrictions. Focusing on restrictions could help to understand better the behavior of existing types of contrast patterns and could guide the creation of new types of patterns that could improve the accuracy of associative classifiers.

The structure of the paper is the following. First, Section 5.2 introduces associative classifiers, including some formal definitions and the similarities/differences between them and other classifiers based on patterns. Then, Section 3 presents a review of different approaches to compare associative classifiers. Section 4 formalizes the restrictions based on different intuitions, including a review of their usage in different types of patterns. Section 5 presents the experiments performed together with the discussion of the results. Finally, Section 6 presents the conclusions and future work.
2. Associative classification

Let $D$ be a dataset formed by the $m$ instances $\left\{O_{1},O_{2},\ldots,O_{m}\right\}$ split in $c$ classes $D=C_{1}\cup C_{2}\cup\ldots\cup C_{c}$ . Every $O\in D$ belongs to a single class class(O).

The universe of all possible, maybe infinite, items is denoted $I=\left\{i_{1},i_{2},\ldots\right\}$ . Every item $i$ has the structure $\textit{Feature}\otimes\textit{ValueOrSet}$ , where ValueOrSet is a value or subset of values of Feature and $\otimes$ is a relational operator compatible with Feature. For example, for the nominal feature Color we can have the items $\textit{Color}=\textit{red}$ or $\textit{Color}\in\left\{\textit{green},\textit{black}\right\}$ , while for numerical feature Age we can have items $\textit{Age}>$ 23 or $\textit{Age}\in$ [12,35].

A pattern $P$ is a subset of $I$ . A pattern $P$ matches an instance $O$ , denoted as $\textit{match}(P,O)$ , if it satisfies all the properties in the pattern’s items. The instances matching the pattern $P$ in a set $S$ is $S(P)$ . A discriminative pattern $P$ has an associated class, where the majority of objects in $S(P)$ belong. In the case of transactional databases, where instances are subsets of a finite set of transactions $T$ , $I\subseteq T$ , and every pattern $P\subset T$ .

An important measure related to patterns is the support, defined as the ratio of instances matched by the pattern in a given set $S$ . Patterns with support above a given threshold are named frequent patterns.

A contrast pattern is a pattern that appears significantly more in a class $C_{i}$ than in the remaining problem classes. Different equations have been proposed to measure the significance of a pattern.

Like most trainable classifiers, associative classifiers operate in two stages: model building and model exploitation [2, 1]. During model building, a training sample formed by objects whose class is known is used to extract contrast patterns from each class $C_{i}$ . A contrast pattern is equivalent to a decision rule, having the structure $A\Rightarrow C_{i}$ . In the training stage, patterns can be sorted, filtered or grouped, depending on the particular algorithm.

During model exploitation, a query object $q$ is presented to the trained classifier. Then, the classifier uses the built model to return the most probable class for $q$ . The whole procedure can be formally described in the following way:

Model building
Model is built based on the information present in a training sample:

(a)
Find contrast patterns in the training sample. According to the type of pattern miner, this process might be performed independently per each problem class or in a single step per all classes.
(b)
Prune not interesting patterns based on particular criteria.

Model exploitation
The model is used to predict the class of a query object $q$ . The following steps are usually performed:

(a)
From the contrast patterns mined in model building, find those matching $q$ . The matching can be full (if $q$ fulfills all the information of the pattern) or partial ( $q$ fulfills only a portion of the pattern).
(b)
Sort and filter the found contrast patterns, according to its relation to $q$ . Note that this step is different to the filtering step in the model building because here we use the information of $q$ .
(c)
Aggregate the information of selected contrast patterns, usually by some form of vote aggregation; each contrast pattern emit a vote according to some pattern measure.
(d)
Assign the most probable class to $q$ .

Examples of associative classifiers are CBA [4], CAEP [16], BCEP[18], CSM [35], a fuzzy classifier FEPM [27], iCAEP[56], and specialized for imbalance datasets like PBC4cip [41].

Associative classifiers are similar to other pattern-based classifiers, but they have the following differences:

1
Rule based classifiers[53] Rules are obtained by sequential covering algorithms, so most of the resulting rules are local and not representative of the whole dataset. Then, to classify a query object a single rule or a small rule subset is selected.
2
Lazy associative classifiers [52] A limited model is built for classifying a query object $q$ , instead of building a global model during model building . These classifiers are useful for problems where the training set changes frequently therefore building a new model each time might be costly. Additionally, they can be applied in large problems where the cost of building the whole model is prohibitively high.
3
Decision trees [47] A decision tree is a tree where every path from a leaf to the root node represents a contrast pattern. In this way, it represents a hierarchical decomposition of the objects in the training dataset using a set of properties selected by a greedy algorithm. In these classifiers, a single contrast patterns matches the query object.

3. Comparing associative classifiers

To solve a real problem using an associative classifier, users need to make some decisions that have a significant impact in the quality of the result: the classifier to use, the type of contrast pattern, the discretization procedure if it is necessary,1

¹
Some contrast pattern miners require that all the features are discrete.

among others. In this section, we present published experimental comparisons evaluating different components or parameters.

Associative classifier selection. Studies compare the behavior of some classifiers evaluated on a set of repository datasets. Then, based on measures like accuracy or AUC[34], conclusions are derived based on some statistic measures or techniques. For example, [52] compares 11 lazy learners using 9 datasets, and uses the accuracy average to suggest the best one. Similarly, [6] compares 6 associative classifiers using 13 datasets. In this case, the classifier that achieves the highest accuracy on most datasets was selected as the best one.

Quality measure used for pattern selection. To select which patterns are useful for classification, quality measures are usually used. In [25], authors compare 10 different quality measures using 25 datasets. Each quality measure was used to sort the whole pattern collection, and a subset of fixed size was selected to build the classifier. Finally, the classifier accuracy was used to evaluate the quality measure. In that paper, all patterns were mined using a particular quality measure, so results could be biased. Those results were later extended in [40] including 33 quality measures and 61 datasets. A different approach was followed in [26], where authors correlated the values of a large collection of quality measures with a quality value estimated using a different object subset. In rule-based systems, where patterns are mined using sequential covering algorithms2

At each step, mining procedure is influenced by all the patterns mined before.

, a similar study[53] compared 10 quality measures using 34 datasets.

Parameters of a particular classifier Some associative classifiers have particular parameters that also need to be selected or tunned. For example, some miners extract contrast patterns from a set of diversely generated decision trees, and in [28] different diversity generation procedures are compared. That paper compares the number of mined patterns and their quality using 65 datasets. Another example, rule based classifiers use rule pruning techniques to reduce the (usually large) collection of mined rules, and in [43] authors compared different pruning methods. Methods were evaluated in some datasets according to accuracy and reduction ratio. A final example is related to a limitation that many contrast pattern miners have, related to the inability to deal with numerical features. In [5], authors study the effect of 9 discretization methods on the performance of one particular associative classifier.

In 2018, García-Borroto[24] performed a theoretical study about types of contrast patterns from the perspective of their restrictions. For a better understanding, he grouped restrictions based on intuitions, which are insights about how a good contrast pattern should be. Although some interesting conclusions were suggested, no experimental evidence was provided to support them.

In order to provide experimental support to the conclusions in [24], González-Méndez et.at [30] performed some experiments. Nevertheless, their experiments have some important limitations. Firstly, only a single measure was evaluated per intuition using two classifiers. Secondly, contrast patterns were mined using an algorithm designed to mine a particular type of pattern, so it could bias the result. The used miner do not discretize the dataset, which was good for avoiding information loss, but it did not mine the whole set of patterns. That can also introduce some implicit bias very hard to detect. Thirdly, the datasets selected for experiments included imbalanced datasets, but the evaluation was performed using accuracy, a measure that could have a low performance in imbalanced datasets. Finally, no statistical analysis was performed.

This study, together with its two previous papers, shift the focus from type of patterns to restrictions, which could help to understand better the behavior of existing combinations of restrictions and to guide in the creation of new combinations that could help to improve the accuracy of associative classifiers.

4. Restrictions in associative classifiers

For building an associative classifier, we need to mine useful contrast patterns. The definition of a useful contrast pattern usually includes a set of restrictions like minimum support or confidence. These restrictions are used for three main purposes: speeding up the mining process, avoiding noisy patterns, and avoiding redundant patterns. In [24], the author group these restrictions according to three intuitions or insights about what a good contrast pattern is, which appears implicit in many related papers.

In this section, we present some restrictions grouped by intuitions, using the following notations. The probability of finding an object with a given contrast pattern $P$ is denoted by $p(P)$ , while the probability of not finding an object with a given pattern is denoted as $p(\neg{P})=1-p(P)$ . With respect to a given class $C$ , probabilities of finding an object of a given class and from a different class are denoted respectively as $p(C)$ and $p(\neg{C})$ , respectively. Joint probabilities are then denoted as $p(PC)$ , $p(P\neg{C})$ , and so on.

Intuition 1, Frequent. Each pattern must be frequent enough to guarantee it is not due to chance.

This intuition is commonly measured using the pattern support $p(P)>\mu$ (Rst.A). In some cases, even the support in the negative class is forced to be greater than a threshold $p(P\neg{C})>\mu$ (Rst.B).

Intuition 2, Contrast. Each pattern must be distinctive of its representative class, so it must contrast one class with the others. Contrast is evaluated using a quality measure and a threshold value: contrast patterns having a quality value below the threshold are considered non-important. Some of the quality measures used for evaluated this intuition appears here:3

³
We omit here the original references because they are irrelevant for this paper. Readers can found them in [23].

•

Maybe the first and most used measure to estimate contrast is confidence $p(C|P)$ (Rst.A). Confidence contrasts the probability of the pattern in the class with respect to the global probability of the pattern. Since confidence estimation might fail for patterns with low support, Laplace correction $\frac{p(PC)+1/N}{p(P)+2/N}$ (Rst.B) is used. Additionally, confidence have been extended to the fuzzy case (Rst.C).

•

Contrasting the probability of finding the pattern in both classes estimated by division in Emerging Patterns (Rst.D) $\frac{p(P|C)}{p(P|\neg{C})}$ and Subgroups (Rst.E) $\frac{p(PC)}{p(P\neg{C})}$ or by difference in Contrast Sets (Rst.F) $p(P|C)-p(P|\neg{C})$ .

•

Restricting the appearance of the pattern in the negative class (Rst.G) $p(P\neg{C})\leqslant\mu$ . A special case of this restriction is known as the Jumping restriction (Rst.H), that forbids the pattern to appear in other classes, $p(P\neg{C})=0$ .

•

Contrasting the probability of finding and not finding the pattern in the positive class, using for example (Rst.I) Relative Risk $\frac{p(C|P)}{p(C|\neg{P})}$

•

Testing for dependency between the pattern and the positive class, using different quality measures like (Rst.J) WRACC $p(CP)-p(P)p(C)$ , (Rst.K) Lift $\frac{p(CP)}{p(P)p(C)}$ ,(Rst.L) $\chi^{2}$ , (Rst.M) Pearson correlation, (Rst.N) Odds Ratio, or (Rst.O) NetConf.

Intuition 3, Minimal. In this intuition, restrictions are evaluated on a pattern with respect to other patterns. There are based on the intuition that shorter more general patterns are preferred, because they contain the irreducible relations among attributes that determines the class. They are necessary in problems where there are many contrast patterns almost identical with just little changes among them.

The simplest restriction based on this intuition is (Rst.A) to remove all non-minimal patterns (with respect to the item subset inclusion). Like other restrictions in this intuition, it must be applied together with other restrictions. Other restrictions allow non-minimal patterns, but only when they are better than all their subsets according to some criterion:

•

(Rst.B) Positive growth rate (GR) improvement, where $\textit{rateimp(P)}=\min_{P^{\prime}\subset P}\{\operatorname{GR}(P)-% \operatorname{GR}(P^{\prime})\}$ [57]

•

(Rst.C) Relative growth rate improvement greater than 1, where $\textit{relrateimp(P)}=\min_{P^{\prime}\subset P}\{\operatorname{GR}(P)/% \operatorname{GR}(P^{\prime})\}$ [57]

•

(Rst.D) Positive coverage improvement, where $\forall P^{\prime}\subset P:\operatorname{Support}(P)>\operatorname{Support}(P% ^{\prime})$ [57]

•

(Rst.E) Statistically different than sub-patterns. $\left(|P|=1\right)\vee\left(|P|>1\wedge\forall P^{\prime}\subset P,|P^{\prime}% |=|P|-1\Rightarrow chiTest(P,P^{\prime})\geqslant\eta\right)$ , where $\eta=$ 3.84 is a minimum chi-value threshold

•

(Rst.F) Positive strength improvement. $\forall P^{\prime}\subset P,\operatorname{Support}(P^{\prime})\geqslant\mu% \wedge\operatorname{GR}(P^{\prime})\geqslant\rho\Rightarrow\operatorname{% Streng}(P)>\operatorname{Streng}(P^{\prime})$

•

(Rst.G) Productive restriction. For every pair of patterns $P_{1}$ and $P_{2}$ formed by a partition of the items in $P$ , $p(P|C)>p(P_{1}|C)p(P_{2}|C)$ and $p(P|\neg{C})>p(P_{1}|\neg{C})p(P_{2}|\neg{C})$ .

•

(Rst.H) Conditional discriminative, where $\min_{P^{\prime}\subset P}\{\operatorname{SupDif}(P^{\prime})\}\geqslant\mu$ [57]

•

(Rst.I) Correlation improvement, if $\forall P^{\prime}\subset P:|\operatorname{Pearson}(P)-\operatorname{Pearson}(% P^{\prime})|\geqslant\gamma$

Table 1

Restrictions used on different types of contrast patterns. Rightmost columns are associated to intuitions I1[24]

Year	Type of contrast pattern	I1	I2	I3
1982	Consistent Generalization[42]		rst:Jumping
1997	Unusual subgroup[54]	rst:Support	rst:Subgroup
1998	CAR[38]	rst:Support	rst:Conf
1999	Emerging Pattern (EP)[15]		rst:Emerging
1999	Jumping EP (JEP) [15]		rst:Jumping, rst:Emerging
2000	Constrained EP(ConsEP) [57]		rst:Jumping	rst:PostivGRImprov or rst:RelGRImprovAboveOne, rst:PositSuppImprov
2001	Contrast set(CS)[11]		rst:SupDiff , rst:ChiBounded
2002	Interesting subgroup[22]		rst:SebagVersion
2002	Essential JEP (eJEP)[17]	rst:Support	rst:Jumping	rst:Minimal
2003	Constrained EP (CEP)[9]	rst:Support	rst:NegativeSupport
2003	Interesting EP[19]	rst:Support	rst:Emerging	rst:PostivGRImprov, rst:StatDifThanSubpat
2003	Predictive AR[55]	rst:Support	rst:Laplace
2004	Interesting subgroup[36]		rst:Subgroup
2004	Strong Frequent Pattern[51]	rst:Support	rst:Conf	rst:PositSuppImprov
2005	Chi EP (Chi-EP)[49]	rst:Support	rst:Emerging	rst:PositiveStrengthImprov , rst:StatDifThanSubpat
2005	Relative Risk Pattern[37]		rst:RelativeRisk
2005	Odds Ratio Pattern[37]		rst:OddsRatio
2006	Strong JEP[20]	rst:Support	rst:Jumping	rst:Minimal
2006	Noise tolerant EP(NEP)[20]	rst:Support	rst:NegativeSupport	rst:Minimal
2007	Contrast set in CIGAR[33]	rst:Support	rst:SupDiff, rst:ChiBounded, rst:PearsonBounded	rst:PearsonImprovement
2009	Interesting subgroup[7]		rst:Subgroup,rst:Lift
2011	Fuzzy EP[27]		rst:FuzzySup
2012	CAR-NF[32]	rst:Support	rst:Conf,rst:NetConf
2015	Nofong EP[44]	rst:Support, rst:SupportNegClass	rst:Emerging
2015	Productive EP[44]	rst:Support, rst:SupportNegClass	rst:Emerging	rst:Productive
2017	Conditional Discrim. Pattern[31]		rst:SupDiff	rst:ConditDiscrim

Table 1 presents a collection of existing types of contrast patterns[24]. Each column is associated with a particular intuition, and each cell contains the restriction used in the particular case. Based on the table, the author in [24] arrives to the following conclusions:

•

As expected, all types of contrast patterns include a contrast restriction. Many different quality measures have been used, no matter the high correlations reported between them [23].

•

Most types of contrast patterns with a minimal pattern restriction include also a minimal support restriction. This is somewhat unexpected, because minimal patterns have usually the highest supports.

•

Only half of the contrast pattern types use restriction based on support. The author hypothesizes that this is because finding a good threshold for minimal support is hard and cannot be accurately estimated a-priori using dataset characteristics.

•

The same restriction appears in different types of patterns. For example, restriction 1A appears 15 times and 2D appears 7 times. Additionally, different pattern types contains almost identical subsets of restrictions, which makes harder to select the appropriate one for a given problem.

5. Experimental evaluation

In this section, we introduce the experimental evaluation of some selected restrictions belonging to the three described intuitions. First, Section 5.1 presents the experimental setup, including datasets and evaluation protocol. Then, Section 5.2 introduces the associative classifiers that will be used in the experiments. Since all types of contrast patterns include a contrast restriction, Section 5.3 evaluates those restrictions in order to select the best one. Restrictions belonging to the remaining two intuitions are evaluated in Sections 5.4 and Section 5.5.

5.1 Experimental setup

Table 2
Dataset description according to number of instances, number of features, number of classes, minimal support used to mine frequent items, and number of frequent patterns mined. Since cross-validation was used, the last column is an average between the 10 folds

Db	Instances	Features	Classes	Support	Patterns
abalone-3-class-version.arff	4177	9	3	0.001	5384
analcatdata_germangss.arff	400	6	4	0.001	5157
banknote-authentication.arff	1372	5	2	0.001	1531
confidence.arff	72	4	6	0.001	272
iris.arff	150	5	3	0.001	962
monks-problems-1.arff	556	7	2	0.001	4045
monks-problems-3.arff	554	7	2	0.001	3859
seeds.arff	210	8	3	0.001	20702
tae.arff	151	6	3	0.001	2286
teachingAssistant.arff	151	7	3	0.001	5768
australian.arff	690	15	2	0.01	19776
heart-statlog.arff	270	14	2	0.01	20354
Mcredit.a.arff	690	16	2	0.01	36386
segment.arff	2310	20	7	0.01	133142
seismic-bumps.arff	2584	19	2	0.01	471790
sonar.arff	208	61	2	0.01	140815
vehicle.arff	846	19	4	0.01	981
wine.arff	178	14	3	0.01	1686
credit.g.arff	1000	21	2	0.03	230804
ionosphere.arff	351	35	2	0.03	110899
pasture.arff	36	23	3	0.05	126409

We selected 21 datasets from the UCI Machine Learning Repository [8] described in Table 2. From all the available datasets in that repository, we selected those with the following characteristics:

•

Balance between the number of classes to avoid dealing with imbalanced datasets.

•

Datasets must have less than a million frequent patterns with very low support, to evaluate the minimal support restrictions.

•

Datasets must have a large collection of contrast patterns with relative high minimal support threshold, to evaluate different contrast restrictions and how their behavior is influenced by the minimal support threshold.

Statistical tests for comparing classifiers were performed according to [13]. First, we performed a Friedman test with null hypothesis that the classifiers results are identical. Then, if the p-value is below 0.05, we rejected the null hypothesis and performed an all-vs-all Holm post-hoc to find out which classifiers actually differ. Classifier pairs with adjusted p-value below 0.05 are considered different.

Since all databases are balanced, all comparisons are performed based on accuracy, with datasets sampled using 10 folds cross validation. Ranks are calculated as in the Friedman test, using 1 as the best procedure, so lower rank values are better.

5.2 Associative classifiers used in experiments

In this section we describe the three associative classifiers used in the experiments. They were selected because they are frequently used in comparisons and the source code is available.

CAEP[16] classifies an object $q$ by aggregating votes for each contrast patterns that match $q$ . Each vote is calculated by Eq. (1). Aggregated votes are then normalized by the mean score calculated using the training sample. Finally, the class with a larger aggregation score is selected.

$\displaystyle\textit{vote}(q,C_{i})=\sum_{P\in\textit{CP}_{C_{i}}}\textit{sup}% (P,C_{i})\frac{\operatorname{GR}(P)}{\operatorname{GR}(P)+1}$ (1)

where $q$ is the query object, $\textit{CP}_{C_{i}}$ are the contrast patterns of class $C_{i}$ , $\textit{sup}(P,C_{i})$ is the support of $q$ in class $C_{i}$ , and $\operatorname{GR}(P)$ is the growth rate of contrast pattern $P$ .

The second classifier, iCAEP[56], employs the minimum encoding inference approach to classify an object, based on the assumption that a good model is the one leading to a concise total description. To materialize this idea, iCAEP builds a model per class iteratively adding contrast patterns until all object properties are covered. Contrast patterns are selected from those mined in the class after sorting them using a particular criterion. Finally, the class that minimizes Eq. (2) is returned.

$\displaystyle L(q||C_{i})=\sum^{p}_{k=1}\log_{2}p(P|C_{i}),X\in E_{q}^{C_{i}}$ (2)

where $p(X|C_{i})$ is the probability of finding contrast pattern $P$ in class $C_{i}$ , estimated by $p(P|C_{i})=\frac{\textit{sup}(P\wedge C_{i})+2\cdot\frac{\textit{sup}(P)}{N}}{% \textit{sup}(C_{i})+2}$ , $N$ is the number of training instances, and $E_{q}^{C_{i}}$ is the set of contrast patterns selected to model object $q$ in class $C_{i}$ .

BCEP[18] uses contrast patterns matching the query object $q$ to derive a product approximation of $p(q,C_{i})$ . Then, the class with the highest probability is returned.

5.3 Contrast restrictions

Table 3
Candidate thresholds used for each selected quality measure

Quality measure	Threshold 1	Threshold 2	Threshold 3
Growth rate	3	5	7
$\textit{chi}^{2}$	0.5	0.7	0.9
Odds ratio	2.5	5	10
Strength	0.02	0.04	0.06
WRACC	0.01	0.03	0.05

The most important restrictions for associative classifiers are those in the Contrast intuition[24]. These restrictions are composed by a quality measure and a threshold value. Since many quality measures produce results whose ranks are very correlated, the results using one or the other are very similar [23, 40]. In this paper, we used the groups of correlated measures created in [40], selecting one per group. They are growth rate (gr), odds ratio (or), weighted relative accuracy (wracc), chi-squared (chi2), and strength. Selecting proper values of thresholds is a complex task with a direct impact in the quality of the result. In this study, we selected three candidate thresholds based on the histogram of each quality measure (Table 3). Then, results are compared taking into account the highest, median, and lowest value obtained per quality measure, with independence of the ordering of the threshold values. Figure 1 presents the results of the ranks per classifiers considering the highest accuracy value (Fig. 1a), the median value (Fig. 1b), and the lowest value (Fig. 1c).

Figure 1.

Rank of the contrast metrics considering the maximum, median, and minimum accuracies obtained by different cut values.

Figure 1a shows that the best ranks are achieved using gr and or in the three classifiers. Other measures like wracc have a good performance using some classifier, but not so good in the others. Figure 1b and c show that gr is usually equal or better than or. Due to all this, we conclude that gr is the quality measure with better performance.

Since the gr restriction has the best performance, we use the restriction $gr(P)\geqslant 5$ in the evaluation of restrictions that need a contrast restriction.

5.3.1 Jep restriction

The jumping restriction (rst:Jumping) is frequently used in many different types of contrast patterns, thus it deserves a particular analysis. In [30], an experiment shows the impact of the level of noise in the dataset in the quality of patterns mined with such restriction. In this subsection, we will expand previous results.

In this experiment, we randomly insert noise to all datasets in levels from 0.02 to 0.12, with increments of 0.02. With higher noise levels, the accuracy drops so significantly that no analysis is possible. To add the noise, random objects are selected and its class is changed to a different random class.

Figure 2.

Rank comparison of the accuracy using and not using the jumping restriction with different levels of noise.

Figure 2 shows the ranks of each of the tested classifiers using two different sets of patterns: the patterns with the jumping restriction (jep) and all the patterns (nojep). Results for CAEP (Fig. 2a) and iCAEP (Fig. 2b) are similar: when there is no noise inserted, patterns with the jumping restriction are better, and the more noise is inserted the more the jumping pattern quality deteriorates. In the case of BCEP (Fig. 2c), using all the patterns always leads to better results than using jumping patterns. Nevertheless, after $\textit{noise}=$ 0.04, the quality of jumping patterns is reduced less than the quality of all the patterns. We have no explanation for that behavior. Differences are all statistically significant starting from some noise level on each classifier (specified in the figure).

To look for an explanation to this behavior, Fig. 3 shows the number of patterns found by the miner, both jumping and total patterns. Since jumping patterns are a subset of the total pattern set, we show the ratio of the number patterns with respect to maximum number of patterns found when no noise is introduced. As can be seen, in most datasets, after some noise level, the number of jumping patterns decreases more rapidly than the number of total patterns, and this could be a factor in the accuracy drop.

Figure 3.

Number of patterns, with respect to the patterns without noise, while increasing the level of noise from value 0 to 0.12. Black line is using the jep restriction, while gray is without using it.

5.4 Minimal support restriction

In this section, we evaluated the minimal support restriction ( $p(\textit{PC})>\mu$ ). Since we are using supervised classifiers, it is used together with the contrast restriction $gr(P)\geqslant$ 5. The minimal support equal to 0 means that all mined patterns are considered.

The minimal support restriction is used mainly for the following reasons:

•
Control the mining time and the space consumed by the mined patterns. This can be critical, because even small datasets can contain very large collections of contrast patterns.
•
Control the classification time. This is important, because most associative classifiers need to use the whole pattern collection in order to classify query objects.
•
Accuracy of the classifier. Since contrast patterns with low support can appear due to chance, removing them is expected to increase the quality of the resultant classifier.

Figure 4.
Accuracy of classifiers CAEP (black, dotted), BCEP(gray, dashed) and iCAEP(light gray, continuous) per dataset with respect to minimal support value. Values in the $x$ axes are minimal supports from 0 to 0.3 and values in $y$ axes range from 0 to 1.

In this section, we will focus on the third reason, which is the one related with classifier accuracy. The remaining two reasons might be applied in some cases even if they imply a drop in accuracy. Figure 4 shows the accuracy (between 0 and 1) of all the tested datasets, while increasing the value of the minimal support threshold.

Results are different for the tested classifiers. For CAEP (black line), increasing the minimal support threshold usually deteriorates the classifier accuracy. This might be explained because objects in the class border usually only match with low-supported contrast patterns. Then, if we remove low-supported patterns, those objects do not match any pattern or just match a few, therefore vote aggregation is no longer a good class estimation.

BCEP (gray line) has a similar behavior to CAEP, for a similar reason: removing low supported patterns can make the product approximation of the class posterior probability of border objects very unreliable or even impossible to calculate.

Finally, the behavior of iCAEP (light gray) is different in some datasets. This is because iCAEP iteratively builds a model for classification starting by contrast patterns with larger number of items. If low supported contrast patterns are allowed, they will be used with high probability in the final classification. Since low supported patterns are usually low quality patterns, the accuracy might be deteriorated.

Figure 5.
Average ranking of the classifiers using different minimal support restrictions.

To compare among the accuracies of these classifiers, Fig. 5 shows a rank comparison of them using different minimal support thresholds. This figure shows that, for lower minimal support values, the CAEP classifier is better than the other two classifiers, by a large margin. On the other hand, increasing the minimal support threshold favors iCAEP, which is better for higher thresholds. The statistical analysis, performed independently for each threshold value, reveal that differences were only significant for extreme values of the minimal thresholds: 0, 0.1, and 0.25.

This result could explain why the iCAEP is usually reported as a superior classifier than CAEP: all comparative studies use a minimal support threshold and do not evaluate the influence of that support in the tested classifiers. For example, in [56], iCAEP is reported to be better than CAEP using a minimal support threshold of 0.1. This threshold value is close to the point where we find iCAEP to start outperforming CAEP. The influence of the minimal support threshold in the behavior of a supervised classifier is frequently reported in other classification problems [3].
5.5 Minimal patterns

Restrictions in the minimal intuition are used for the following reasons:

•
To reduce pattern set redundancy, because in many datasets there are large amounts of very similar patterns, and using all of them could induce errors. Removing the redundancy allows also to have smaller models and faster classifiers.
•
To make the classifier tolerant to noise, because the most general patterns are usually those with the largest supports, and their quality measures are less affected by wrong labeled objects.
•
To substantially reduce mining time in generative methods, because once the algorithm finds a contrast patterns, more particular patterns do not need to be generated and tested.

In this section, we evaluated the simplest restriction in the minimal intuition: to reject all the contrast patterns that are not minimal. We do not evaluate other restrictions because each one is used only in a single paper. We compare the accuracy results using three different pattern sets: all the patterns (all), those fulfilling the minimal pattern restrictions (minimal), and those not fulfilling the minimal pattern restriction (non-minimal). We also explore the influence of the classifier and the minimal support threshold in the result as can be seen in Fig. 6.

Figure 6.
Ranks of the collection of minimal patterns, non minimal patterns, and all patterns, using different minimal support levels.

We can extract some conclusions. On the one hand, removing minimal patterns deteriorates the classifier in almost all configurations, so minimal patterns are important. On the other hand, using minimal patterns deteriorates the quality for lower minimal supports. For higher minimal supports, the results depend on the classifier, but tend to favor minimal patterns. Differences between non-minimal and the other two sets are statistically significant for most values of min_sup (details appear in the sub-figure headers), but minimal patterns and all patterns were not found different in statistical tests. These results could explain to some extent a conclusion of the previous paper that show that most types of contrast patterns that include a minimal restriction include a minimal support restriction.

Summarizing, using a minimal pattern restriction could be beneficial if a minimal support restriction is included. In those cases, the decrease in the mining time with no drop in accuracy can be particularly beneficial.
6. Conclusions

In this paper, we extend two previous papers that evaluate theoretically and experimentally various of the most important restrictions used in associative classifiers. We conducted experiments to evaluate the impact of using those restrictions in the accuracy of three different associative classifiers. This section presents the conclusions we extracted and future work.

First, we evaluated restrictions on the Contrast intuition, which are formed by a quality measure and a threshold value. We grouped quality measures according to their similarities, selecting one per group. For each selected measure, we find three candidate threshold value using the histogram of their values. Experimental results show that the restriction based on growth rate achieves the best results for all tested classifiers, so we selected the restriction $gr(P)>$ 5 to be included in the remaining experiments where a contrast restriction needs to be included.

Second, we evaluated a particular contrast restriction used in many papers, which is the jumping restriction. This restriction forces the contrast patterns to have zero support in all but one class, so it should be particularly affected by the level of noise in the database. Then, we introduce different increasing noise levels, contrasting the accuracy of jumping patterns with the whole pattern collection. We find that the accuracy of all classifiers significantly deteriorates. We also provide a candidate explanation to that accuracy dropping based on the drop in the number of patterns.

Third, we tested the impact of using the minimal support threshold, finding that for classifiers CAEP and BCEP it deteriorates the result quality in all tested datasets. Then, these restrictions should be added only for reducing both the classification time and memory consumption of the model. For iCAEP, a minimal support threshold might be necessary to improve the classification accuracy. Consistently with those results, the accuracy comparison among the three classifiers depends on the minimal support threshold used: for lower support thresholds CAEP is better while for larger thresholds iCAEP is better. This influence of the minimal support threshold in the quality of the results deserves more attention in the papers where similar comparisons are performed, because it could be biasing the conclusions.

Finally, we tested the most commonly used restriction on the minimal intuition. We find that minimal patterns are important, because removing them always deteriorates the classifiers accuracy. Additionally, we found that using minimal patterns is similar to using all the patterns for higher minimal support values. Since mining minimal patterns could be significantly faster than mining all the patterns, using this restriction could make sense in most cases.

This paper opens some interesting lines of research. First, more experiments are necessary to evaluate other existing restrictions, some of them maybe from new intuitions. Second, different non-known combinations of restrictions could be tested in order to find new and better types of contrast patterns. Finally, experimental evaluations of associative classifiers could be performed, in order to find the relations between their behavior and the restrictions used.

References

Abdelhamid

and Thabtah

, Associative classification approaches: Review and comparison, Journal of Information & Knowledge Management 13(3) (2014), 1–30.

Abrar

Tze

and Sim

, Effects of pruning on accuracy in associative classification, Journal of Informatics and Mathematical Sciences 9(4) (2017), 1047–1051.

Acosta-Mendoza

Morales-González

Gago-Alonso

García-Reyes

E.B.

and Medina-Pagola

J.E.

, Image classification using frequent approximate subgraphs, in: Alvarez

Mejail

Gomez

and Jacobo

, editors, Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, Berlin, Heidelberg, Springer Berlin Heidelberg, 2012, pp. 292–299.

Agrawal

and Srikant

, Fast algorithms for mining association rules, In Proc. 20th Int. Conf. Very Large Data Bases-VLDP, 1994, pp. 487–499.

Ali

, Comparative study of discretization methods on the performance of associative classifiers, In International Frontiers of Information Technology, 2016, pp. 0–5.

Ali

Shahzad

and Shahzad

S.K.

, A review on comparative performance analysis of associative classifiers Zulfiqar, International Journal of Advanced and Applied Sciences 4(6) (2017), 96–103.

Atzmueller

and Lemmerich

, Fast subgroup discovery for continuous target concepts, In ISMIS 2008, 2009, pp. 35–44.

Bache

and Lichman

, {UCI} Machine Learning Repository, 2013.

Bailey

Manoukian

and Ramamohanarao

, Classification using constrained Emerging Patterns, In AIM 2003, 2003, pp. 226–237.

10.

Bay

S.D.

and Pazzani

M.J.

, Detecting change in categorical data: mining contrast sets, In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’99, New York, NY, USA, ACM, 1999, pp. 302–306.

11.

Bay

S.D.

and Pazzani

M.J.

, Detecting group differences: Mining contrast sets, Data Mining and Knowledge Discovery 5(3) (2001), 213–246.

12.

David

, Computational music analysis, 2016.

13.

Demšar

, Statistical comparisons of classifiers over multiple data sets, J Mach Learn Res 7 (Dec. 2006), 1–30.

14.

Dong

and Bailey

, Contrast Data Minint. Concepts, Algorithms, and Applications, Taylor & Francis, 2013.

15.

Dong

and Li

, Efficient mining of emerging patterns: discovering trends and differences, In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’99, New York, NY, USA, ACM, 1999, pp. 43–52.

16.

Dong

Zhang

Wong

and Li

, CAEP: Classification by Aggregating Emerging Patterns, In Arikawa

and Furukawa

, editors, Discovery Science, volume 1721 of Lecture Notes in Computer Science, Springer Berlin/Heidelberg, 1999, p. 737.

17.

Fan

and Ramamohanarao

, An efficient single-scan algorithm for mining essential jumping emerging patterns for classification, In 6th Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD2002), Taipei, Taiwan, China, 2002, p. 456462.

18.

Fan

and Ramamohanarao

, A Bayesian Approach to Use Emerging Patterns for Classification, 2003.

19.

Fan

and Ramamohanarao

, Efficiently Mining Interesting Emerging Patterns, In WAIM 2003, 2003, pp. 189–201.

20.

Fan

and Ramamohanarao

, Fast discovery and the generalization of strong jumping emerging patterns for building compact and accurate classifiers, IEEE Transactions on Knowledge and Data Engineering 18(6) (2006), 721–737.

21.

Fayyad

U.M.

and Irani

K.B.

, On the handling of continuous-valued attributes in decision tree generation, Mach Learn 8(1) (Jan. 1992): 87–102.

22.

Gamberger

and Lavrač

, Expert-guided subgroups discovery: methodology and applications, Journal of Artificial Intelligence Research 17 (2002), 501–527.

23.

Garca-Borroto

Loyola-Gonzlez

Martnez-Trinidad

J.F.

and Carrasco-Ochoa

J.A.

, Evaluation of quality measures for contrast patterns by using unseen objects, Expert Syst Appl 83(C) (Oct. 2017), 104–113.

24.

García-Borroto

, A Restriction-Based Approach to Generalizations, in: Hernández Heredia

Milián Núñez

and Ruiz Shulcloper

, editors, Progress in Artificial Intelligence and Pattern Recognition, Cham, 2018. Springer International Publishing, pp. 239–246.

25.

García-Borroto

Loyola-Gonzalez

Martínez-Trinidad

J.F.

and Carrasco-Ochoa

J.A.

, Comparing Auality Measures for Contrast Pattern Classifiers, volume 8258 LNCS. 2013.

26.

García-Borroto

Loyola-González

Martínez-Trinidad

J.F.

and Carrasco-Ochoa

J.A.

, Evaluation of quality measures for contrast patterns by using unseen objects, Expert Systems with Applications 83 (2017), 104–113.

27.

García-Borroto

Martínez-Trinidad

J.F.

and Carrasco-Ochoa

J.A.

, Fuzzy emerging patterns for classifying hard domains, Knowledge and Information Systems 28(2) (2011), 473–489.

28.

García-Borroto

Martínez-Trinidad

J.F.

and Carrasco-Ochoa

J.A.

, Finding the best diversity generation procedures for mining contrast patterns, Expert Systems with Applications 42(11) (2015), 4859–4866.

29.

García-Vicó

Montes

Aguilera

Carmona

and del Jesús

, Analysing Concentrating Photovoltaics Technology Through the Use of Emerging Pattern Mining, In International Joint Conference SOCO’16-CISIS’16-ICEUTE’16, volume Advances i, 2017, pp. 334–344.

30.

González-Médez

Martín-Rodríguez

and García-Borroto

, Evaluating Restrictions in Pattern Based Classifiers, in: Nyström

Hernández Heredia

and Milián Núñez

, editors Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, Cham, 2019. Springer International Publishing, pp. 439–448.

31.

Zhao

Liu

and Wang

, Conditional discriminative pattern mining: concepts and algorithms, Information Sciences, 2017.

32.

Hernández-León

Fco

C.-O.J.M.-T.J.

Hernández-Palancar

and Hern

, CAR-NF: A Classifier based on Specific Rules with High Netconf, Intelligent Data Analysis 16(1) (2012), 150–158.

33.

Hilderman

R.J.

and Peckham

, Statistical methodologies for mining potentially interesting contrast sets, In Studies in Computational Intelligence(SCI), volume 43, Springer-Verlag, 2007, pp. 153–177.

34.

Huang

and Ling

C.X.

, Using auc and accuracy in evaluating learning algorithms, IEEE Transactions on Knowledge and Data Engineering 17(3) (2005), 299–310.

35.

Kralj

Lavrač

Gamberger

and Krstačić

, Contrast set mining for distinguishing between similar diseases, In Artificial Intelligence in Medicine, 2007, pp. 109–118.

36.

Lavrac

and Flach

, Subgroup discovery with CN2-SD, Journal of Machine Learning Research 5 (2004), 153–188.

37.

Wong

Feng

and Tan

Y.-P.

, Relative Risk and Odds Ratio: A Data Mining Perspective, In PODS 2005, 2005, pp. 368–377.

38.

Liu

Hsu

and Ma

, Integrating classification and association rule mining, In KDD 1998, 1998.

39.

Liu

Wang

and He

, Discriminative pattern mining and its applications in bioinformatics, Briefings in Bioinformatics 16(16) (2015), 884–900.

40.

Loyola-González

Garciá-Borroto

Martínez-Trinidad

J.F.

and Carrasco-Ochoa

J.A.

, An empirical comparison among quality measures for pattern based classifiers, Intelligent Data Analysis 18(6) (Jan 2014), S5–S17.

41.

Loyola-González

Medina-Pérez

M.A.

Martínez-Trinidad

J.F.

Carrasco-Ochoa

J.A.

Monroy

and García-Borroto

, PBC4cip: A new contrast pattern-based classifier for class imbalance problems, Knowledge-Based Systems 115 (2017), 100–109.

42.

Mitchell

T.M.

, Generalization as Search, Artificial Intelligence 18 (1982), 203–226.

43.

Mittal

Aggarwal

and Mahajan

, Efficient pruning methods for obtaining compact associative classifiers with enhanced classification accuracy rate, in: Gani

A.B.

Das

P.K.

Kharb

and Chahal

, editors, Information, Communication and Computing Technology, Singapore, Springer Singapore, 2019, pp. 294–311.

44.

Nofong

V.M.

, Mining productive emerging patterns and their application in trend prediction, In 13-th Australasian Data Mining Conference (AusDM 2015), 2015, pp. 109–117.

45.

Norambuena

B.K.

and Villegas

C.M.

, An extension to association rules using a similarity-based approach in semantic vector spaces, Intell Data Anal 23 (2019), 587–607.

46.

Novak

P.K.

Lavrač

and Webb

G.I.

, Supervised descriptive rule discovery: A unifying survey of contrast set, emerging pattern and subgroup mining, Journal of Machine Learning Researches 10 (Jun 2009), 377–403.

47.

Quinlan

J.R.

, Induction of decision trees, Machine Learning 1 (1986), 81–106.

48.

Rajesh

, Prediction of chronic kidney disease using weighted associative classifier (wac), IJRAR-International Journal of Research and Analytical Reviews (IJRAR) 6(2) (2019), 149–151.

49.

Ramamohanarao

Bailey

and Fan

, Efficient Mining of Contrast Patterns and Their Applications to Classification, 2005, pp. 1–9.

50.

Siddique Ibrahim

S.P.

and Sivabalakrishnan

, An Evolutionary Memetic Weighted Associative Classification Algorithm for Heart Disease Prediction, Springer Singapore, Singapore, 2020, pp. 183–199.

51.

Sucahyo

Y.G.

and Gopalan

R.P.

, Building a More Accurate Classifier Based on Strong Frequent Patterns, in: Webb

and Yu

, editors, LNAI 3339, Springer-Verlag, 2004, pp. 1036–1042.

52.

Tamrakar

S.I.

Tamrakar

and Sp

S.I.

, Comparative study of different lazy learning associative classification methods, Procedia Computer Science 165(2019) (2020), 370–376.

53.

Wróbel

Ł.

Sikora

and Michalak

, Rule quality measures settings in classification, regression and survival rule induction – an empirical approach, Fundamenta Informaticae 149 (July 2015), 419–449, 2016.

54.

Wrobel

, An algorithm for multi-relational discovery of subgroups, In 1st European Conference on Principles of Data Mining and Knowledge Discovery, 1997, pp. 78–87.

55.

Yin

and Han

, Cpar: Classification based on predictive association rules, in: Barbar

and Kamath

, editors, Proceedings of the SIAM Int. Conf. on Data Mining, SIAM, 2003, pp. 331–335.

56.

Zhang

Dong

and Ramamohanarao

, Information-based classification by aggregating emerging patterns, 2010, pp. 48–53.

57.

Zhang

X.X.

Dong

and Ramamohanarao

, Exploring Constraints to Efficientrly Mine Emerging Patterns from Large High-dimensional Datasets, KDD 2000, 2000, pp. 310–314.

Evaluating pattern restrictions for associative classifiers

Abstract

Keywords

1. Introduction

1 Some contrast pattern miners require that all the features are discrete.

3 We omit here the original references because they are irrelevant for this paper. Readers can found them in [23].

5.1 Experimental setup

Table 2 Dataset description according to number of instances, number of features, number of classes, minimal support used to mine frequent items, and number of frequent patterns mined. Since cross-validation was used, the last column is an average between the 10 folds

Table 3 Candidate thresholds used for each selected quality measure

References

¹
Some contrast pattern miners require that all the features are discrete.

³
We omit here the original references because they are irrelevant for this paper. Readers can found them in [23].

Table 2
Dataset description according to number of instances, number of features, number of classes, minimal support used to mine frequent items, and number of frequent patterns mined. Since cross-validation was used, the last column is an average between the 10 folds

Table 3
Candidate thresholds used for each selected quality measure