Mining class association rules on imbalanced class datasets

Abstract

The task of discovering sets of good rules from imbalanced class datasets may not come easy for existing class association rule mining algorithms. The reason is that they often generate rules belonging to the dominant classes. For example, in medical applications, some symptoms of illness are not popular, and the doctors are very interested in the rules associated with these symptoms. This paper proposes a novel approach for mining class association rules (CARs) in imbalanced class datasets. Firstly, assuming there are n given classes, the training dataset is split into n corresponding groups. For each group, the data is clustered by the k-means algorithm into k groups where the value of k is equal to the number of records of the smallest group. Secondly, we combine all records from the groups after clustering and use the CAR-Miner-Diff algorithm to mine all CARs. We also propose an iterative method to get a highly accurate classifier. From experiments, we show that the proposed approach outperforms existing algorithms while maintaining a large number of useful rules in the classifier.

Keywords

Class association rules associative classification imbalanced class dataset clustering data mining

1 Introduction

Classification is one of the most important topics in data mining and knowledge discovery techniques. Rule-based classification is one kind of classifications which has a high accuracy and an easy interpretation. Association rule mining (ARM) is the process of generating all rules in the dataset that their supports satisfy a minimum support threshold (minSup) and their confidences satisfy a minimum confidence threshold (minConf). Classification rule mining (CRM) aims to mine a small set of association rules in the dataset, in which the consequent of each rule contains a value of the class attribute and form an accurate classifier.

Many methods for rule-based classification have been proposed in recent years such as CART [4], decision trees [25, 26], ILA [31, 32], associative classification (AC) [2 , 36], weighted CARs [1, 23], imbalanced classification [37], mining CARs with imbalanced attribute [40]. In which, associative classification, which integrates both association rule mining and classification [9 , 29], is an efficient classification approach and attracts a lot of research interests. Its purpose is to mine a subset of association rules (denoted by CARs) where their right-hand sides are restricted to values of the class attribute (or the target attribute). Currently, there are many methods in associative classification including predictive association rules [38], multiple association rules [11], associations [7, 14], multi-class, multi-label association classification [31], multi-class classification based on association rules [13], greedy method to build classifier [29], class association rule mining based on equivalence class [17 , 33], mining class association rules in updated datasets [18, 20], mining fuzzy class association rules in big data [27].

Although there have many methods to improve the accuracy of classifiers based on class association rules. However, most of them use the mined CARs to build classifiers. One of the challenges of classification based on association rules is how to determine the value of minSup to achieve a good performance. If the value of minSup is too high, the class with the small number of samples only contains infrequent itemsets, and it cannot generate any rules. This influences the prediction for new cases. Otherwise, the number of rules is very few and it also affects the class prediction stage. The problem becomes more challenging on imbalanced class datasets.

To address this challenge, we propose a method for mining CARs on imbalanced class datasets. First, we divide the dataset containing m classes into m sub-datasets, the i^th sub-dataset contains the samples that belong to the i^th class. Then, for each sub-dataset that has the number of records being larger than k (k is determined based on the number of rows in the smallest dataset), we use the k-means algorithm to cluster this sub-dataset into k groups and select each group a representative sample. Thus, the number of records in each sub-dataset is k. Finally, we focus on m sub-datasets and use the CAR-Miner-Diff algorithm to mine CARs.

2 Preliminaries for class association rule mining

In this section, we briefly introduce the definition of the associative classification problem in data mining.

Let D be a training dataset with n attributes A₁, A₂, . . . , A_n, the attribute C be a set that consists of different values, denoted by C = {c₁, c₂, . . . , c _k }, where k ≤ |D|. With 0 ≤ i ≤ k, each value c_i ∈ C represents a different class in D.

For example, Table 1 shows a training dataset D, in which there exist 8 records, the attributes are {A, B, C}, class denotes the class attribute, and class = {0, 1} (i.e., there are 2 classes).

Table 1
An example of training dataset

OID A B C Class

1 a1 b1 c1 0

2 a1 b2 c1 1

3 a2 b2 c1 1

4 a3 b3 c1 0

5 a3 b1 c2 1

6 a3 b3 c1 0

7 a1 b3 c2 0

8 a2 b2 c2 1

OID	A	B	C	Class
1	a1	b1	c1	0
2	a1	b2	c1	1
3	a2	b2	c1	1
4	a3	b3	c1	0
5	a3	b1	c2	1
6	a3	b3	c1	0
7	a1	b3	c2	0
8	a2	b2	c2	1

Definition 1. An itemset is a set of pairs (a, v) where a is an attribute from A, and v is its value. Any attribute from A only appears at most once in the itemset.

For example, we have many itemsets, such as {(A, a3)}, {(B, b3), (C, c1)}, etc from Table 1.

Definition 2. A class association rule r is in the form of X⟶ c, where X is an itemset and c is a class.

Definition 3. The support of an itemset X, denoted by Supp(X), is the number of records in D containing X.

Definition 4. The support of a class association rule r = X⟶ c_i, denoted by Supp(r), is the number of records in D containing X and (C, c_i).

Definition 5. The confidence of a class association rule r = X⟶ c_i, denoted by Conf(r), is defined as: $Conf (r) = \frac{Supp (r)}{Supp (X)} .$

Next, assuming the dataset in Table 1 is used as an example, we briefly introduce the generation of candidate rules from the dataset. Let r: (A, a1) ⟶ 0, with X = (A, a1) and c_i = 0. We have Supp(X) = 3, Supp(r) = 2 because there are three records with A = a1, and two of them have class 0, and $Conf (r) = \frac{Supp (r)}{Supp (X)} = \frac{2}{3}$ .

Besides, we utilize the list of transaction identifications, denoted as Obidset, to calculate the support of candidate ruleitems.

3 Related work

3.1 Mining class association rules

CAR mining was first introduced by Liu et al. in [12]. They proposed CBA-RG (a rule generator of CBA) for mining CARs. They also proposed CBA-CB algorithm to build a classifier based on mined CARs. CBA-CB is based on a heuristic to select the strongest rules to form a classifier. It includes two phases: (1) The first phase was to generate rules; all CARs were mined using CBA, an extended of the Apriori algorithm; and (2) The second was to build a classifier by following steps: First, the set of generated rules is sorted according to descending precedence. Then, the rules are selected to form the classifier from CARs following the stored sequence. Finally, the rules in CARs that do not improve the accuracy of the classifier are discarded. The authors in [11] proposed CMAR (Classification based on Multiple Association Rules) algorithm, an FP-tree-based approach, for mining CARs. To predict a new case r, CMAR finds a set of rules (R) in the mined rule set that their right-hand sides satisfy r condition and divides the rules in R into m groups corresponding to m classes in R. A weighted χ2 is calculated for each group and the class with the highest weighted χ2 is selected and assigned to this record. This approach prunes redundant rules after mining CARs. In 2004, MMAC method was proposed [30]. The authors use multi-class, multi-label to mine CARs. Vo and Le proposed ECR-CARM (Equivalence class rule – class association rule mining) algorithm [33] for fast mining CARs with one dataset scan. The authors built the ECR-tree to store all frequent itemsets and based on the tree, they proposed an algorithm to mine CARs. Nguyen et al. proposed CAR-Miner algorithm to mine CARs [22]. Each value was stored into one node on the tree to save the time of checking the superset of itemsets. The authors also modified ECR-tree [33] into MECR-tree and based on this tree, they developed some theorems to prune itemsets that are infrequent as soon as possible and prevent joining two itemsets that have the same attributes. Nguyen et al. also proposed CAR-Miner-Diff [18] to save the time by replacing the intersection between two Obidsets by the difference between two Obidsets. Besides, mining CARs with constraints [19, 21] and mining CARs from updated datasets [18, 20] were proposed in recent years.

3.2 Clustering

k-means algorithm [15], proposed by MacQueen in 1967, is the simplest and most well-known unsupervised learning algorithms [34]. At the beginning of the algorithm process, k initial centroids are chosen where k was given by users. Then, each point of the dataset is assigned to its nearest centroid using a selected articular proximity measure (e.g., Euclidean distance). Next, the centroids for each cluster are recomputed by taking the mean of all data points in this cluster. Those two above steps are then repeated until no modification is observed in the centroid points. k-means can be seen as a greedy algorithm. This algorithm was listed in the top ten famous algorithms for data mining (Nominated by prestigious scientists for data mining at the IEEE-ICDM in 2006) [35]. The detailed mathematical convergence, as well as its proof of k-means, were presented in [15].

Instead of using the mean point as the center of a cluster, Kaufman and Rousseeuw proposed PAM (Partitioning Around Medoids) [10], a k-medoids-based clustering algorithm, to be more insusceptible to noise and outliers compared to k-means, which uses an actual point in the cluster to represent it. PAM focuses to minimize the AEC (Absolute Error Criterion) rather than the SSE (Sum of Square Errors). Like k-means, PAM proceeds iteratively between data assignments and medoid update steps until each representative object is almost the medoid of the cluster. This algorithm works well with small or medium datasets. It has performance problems when processing with large datasets. The most powerful approach based on the k-medoids algorithm, which was also proposed by Kaufman and Rousseeuw, was called CLARA. The strength of CLARA is that it can deal with larger datasets than PAM can. However, it still has some weaknesses because its efficiency depends on the sample size. Recently, CLARANS algorithm (A Clustering Algorithm based on Randomized Search) [16] was proposed for clustering objects for spatial data mining which draws samples of neighbors dynamically. This technique is considered as searching a graph where each node is a potential solution which is a set of k-medoids. The performance of this algorithm is efficient and scalable compared to those of both PAM and CLARA. BIRCH [39] is a hierarchical clustering algorithm, with a CF-Tree data structure to store the summary as clusters. It builds the tree as it encounters the data points and inserts the data points based on one or more of distance function metrics. BIRCH has the ability to incrementally and dynamically cluster upcoming objects with a single data scan. In addition, CURE algorithm [28] is also a robust approach for outliers and identifies clusters with non-spherical shapes and wide variances in terms of size. The distance between clusters is calculated based on a strategy that uses a group of points. After which, the algorithm chooses a c of well-scattered points from a cluster and it shrinks the selected points toward the centroid of the cluster using some predetermined fraction. An improved k-medoids clustering algorithm [5] using CF-Tree based on clustering features of BIRCH algorithm. This approach preserves all training sample in the tree structure then uses k-medoids methods to cluster the leaf nodes of the tree. It then scans from the root of CF-Tree to get k clusters.

4 Mining class association rules based on clustering

4.1 AC-Cluster algorithm

Our proposed approach has three steps as follows. The first step is to split the training dataset into m sub-datasets corresponding to m classes. For each sub-dataset that has the number of records greater than k, k-means algorithm is used to cluster all the records into k groups. In each group, our algorithm selects a representative record that is so-called the kernel record of each group. Therefore, each sub-dataset has k records.

In step 2, we can use k-means, k-medoids, etc., for clustering. In our experiments, k-means clustering algorithm was used on the experimental datasets because of its effectiveness. Euclidian measure is used to compute the distance between two objects. We select the record that is the closest one to the center of each cluster as a representative record.

In step 3, we can use any class association rule mining algorithms to mine CARs such as CBA [12], CMAR [11], MMAC [30], CPAR [38].

4.2 An illustration of AC-Cluster

Following is an illustration for the steps of the algorithm in Fig. 1.

Fig.1

The steps for mining class association rules based on clustering.

Step 1.1: First, we divide the dataset in Table 2 into 2 sub-datasets corresponding to two classes 0 and 1. Assume that c = 1, because the number of records in sub-dataset containing class 1 is smalless (2 records) ⇒ k = 2 × c = 2.

Step 1.2: We use k-means to cluster 5 records in the sub-dataset with class 0 into 2 clusters. After that, each cluster chooses one presentative record, we get the result as in Table 3 (after reordering OIDs).

Table 2

An example of imbalanced dataset (5 records belong to class 0 and 2 records belong to class 1)

OID	A	B	C	D	CLASS
1	7	5	7	6	0
2	6	3	6	5	0
3	7	4	6	6	0
4	8	8	7	6	0
5	8	5	7	5	0
6	4	3	4	4	1
7	6	4	6	5	1

Table 3

Dataset D’ after clustering and reordering OIDs

OID’	A	B	C	D	CLASS
1	8	8	7	6	0
2	8	5	7	5	0
3	4	3	4	4	1
4	6	4	6	5	1

Step 1.3: We use CAR-Miner-Diff [17] to mine CARs with minSup = 25% and minConf = 60% from the dataset in Table 3, the results are shown in Table 4.

Table 4

Class association rules from dataset D’

ID	Generated Rules	Support (%)	Confidence (%)
R1	If A = 4 then class = 1	25	100
R2	If A = 4 and B = 3 then class = 1	25	100
R3	If A = 4 and B = 3 and C = 4 then class = 1	25	100
R4	If A = 4 and B = 3 and C = 4 and D = 4 then class = 1	25	100
R5	If A = 4 and B = 3 and D = 4 then class = 1	25	100
R6	If A = 4 and C = 4 then class = 1	25	100
R7	If A = 4 and C = 4 and D = 4 then class = 1	25	100
R8	If A = 4 and D = 4 then class = 1	25	100
R9	If A = 6 then class = 1	25	100
R10	If A = 6 and B = 4 then class = 1	25	100
R11	If A = 6 and B = 4 and C = 6 then class = 1	25	100
R12	If A = 6 and B = 4 and C = 6 and D = 5 then class = 1	25	100
R13	If A = 6 and B = 4 and D = 5 then class = 1	25	100
R14	If A = 6 and C = 6 then class = 1	25	100
R15	If A = 6 and C = 6 and D = 5 then class = 1	25	100
R16	If A = 6 and D = 5 then class = 1	25	100
R17	If A = 8 then class = 0	50	100
R18	If A = 8 and B = 3 then class = 0	25	100
R19	If A = 8 and B = 3 and C = 7 then class = 0	25	100
R20	If A = 8 and B = 3 and C = 7 and D = 6 then class = 0	25	100
R21	If A = 8 and B = 3 and D = 6 then class = 0	25	100
R22	If A = 8 and B = 8 then class = 0	25	100
R23	If A = 8 and B = 8 then and C = 7 class = 0	25	100
R24	If A = 8 and B = 8 and C = 7 and D = 6 then class = 0	25	100
R25	If A = 8 and B = 8 and D = 6 then class = 0	25	100
R26	If A = 8 and C = 7 then class = 0	50	100
R27	If A = 8 and C = 7 and D = 5 then class = 0	25	100
R28	If A = 8 and C = 7 and D = 6 then class = 0	25	100
R29	If A = 8 and D = 4 then class = 0	25	100
R30	If A = 8 and D = 6 then class = 0	25	100
R31	If B = 3 and C = 4 then class = 1	25	100
R32	If B = 3 and C = 4 and D = 4 then class = 1	25	100
R33	If B = 3 and C = 7 then class = 0	25	100
R34	If B = 3 and C = 7 and D = 5 then class = 0	25	100
R35	If B = 3 and D = 4 then class = 1	25	100
R36	If B = 3 and D = 5 then class = 0	25	100
R37	If B = 4 then class = 1	25	100
R38	If B = 4 and C = 6 then class = 1	25	100
R39	If B = 4 and C = 6 and D = 5 then class = 1	25	100
R40	If B = 4 and D = 5 then class = 1	25	100
R41	If B = 8 then class = 0	25	100
R42	If B = 8 and C = 7 then class = 0	25	100
R43	If B = 8 and C = 7 and D = 6 then class = 0	25	100
R44	If B = 8 and D = 6 then class = 0	25	100
R45	If C = 4 then class = 1	25	100
R46	If C = 4 and D = 4 then class = 1	25	100
R47	If C = 6 then class = 1	25	100
R48	If C = 6 and D = 5 then class = 1	25	100
R49	If C = 7 then class = 0	50	100
R50	If C = 7 and D = 5 then class = 0	25	100
R51	If C = 7 and D = 6 then class = 0	25	100
R52	If D = 4 then class = 1	25	100
R53	If D = 6 then class = 0	25	100

5 An improved method

One of the weaknesses of k-means is that the clusters may be different in different runs. It, therefore, effects on the accuracy of classifiers. To improve the accuracy, we propose an improved method to choose a good classifier by running AC-Cluster n times on the training dataset D to create n classifiers and choose the classifier with the highest accuracy in D.

Definition 6. (Rule with higher precedence) [14] Given two rules r₁ and r₂, r₁ is said to take precedence over r₂, denoted by r₁ ∝ r₂, if one of following conditions is true:

Conf (r₁)> Conf (r₂)

Conf (r₁) = Conf (r₂) and Supp(r₁)> Supp(r₂)

Conf (r₁) = Conf (r₂), Supp(r₁)> Supp(r₂) and r₁ is generated earlier than r₂.

Definition 7. (Sub-rule) [33]: Assume there are two rules r₁ and r₂, where r₁= {< (A_{i
₁}, a_{i
₁}) , …, (A_{i
_u}, a_{i
_u})> ⟶ c₁} and r2 = {< (B_{j
₁}, b_{j
₁}) , …, (B_{v
₁}, b_{v
₁}) > ⟶ c₂}. Rule r₁ is called a sub-rule of r₂ if the two following criteria are met:

1. u ≤ v.

2. ∀k∈ [1, u] : (A_i
₁, a_i
₁) ∈ {< (B_j
₁, b_j
₁) , … (B_v
₁, b_v
₁) >

Definition 8. (Redundant rule) A rule r₁ in D is called a redundant rule if there is another rule r₂ in the set of CARs from D such that r₂ is a sub-rule of r₁, and r₂ ∝ r₁.

The details of the improved method is shown in Fig. 2.

Fig.2

An improved version of AC-Cluster.

Remarks for the improved version of AC-Cluster:

The value of n set by the users, the larger n is, the higher accuracy is but the more time consumes. As a tradeoff between the runtime and accuracy, a suitable value of n should be selected.

To build the classifier in Step 3, we can use CBA [12], CMAR [11].

6 Experimental studies

Algorithms were coded by C# 2012 in PC with Intel Core i3-350 2.26 GHz, 4GB RAM, 320GB. Experimental datasets were downloaded from http://mlearn.ics.uci.edu and characteristics of them are shown in Table 5.

Table 5
Experimental datasets

Dataset #Attrs #Classes #Records Details

Breast 10 2 699 Class 0:458 (65.5%) Class 1:241 (34.5%)

Iono 34 2 351 Class 0:225 (64.1%) Class 1:126 (35.9%)

Iris 4 3 150 Class 0:50 (33.33%) Class 1:50 (33.33%) Class 2:50 (33.33%)

Zoo 16 7 101 Class 0:8 (7.92%) Class 1:41 (40.6%) Class 2:20 (19.8%) Class 3:13 (12.87%)

Class 4:5 (4.95%) Class 5:10 (9.9%) Class 6:5 (3.96%)

Dataset	#Attrs	#Classes	#Records	Details
Breast	10	2	699	Class 0:458 (65.5%) Class 1:241 (34.5%)
Iono	34	2	351	Class 0:225 (64.1%) Class 1:126 (35.9%)
Iris	4	3	150	Class 0:50 (33.33%) Class 1:50 (33.33%) Class 2:50 (33.33%)
Zoo	16	7	101	Class 0:8 (7.92%) Class 1:41 (40.6%) Class 2:20 (19.8%) Class 3:13 (12.87%)
				Class 4:5 (4.95%) Class 5:10 (9.9%) Class 6:5 (3.96%)

Breast, Iono have two classes and the percentage of class 0 is nearly 2 times larger than class 1. Iris has three classes, and this is a balanced dataset (we use it to show that AC-Cluster does not effect on balanced class datasets). Zoo has 7 classes and the percentage of the smallest class is 3.96% while the percentage of the biggest class is 40.6%. Breast, Iono, and Zoo are imbalanced class datasets.

We choose these datasets for our experiments because of the following reasons:

We want to have both imbalanced datasets (Breast, Iono and Zoo) and balanced dataset (Iris) in our experiments.

For imbalanced datasets, we want to have both two classes (Breast, Iono) and multiple classes (Zoo with 7 classes).

Besides, we also want to show the experiments in datasets that the ratios among classes are small (such as Breast and Iono) and large (such as Zoo).

6.1 Results

Experimental results were evaluated based on the four datasets from Table 5. Classifiers are built based on mined CARs using CAR-Miner-Diff and Improved-AC-Cluster with c = 1 and 1.2. We choose CAR-Miner-Diff to mine CARs because it is a state-of-the-art algorithm for mining CARs.

We compare the accuracy of the two algorithms with a fixed minConf = 60% for all datasets and n is set to 4. Algorithms are tested using 10-fold cross-validation.

Figures 3 to 6 compare the accuracy of CAR-Miner-Diff and Improved-AC-Cluster (n = 4).

Fig.3

Comparison of the accuracy between Improved-AC-Cluster and CAR-Miner- Diff in Breast dataset.

Figure 3 presents the comparison between Improved-AC-Cluster and CAR-Miner-Diff in Breast dataset.

We can see that Improved-AC-Cluster has a higher accuracy than CAR-Miner-Diff. For example, with minSup = 10%, CAR-Miner-Diff has the accuracy of 78.25% while Improved-AC-Cluster with c = 1.0 has the accuracy of 90.84%.

For a lower minSup, the accuracy of CAR-Miner-Diff is increasing but the accuracy of Improved-AC-Cluster is always better. The highest accuracy of CAR-Miner-Diff is 87.68% at minSup = 1% while the lowest accuracy of Improved-AC-Cluster with c = 1.0 is 90.06% at minSup = 1%.

When we change c from 1.0 to 1.2, the accuracy of Improved-AC-Cluster is just slightly changed.

One of the strong points of Improved-AC-Cluster is that it can achieve a high accuracy at a high minSup. Improved-AC-Cluster improves the performance of both two steps of classification including (1) Building classifier; and (2) Prediction.

Figure 4 presents the comparison between Improved-AC-Cluster and CAR-Miner-Diff in Iono dataset. Improved-AC-Cluster also has a higher accuracy than CAR-Miner-Diff. The results are the same Breast dataset.

Fig.4

Comparison of the accuracy between Improved-AC-Cluster and CAR-Miner-Diff in Iono dataset.

Figure 5 compares the accuracy of between Improved-AC-Cluster and CAR-Miner-Diff in Iris, a balanced class dataset. The accuracy is not different between Improved-AC-Cluster and CAR-Miner-Diff.

Fig.5

Comparison of the accuracy between Improved-AC-Cluster and CAR-Miner-Diff in Iris dataset.

Figure 6 shows the accuracy results in Zoo dataset. We can analyze the results as follows:

Fig.6

Comparison of the accuracy between Improved-AC-Cluster and CAR-Miner-Diff in Zoo dataset.

Improved-AC-Cluster is significantly better than CAR-Miner-Diff. For example, with minSup = 10%, CAR-Miner-Diff has the accuracy of 72% while Improved-AC-Cluster with c = 1.0 has the accuracy of 78% and with c = 1.2 it has the accuracy of 78.4%.

For a lower minSup, the accuracy of CAR-Miner-Diff increases but the accuracy of Improved-AC-Cluster is always better. The highest accuracy of CAR-Miner-Diff is 76% at minSup = 4% while the lowest accuracy of Improved-AC-Cluster with c = 1.0 is 78% at minSup = 10%.

When we change c from 1.0 to 1.2, the accuracy of Improved-AC-Cluster is better on Breast and Iono. This result shows that when the ratio between the majority class and the minority class is large, increasing c will increase the accuracy.

7 Conclusions and future work

In this paper, we proposed a new method for classification based on association rules in imbalanced class datasets. The proposed method used clustering to balance the dataset and CAR-Miner-Diff to mine CARs in the balanced dataset. We also proposed an improved method to improve the accuracy of the classifier. Experimental results showed that our method has a higher accuracy than the method does not use clustering.

One of the weaknesses of our method is that it will remove many records when the number of records in the smallest dataset is small (and c is small). In future, we will expand our method to choose a suitable value of c. We will try to use others clustering methods in imbalanced datasets. Besides, this method will be used with other classification methods such as decision trees, ILA, SVM, etc. We also study how to fill data [24] for incomplete datasets for a better classification.

References

Alwidian

, Hammo

B.H.

and Obeid

WCBA: Weighted classification based on association rules algorithm for breast cancer disease, Applied Soft Computing 62 (2018), 536–549.

Azmi

and Berrado

Class-association rules pruning using regularization, in Proc. of 2016 IEEE/ACS 13th International Conference of Computer Systems and Applications (AICCSA) (2016), pp. 1–7.

Bechini

, Marcelloni

and Segatori

A MapReduce solution for associative classification of big data, Information Sciences 322 (2016), 33–55.

Breiman

, Friedman

J.H.

, Olshen

R.A.

and Stone

C.J.

Classification and Regression Trees. Wadsworth, Belmont, CA: Republished by CRC Press 1984.

Cao

and Yang

An improved k-medoids clustering algorithm, in Proc of Computer and Automation Engineering (ICCAE), 2010.

Chen

, Wang

, Li

, Wu

and Tian

Principal Association Mining: An efficient classification approach, Knowledge-Based Systems 67 (2014), 16–25.

Coenen

, Leng

and Zhang

Threshold tuning for improved classification association rule mining, in Proc of PAKDD 2005, LNAI 3518, (2005), pp. 216–225.

Hadi

, Issa

and Ishtaiwi

ACPRISM: Associative classification based on PRISM algorithm, Information Sciences 417 (2017), 287–300.

, Lu

, Zhou

and Shi

Integrating classification and association rule mining: A concept lattice framework, in Proc of the International Workshop on New Directions in Rough Sets, Data mining, and Granular-Soft Computing (1999), pp. 443–447.

10.

Kaufman

and Rousseeuw

P.J.

Clustering by Means of Me, 1987.

11.

, Han

and Pei

CMAR: Accurate and efficient classification based on multiple class-association rules, in Prof of 1st IEEE international conference on Data mining (2001), pp. 369–376.

12.

Liu

, Hsu

and Ma

Integrating classification and association rule mining, in Proc of the 4th International Conference on Knowledge Discovery and Data Mining (1998), pp. 80–86.

13.

Liu

Y.Z.

, Jiang

Y.C.

, Liu

and Yang

S.L.

CSMC: A combination strategy for multiclass classification based on multiple association rules, Knowledge-Based Systems 21(8) (2008), 786–793.

14.

Liu

, Ma

and Wong

C.K.

Improving an association rule based classifier, in Proc of the 4th European Conference on Principles of Data Mining and Knowledge Discovery (2000), pp. 80–86.

15.

MacQueen

J.B.

Some methods for classification and analysis of multivariate observations, in Proc of the Fifth Berkeley Symposium on Mathematical Statistics and Probability (1967), pp. 281–297.

16.

R.T.

and Han

CLARANS: A Method for Clustering Objects for Spatial Data Mining, IEEE Transactions on Knowledge and Data Engineering 14(5) (2002), 1003–1016.

17.

Nguyen

L.T.T.

and Nguyen

N.T.

An improved algorithm for mining class association rules using the difference of Obidsets, Expert Systems with Applications 42(9) (2015), 4361–4369.

18.

Nguyen

L.T.T.

and Nguyen

N.T

Updating mined class association rules for record insertion, Applied Intelligence 42(4) (2015), 707–721.

19.

Nguyen

, Nguyen

L.T.T.

, Vo

and Hong

T.P.

A novel method for constrained Class-association rule mining, Information Sciences 320 (2015), 107–125.

20.

Nguyen

L.T.T.

, Nguyen

N.T.

, Vo

and Nguyen

H.S.

Efficient method for updating class association rules in dynamic datasets with record deletion, Applied Intelligence 48(6) (2018), 1491–1505.

21.

Nguyen

, Nguyen

L.T.T.

, Vo

and Pedrycz

Efficient mining of class association rules with the itemset constraint, Knowledge-Based Systems 103 (2016), 73–88.

22.

Nguyen

L.T.T.

, Vo

, Hong

T.P.

and Thanh

H.C.

CAR-Miner: An efficient algorithm for mining class-association rules, Expert Systems with Applications 40(6) (2013), 2305–2311.

23.

Nguyen

L.T.T.

, Vo

and Mai

Thanh-Long Nguyen: A Weighted Approach for Class Association Rules, in Proc of ACIIDS 2018, pp. 213–222.

24.

Qin

, Ma

, Herawan

and Zain

J.M.

ACIIDS, Data Filling Approach of Soft Sets under Incomplete Information 2 2011, pp. 302–311.

25.

Quinlan

J.R.

C4.5: Program for machine learning, Morgan Kaufmann 1992.

26.

Quinlan

J.R.

Introduction of decision tree, Machine Learning 1(1) (1986), 81–106.

27.

Segatori

, Bechini

and Ducange

A distributed fuzzy associative classifier for big data, IEEE Transactions on Cybernetics 48(9), 2656–2669.

28.

Sudipto

, Rajeev

and Kyuseok

CURE: An Efficient Clustering Algorithm for Large Databases, in Proc. of the 1998 ACM SIGMOD international conference on Management of data (1998), pp. 73–84.

29.

Thabtah

F.A.

A review of associative classification mining, Knowledge Engineering Review 22(1) (2007), 37–65.

30.

Thabtah

, Cowling

and Peng

MMAC: A new multi-class, multi-label associative classification approach, in Brighton, UK, Proc of the 4th IEEE International Conference on Data Mining (2004), pp. 217–224.

31.

Tolun

M.R.

and Abu-Soud

S.M.

ILA: An inductive learning algorithm for production rule discovery, Expert Systems with Applications 14(3) (1998), 361–370.

32.

Tolun

M.R.

, Sever

, Uludag

and Abu-Soud

S.M.

ILA-2: An inductive learning algorithm for knowledge discovery, Cybernetics and Systems 30(7) (1999), 609–628.

33.

and Le

A novel classification algorithm based on association rule mining, in Proc of the 2008 Pacific Rim Knowledge Acquisition Workshop (Held with PRICAI’08) (2008), pp. 61–75.

34.

Advances in K-means clustering: A data mining thinking, in Springer Science & Business Media (2012), pp. 17–35.

35.

et al., Top 10 algorithms in data mining, Knowledge and Information Systems 14(1) (2008), 1–37.

36.

C.-H.

and Wang

J.-Y.

Associative classification with a new condenseness measure, Journal of the Chinese Institute of Engineers 38(4) (2015), 458–468.

37.

, Wang

, Pang

and Tian

Maximum margin of twin spheres machine with pinball loss for imbalanced data classification, Applied Intelligence 48(1) (2018), 23–34.

38.

Yin

and Han

CPAR: Classification based on predictive association rules, in SIAM International Conference on Data Mining (SDM’03) (2003), pp. 331–335.

39.

Zhang

, Raakrishman

and Livny

BIRCH: An efficient data clustering method for very large databases, in Proc. of ACM SIGMOD Conference Management of data (1996), pp. 103–114.

40.

Zhang

, Zhao

, Cao

and Zhang

Class association rule mining with multiple imbalanced attributes, in Proc of Australasian Joint Conference on Artificial Intelligence 2007, LNAI 4830, (2007), pp. 827–831.

Mining class association rules on imbalanced class datasets

Abstract

Keywords

1 Introduction

2 Preliminaries for class association rule mining

Table 1 An example of training dataset OID A B C Class 1 a1 b1 c1 0 2 a1 b2 c1 1 3 a2 b2 c1 1 4 a3 b3 c1 0 5 a3 b1 c2 1 6 a3 b3 c1 0 7 a1 b3 c2 0 8 a2 b2 c2 1

3.1 Mining class association rules

3.2 Clustering

4 Mining class association rules based on clustering

4.1 AC-Cluster algorithm

4.2 An illustration of AC-Cluster

References

Table 1
An example of training dataset

OID A B C Class

1 a1 b1 c1 0

2 a1 b2 c1 1

3 a2 b2 c1 1

4 a3 b3 c1 0

5 a3 b1 c2 1

6 a3 b3 c1 0

7 a1 b3 c2 0

8 a2 b2 c2 1