An improved fuzzy classifier for imbalanced data

Abstract

Selecting model between recognition rate of “large” class and recognition rate of “small” class in imbalanced data is often a serious trade-off. Most approaches emphasize the accuracy of “large” class. The drawback is that potentially informative “small” class may be overlooked and even make an overfitting model. In this paper, we propose an alternative approach based on fuzzy system for classification problems with imbalanced data, called receive feedback model (RFM). It works by starting with a maximal attribution ratio probability that includes all observations for each class, and then gradually reclassify “unlabeled” samples if they succeed in minimal risk evaluation of a certain class. To exploit the RFM of classification problems, we further introduce probably approximately correct of the model and the convergence of our procedure. Extensive experiments using public data sets and the results of statistical tests have shown that the proposed RFM significantly outperforms other approaches in term of the appropriate trade-off both recognition rates of “large” class and “small” class.

Keywords

Imbalanced data classification fuzzy number probably approximately correct fuzzy rule

1 Introduction

Classifying objects into different categories is one of the key problems in pattern recognition and data mining, which has received much attention over the past decades [1 –8]. For classification problems, some uncertain approaches such as the fuzzy set [9 –11], rough set [13] and bayes [14 –16] etc., are widely used. Kuncheva propose that fuzzy rule based classifiers show the higher classification accuracy and fully comprehensible in his paper on How Good Are Fuzzy If-Then Classifiers? [17]. For example, they can be built into based on tree classifiers or fuzzy rule based regression model. Fuzzy rule-based classifiers are a popular counterpart of fuzzy control systems, there are numerous studies discussing the practical design of such classifiers, among which are neuro-fuzzy models [18, 19] and fuzzy systems constructed using genetic algorithms [20] etc. The fuzzy rule also may be of interest in their own right, as it can characterise distinctions between models in a simple and interpretable way. Instead of traditional decision trees, in which only a feature is taken account at each node, based on trees fuzzy classifier presented decision tree involves a fuzzy rule which involves multiple features.

A typical tree-based fuzzy classifier such as FRDT [21] based on decision tree [22] works by building a decision tree based on fuzzy rules. But a precondition has been proposed to class distribution of relative balance data by decision tree. Although the FRDT approach exhibits the better performance in term of between accuracy and the size of the produced tree, its drawback is that the lower recognition rate for dealing with the imbalanced data and even make an overfitting model. Recently, there is a hot research topic, in which input space may be pervaded with imbalanced data [23, 24] and people are interested in classification tasks as a tool for the analysis of this type of objects in several fields. Undoubtedly, it is widely used in several aspects. For example, inspection data to check the patient’s disease [25], automatic text classification [26] and identify malicious call [27] etc.

In the paper, we consider classification problems with imbalanced data, our aim is to enhance recognition rate for the “small” class of imbalanced sample and can not make an overfitting model. Extensive experiments using public data sets and the results of statistical tests have shown that the proposed RFM significantly outperforms other approaches in term of the appropriate trade-off both recognition rates of “large” class and “small” class. For this purpose, a fuzzy rule (to calculate a attribute ratio probability) defined on the input space is computed from training samples. The maximum value of attribute ratio probabilities is used for predicting a label. And then gradually reclassify “unlabeled” samples if they succeed in minimal risk evaluation of a certain class. Our method is used for classification based on the principle of the “maximum attribute ratio and minimum risk evaluation”.

The paper looks at a novel method for imbalanced data, which called receive feedback model. Rather than learning a model through samples directly, our method works by looking for fuzzy rules including finite fuzzy numbers to parallel computing the attribute ratio for each class. In Section 2, we propose a version of receive feedback model and exploit the model of data classification. Further more, we introduce probably approximately correct of the model and the convergence of our procedure. Some numerical examples are given in Section 3. Finally, we conclude with a brief discussion and future work in Section 4.

2 Receive feedback model

2.1 An overall architecture of the RFM

In this section, we consider an overall architecture of the proposed approach for classification problems with imbalanced data. We suppose that the data can be written in the form (X_i, Y_i) for observation i = 1, 2,. . . , n; X_i = {x_i1, x_i2,. . . , x_im} is the set of active predictors for observation i (out of a total of m predictors) and Y_i is the class label. An important example of this type of problem is that of automatic text classification, where X_i is a set of frequently appearing words for document i, and Y_i indicates whether the document belongs to a certain class. In this case, although the size of a text can be of the order of several thousand or more, useful or interesting information accounts for only a small part and its size significantly less than the samples. More generally, if the samples with imbalance distribution of class are available, they can be divided into “large” class which includes the majority of patterns (data points), on the contrary, called “small” class.

Our aim here is to develop classification model for imbalanced data. More precisely, we are interested in discovering attribution ratio probability and risk evaluation without requiring any of hypothesis that balance distribution of class.

Let us summarize the notations to be used throughout the paper. Let X being an n × m matrix, x_ij be the j-th feature value of the pattern X_i. The size of this sample is n, corresponds to each row. The number of the feature is m, corresponds to each column. We have $X = [\begin{matrix} X_{1} \\ X_{2} \\ ⋮ \\ X_{n} \end{matrix}] = [\begin{matrix} x_{11} & x_{12} & \dots & x_{1 m} \\ x_{21} & x_{22} & \dots & x_{2 m} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ x_{n 1} & x_{n 2} & \dots & x_{nm} \end{matrix}], Y = [\begin{matrix} 1 \\ 2 \\ ⋮ \\ c \end{matrix}],$ (1) What a hypothesis [21] is that the fuzzy numbers defined for the feature f_j is equal to the number of class c. That is, let f_k,j ∈ F represent the k-th (k = 1, 2,. . . , c) fuzzy number formed for the feature f_j. Each trapezoidal fuzzy number f_k,j is characterized by double parameters, such as, if k = 1, trapezoidal fuzzy number f_1,j has two parameters m_1,j and m_2,j. If k = c, trapezoidal fuzzy number f_c,j has two parameters m_c-1,j and m_c,j, the boundaries of trapezoidal fuzzy numbers is respectively $\frac{m_{1, j} + m_{2, j}}{2}$ and $\frac{m_{c - 1, j} + m_{c, j}}{2}$ . Each triangular fuzzy number f_k,j is characterized by three parameters m_k-1,j, m_k,j and m_k+1,j, the boundaries of triangular fuzzy number f_k,j is $\frac{m_{k - 1, j} + m_{k, j}}{2}$ and $\frac{m_{k, j} + m_{k + 1, j}}{2}$ . In [21], the parameters m_k,j are taken as the mean values of all patterns falling within the class C_k while considering feature f_j. Let m_k,j assume the values given by the following Equation (2) and $m_{k, j}^{'}$ be the same values after sorting them in an ascending order. $m_{k, j} = \frac{\sum_{X_{i} \in C_{k}} x_{ij}}{| C_{k} |},$ (2) We can find fuzzy rule which is a subset $F_{R}^{(k)} \subseteq F = {{f_{k, j}}, k = 1, 2, . . ., c; j = 1, 2, . . ., m}$ for all fuzzy numbers. The fuzzy rule of a certain class which the higher fuzzy confidence than other fuzzy rules. For simplicity, there are only two classes and a set of labels being {-1, 1}. Given a threshold β (0 < β < 1), generally, parameter β is equal to 0.02 (see [21]). Our goal is to find a optimal subsets $F_{R}^{(- 1)}$ and $F_{R}^{(1)}$ (as higher correlation as possible) for the two classes, for which $F_{R}^{(- 1)} = max {F_{l} | A - B) > β},$ (3) $A = Fconf (F_{l} \Rightarrow - 1,$ (4) $B = Fconf (F_{l + 1} \Rightarrow - 1,$ (5) $P_{AR}^{(k)} = \frac{F_{k}}{F_{1} + F_{2} + . . . + F_{k} + . . . + F_{c}},$ (6) That is $P_{AR}^{(k)} = \frac{\sum_{p = 1}^{l^{(k)}} f_{k, j}^{p}}{\sum_{k = 1}^{c} \sum_{p = 1}^{l^{(k)}} f_{k, j}^{p}}, (f_{k, j}^{p} \in F_{R}^{(k)}),$ (7) Let $f_{k, j}^{p}$ is the p-th fuzzy number for fuzzy rule $F_{R}^{(k)}$ of the class C_k, particularly $P_{AR}^{(- 1)} = \frac{\sum_{p = 1}^{l^{(- 1)}} f_{- 1, j}^{p}}{\sum_{p = 1}^{l^{(- 1)}} f_{- 1, j}^{p} + \sum_{p = 1}^{l^{(1)}} f_{1, j}^{p}} .$ (8) Let l^(k) be satisfied with 1 ≤ l ≤ min (max L, c × m). Here and throughout the paper, we use the subscript AR and to indicate that the probabilities are attribution ratio probability. For example, for c ∈ {-1, 1}

$\begin{matrix} P_{AR} (x \in X, F_{R}^{(c)} \subset F | Y = c) \\ = \max {P_{AR}^{(- 1)}, P_{AR}^{(1)}} . \end{matrix}$ (9) This is a process of parallel computing. Gradually, reclassify “unlabeled” samples if they succeed in minimal risk evaluation of a certain class. The risk evaluation as follow $L_{i} = \arg min {\frac{\sqrt{M_{1} + M_{2} + . . . + M_{l^{(k)}}}}{2}},$ (10) $M_{l^{(k)}} = (x_{i, j_{l^{(k)}}} - m_{k, j_{l^{(k)}}})^{2},$ (11) for m_{k,j_{l
^(k)}} is the l^(k)-th parameter of the fuzzy rule $F_{R}^{(k)}$ of the class C_k. The fuzzy rule of the k-th class of interest is $F_{R}^{(k)} = {f_{k, j_{1}}, f_{k, j_{2}}, . . ., f_{k, j_{p}}}$ (p is the size of fuzzy rule).

A pure brute force search examines each fuzzy rule $F_{R}^{(k)}$ of a finite length to check whether it satisfies (3). Restricting the number of a fuzzy rule $F_{R}^{(k)}$ , in order to control appropriately a trade-off between recognition rates of “large” class and “small” class.

In this section, we present an alternative approach based on fuzzy system for classification problems with imbalanced data in Fig. 1. This type of data classification process involves three steps: it works by starting with a maximal attribution ratio probability that includes all samples for each class (at “Receive” stage); Gradually, reclassify “unlabeled” samples if they succeed in minimal risk evaluation of a certain class (at “Feedback” stage); Finally, hierarchical thinking is proposed at “Update” stage.

Above all, the model arises as a novel architecture of the fuzzy classifier based on imbalanced data. In Fig. 1, the root node “Data” denoted an interesting sample, we can classify the sample by attribute ratio probability based on fuzzy rules to form from “Class 1” node to “Class n” node for each class. The samples which are not classified (also called “unlabel” samples) by the attribute ratio probability of first layer are arranged in a “Risk Evaluation” node (including a mixture of data are from different classes), then “reclassify” to the “unlabel” samples. If the “Risk Evaluation” node is not empty, the growth of the RFM (the process of update in Fig. 1) is realized by expanding the “Risk Evaluation” node as illustrated in Fig. 2.

2.1.1 Receive-stage

Let AR_t be the attribute ratio of the t-th layer and $f_{k, j}^{p}$ is the p-th fuzzy number of the feature f_j for the class C_k. Let $P_{AR}^{(k)}$ indicate that the probabilities are attribution ratio probability for the class C_k. More formally, we describe the sample asfollow

Definition 1. For the subscript AR is called a attribution ratio uncertain samples X and $P_{{AR}_{t}}^{(k)}$ (Equation (4)) is a attribution ratio probability for the class C_k of the t-th layer with regard to the following equation is satisfied. If $P_{{AR}_{t}} (x \in X, F_{R}^{(k)} \subset F | Y = k) = \max {P_{{AR}_{t}}^{(k)}} .$ (12) then ∀ x ∈ X, the class label of x is k.

2.1.2 Feedback-stage

According to the “Receive” stage, given samples will be classified into a class of interest. But some samples are not classified or assigned the several different classes. That is to say, the samples which are not classified (also called “unlabel” samples) by the attribute ratio probability of this layer are arranged in a “Risk Evaluation” node, then “reclassify” to the “unlabel” samples.

Definition 2. Let X be an interesting sample which is an n × m matrix, there is one-to-one correspondence between a fuzzy number and its parameter. $f_{k, j}^{p}$ is one of the elements of the fuzzy rule $F_{R}^{(k)}$ , that is $f_{k, j}^{p} \in F_{R}^{(k)} \subset F$ .

Definition 3. RE_t (AR₁, AR₂,. . . , AR_nt) ⊆ RE_t-1 or $P_{AR}^{k_{1}} = P_{AR}^{k_{2}} (k_{1} \neq k_{2})$ be called a “unlabel” samples set of RE_t-1.

With regard to the Equation (7) is satisfied, the “unlabel” samples should be classified to a class of interest. In general, this process will be terminated until each pattern will be classified. Therefore, the two stage of “Receive” and “Feedback” are fundamental steps what we should do.

2.1.3 Update-stage

This method is proposed to modify the structural model (RFM) of a dynamic finite process. Updating stage is crucial for obtaining accurate models while the measured structural responses (receive feedback model) are available. Inevitably, it has a disadvantages that result to the size of the model is enlarged.

2.2 The algorithm of RFM

2.2.1 The algorithm of RFM

Our method searches for the “maximum attribute ratio and minimum risk evaluation” by looking at fuzzy rules of given classes. We start with the samples of each class as a block and then respectively parameters to corresponding fuzzy rules. Each process of finding a fuzzy rule, we just keep fuzzy rule that are present in a different with others. Then we repeat with the last process until all fuzzy rules will be founded. If a pattern X_i (i = 1, 2,. . . , n) has high attribute ratio probability for a certain class in the first layer (t = 1) of the RFM, that is $P_{AR} (x \in X, F_{R}^{(c)} \subset F | Y = c)$ is the maximum. And then the “unlabel” sample X_i has the lower risk evaluation for a certain class in the first layer of the model, it also will be reclassified in the given a class of interest with the minimum risk evaluation.

To describe the details of our algorithm, we redefine a parameter associated with the model will be needed later. Let RE^(T) mean the T-th update of the model.

Algorithm 1. Receive Feedback Model

1 RE⁽⁰⁾ ← X, T ← 0, i ← 1;

2 while RE⁽⁰⁾≠ ∅

3 do class k← the number of classes in the sample X

4 if class k = 1 break; end if

5 ▹ Or continue to perform the following procedure

6 $F_{R}^{(k)} \leftarrow$ transfer AREA algorithm [21]

7 parallel compute $P_{{AR}_{t}}^{(k)}$ (k from 1 to c) corresponding

8 the Equation (4)

9 find the label of X_i is empty or classified different classes

10 ▹ that is the “unlabel” samples

11 get risk evaluation on the basis of the Equation (7)

12 if the number of “unlabel” samples =∅ break;

13 end if

14 i ← i + 1, T ← T + 1 (means update above process)

15 end while

Algorithm 1 describes a generalization version of receive feedback model procedure. This process is repeated until one of the following three conditions have been satisfied:

The number of the “unlabel” samples is equal to a set of empty, that is all of the samples can be classified by Algorithm 1;

Remain “unlabel” samples belong to the same class of interest;

Remain each “unlabel” sample can not classified by attribute ratio of the same layer and these samples are discarded.

And we first analyze the time complexity of our algorithm. The maximum length of the AREA algorithm [21] which generated fuzzy rule for each class is MaxL, therefore, the time complexity of AREA algorithm is O (MaxL * c * n). RFM algorithm is a non-iterative process, the number of layers is n in the worst situation, what is each layer of the learned tree for the training patterns can classify only a single pattern. Thus, the total time complexity of the RFM method is O (MaxL * c * n * n).

2.3 The theoretical results of RFM

In supervised learning, we define 𝒳 as the input space and {1, 2,. . . , c} as a set of labels. Suppose that the training samples are randomly drawn from an underlying probability distribution P. We have a hypothesis that training data are independent identically distributed (i.i.d) sample from P: S_n = {(X_i, Y_i)} ∈ {𝒳 × {1, 2,. . . , c}} ⁿ, X_i ∈ 𝒳 ⊂ Rⁿ. It is to estimate a decision function $f : 𝒳 \to ℝ$ such that the sign of f (X) provides an accurate prediction of the unknown labels associated with the input X_i under the probability distribution P.

Assumption. For the probability distribution of training samples, there exists a positive constant ɛ > 0 such that $P ({i \in {1, 2, . . ., n} : ɛ \leq P_{AR}^{(k)} \leq 1 - ɛ}) > 0,$ (13) where $P_{AR}^{(k)}$ is the attribute ratio probability (see Definition 1) of the label k for the input X_i.

Remark. Under the Equation (4), we can get $P_{AR}^{(k)} = \frac{F_{k}}{F_{1} + F_{2} + . . . + F_{k} + . . . + F_{c}}$ . Applying the convergence of the sequence of functions, we have $P_{AR}^{(k)} \to c$ , where c (0 ≤ c ≤ 1) is a positive constant.

Theorem 1. Let X_i (i = 1, 2,. . . , n) be an independent identically distributed (i.i.d) sample, μ_n is the number of occurrences of a certain class and n is experimental frequency. Let 𝒫 is the probability of each class occurs in the sample, for any ɛ > 0, we have $lim_{n \to \infty} P {| \frac{μ_{n}}{n} - 𝒫 | < ɛ} = 1 .$ (14)

Proof. Suppose that μ_n ∼ b (n, 𝒫) and the mathematical expectation and variance of $\frac{μ_{n}}{n}$ as follow $E (\frac{μ_{n}}{n}) = 𝒫, Var (\frac{μ_{n}}{n}) = \frac{𝒫 (1 - 𝒫)}{n},$ (15) due to Chebycheff inequality, we have $1 \geq P {| \frac{μ_{n}}{n} - 𝒫 | \geq 1 - \frac{Var (\frac{μ_{n}}{n})}{ɛ^{2}} = 1 - \frac{𝒫 (1 - 𝒫)}{n ɛ^{2}} .$ (16) The following inequality holds for n→ ∞ $lim_{n \to \infty} (1 - \frac{𝒫 (1 - 𝒫)}{n ɛ^{2}}) \to 1,$ (17) hence $lim_{n \to \infty} P {| \frac{μ_{n}}{n} - 𝒫 | < ɛ} = 1 .$ (18) □

Our goal is to construct prediction function loss over D: E_(X,Y)∼D (f (X) ≠ Y). Let $A$ represents a collection of all RFM learning algorithms. Let $\hat{f} = A (S_{n})$ be learned prediction rule from training sample S_n = (X_i, Y_i) _i=1,2,...,n. Then, the training error of a classification algorithm as follow

$\begin{matrix} TRAINING error (f) \\ = \frac{1}{n} \sum_{i = 1}^{n} count (f (X_{i}) \neq Y_{i}) . \end{matrix}$ (19)

The average prediction performance of the decision function f is evaluate performance of a learning algorithm on test sample by loss over D, that is $TEST error (f) = E_{(X, Y) \sim D} [count (f (X_{i}) \neq Y_{i})] .$ (20)

To confirm our approach is how efficient a learning algorithm and to theoretically justify it, the minimize training error as follow $A^{*} = \arg min_{A} \sum_{i = 1}^{n} count (P_{AR}^{(k)} > k or P_{AR}^{(k)} < k) .$ (21)

According to the above Equation (17) the training error of $A^{*}$ can be observed, we will discuss to give probably approximately correct by how to estimate test error.

Theorem 2. (Probably Approximately Correct) Let $A$ be a finite set, for existing $A^{*} \in A$ that may depend on training samples,

$\begin{matrix} | TEST error (A^{*}) - TRAINING error (A^{*}) | \\ \leq \frac{1}{n} ln \frac{| A |}{η} ≜ ɛ (n, A, η), \end{matrix}$ (22) where n is a positive constant, denoted the size of samples. For ∀ n, the above inequality holds with probability 1 - η.

Proof. Let, and B_ɛ be a subset of $A$ . For given ɛ ∈ (0, 1), that is $B_{ε} = {A^{'} \in A : l (A^{'}) > ε} .$ (23) Due to in Equation (9) (see assumption), hence $B_{ε} = {A^{'} \in A : P {(X, Y) : A^{'} (X) \neq Y} > ε} .$ (24) For any, we have

$\begin{array}{l} P {(X, Y) : f (X) = Y} \\ = 1 - P {(X, Y) : A^{'} (X) \neq Y} \end{array}$ (25)

≤1 - ɛ, or $P {(S_{n} = {(X_{1}, Y_{1}), (X_{2}, Y_{2}), . . ., (X_{n}, Y_{n})})$ (26) where i = 1, 2,. . . , n. Then, we have $A (S_{n}) = {A^{'} : A^{'} (X_{i}) = Y_{i}, i = 1, 2, ..., n, A^{'} \in A} .$ (27) In Equation (22) denotes that the probability of a subset B_ɛ of $A (S_{n})$ less than e^-ɛn in $A (S_{n})$ . So that there are a $A^{*} \in B_{ɛ}$ less than $| A | e^{- ɛ n}$ at least. In other words $P {S_{n} : \exists A^{'} \in A (S_{n}), g e t l (A^{*}) > ε} \leq | A | e^{- ε n} .$ (28) For any $A^{*}$ of $A (S_{n})$ , we have $l (A^{*}) \leq ɛ .$ (29)

For any random selected S_n holds with probability $1 - | A | e^{- ɛ n}$ . In order to estimate the value of ɛ, just to make

$1 - | A | e^{- ɛ n} \geq 1 - η,$ (30) that is $| A | e^{- ɛ n} \geq η,$ (31) or $ln | A - ɛ n | \leq ln η .$ (32)

The only need to take $ɛ = \frac{1}{n} ln \frac{| A |}{η},$ (33) as a result, the algorithm can be proved with PAC. ■

3 Numerical examples

In this section, we give an example and several numerical public data sets in Table 1 and Fig. 4 to provide further insight into the performance of our method. The first is about an example to analysis a procedure that we should be step-by-step about our method.

Example 1. We consider a public data set (haberman comes from UCI Repository of machine learning [28]) as an illustrative example. The public data set consists 306 patterns, 81 patterns are coming from class 1, and 225 patterns are belong to class 2. The sample is shown in Fig. 3, let X = [x_ij] be a 306 × 3 matrix representing the sample, thus X_i (i = 1, 2,. . . , 306) denotes the i-th pattern, f_j (j = 1, 2, 3) denotes the j-th feature of X.

Step 1. “Receive stage”: Due to the number of classes k = 2, there are two fuzzy numbers for each feature. Respectively, feature f₁ corresponding f_1,1 and f_2,1, feature f₂ corresponding f_1,2 and f_2,2, feature f₃ corresponding f_1,3 and f_2,3. Then, we consider the total number of the parameters m_k,j from the Equation (2).

In this step, the values of the parameters m_k,j are taken as the mean values of all patterns falling within the class C_k while considering the feature f_j. From Equation (4), we can compute attribute ratio of each class. Finally, patterns will be classified by the maximum attribute ratio probability in first level.

Step 2. “Feedback stage”: According to the first stage, haberman data set will be classified. But some patterns are not classified or assigned the several different classes. That is to say, the ones which are not classified (“unlabel” samples) by the attribute ratio probability of this layer are arranged in a “Risk Evaluation” node. In step 2, we should “reclassify” to the “unlabel” samples. Finally, each pattern will be found a class of interest.

Specifically, we prepare to learning the winning combinations for some well-known classifiers and Fuzzy rule based on decision tree [21]. We test the performance of the algorithm on data sets being from the UCI Repository of machine learning. We can see the Tables 2, 3 and Figs. 5 and 6.

Tables 2 and 3 show the test errors of BFT, C4.5, LAD, SC, NBT, FRDT and the proposed learning method using the mean coming with a ten times 10- fold cross validation. The tables show that the RFM has an higher recognition ratio for imbalanced samples, that is, the case of red font. Overall, the proposed method is better than the other learning algorithm. Indeed, the difference between it and RFM is statistically significant. On the other hand, the balanced data does not significantly affect the experimental results. Hence, the revision of fuzzy rule based decision tree can improved the prediction accuracy of imbalanced-data-based learning.

In this experiment, we propose a comparative analysis while using above traditional classifiers. The result presented in Figs. 5 and 6 offer several insight into the performance of the algorithms, the best scores are shown in “red” dotted line. The misclassified testing data and standard deviation for the experiments on several different samples are given in Tables 2 and 3. The fuzzy classifier based on imbalanced data sets, that is RFM produces high accuracy and outperforms the other approaches in test data.

In Section 2, we proved the probably approximately correct of the model and the convergence of our procedure. The numerical experiments described in this section indicate that learning algorithm derived from revised fuzzy rule based decision tree are an alternative for solving classification problems involving imbalanced data. Finally, we will extend the current results for fuzzy dynamic models with distributed parameters framework [29, 30], and then for the underlying systems under the network-based environment with time-delays, packet dropouts, and/or quantization [31, 32] for this issue.

4 Conclusion

We studied an improved fuzzy classifier for imbalanced data in classification problems. As we all known, most approaches emphasize the accuracy of “large” class. The drawback is that potentially informative “small” class may be overlooked and even make an overfitting model. In the paper, we proposed a receive feedback model, which exists a superiority to enhance the recognition rate of imbalanced data based on the principle of the “maximum attribute ratio and minimum risk evaluation”. The algorithm is proposed to modify the structural model (RFM) of a dynamic finite process. Updating stage is crucial for obtaining accurate models while the measured structural responses (receive feedback model) are available. Inevitably, it has a disadvantages that result to the size of the model is enlarged. Numerical experiments showed that learning algorithm derived from revised fuzzy rule based decision tree are alternatives methods for solving classification problems involving imbalanced data. We believe developments along these lines (for imbalanced data) will prove to be fruitful directions for future research. Further, we also plan to generalise the idea to categorical and sparse predictor variables, such as computer vision area (image classification, text classification).

Footnotes

Acknowledgments

The author thank the editors and the anonymous reviewers for helpful comments and suggestions. The research was supported by the National Natural Science Foundation of China (Grant Nos. 61573266 and 11301408).

References

Aha

and Kibler

, Instance-based learning algorithms, Machine Learning 6 (1991), 37–66.

Agrawal

, Imielinski

and Swami

, Database mining: A performance perspective, IEEE Transactions on Knowledge and Data Engineering 5 (1993), 914–925.

Keerthi

S.S.

, Shevade

S.K.

, Bhattacharyya

and Murthy

K.R.K.

, Improvements to Plarr’s SMO algorithm for SVM classifier design, Neurocomputing 11 (1993), 637–649.

Cohen

W.W.

, Fast effective rule induction. In: Proceedings of the 12th International Conference on Machine Learning, 1995, pp. 115–123.

Holte

R.C.

, Very simple classification rules perform well on most commonly used datasets, Machine Learning 11 (1993), 63–91.

Pernkopf

and Wohlmayr

, Stochastic margin-based structure learning of Bayesian network classifiers, Pattern Recognition 46 (2013), 464–471.

Bounhas

, Hamed

M.G.

, Prade

, Serrurier

and Mellouli

, Naive possibilistic classifiers for imprecise or uncertain numerical data, Fuzzy Sets and Systems 239 (2014), 137–156.

Wang

, Xing

, Li

, Hua

, Dong

and Pedrycz

, A study on relationship between generalization abilities and fuzziness of base classifiers in ensemble learning, IEEE Trans Fuzzy Syst 3 (2015), 1638–1654.

Zadeh

L.A.

, Fuzzy sets, Information and Control 8 (1965)338–353.

10.

Zadeh

L.A.

, A fuzzy-algorithm approach to the definition of complex or imprecise concepts, Internat J Man-Machnie Stud 8 (1976), 249–291.

11.

Zadeh

L.A.

, Fuzzy logic –a personal perspective, Fuzzy Sets and Systems 281 (2015), 4–20.

12.

Hassanien

A.E.

, Rough set approach for attribute reduction and rule generation: A case of patients with suspected breast cancer, Journal of the American society for Information science and Technology 55 (2004), 954–962.

13.

Pawlak

, Rough sets: Theoretical aspects of reasoning about data, Springer Science and Business Media, 1991.

14.

Yang

and Wu

, On the properties of concept classes induced by some multiple-valued Bayesian networks, Information Sciences 184 (2012), 155–165.

15.

Yang

and Wu

, Inner product space and concept classes induced by Bayesian networks, Acta Application Mathematicae 106 (2009), 337–348.

16.

Yang

and Wu

, VC dimension and inner product space induced by Bayesian networks, International Journal of Approximate Reasoning 50 (2009), 1036–1045.

17.

Kuncheva

L.I.

, How good are fuzzy if-then classifiers? IEEE Transactions on Systems 30 (2000), 501–509.

18.

Masulli

, Casalino

and Vannucci

, Bayesian properties and performances of adaptive fuzzy systems in pattern recognition problems, in ICANNąŕ94, Sorrento, Italy, 1994, pp. 189–192.

19.

Nauck

and Kruse

, A neuro-fuzzy method to learn fuzzy classification rules from data, Fuzzy Sets and Systems 89 (1997), 277–288.

20.

Ishibuchi

, Nozaki

, Yamamoto

and Tanaka

, Construction of fuzzy classification systems with rectangular fuzzy rules using genetic algorithms, Fuzzy Sets and Systems 65 (1994), 237–253.

21.

Wang

, Liu

, Pedrycz

and Zhang

, Fuzzy rule based decision trees, Pattern Recognition 48 (2015), 50–59.

22.

Quinlan

, In: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, California, 1993.

23.

Antonelli

, Ducange

and Marcelloni

, An experimental study on evolutionary fuzzy classifiers designed for managing imbalanced datasets, Neurocomputing 146 (2014), 125–136.

24.

Sun

, Song

, Zhu

, Sun

, Xu

and Zhou

, A novel ensemble method for classifying imbalanced data, Pattern Recognition 48 (2015), 1623–1637.

25.

Chawla

, Bowyer

and Hall

, SMOTE: Synthetic minority over sampling technique, Journall of artificial Intelligence Research 16 (2002), 321–357.

26.

Sarker

and Gonzalez

, Portable automatic text classification for adverse drug reaction detection via multi-corpus training, Journal of Biomedical Informatics 53 (2015), 196–207.

27.

Fawcett

and Provost

, Combining data mining and machine learning for effect user profile, Proc of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, 1996.

28.

Frank

and Asunicion

, UCI machine learning repository, Irvine, CA: University of California, School of Information and Computer Science, vol. 213, 2010.

29.

Qiu

J.B.

, Ding

S.X.

, Gao

H.J.

and Yin

, Fuzzy-model-based reliable static output feedback H_∞ control of nonlinear hyperbolic PDE systems, IEEE Transactions on Fuzzy Systems 99 (2015). doi: 10.1109/TFUZZ.2015.2457934.

30.

Qiu

J.B.

, Ding

S.X.

, Li

L.L.

and Yin

, Reliable fuzzy output feedback control of nonlinear parabolic distributed parameter systems with sensor faults, 29 (2015), 1197–1208.

31.

Qiu

J.B.

, Wei

Y.L.

and Karimi

H.R.

, New approach to delaydependent H_∞ control for continuous-time Markovian jump systems with time-varying delay and deficient transition descriptions, Journal of the Franklin Institute 352 (2015), 189–215.

32.

Qiu

J.B.

, Gao

H.J.

and Ding

S.X.

, Recent advances on fuzzy-model-based nonlinear networked control systems: A survey, IEEE Transactions on Industrial Electronics 63 (2016), 1207–1217.