An instance selection algorithm for fuzzy K -nearest neighbor

Abstract

The condensed nearest neighbor (CNN) is a pioneering instance selection algorithm for 1-nearest neighbor. Many variants of CNN for K-nearest neighbor have been proposed by different researchers. However, few studies were conducted on condensed fuzzy K-nearest neighbor. In this paper, we present a condensed fuzzy K-nearest neighbor (CFKNN) algorithm that starts from an initial instance set S and iteratively selects informative instances from training set T, moving them from T to S. Specifically, CFKNN consists of three steps. First, for each instance x ∈ T, it finds the K-nearest neighbors in S and calculates the fuzzy membership degrees of the K nearest neighbors using S rather than T. Second it computes the fuzzy membership degrees of x using the fuzzy K-nearest neighbor algorithm. Finally, it calculates the information entropy of x and selects an instance according to the calculated value. Extensive experiments on 11 datasets are conducted to compare CFKNN with four state-of-the-art algorithms (CNN, edited nearest neighbor (ENN), Tomeklinks, and OneSidedSelection) regarding the number of selected instances, the testing accuracy, and the compression ratio. The experimental results show that CFKNN provides excellent performance and outperforms the other four algorithms.

Keywords

fuzzy membership degree instance selection information entropy

1 Introduction

K-nearest neighbor (K-NN) [1] is a simple classification algorithm that only calculates the distances between test instances and training instances, but not trains classification models. Although K-NN has been successfully applied in many fields [2 –7], it has the following two drawbacks [8]:

(1) High computational complexity. For classifying a test instance, K-NN calculates the distances between this instance and all training instances. In addition, the entire training set must be loaded into memory.

(2) K-NN is sensitive to noise. When the training set contains noisy instances or error instances, the performance of the K-NN algorithm will be seriously deteriorated.

To overcome the first disadvantage of K-NN, researchers proposed instance selection based solutions, which select a subset S from the training set T, and replace T with S for classification. The pioneering work of instance selection for K-NN is the condensed nearest neighbor (CNN) algorithm proposed by Hart [9]. Based on CNN, many instance selection algorithms for K-NN have been proposed, which can be roughly classified into three categories [8]: incremental, decremental, and hybrid algorithms.

Starting from an initial set S, incremental algorithms gradually select important instances based on instance selection criteria from the set T and add them to S until the predefined halt conditions are met. The set S can be initialized to an empty set or by several instances randomly selected from T. CNN is an incremental algorithm which first initializes S with one instance randomly selected from T, and then iteratively selects instance from T one by one. Specifically, if an instance x ∈ T is misclassified by 1-NN using the instances of S, then x is moved from T to S. In other words, the instances misclassified by 1-NN using the instances of S are important. When T becomes an empty set, or all remaining instances in T can be classified correctly by 1-NN using the instances of S, the algorithm halts. Based on CNN, several improved incremental algorithms were proposed by different authors. Tomek proposed two improved incremental algorithms [10]. The first is similar to CNN, the difference is that when an instance x is misclassified by 1-NN using S, instead of adding x to S, the algorithm first looks for its enemy nearest neighbor y (which is the instance closest to x but with different class label), and then finds y’s nearest neighbor z which belongs to the same class as x. Finally, z is added to S. The second does not select instances from the original training set T but from a subset F of T, which belongs to the same class as its nearest neighbor. Devi and Murty proposed an instance selection algorithm called modified CNN (MCNN) [11]. When an instance is misclassified, instead of adding it directly to S, the MCNN algorithm marks the misclassified instance first. When all the instances in T are tested, a representative instance (class center instance) from each class is selected and added to S. Chang et al. [12] proposed generalized CNN (GCNN). In GCNN, each instance x casts a vote to its nearest neighbor with the same category as x, and the instance with the highest number of votes is selected. For each category, the instances are selected until no instances remain in the category. Based on Voronoi partition and Voronoi enemy nearest neighbor, Angiulli [13] proposed a fast CNN algorithm which is order independent, and its time complexity is quadratic in worst case. Based on MapReduce and voting mechanism, Zhai et al. [14] proposed a large dataset instance selection algorithm, the proposed algorithm has fast learning speed and high compression ratio. Based on locality-sensitive hashing, Arnaiz-González et al. [15] proposed an instance selection algorithm with linear time complexity for big data.

Starting from an initialization S=T, decremental algorithms gradually remove unimportant instances from S based on some criteria until the predefined stop condition is met. The representative works include the reduced nearest neighbor (RNN) [16], the minimal consistent set (MCS) [17] , and the decremental reduction optimization procedure (DROP) [18]. RNN [16] is an extension of the CNN, and it starts from S selected by the CNN from T. For each instance x ∈ S, if all instances in T can be classified correctly by S - {x}, then the instance is removed from S. In CNN, the consistent subset is a core concept, it is a subset of T that classify all instances in T correctly . The minimal consistent subset is the smallest subset among the consistent subsets. The goal of CNN is to find a minimal consistent subset S from T. However, it is not guaranteed that the S selected by CNN is a minimal consistent subset. To address this problem, Dasarathy [17] proposed the MCS algorithm. DROP [18] is a family of instance selection algorithms, DROP3 is the most famous one which applies a noise filter to remove all instances that are classified incorrectly by K-NN. Arnaiz-González et al. [19] proposed an adaptive DROP algorithm which extends instance selection from the context of classification to regression.

Hybrid algorithms combine incremental algorithms and decremental algorithms. During instance selection, the instances in S may be added or deleted. Intuitively, the performance of hybrid algorithms should be better than that of the incremental and decrement algorithms. However, the computational complexity of hybrid algorithms is high since extra measurements of the importance of the instances in S are required. Therefore, researchers pay less attention to hybrid algorithms.

To overcome the second disadvantage of K-NN, Keller et al. [20] introduced the fuzzy set theory into K-NN, and proposed fuzzy K-NN which extended K-NN from deterministic scenarios to fuzzy scenarios. Since there are many phenomena with fuzziness in the real world, fuzzy set theory [21] has received more and more attention from researchers in a wide range of scientific areas, and has been applied to wide range of fields. Many methods have been proposed by different researchers, the details can be found in several review literature [22 –24] published in recent years. Although fuzzy K-NN can effectively surmount the second drawback of K-NN, it does not overcome the first disadvantage of K-NN. To the best of our knowledge, only Zhai et al. [25] conducted a preliminary study on the instance selection for fuzzy K-NN. In this paper, we present a further investigation on this topic. The main contributions of this paper include the following three folds.

(1) We propose a simple yet effective instance selection algorithm called condensed fuzzy K-NN (CFKNN), which is more effective and efficient than the algorithm presented in [25]. The effectiveness is due to the use of dynamic thresholds, and the efficiency is due to the efficient calculation of the fuzzy membership degrees of the K-nearest neighbors using S rather than T. (2) We find that there are significant differences between the suitable thresholds for the selection of instances from different datasets with the proposed algorithm.

(3) Extensive experiments are conducted to demonstrate the effectiveness of the proposed algorithm by comparing with four state-of-the-art algorithms using metrics of number of selected instances, compression ratio, and testing accuracy.

The rest of this paper is organized as follows. In section 2, we review the related works. In section 3, we describe the details of the proposed method. In section 4, extensive experiments are carried out to verify the effectiveness of the proposed algorithm by comparing it with four state-of-the-art methods on three aspects: the number of selected instances, the compression ratio, and the testing accuracy. At last, we conclude our work in the section 5.

2 Preliminaries

In this section, we briefly review the preliminaries related to our work, including K-nearest neighbor, condensed nearest neighbor, and fuzzy K-nearest neighbor.

2.1 K-nearest neighbor

The idea of K-nearest neighbor (K-NN) [1] is very simple. Given a training set T and a test instance x, the K-NN algorithm firstly finds K instances in T that are closest to x, then the class of x is determined by majority voting. The pseudo-code of the K-NN algorithm is given in Algorithm 1.

Algorithm 1 K-nearest neighbor algorithm

Input:

Training set T = {(x_i, y_i) |x_i ∈ R^d, y_i ∈ Y}, 1 ≤ i ≤ n, Y is the set of class labels, test instance x, and parameter K.

Output:

y ∈ Y.

1: For(i = 1 ; i ≤ n ; i = i + 1)

2: Calculate d (x, x_i) which is the distance between x and x_i;

3: Find K instances in T that are closest to x, denoted the set of K instances by N;

4: Count votes for each class by $y = \underset{l \in Y}{argmax} \sum_{x \in N} I (l = class (x))$ , I (·) is an indicator function;

5: return y.

2.2 Condensed nearest neighbor

The condensed nearest neighbor (CNN) [9] is an instance selection algorithm for K-NN. Let T be a training set, and S be the set of selected instances. CNN firstly initializes S by an instance randomly selected from T, then it iteratively selects important instances from T, and move them from T to S. The important instances are those classified incorrectly by K-NN with S. The pseudo-code of the CNN algorithm is given in Algorithm 2.

Algorithm 2 Condensed nearest neighbor algorithm

Input:

Training set T = {(x_i, y_i) |x_i ∈ R^d, y_i ∈ Y}, 1 ≤ i ≤ n, Y is the set of class labels.

Output:

S ⊆ T.

1: S =∅;

2: Randomly select an instance from T, and move it from T to S;

3: Repeat

4: For(∀x_i ∈ T)

5: For(∀x_j ∈ S)

6: Calculate the distance between x_i and x_j;

7: Find the nearest neighbor $x_{j}^{*}$ of x_i in S;

8: If( $y_{i} \neq y_{j}^{*}$ )

9: S = S ∪ {x_i};

10: T = T - {x_i};

11: until (T =∅ or all instances in T are classified correctly by K-nearest neighbor algorithm);

12: return S.

2.3 Fuzzy K-nearest neighbor

The fuzzy K-nearest neighbor (Fuzzy K-NN) [20] is an extension of K-NN to the fuzzy scenario, it can overcome the second drawback of K-NN. Given a test instance x, the fuzzy membership degree of x belonging to the j^th class is determined by equation (1). $μ_{j} (x) = \frac{\sum_{i = 1}^{K} μ_{ij} (\frac{1}{∥ x - x_{i} ∥^{\frac{2}{m - 1}}})}{\sum_{i = 1}^{K} (\frac{1}{∥ x - x_{i} ∥^{\frac{2}{m - 1}}})}$ (1) where 1 ≤ j ≤ l, μ_ij is given by equation (2). $μ_{ij} = μ_{j} (x_{i}) = \frac{\frac{1}{∥ x_{i} - c_{j} ∥^{\frac{2}{m - 1}}}}{\sum_{j = 1}^{l} (\frac{1}{∥ x_{i} - c_{j} ∥^{\frac{2}{m - 1}}})}$ (2) where x_i is the i^th training instance, and c_j is the center of j^th class. In equations (1) and (2), m is a hyperparameter, which determines the weight of the distance when calculating each neighbor’s contribution to the membership value [20]. In our experiments, we set m = 2, as suggested in [20], which means that the contribution of each neighboring point is weighted by the reciprocal of its distance from the point being classified.

In fuzzy K-NN, the fuzziness plays three roles in improving the performance of K-NN.

(1) For a given test instance x, the fuzziness can well model the importance of the K nearest neighbors of x.

(2) For a given test instance x, the fuzziness can well model the possibilities of x belonging to different classes by membership degrees.

(3) Fuzziness can enhance the robustness of the classification algorithm to noise.

The pseudo-code of the fuzzy K-nearest neighbor algorithm is given in algorithm 3.

Algorithm 3 Fuzzy K-nearest neighbor algorithm

Input:

Training set T = {(x_i, y_i) |x_i ∈ R^d, y_i ∈ Y}, 1 ≤ i ≤ n, Y is the set of class labels, test instance x, and parameter K.

Output:

μ_j (x) , 1 ≤ j ≤ l.

1: For(i = 1 ; i ≤ n ; i = i + 1)

2: Calculate d (x, x_i) which is the distance between x and x_i;

3: Find K instances x_i (1 ≤ i ≤ K) in T that are closest to x;

4: For(i = 1 ; i ≤ K ; i = i + 1)

5: For(j = 1 ; j ≤ l ; j = j + 1)

6: Calculate μ_ij with Eq. (2);

7: For(j = 1 ; j ≤ l ; j = j + 1)

8: Calculate μ_j (x) with Eq. (1);

9: return μ_j (x).

3 The proposed algorithm

In this section, we present the CFKNN algorithm, which can be viewed as an instance selection algorithm for fuzzy K-NN or an extension of CNN to the fuzzy scenario (see Figure 1).

Fig. 1

The technical route (solid line) of the proposed algorithm.

CFKNN differs from CNN in the following three aspects: (1) CFKNN is tailored for K-NN rather than the 1-nearest neighbor in CNN, and it uses K instances randomly selected from each class to initialize the set S. (2) Given a test instance x, the fuzzy memberships of the K nearest neighbors of x are calculated using S rather than the original training set T, which significantly reduces the computational time complexity without degenerating the performance of CFKNN. (3) A dynamic entropy threshold λ is introduced to select informative examples. It is well known that entropy is a measure of class uncertainty of instances, the larger the entropy of the instance, the more difficult it is to determine its class. Usually, it is believed that the instances with larger entropy are more informative. Let p_i be the prior of classes, the entropy is defined by $E (class) = - \sum_{i = 1}^{l} p_{i} {log}_{2} p_{i}$ , where l is the number of classes of instances. According to the maximum entropy principle [26], when $p_{i} = \frac{1}{l}, 1 \leq i \leq l$ , the E (class) achieves its maximum. The relationship between the number of classes and the maximum entropy is illustrated in Figure 2. As observed in Figure 2, it is reasonable for different datasets with different number of classes to use dynamic entropy thresholds to select informative instances. The pseudo-code of CFKNN is given in algorithm 4. Next, we analyze the computational time complexity of Algorithm 4.

Fig. 2

The relationship between the number of classes and the maximum entropy.

From the pseudo-code in Algorithm 4, we can easily find that the computational complexity of Algorithm 4 is determined by the “for loop" consisting of steps 2 to 6. The computational complexity of step 3 is O (n × s), s is the number of instances in S. In the worst case, the computational complexity of step 3 is O (n × n). The computational complexities of step 4 and step 5 are both O (K × l), where l is the number of classes. Since K and l are both small numbers, the computational complexities of step 4 and step 5 can be viewed as O (1); The computational complexity of step 6 is O (l), which can be also viewed as O (1). Hence, in the worst case, the computational complexity of Algorithm 4 is O (n × n) + O (1) + O (1) + O (1) = O (n²).

Algorithm 4 CFKNN algorithm

Input:

Training set T = {(x_i, y_i) |x_i ∈ R^d, y_i ∈ Y}, 1 ≤ i ≤ n, Y is the set of class labels, and threshold λ.

Output:

S ⊂ T.

1: Initialization: randomly selected K instances from each class to initialize the set S, and move the K instances from T to S;

2: For(each x ∈ T)

3: Find its K nearest neighbors in S;

4: Calculate the fuzzy membership degrees of the K nearest neighbors of x by Eq. (2);

5: Calculate the fuzzy membership degree of x by Eq. (1);

6: Calculate the entropy of x by $E (x) = - \sum_{i = 1}^{l} μ_{i} (x) {log}_{2} μ_{i} (x)$ ;

7: If(E (x) > λ)

8: S = S ∪ {x};

9: return S.

4 Experimental results and analysis

To verify the effectiveness of CFKNN, we conducted extensive experiments on 11 datasets, the experiments include two parts: (1) The investigation of the impact of threshold on the performance of CFKNN, (2) The comparison with 4 state-of-the-art algorithms: CNN [9], edited nearest neighbor (ENN) [12], Tomeklinks [10], and OneSidedSelection [27]. The 11 datasets include 1 artificial dataset, 2 real world datasets [25], and 8 UCI datasets [28].

The artificial dataset is generated using two two-dimensional Gaussian distributions p (x|ω_i) ∼ N (μ_i, Σ_i) (i = 1, 2), which corresponds to two classes, their mean vectors and covariance matrices are given in Table 1. The 2 real world datasets are the CT and the RenRU dataset. The CT dataset was obtained by collecting 212 medical CT images from a Baoding local hospital. All CT images are classified into 2 classes (i.e., normal class and abnormal class). The CT dataset has 170 normal instances and 42 abnormal instances. A total of 35 features are initially selected including 10 symmetric features, 9 texture features and 16 statistical features including mean, variance, skewness, kurtosis, energy and entropy. The RenRu dataset was created by the Key Laboratory of Machine Learning and Computational Intelligence of Hebei Province, China. The RenRu dataset consists of 148 Chinese characters REN and RU with different typeface, font and size. There are 92 Chinese characters REN and 56 Chinese characters RU. Each Chinese character is described by 26 numerical features. The 8 UCI datasets include WDBC, Parkinsons, Pima, Skin, Iris, Glass, Fertility, and Survival (i.e. Haberman’s Survival). The basic information of the 11 datasets are listed in Table 2. All experiments are conducted on a platform with Intel(R) Core(TM) i3-3120M CPU @ 2.50GHz, 8GB main memory, Windows 10, and Python3.6.3.

Table 1
The parameters of Gaussian distribution of the artificial dataset

i μ_i Σ_i

1 (0.1597, 1.3541) ^T $[\begin{matrix} 0.1726 0.0912 \\ 0.0912 0.1020 \end{matrix}]$

2 (1.1597, 1.4541) ^T $[\begin{matrix} 0.1726 0.0912 \\ 0.0912 0.1020 \end{matrix}]$

i	μ_i	Σ_i
1	(0.1597, 1.3541) ^T	$[\begin{matrix} 0.1726 0.0912 \\ 0.0912 0.1020 \end{matrix}]$
2	(1.1597, 1.4541) ^T	$[\begin{matrix} 0.1726 0.0912 \\ 0.0912 0.1020 \end{matrix}]$

Table 2

The basic information of datasets used in experiments

datasets	#Training instances	#Test instances	#Class
Gaussian	13333	6667	2
CT	154	48	2
RenRu	103	45	2
WDBC	388	167	2
Parkinsons	136	59	2
Pima	537	231	2
Skin	171539	73518	2
Fertility	70	30	2
Survival	245	61	2
Iris	105	45	3
Glass	112	48	6

The experiments on the artificial dataset are used for three purposes: (1) to verify the feasibility of CFKNN; (2) to investigate the impact of threshold λ on the results of instances selection, and (3) to more intuitively elucidate the impact of λ by visualizing the distribution of the original instances and the selected instances with different thresholds. The distribution of the instances in the artificial dataset (Gaussian dataset) is illustrated in Figure 3. The distributions of instances selected by CFKNN (k = 5) from Gaussian dataset with λ = 0.5 and λ = 0.6 are given in Figure 4 and 5, respectively. It is observed from Figure 4 and Figure 5 that like most instance selection algorithms, instances selected by CFKNN are distributed near the classification boundary. The experimental results on artificial dataset confirm that (1) the CFKNN is feasible; (2) the threshold λ has significant influence on the experimental results. Next, we will show the impact of threshold λ on the performance of CFKNN.

Fig. 3

The distribution of instances in Gaussian dataset.

Fig. 4

The distribution of instances selected by CFKNN (k = 5) from Gaussian dataset with λ = 0.5.

Fig. 5

The distribution of instances selected by CFKNN (k = 5) from Gaussian dataset with λ = 0.6.

4.1 Impact of threshold λ on the performance of CFKNN

As illustrated in Figure 2, the maximum entropy differs for datasets with different numbers of classes. For instance, the maximum entropies are 1.0, 2.32, 3.32, and 3.91 for datasets with 2, 5, 10, and 15 classes, respectively. Accordingly, it is inappropriate to employ a static uniform threshold to select instance from datasets with different number of classes [25]. Based on the approach, we investigate experimentally the impact of threshold λ on the performance of CFKNN in this section.

For the dataset Iris with 3 classes, the threshold λ is gradually increased from 0.7 to 1.0 with a step size of 0.5. The number of selected instances and the testing accuracy of CFKNN with K = 3 and K = 5 are reported in Table 12.

For the datasets Glass with 6 classes, the threshold λ is increased from 1.0 to 1.6 with a step size of 0.1. The the number of selected instances and the testing accuracy of CFKNN with K = 3 and K = 5 are reported in Table 13.

In our experiments, for the datasets with 2 classes the threshold λ is gradually increased from 0.5 to 0.95 with a step size of 0.5. The number of selected instances and the testing accuracy of CFKNN with K = 3 and K = 5 are reported. The experimental results are listed in Table 11, respectively. In the following tables, TA represents the testing accuracy, and #SI is the number of the selected instances.

Table 3
The experimental results on dataset Gaussian.

λ TA with K=3 TA with K=5 #SI with K=3 #SI with K=5

0.00(baseline) 0.9238 0.9385 13333 13333

0.50 0.9270 0.9274 7296 2100

0.55 0.9202 0.9282 6625 2118

0.60 0.9211 0.9276 6000 2051

0.65 0.9201 0.9273 5412 2094

0.70 0.9319(↑) 0.9264 4802 1966

0.75 0.9115 0.9274 4190 1970

0.80 0.9168 0.9285 3559 1924

0.85 0.8910 0.9292 2979 1670

0.90 0.9090 0.9292 2395 1510

0.95 0.9196 0.9301(↓) 1744 1254

λ	TA with K=3	TA with K=5	#SI with K=3	#SI with K=5
0.00(baseline)	0.9238	0.9385	13333	13333
0.50	0.9270	0.9274	7296	2100
0.55	0.9202	0.9282	6625	2118
0.60	0.9211	0.9276	6000	2051
0.65	0.9201	0.9273	5412	2094
0.70	0.9319(↑)	0.9264	4802	1966
0.75	0.9115	0.9274	4190	1970
0.80	0.9168	0.9285	3559	1924
0.85	0.8910	0.9292	2979	1670
0.90	0.9090	0.9292	2395	1510
0.95	0.9196	0.9301(↓)	1744	1254

Table 4

The experimental results on dataset CT

λ	TA with K=3	TA with K=5	#SI with K=3	#SI with K=5
0.00(baseline)	0.9403	0.9403	154	154
0.50	0.9254	0.9254(↓)	151	98
0.55	0.9403	0.9254(↓)	149	82
0.60	0.9403	0.9254(↓)	142	83
0.65	0.9403	0.8955	133	78
0.70	0.9552(↑)	0.8955	122	77
0.75	0.9104	0.9104	101	72
0.80	0.9403	0.9104	100	54
0.85	0.9403	0.8955	90	43
0.90	0.8955	0.9104	77	40
0.95	0.9254	0.9254(↓)	51	31

Table 5

The experimental results on dataset RenRu

λ	TA with K=3	TA with K=5	#SI with K=3	#SI with K=5
0.00(baseline)	0.8667	0.8444	103	103
0.50	0.8667(=)	0.8444(=)	103	77
0.55	0.8667(=)	0.7778	102	61
0.60	0.8667(=)	0.8000	103	58
0.65	0.8667(=)	0.8000	102	51
0.70	0.8667(=)	0.8000	102	50
0.75	0.8667(=)	0.7111	102	43
0.80	0.8667(=)	0.7556	79	31
0.85	0.6444	0.8222	32	28
0.90	0.8000	0.6667	22	18
0.95	0.8444	0.5333	19	12

Table 6

The experimental results on dataset WDBC

λ	TA with K=3	TA with K=5	#SI with K=3	#SI with K=5
0.00(baseline)	0.9341	0.9341	388	388
0.50	0.9341(=)	0.9341	142	108
0.55	0.9102	0.9281	70	104
0.60	0.9102	0.9042	72	86
0.65	0.8743	0.9162	67	73
0.70	0.9341	0.9042	77	83
0.75	0.8982	0.9162	55	62
0.80	0.6228	0.9162	50	75
0.85	0.9042	0.9461(↑)	25	73
0.90	0.9222	0.9102	26	71
0.95	0.9162	0.9102	24	33

Table 7

The experimental results on dataset Parkinsons

λ	TA with K=3	TA with K=5	#SI with K=3	#SI with K=5
0.00(baseline)	0.8983	0.9153	136	136
0.50	0.8983(=)	0.8983	136	52
0.55	0.8983(=)	0.9153(=)	131	64
0.60	0.8983(=)	0.7797	132	25
0.65	0.8983(=)	0.8814	130	73
0.70	0.8983(=)	0.7797	127	34
0.75	0.8983(=)	0.8475	121	44
0.80	0.8644	0.8475	107	53
0.85	0.8814	0.8644	81	31
0.90	0.7797	0.8644	23	24
0.95	0.7627	0.8644	24	20

Table 8

The experimental results on dataset Pima

λ	TA with K=3	TA with K=5	#SI with K=3	#SI with K=5
0.00(baseline)	0.6926	0.7489	537	537
0.50	0.7013	0.7576	529	380
0.55	0.6840	0.7446	527	368
0.60	0.7273(↑)	0.7403	499	361
0.65	0.7100	0.7056	503	207
0.70	0.7056	0.7489	487	275
0.75	0.7273(↑)	0.7532	470	325
0.80	0.7100	0.7446	460	307
0.85	0.7186	0.7403	402	315
0.90	0.7056	0.7619	294	265
0.95	0.6926	0.7792(↑)	303	248

Table 9

The experimental results on dataset Skin

λ	TA with K=3	TA with K=5	#SI with K=3	#SI with K=5
0.00(baseline)	0.9995	0.9995	171539	171539
0.50	0.9995(=)	0.9995(=)	1361	808
0.55	0.9995(=)	0.9995(=)	1058	805
0.60	0.9995(=)	0.9995(=)	827	795
0.65	0.9995(=)	0.9995(=)	824	767
0.70	0.9995(=)	0.9995(=)	821	791
0.75	0.9995(=)	0.9995(=)	735	692
0.80	0.9995(=)	0.9995(=)	735	667
0.85	0.9993	0.9994	663	602
0.90	0.9991	0.9993	529	544
0.95	0.9991	0.9991	507	433

Table 10

The experimental results on dataset Fertility

λ	TA with K=3	TA with K=5	#SI with K=3	#SI with K=5
0.00(baseline)	0.8333	0.8333	70	70
0.50	0.8333	0.8333	41	47
0.55	0.8667	0.8667	31	35
0.60	0.8333	0.8333	28	30
0.65	0.8333	0.8333	31	30
0.70	0.8333	0.9333(↑)	30	31
0.75	0.9000(↑)	0.9000	28	28
0.80	0.9000(↑)	0.8000	21	22
0.85	0.8333	0.8333	19	21
0.90	0.8667	0.8667	16	18
0.95	0.8667	0.8667	14	16

Table 11

The experimental results on dataset Survival

λ	TA with K=3	TA with K=5	#SI with K=3	#SI with K=5
0.00(baseline)	0.7868	0.7869	245	245
0.50	0.7377	0.7049	105	140
0.55	0.7540	0.7377	100	145
0.60	0.7377	0.7213	108	139
0.65	0.7049	0.7049	106	139
0.70	0.7213	0.7541(↑)	93	99
0.75	0.6230	0.7377	87	91
0.80	0.7377(↑)	0.7541(↑)	85	89
0.85	0.7213	0.7213	70	66
0.90	0.7049	0.7377	66	56
0.95	0.7213	0.7541(↑)	51	50

From the experimental results listed in Table 11, it is observed that the optimal threshold λ are different for different datasets for both cases K = 3 and K = 5, even though the datasets have same number of classes. Moreover, there are multiple optimal thresholds for most datasets. Compared with the baseline 1 listed in the second row in the tables, we also found that for the case K = 3 CFKNN shows competence preservation 2 (marked with “ = ”) or competence enhancement 3 (marked with “ ↑ ”) on almost datasets except dataset Iris. On the dataset Iris, CFKNN is competence decrement (marked with “ ↓ ”), yet the testing accuracy decreases slightly (from 0.9778 to 0.9556). Compared with the baseline, in the case K = 5, CFKNN is competence preservation or competence enhancement on almost datasets except on dataset Gaussian and CT. On the datasets Gaussian and CT, CFKNN is competence decrement, but the decrease in test accuracy is also very little (from 0.9385 to 0.9301 and from 0.9403 to 0.9254 respectively).

4.2 Comparison with 4 state-of-the-art algorithms

To further verify the effectiveness of CFKNN, we experimentally compared CFKNN with four state-of-the-art methods: CNN [9], ENN [12], Tomeklinks [10], and OneSidedSelection [27], on three aspects: the number of selected instances (#SI), compression ratio (CR) and testing accuracy (TA). In this experiment, let K = 5. The experimental results are listed in Table 14, 15, 16, and 17. The testing accuracy and compression ratio are illustrated in Figures 6 and 7, respectively. The experimental results in Table 14 show that CFKNN outperforms CNN on testing accuracy on 8 datasets. On larger datasets, such as Gaussian and Skin, compared with CNN, CFKNN has much higher compression ratios (0.9059 and 0.9960, respectively). The experimental results in Table15-17 indicate that the performance of CFKNN is on par with state-of-the-art approaches ENN, Tomeklinks, and OneSidedSelection.

Table 12
The experimental results on dataset Iris

λ TA with K=3 TA with K=5 #SI with K=3 #SI with K=5

0.00(baseline) 0.9778 0.9778 105 105

0.70 0.9556(↓) 0.8667 47 33

0.75 0.9556(↓) 0.9556 48 26

0.80 0.9556(↓) 0.9556 47 26

0.85 0.9556(↓) 0.9778 (=) 42 25

0.90 0.9556(↓) 0.9778 (=) 37 22

0.95 0.9333 0.9556 37 23

1.00 0.9556(↓) 0.9556 34 15

λ	TA with K=3	TA with K=5	#SI with K=3	#SI with K=5
0.00(baseline)	0.9778	0.9778	105	105
0.70	0.9556(↓)	0.8667	47	33
0.75	0.9556(↓)	0.9556	48	26
0.80	0.9556(↓)	0.9556	47	26
0.85	0.9556(↓)	0.9778 (=)	42	25
0.90	0.9556(↓)	0.9778 (=)	37	22
0.95	0.9333	0.9556	37	23
1.00	0.9556(↓)	0.9556	34	15

Table 13

The experimental results on dataset Glass

λ	TA with K=3	TA with K=5	#SI with K=3	#SI with K=5
0.00(baseline)	0.6250	0.6250	112	112
1.00	0.6250(=)	0.6042	109	64
1.10	0.6042	0.5833	95	61
1.20	0.6042	0.5833	104	63
1.30	0.6250(=)	0.5625	82	58
1.40	0.6250(=)	0.6250	106	54
1.50	0.6042	0.6250	92	47
1.60	0.6250(=)	0.6458(↑)	89	17

Table 14

Experimental results compared with CNN

datasets	CFKNN			CNN
#SI	TA	CR	#SI	TA	CR
Gaussain	1254	0.9301	90.59	2468	0.9262	81.49
CT	31	0.9254	79.87	52	0.9254	66.23
RenRu	77	0.8444	25.24	27	0.7333	73.79
WDBC	73	0.9461	81.19	64	0.9402	83.51
Parkinsons	64	0.9153	52.94	48	0.8476	64.71
Pima	248	0.7792	53.82	282	0.7403	47.49
Skin	692	0.9952	99.60	735	0.9971	99.57
Iris	22	0.9778	79.04	13	0.5111	87.62
Glass	17	0.6458	84.82	57	0.5000	49.11
Fertility	14	0.9333	80.002	16	0.9012	77.14
Survival	50	0.7541	79.59	69	0.7538	71.84

Table 15

Experimental results compared with ENN

datasets	CFKNN			ENN
#SI	TA	CR	#SI	TA	CR
Gaussain	1254	0.9301	90.59	10155	0.9355	23.84
CT	31	0.9254	79.87	83	0.8806	46.10
RenRu	77	0.8444	25.24	56	0.8667	45.63
WDBC	73	0.9461	81.19	309	0.9162	20.36
Parkinsons	64	0.9153	52.94	62	0.8644	54.41
Pima	248	0.7792	53.82	144	0.7359	73.18
Skin	692	0.9952	99.60	171319	0.9993	0.13
Iris	22	0.9778	79.04	89	0.9778	15.24
Glass	17	0.6458	84.82	38	0.604200	66.07
Fertility	14	0.9333	80.002	31	0.8658	55.71
Survival	50	0.7541	79.59	87	0.7540	64.49

Table 16

Experimental results compared with TomekLinks

datasets	CFKNN			TomekLinks
#SI	TA	CR	#SI	TA	CR
Gaussain	1254	0.9301	90.59	12503	0.9321	6.23
CT	31	0.9254	79.87	142	0.9403	7.79
RenRu	77	0.8444	25.24	101	0.8444	1.94
WDBC	73	0.9461	81.19	374	0.93410	3.61
Parkinsons	64	0.9153	52.94	128	0.9153	5.88
Pima	248	0.7792	53.82	459	0.7359	14.53
Skin	692	0.9952	99.60	171517	0.9995	0.01
Iris	22	0.9778	79.04	101	0.9778	3.81
Glass	17	0.6458	84.82	98	0.6667	12.50
Fertility	14	0.9333	80.002	57	0.8655	18.57
Survival	50	0.7541	79.59	215	0.7084	12.24

Table 17

Experimental results compared with OneSidedSelection

datasets	CFKNN			OneSidedSelection
#SI	TA	CR	#SI	TA	CR
Gaussain	1254	0.9301	90.59	12883	0.9279	3.38
CT	31	0.9254	79.87	143	0.9254	7.14
RenRu	77	0.8444	25.24	98	0.8000	4.85
WDBC	73	0.9461	81.19	307	0.9341	20.88
Parkinsons	64	0.9153	52.94	126	0.8983	7.36
Pima	248	0.7792	53.82	483	0.7359	10.06
Skin	692	0.9952	99.60	169896	0.9995	0.96
Iris	22	0.9778	79.04	51	0.6222	51.43
Glass	17	0.6458	84.82	34	0.6042	69.64
Fertility	14	0.9333	80.002	44	0.8322	37.14
Survival	50	0.7541	79.59	78	0.6646	68.16

Fig. 6

The comparison of testing accuracy

Fig. 7

The comparison of compression ratio

To further confirm the superiority of the proposed algorithm to the 4 state-of-the-art algorithms, we statistically analyzed the experimental results of #SI, TA and CR using paired T-test with a confidence level of 0.05 [29]. First, for each dataset and algorithm, we run the 5-fold cross-validation 5 times and obtain five 25-dimensional statistics denoted by X₁, X₂, X₃, X₄ and X₅ corresponding to CNN, ENN, TomekLinks, OneSidedSelection and CFKNN respectively. Next the paired T-test is applied to the experimental results by calling the Python library function ttest_rel(·, ·). The results of the statistical analysis on #SI, TA and CR are listed in Table 18, 19 and 20, respectively. The p-values listed in the three tables show that CFKNN statistically outperforms CNN, ENN, TomekLinks, and OneSidedSelection.

Table 18

The results of statistical analysis on #SI

datasets	p-value1	p-value2	p-value3	p-value4
Gaussian	6.150e-03	3.077e-01	8.407e-02	1.487e-02
CT	3.961e-02	3.795e-12	9.169e-02	1.207e-02
RenRu	1.332e-07	2.826e-02	2.731e-02	8.049e-05
WDBC	2.411e-02	2.994e-03	1.628e-02	2.730e-02
Parkinsons	6.844e-07	4.243e-09	4.104e-02	5.255e-03
Pima	1.023e-07	6.270e-09	3.324e-08	5.385e-04
Skin	5.863e-01	4.135e-01	5.374e-01	8.272e-01
Iris	1.252e-11	3.018e-01	2.421e-01	2.381e-11
Glass	9.369e-10	9.446e-05	7.760e-02	1.491e-03
Fertility	7.061e-10	1.092e-12	1.214e-12	3.061e-14
Survival	1.565e-01	3.649e-02	5.430e-03	6.836e-11

Table 19

The results of statistical analysis on TA

datasets	p-value1	p-value2	p-value3	p-value4
Gaussian	9.416e-21	1.536e-28	9.715e-10	2.485e-17
CT	9.360e-12	5.391e-16	6.921e-15	7.393e-18
RenRu	2.126e-11	8.477e-08	1.507e-09	4.759e-10
WDBC	1.303e-10	1.535e-17	1.126e-09	1.253e-11
Parkinsons	5.351e-08	5.521e-02	5.087e-12	1.612e-12
Pima	3.917e-11	3.108e-17	2.791e-18	1.121e-12
Skin	1.703e-11	3.386e-06	3.388e-06	3.391e-06
Iris	7.726e-07	7.153e-13	1.623e-12	1.190e-11
Glass	1.463e-10	2.638e-09	6.661e-14	2.772e-08
Fertility	5.633e-04	3.861e-09	8.862e-14	4.568e-12
Survival	2.130e-07	6.081e-11	1.191e-14	3.525e-08

Table 20

The results of statistical analysis on CR

datasets	p-value1	p-value2	p-value3	p-value4
Gaussian	2.230e-05	6.129e-15	3.059e-18	4.386e-18
CT	4.534e-11	4.924e-15	6.173e-18	6.608e-18
RenRu	4.637e-12	2.340e-08	2.597e-12	1.895e-11
WDBC	1.201e-01	1.161e-14	2.536e-18	7.049e-17
Parkinsons	6.455e-08	5.117e-01	2.557e-14	1.374e-16
Pima	5.178e-08	1.370e-11	1.205e-15	1.683e-16
Skin	7.357e-01	5.606e-19	3.516e-19	2.622e-19
Iris	5.589e-07	2.661e-14	6.967e-17	3.237e-13
Glass	9.671e-15	4.546e-11	7.127e-18	4.398e-11
Fertility	3.297e-05	6.936e-13	1.519e-16	1.736e-14
Survival	3.820e-08	5.794e-11	1.546e-15	3.858e-10

5 Conclusion

Motivated by the idea of CNN, an instance selection algorithm named CFKNN is proposed for fuzzy K-nearest neighbor. CFKNN can be viewed as an extension of CNN to the fuzzy scenario, also can be viewed as an improvement of our previous work presented in [25]. CFKNN has four advantages: (1) it is more generic: the hyperparameter K can adopt different values, while CNN only works for K = 1; (2) it is very efficient since it uses subset S to calculate fuzzy membership degree of instance rather than using set T; (3) it uses dynamic threshold λ to select informative instances for different datasets; (4) it has very high compression ratio without deteriorating testing accuracy, especially on larger datasets. In the future, we will investigate the scalability of CFKNN in big data scenario.

Acknowledgments

This research is supported by the key R&D program of science and technology foundation of Hebei Province (19210310D), and by the natural science foundation of Hebei Province (F2017201026).

Footnotes

The baseline corresponds to the case λ = 0.0, which means that no instances are removed from the training datasets.

The competence preservation means that the classification accuracy will not decrease or decrease very little when superfluous instances are removed.

The competence enhancement means that the classification accuracy is increased by removing certain instances.

References

Cover

and Hart

, Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13(1) (1967), 21–27.

Noh

Y.Y.

, Zhang

B.T.

and Lee

D.D.

, Generative Local Metric Learning for Nearest Neighbor Classification. IEEE Trans actions on Pattern Analysis and Machine Intelligence 40(1) (2018), 106–118.

Mullick

S.S.

, Datta

and Das

, Adaptive Learning-Based K-Nearest Neighbor Classifiers with Resilience to Class Imbalance. IEEE Transactions on Neural Networks and Learning Systems 29(11) (2019), 5713–5725.

Yang

, Cai

Z.W.

, et al., Top K representative: a method to select representative samples based on K nearest neighbors. International Journal of Machine Learning and Cybernetics 10(8) (2019), 2119–2129.

Basu

and Murthy

C.A.

, Towards enriching the quality of knearest neighbor rule for document classification. International Journal of Machine Learning and Cybernetics 5(6) (2014), 897–905.

Sun

, Zhang

X.Y.

, Qian

Y.H.

, et al., Feature selection using neighborhood entropy-based uncertainty measures for gene expression data classification. Information Sciences 502 (2019), 18–41.

Sun

, Zhang

X.Y.

, Qian

Y.H.

, et al., Joint neighborhood entropy-based gene selection method with fisher score for tumor classification. Applied Intelligence 49(4) (2019), 1245–1259.

Salvador

, Joaquin

, Jose

R.C.

, et al., Prototype selection for nearest neighbor classification: taxonomy and empirical study. IEEE Transactions on Pattern Analysis and Machine Intelligence 34(3) (2012), 417–435.

Hart

, The condensed nearest neighbor rule. IEEE Transaction on Information Theory 14(5) (1967), 515–516.

10.

Tomek

, Two Modifications of CNN. IEEE Transactions on Systems, Man and Cybernetics 6(11) (1976), 769–772.

11.

Devi

V.S.

and Murty

M.N.

, An incremental prototype set building technique. Pattern Recognition 35(2) (2002), 505–513.

12.

Chang

, Lin

C.C.

, Lu

C.J.

, et al., Adaptive prototype learning algorithms: theoretical and experimental studies. Journal of Machine Learning Research 7(4) (2006), 2125–2148.

13.

Angiulli

, Fast nearest neighbor condensation for large datasets classification. IEEE Transactions on Knowledge and Data Engineering 19(11) (2007), 1450–1464.

14.

Zhai

J.H.

, Wang

X.Z.

and Pang

X.H.

, Voting-based Instance Selection from Large datasets with Map Reduce and Random Weight Networks. Information Sciences 367 (2016), 1066–1077.

15.

Arnaiz-González

, Díez-Pastor

J.F.

, Rodríguez

J.J.

, et al., Instance selection of linear complexity for big data. Knowledge-Based Systems 107 (2016), 83–95.

16.

Gates

G.W.

, The reduced nearest neighbor rule. IEEE Transactions on Information Theory 18(3) (1972), 431–433.

17.

Dasarathy

B.V.

, Minimal consistent set identification for optimal nearest neighbor decision systems design. IEEE Transactions on Systems, Man, and Cybernetics 24(1) (1994), 511–517.

18.

Wilson

D.R.

and Martínez

T.R.

, Improved Heterogeneous Distance Functions. Journal of Artificial Intelligence Research 11(1) (1997), 1–34.

19.

Arnaiz-González

, Díez-Pastor

J.F.

, Rodríguez

J.J.

, et al., Instance selection for regression: Adapting DROP. Neurocomputing 201 (2016), 66–81.

20.

Keller

J.R.

, Gray

M.R.

and Givens

J.A.

, A fuzzy k-nearest neighbor algorithm. IEEE Transactions on Knowledge and Data Engineering 21(9) (2009), 1263–1284.

21.

Zadeh

L.A.

, Fuzzy sets. Information and Control 8 (1965), 338–353.

22.

Arji

, Ahmadi

, Nilashi

, et al., Fuzzy logic approach for infectious disease diagnosis: A methodical evaluation, literature and classification. Biocybernetics and Biomedical Engineering 39 (2019), 937–955.

23.

Melin

and Castillo

, A review on type-2 fuzzy logic applications in clustering, classification and pattern recognition. Applied Soft Computing 21 (2014), 568–577.

24.

Ahmadi

, Gholamzadeh

, Shahmoradi

, et al., Diseases diagnosis using fuzzy logic methods: A systematic and meta-analysis review. Computer Methods and Programs in Biomedicine 161 (2018), 145–172.

25.

Zhai

J.H.

, Li

and Zhai

M.Y.

, The condensed fuzzy k-nearest neighbor rule based on sample fuzzy entropy. Proceedings of the 2011 International Conference on Machine Learning and Cybernetics, Guilin (2011), 10–13. 1 282–286.

26.

Cover

T.M.

and Thomas

J.A.

, The elements of information theory (Second Edition). John Wiley & Sons, Incorporation, Hoboken, New Jersey, (2006).

27.

Kubat

and Matwin

, Addressing the curse of imbalanced training sets: One-sided selection, In. Proceedings of the 14th International Conference on Machine Learning 97 (1997), 179–186.

28.

Dua

and Graff

, UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science, (2019).

29.

Janez

, Statistical comparisons of classifiers over multiple datasets. Journal of Machine Learning Research 7(1) (2006), 1–30.

An instance selection algorithm for fuzzy K -nearest neighbor

Abstract

Keywords

1 Introduction

2 Preliminaries

2.1 K-nearest neighbor

2.2 Condensed nearest neighbor

2.3 Fuzzy K-nearest neighbor

Table 1 The parameters of Gaussian distribution of the artificial dataset i μ i Σ i 1 (0.1597, 1.3541) T [ 0.1726 0.0912 0.0912 0.1020 ] 2 (1.1597, 1.4541) T [ 0.1726 0.0912 0.0912 0.1020 ]

Acknowledgments

Footnotes

References

Table 1
The parameters of Gaussian distribution of the artificial dataset

i μ_i Σ_i

1 (0.1597, 1.3541) ^T $[\begin{matrix} 0.1726 0.0912 \\ 0.0912 0.1020 \end{matrix}]$

2 (1.1597, 1.4541) ^T $[\begin{matrix} 0.1726 0.0912 \\ 0.0912 0.1020 \end{matrix}]$