In the conventional fuzzy k-NN classification rule, the vote cast by each nearest neighboring known (labelled) sample on the class membership grades of the unknown (unlabelled) sample is formed by weighting the nearest neighbor’s class membership grades by the inverse of the nearest neighbor’s distance to the unknown sample. This paper proposes a modification of the weight (distance) used for each nearest neighbor by employing the geometrical relation among the nearest neighbor, its most informative known neighbor of the same class and the unknown sample. It is also proposed that this modification be only (conditionally) applied when the feature vector of the unknown sample lies outside the convex hull of the feature vectors of the known samples of each class. Results on a large number of datasets from the UCI and KEEL repositories and synthetically generated datasets show that, in return for a modest increase in classification complexity over the original fuzzy k-NN rule, the proposed fuzzy k-NN rule offers a better classification accuracy than the accuracies of the original fuzzy kNN rule and most other nearest neighbor type algorithms.
Even though it was introduced nearly seven decades ago as a simple non-parametric learning approach, k-Nearest-Neighbor (k-NN) rule [1] surpasses common classification and regression algorithms in many application domains. Recently, the optimization of convolutional neural network features for k-NN classification [2] suggested that k-NN could be an integral classification component of deep learning architectures.
The k-NN classification rule estimates the a posteriori density from the samples in the local neighborhood and assigns the most frequently observed class label to the unknown sample (sample with unknown class label). The error rate of the rule is no worse than twice the Bayes rate for dense datasets and asymptotically approaches the Bayes rate for large k [3].
The pioneering fuzzy k-NN learning approach [4] extends the k-NN rule by assigning class membership grades to the samples and determines the class membership grades of an unknown sample by using its nearest neighbors’ known class membership grades. A sequence of triples of membership vectors for the training samples, k and error rate are iteratively determined in three phases [4]. In [5], an interpretation of this method is presented and its performance is compared against the performances of two classifiers based on the Fuzzy C-means algorithm [6]. In the more general approach of [7], a soft voting for the class membership grades of an unknown sample is facilitated by weighting the known membership grades of each nearest neighbor of the unknown sample by the nearest neighbor’s normalized inverse distance to the unknown sample. The fuzzy k-NN rules of [7] are adopted as the basis of the current work.
Let denote an unknown sample (its feature vector) and K denote the index set of its k nearest neighbors. Let μi (xj) and c (xj) denote the fuzzy membership grade of i-th class and the assigned class, respectively, of j-th nearest neighboring known sample xj. The grade μi (x) represents the possibility of sample x belonging to i-th class. In the fuzzy k-NN classification rule, the membership grade of in i-th class is estimated as
where dj =∥ vj ∥ and is the vector pointing from (the feature vector of) the j-th nearest neighbor xj to (the feature vector of) the unknown sample .
The indicator function mm outputs 1 if m = n and 0 otherwise. If xj has membership in only one class, say c (xj), but no others, i.e. μi (xj) = ic(xj), crisp initialization version of Equation 1 can be expressed as
In the above, Ki denotes the index set of the ki nearest neighbors of having assigned class i, where ∪iKi = K and ∑iki = k.
In [7], the fuzzy initialization version of Equation 1 is achieved by the fuzzification of the memberships of the known samples from assigned class c (xj) to fuzzy class membership grades μi (xj) as
where ni,j is the number of nearest neighbors (out of kinit) of xj having assigned class i. The soft voting schemes in Equation 1 and 2 additively pool the i-th class membership grade estimates from all the k nearest neighbors to form the final robust estimate .
Contribution and motivation
The main contribution of this paper is the use of a second sample alongside each nearest neighbor xj for determining the distance used to weight μi (xj) in order to obtain an estimate for .
Let S be the set of all known (training) samples and C (x) denote the random variable representing the class of generic sample x with its instantiation being the assigned class label c (x). Good estimates of need to exploit , the statistical dependence (mutual information) between and the classes of the samples in S. However, the fuzzy rules of Equation 1 and 2, that employ only the distance dj between xj and for estimating exploit for each nearest neighbor xj.
To get a better estimate for each xj, the current work aims to exploit , where, given , is the sample that maximizes . Towards this end, the proposed method introduces the distance derived from dj to form an estimate of for each xj. The new distance is based on the positions of both xj and relative to in the feature space. It is an approximate measure of the distance of from the set of samples of the same class as xj.
Let , Cj ≜ C (xj) and and denote Kullback-Leibner distance by D (. ∥ .). Then
While the estimate in the original fuzzy k-NN rule of Equation 1 attains pC|Cj beyond a priori knowledge pC through the use of dj, it is the goal of the current work to make the estimate of additionally attain beyond pC|Cj thru the use of .
As with Equation 1, the estimates formed using both dj and for all the k nearest neighboring samples are additively pooled together to get the final estimate.
Section 2 reviews other prior work. Section 3 covers the proposed class membership grade estimation method. Section 4 the enhancement of the basic method by its conditional application. Experimental results demonstrating the performance advantage of the proposed method are reported in Section 5. This section also discusses the complexity of the proposed method and gives insights gained from dimensionality reduction. Section 6 presents concluding remarks.
Prior work
The theoretical works of [8, 9] prove the performance bound and the convergence rate of the fuzzy generalized nearest neighbor rule.
In fuzzy rule based determination of the initial class membership grades of known samples [10], the parameters of the antecedent and consequent sets as well as feature weights are optimized with a genetic algorithm. In [11], the distance to a training sample is modified by dividing by the level of the subset to which the sample belongs and the class memberships are expressed as functions of numbers of neighbors of each class and the class memberships of the neighbors in a fuzzy decision rule. Assigning weights to neighbors based on initial membership values, distance as well as standard deviation of class membership values gives more credibility to neighbors located parallel to the vector connecting the class means [12]. The fuzzy k-NN voting scheme in [13] achieves the effect of automatically varying the effective number of nearest neighbors according to the local density of the known samples. An estimation theoretic approach [14] determines the weights used to emphasize the membership grades of the neighbors in fuzzy k-NN rule as the optimal weights of an unbiased linear mean squared error estimator for the feature vector of the unknown sample.
The intuitionistic fuzzy k-NN approach [15] thresholds both the membership and nonmembership functions to define a voting function as either the membership function, its negative, or zero. The voting functions for the k nearest neighbors are averaged and shifted to yield the class membership grades.
In the adaptable nearest neighbor selection method [16], prototype like neighbors (high membership in one class) are emphasized in deciding the final membership for the unknown sample. Pruning of the training set to a prototype set [17] has also been proposed to reduce the complexity of application of fuzzy k-NN. Similarly, the class membership grade estimates for each training vector can be computed by fuzzy k-NN and a training vector can be deleted if the difference of the actual and estimate vectors of the class membership grades has a large norm [18].
The uncertainty in the choice of initial k (no. nearest of neighbors) can be managed by extending membership grades for each known sample as interval type 2 fuzzy memberships [19]. Here, classification is performed by first assigning the weighted average of the interval type 2 fuzzy memberships of its nearest neighbors to the unknown sample and then performing type reduction. Another approach [20] is to optimize the kinit and p parameters by a genetic algorithm which has been extended [21] by allowing the kinit and p to assume sets of values where the set elements are determined by the CHCevolutionary optimization. This leads to intervals for both the class membership grades of nearest neighbors and the weights assigned to these grades. A lexicographic order is applied to the sums of intervals of the nearest neigbors for all classes to get the class label for the test sample.
In [22], basic belief masses are assigned to each subset of the set of classes for each of the k nearest neighbors of the unknown sample and combination of these masses using Dempster’s rule. Later, [23] optimizes the parameters of the classification rule of [22].
For numerical features, [24] generalized the fuzzy k-NN rule by modelling each feature as a linguistic variable using FCM [6] and strong fuzzy partitioning. Similarities of individual features and votes of training samples are then aggregated by using operators to construct the unknown sample’s membership grades.
In the literature, the notable fuzzy k-NN classifier application areas are bioinformatics, image processing and biomedical signal processing [25–28].
The Basic Method
Fig. 1 a) illustrates the essential mechanism of the proposed method on a 2-D example involving two linearly separable classes of known samples. The class boundaries are indicated by dotted lines and the unknown sample has an equal number of known nearest neighbors of each class (a total of k = 4) by design. Feature vectors of known data samples of each class are contained in an envelope (convex hull) that can be formed by interpolating these vectors. Rather than using the distance dj of j-th nearest neighboring known sample to the unknown sample, the proposed approach uses the distance, , of the envelope of c (xj) to the unknown sample. In the example of Fig. 1 a), the unknown sample is labelled as class B by the original fuzzy k-NN rule (crisp initialization) since it is closer to the two known nearest neighboring samples of class B than the two known nearest neighboring samples of class A. However, using in place of dj in the fuzzy rule changes the classification result in favor of the more meaningful label of class A. Had the set of samples been denser in the feature space, the original fuzzy rule would also have yielded label A as the classification result. Put differently, the geometric structures of class regions are exploited to compensate for the sparseness of the dataset by using in place of dj.
a) When is used in place of dj in the fuzzy k-NN rule, the unknown sample is assigned the label of the class with closer envelope (class A). b) is outside the convex hull of class C (xj). The derived distance is the shortest distance from to the line connecting xj to another sample of class C (xj) and inversely correlates with the confidence in the class. c) Contrary to (a) and (b), when is inside the convex hull, , the distance to boundary of class on the left (right) is small (large) indicating a lower (higher) confidence in the class.
As illustrated in Fig. 1 b), for an unknown sample outside the envelope of a particular class, where αj is the smallest angle that makes with a line segment connecting the nearest neighbor xj to another labelled sample of the same class. The proposed distance
replaces the original distance dj in the fuzzy rule.
The proposed distance is meaningful only if , the feature vector of the unknown sample, lies outside the envelope of each class (see Section 4). When the convex hull condition is satisfied, could be interpreted as the shortest distance of to a fictitious sample xP of the same class as xj and , assumed to lie on the line containing , i.e. . In order for to provide the largest information on the class identity of , it is required to be of the same class as xj even though it can be a far off neighbor of xj.
It is possible for to be large or . While the former necessitates interpolation of class identity using distant samples, the latter relies on extrapolation of the class identity from to xP if αj < π/2, and xj to xP if αj > π/2. To handle such cases with unreliable , is opted as a more robust estimate than .
In the general D dimensional case, the proposed fuzzy k-NN rule with crisp initialization is
where and αj is defined as in the 2-D example discussed above. The use of (1 + αj) in place of (1 + sin (αj)) effectively increases the weight given to neighbors with small angles since is a strictly decreasing function of αj. This modification of the proposed distance yields slightly more robust performance in the D dimensional feature space for D > 2. One can argue for this modification by referring to the 3-D drawing in Fig. 2. While for small αj, the shortest distance of the unknown sample’s feature vector to a vector interpolated from feature vectors of the same class as xj is possibly much shorter than that predicted by its distance to the line , for large αj, the difference between the two is negligible.
Left: If αj, the angle between vj and is large, the difference between and the magnitude of vj’s projection onto the normal of the shaded face (the shortest distance to a vector interpolated from feature vectors of the same class as xj) is relatively small. Right: If αj is small, it is likely that is much larger than the magnitude of vj’s projection onto the normal of the shaded face.
Two classes of samples uniformly distributed on [0,1]x[0,0.5] and [0,1]x[0.5,1]. Known samples: +, unknown samples: •, nearest neighbors: □. The colors indicate the classification decisions for k = 3 nearest neighbors. Left: Classification using dj, Right: Classification using . The blown up sections show the correction of a classification decision for one unknown sample by use of .
For the fuzzy k-NN rule with fuzzy initialization, each nearest neighbor xj could have multiple class memberships. Again, let . For each class i with μi (xj) >0, the smallest angle αj,i that vj makes with the line segment connecting xj to a sample of assigned class i, , is determined. Based on the proposed distance , the class membership grade of is determined as
In this case, the proposed distance depends on the class index as well as the nearest neighbor index.
Conditional application of the proposed distance
Assuming that the density of the known samples is fairly high, the feature space position of an unknown sample relative to the convex hull of the known samples of a certain class offers a simple means for assessing the confidence in employing the proposed distance.
Specifically, three cases can be identified.
i) ∀j where Si is the set of known samples of class i (as in Fig. 1 a), b)). In this case, correlates with the inverse of the confidence in c (xj).
ii) for one class i0 s.t. c (xj) = i0 ∀j even for a sparse distribution of samples in local neighborhood of . Since the unanimous voting in the fuzzy rule yields the same classification result when replace dj, need not replace dj.
iii) | {i : ∃ j s . t . c (xj) = i} |>1 and ∃i s.t. . Fig. 1 c) suggests that the derived distance need not correlate with the inverse of the confidence in the class c (xj). Hence, in this case, dj is employed instead of to determine the class of .
Therefore, the first case is distinguished from the other two by testing ∀j. If the test returns false, the proposed distance is not used and Equation 1 is employed with dj as a fallback solution.
Due to its convergence speed and runtime, Wolfe’s method [29] as described in [30] has been adopted for testing . is first subtracted from the feature vector of each known sample of a particular class to get a set of vectors denoted as P. The implementation in Algorithm 1 then tests .
In the case of crisp initialization, the conditionally modified distance replaces in Equation 6 where
For fuzzy initialization, for each class i that nearest neighbor xj has nonzero membership grade, the inclusion of in the convex hull of Si is tested separately. The conditionally modified distance replaces in Equation 7 where
Experimental results
In Section 5.1, a figure of merit called the grade advantage is reported alongside accuracy for each classification task. It reflects the confidence in the classification decision-making by comparing the grades of the correct class and the class through
where . Each unknown sample contributes a value between 0 and 1 to γ. The larger the γ is, the better the fuzzy k-NN rule is performing in terms of assigning grades to the classes. For a small size sample (small dataset), γ is a better predictor of the classifier performance on the population (larger dataset) than accuracy is.
In the next two subsections, all performance figures reported are based on five fold cross-validation, and the tested fuzzy k-NN rules employ unnormalized attributes and the Euclidean distance metric for selecting the nearest neighbors. The parameters of the fuzzy k-NN rules are set as kinit = k and p = 2.
Comparisons with original fuzzy k-NN rule
In the first experiment, two linearly separable classes of 450 unknown and 50 known samples each, have been synthetically generated to have feature vectors uniformly distributed on [0,1]x[0,0.5] and [0,1]x[0.5,1]. In Fig. 5.1, it is seen that more than 40% (14 out of 33) of the unknown samples that are incorrectly classified by the crisp initialized fuzzy k-NN rule when original distance dj is employed (left) are correctly classified when the proposed distance is employed (right).
Secondly, two-class Gaussian and truncated Gaussian datasets with independent features have been synthetically generated. Data sets G1-G3 have 0 mean for the features of both classes and standard deviation of 1 and 2 for the features of the first and second classes, respectively. Data sets G4-G10 have standard deviation of 1 for the features of both classes and means of 0 and 1 for the features of the first and second classes, respectively. Truncated Gaussian datasets TG3,TG6,TG9 have been obtained from G3,G6,G9, respectively, by discarding those samples with feature vectors having a distance from the class mean vector of greater than the feature std dev (TG6), or twice the feature std dev (TG3, TG9). First row of Table 1 also displays the dimension D and size n of each dataset as (D,n) under its label. Since the performance advantage is largely dependent on the few choices of parameter combinations, average performances are not reported. The results are summarized in Table 1.
Classification error ϵ%, grade advantage γ of fuzzy k-NN rule employing original and conditionally applied proposed distance (bold:winner)
Rule (init.)
k
Measure
G1
G2
G3
G4
G5
G6
G7
G8
G9
G10
TG3
TG6
TG9
(5, 102)
(5, 103)
(5, 104)
(2, 102)
(2, 103)
(2, 104)
(5, 102)
(5, 103)
(5, 104)
(15, 103)
(5, 104)
(2, 104)
(5, 104)
Original
5
ϵ%
22
20.8
19.25
26
28.8
29.57
20
16.5
16.06
4.1
9.58
8.12
7.08
(crisp)
γ
73.03
73.2
76.39
67.43
68.47
68.04
74.2
78.73
80.04
93.17
86.62
91.79
91.5
Cond.
5
ϵ%
14
19.2
17.96
25
28.65
29.54
19
16.3
15.9
4.1
8.67
8.05
6.46
(crisp)
γ
76.26
76.8
78.18
68.2
68.65
68.08
75.67
80.31
80.59
93.51
87.93
91.92
92.58
Original
5
ϵ%
39
22.6
17.94
29
25.85
27.03
19
15.9
14.74
4
10.4
7.95
6.81
(fuzzy)
γ
65.48
69.8
75.14
67.12
68.89
68.01
72.21
77.94
79.7
92.33
86.35
91.55
91.6
Cond.
5
ϵ%
26
18.4
16.61
29
25.75
27.04
18
15.5
14.63
4
7.65
7.85
6.28
(fuzzy)
γ
68.81
75.08
77.75
68.6
69.1
68.05
74.69
79.83
80.34
92.86
88.09
91.69
92.61
Original
10
ϵ%
36
22.3
17.95
29
27.95
28.65
17
15.4
14.6
3.6
9.85
8.1
6.83
(crisp)
γ
67.35
71.34
75.87
67.62
68.39
68.1
74.03
77.89
79.96
92.62
85.22
91.67
91.22
Cond.
10
ϵ%
26
18.1
16.86
29
27.9
28.63
18
15.3
14.56
3.8
7.5
8.05
6.19
(crisp)
γ
70.54
75.92
78.07
68.87
68.59
68.15
75.77
79.79
80.57
93.16
87.25
91.86
92.53
Original
10
ϵ%
47
24.7
17.59
28
25
26.05
18
14.9
13.82
3.5
11.38
8.17
6.26
(fuzzy)
γ
61.28
67.51
74.26
67.15
68.67
68.03
72.25
77.11
79.53
91.58
83.96
91.34
91.13
Cond.
10
ϵ%
41
17.5
16.24
28
24.95
26.06
19
14.8
13.81
3.4
7.34
8.15
5.86
(fuzzy)
γ
64.15
73.65
77.41
68.87
68.89
68.08
74.69
79.28
80.23
92.31
86.82
91.57
92.44
Although there is a conspicuous accuracy advantage with the proposed method for datasets G1-G3 with zero mean classes, the advantage is very slim for datasets G4-G10 with different class means since a larger fraction of test samples fall into case ii) of Section 4 for datasets G4-G10. Truncation of the class distributions leads to larger accuracy gains since, in truncated datasets, a much larger fraction of test samples that are located outside the convex hulls of the training samples benefit from the proposeddistance.
Ten datasets from the UCI repository have also been used to compare the classification performance of the proposed fuzzy k-NN rule against that of the original one. Table 2 provides results with unconditional and conditional application of the proposed distance.
Classification error ϵ%, grade advantage γ for fuzzy k-NN rule employing original and unconditionally/conditionally applied proposed distance
Rule (init.)
k
Measure
Pima
Ionosphere
Wine
Occupancy
Iris
Glass
Ecoli
Yeast
Waveform
Seeds
AVG.
Original
5
ϵ%
29.17
17.38
23.03
1.07
3.33
30.37
14.58
42.39
17.9
10.95
19.02
(crisp)
γ
66.69
81.97
72.51
98.7
95.67
68.39
81.08
53.93
77.39
89.97
78.63
Uncond.
5
ϵ%
28.78
16.24
22.47
0.99
4
27.1
13.69
42.05
17.84
11.43
18.46
(crisp)
γ
67.62
82.67
73.61
98.79
95.99
69.22
82.22
54.55
77.87
89.59
79.21
Cond.
5
ϵ%
27.22
16.24
21.91
1.06
4
26.16
13.39
41.24
17.84
10.95
18
(crisp)
γ
69.22
82.67
74.36
98.71
95.97
69.25
83.29
54.92
77.87
89.78
79.6
Original
5
ϵ%
27.99
18.8
28.09
0.99
3.33
31.78
14.29
40.57
17.04
11.9
19.48
(fuzzy)
γ
66.46
80.07
70.5
98.58
94.8
65.73
80.5
54.75
75.79
89.57
77.68
Uncond.
5
ϵ%
27.34
17.95
24.72
0.98
3.33
29.91
14.59
40.17
16.7
10
18.57
(fuzzy)
γ
67.48
80.92
71.67
98.65
95.58
66.75
81.96
55.33
76.45
89.04
78.39
Cond.
5
ϵ%
25.26
17.95
23.03
0.99
3.33
30.38
13.4
40.64
16.7
10
18.17
(fuzzy)
γ
69.41
80.92
72.65
98.59
95.64
66.77
83.14
55.41
76.45
89.27
78.83
Original
10
ϵ%
26.56
17.38
21.91
1.04
2.67
32.71
14.58
40.57
16.54
9.52
18.35
(crisp)
γ
67
80.14
72.32
98.65
94.9
66.45
80.1
54.96
76.96
89.1
78.06
Uncond.
10
ϵ%
25.52
17.1
24.16
0.96
3.34
28.97
14.28
40.03
16.32
8.57
17.93
(crisp)
γ
68.18
80.99
73.7
98.76
95.53
67.34
81.6
55.52
77.55
89.9
78.91
Cond.
10
ϵ%
25
17.1
23.03
1.03
3.34
30.84
14.28
40.57
16.32
8.57
18.01
(crisp)
γ
69.8
80.99
74.51
98.66
95.65
67.25
82.64
55.62
77.55
90.12
79.28
Original
10
ϵ%
26.69
19.37
24.72
1.02
4
32.71
14.29
40.36
16.14
10.95
19.03
(fuzzy)
γ
66.2
78.08
70.58
98.53
93.42
63.53
79.04
54.97
75.28
88.35
76.8
Uncond.
10
ϵ%
25.26
18.8
23.03
0.99
4.67
30.84
14.29
40.23
15.74
10
18.39
(fuzzy)
γ
67.41
79.11
71.93
98.61
94.56
64.62
80.82
55.49
76.04
89.07
77.77
Cond.
10
ϵ%
24.74
18.8
21.35
1.01
4.67
32.24
13.4
41.03
15.74
10
18.3
(fuzzy)
γ
69.31
79.12
72.83
98.54
94.81
64.48
81.96
55.48
76.04
89.37
78.19
It is observed that the unconditional or conditional application of the proposed distance provides advantages in error rates in 8 out of 9 datasets for k=5 nearest neighbors, and 7 out of 9 datasets for k=10 nearest neighbors. On the average, both strategies yield advantages in error rates for all combinations of initialization type and k value. The average percentage reductions in classification error rate due to the unconditional application of proposed distance are 2.94% and 4.67%, respectively, for crisp and fuzzy initialization when k=5. Conditional application of the proposed distance increases these reductions to 5.36% and 6.72%, respectively, for crisp and fuzzy initialized fuzzy k-NN. When k=10, the respective reductions in error rates are 2.29% and 3.36% due to the unconditional application of proposed distance and 1.85% and 3.84% due to its conditional application. The advantage of the use of proposed distance is less for k=10 than for k=5.
The advantage of conditional application is not immediately clear, since, for k=10 and crisp initialization, unconditional application of the proposed distance has a larger average percentage reduction in error rate than its conditional application. However, for all combinations of k with initialization types, conditional application of the proposed distance yields a larger average γ than its unconditional application.
For statistical comparison of two classifiers over multiple datasets, [31] recommends the Wilcoxon signed rank test over the paired t-test. The Wilcoxon test statistic w- is formed by taking the absolute differences of the error rates, ranking them and finally summing the ranks for which the error rate of the proposed fuzzy k-NN rule exceeds the error rate of the original fuzzy k-NN rule. From the exact p-values (p = Pr {W- ≤ w-}) reported in Table 3, it can be concluded that fuzzy initialization as well as k=5 nearest neighbors generally yield significantly lower error rates (higher classification accuracies).
Wilcoxon statistic and p-values (×10-2) calculated from Table 2H0 : accuracy of proposed rule = accuracy of original rule
Unconditionally applied
Conditionally applied
crisp k=5
fuzzy k=5
crisp k=10
fuzzy k=10
crisp k=5
fuzzy k=5
crisp k=10
fuzzy k=10
W-
12
2
15
5
3
1
12
11
pval
6.45
0.391
11.5
1.76
0.781
0.391
12.3
4.98
In the D dimensional feature space, the original fuzzy k-NN rule has a complexity of O (Dn). Since each known sample of the same class as a nearest neighbor could be its most informative neighbor, the unconditional application of the proposed distances has complexity O ((k + 1) Dn). For large n, the Wolfe’s algorithm complexity of O (n3 + Dn2) (see [30]) dictates the overall complexity in the conditional application of the proposed distance. For small n, the additional complexity needed to compute the proposed distance may actually be larger than the complexity of Wolfe’s algorithm so that it may be more advantageous to conditionally apply the proposed distance. In Table 4, average multiplicative runtime increase factors for the UCI datasets are reported for different scenarios. On the average, the complexity increase is the greatest for k=10 and fuzzy initialization, and the conditional application of the proposed distance yields a modest complexity increase over its unconditional application.
Average multiplicative run time increase factors
crisp k=5
fuzzy k=5
crisp k=10
fuzzy k=10
Unconditionally applied
2.3
3.7
4.4
5.5
Conditionally applied
6.1
9.3
8.6
11.9
Insights gained by dimensionality reduction
Linear Discriminant Analysis has renown as a preprocessing transformation for retaining the non-redundant information in the data and has been applied in this work to gain some insight into how the dimensionality of the data affects the performance gain of the proposed fuzzy k-NN rule over the original fuzzy k-NN rule. Table 5 shows on four datasets of Section 5.1 that when LDA is applied, the performance advantage of the proposed rule is largely reduced or nullified.
Effect of dimensionality reduction with LDA on error ϵ% (k=5)
Orig. crisp
Prop. crisp
Orig. fuzzy
Prop. fuzzy
w/o LDA
with LDA
w/o LDA
with LDA
w/o LDA
LDA LDA
w/o LDA
with LDA
Ion.
17.38
14.53
16.24
14.25
18.80
12.54
17.95
12.54
Pima
29.17
30.08
27.21
29.95
27.99
27.08
25.26
26.95
Wine
23.03
0.56
21.91
0.56
28.09
0.56
23.03
0.56
Glass
30.37
35.51
26.17
35.05
31.78
35.98
30.37
35.05
While any pair of classes may be separated by a hyperplane in a local region of a high dimensional space, such structure in the data may vanish for multimodal class distributions when it is projected to a lower dimensional space. Therefore, the proposed fuzzy k-NN rule may lose its power in a reduced dimensional space where test samples fall inside convex hulls of classes.
Competitiveness against common classifiers in a remote sensing image classification application
Forest Type Mapping dataset of 523 samples in UCI repository contains remote sensing data of a forested area in Japan. 4 different forest types are mapped based on the spectral characteristics using ASTER satellite imagery. Spectral measurements and the deviations of these measurements from their predictions are specified as features. Alongside the ten fold stratified cross validation classification accuracy results achieved by the custom implementations of the crisp and fuzzy initialized versions of the original and proposed fuzzy k-NN rules, the corresponding results achieved by Weka implementations of crisp initialized fuzzy k-NN classifier and the common classifiers, Support Vector Machines (SVM), Random Forests (RF), Naive Bayes (NB) are listed in Table 6. The optimal combination of regularization and radial basis width parameters for SVM and the number of features and tree depth for RF have been determined by grid search. Best k : k ∈ {1, 30} has been determined by exhaustive search for the original fuzzy k-NN rule with crisp initialization and has been used in the other fuzzy k-NN rules.
Classification errors ϵ% on the Forest Type Mapping dataset
Orig. crisp
Orig. fuzzy
Prop. crisp
Prop. fuzzy
Weka Orig. crisp
Weka SVM
Weka RF
Weka NB
9.94
9.75
9.75
9.37
10.51
10.13
10.71
13.96
In this application domain, the proposed fuzzy k-NN rule has a greater performance advantage over SVM, RF and NB than the original fuzzy k-NN rule and could be the classifier of choice if its higher computational complexity is not an issue.
Performance comparisons with fuzzy KNN type algorithms on Keel datasets
Experiments have also been conducted on the predesigned, 10-fold stratified cross validation (no covariate shift reduction) fold partitions of the 44 datasets from the KEEL-Dataset repository reported in [21]. The performances of the conditionally applied proposed and original fuzzy k-NN rules are compared against the performances of the 8 nearest neighbor type benchmark algorithms reported in Tables A.7-A.10 of [21]. Six of these benchmark algorithms other than EF-kNN-IVFS [21] and kNN, are the best performing subset of fuzzy nearest neighbor type classifier algorithms comparatively evaluated in [32].
Table 7 lists the maximum of accuracy values over {3, 5, 7, 9} neighbors for the classification of each dataset with each benchmark algorithm (taken from Tables A.7-A.10 of [21]) and the original and proposed fuzzy k-NN rules (kinit = 3, p = 2) with crisp and fuzzy initialization. A statistical analysis of these results has been performed by taking the proposed fuzzy rule (with either initialization) as the control treatment (algorithm) and comparing it against the original fuzzy rule (with the same initialization) and the other 8 algorithms in [21] by using hypothesis testing. In the first stage, the Friedman hypothesis test has been used as per [31] with the null hypothesis that the performances of m = 10 algorithms are indistinguishable. Let aij represent the accuracy of j-th treatment for i-th dataset. In a randomized block design, where each block represents one dataset, ranks, rij, of the accuracies, aij, are obtained by ranking the accuracies from 1 to m within each block. The average ranks for the algorithms are for b blocks. The Friedman statistic
Accuracy % / Kappa (×10-2) (maximums over k ∈ {3, 5, 7, 9} neighbors) for the classification of the Keel datasets
is chi-square distributed with m-1 degrees of freedom. For the m = 10 algorithms and b = 44 datasets with accuracies shown in Table 8, the statistic is determined to be 66.9583 and 73.221, for crisp and fuzzy initialization, respectively. The corresponding p-values (p = Pr {X > S}) are 5.3E-18 and 3.89E-21.
Pairwise accuracy comparisons (Wilcoxon signed rank test) between the 9 algorithms and the proposed fuzzy rule with crisp initialization
EF-kNN-IVFS
kNN
D-SKNN
IF-KNN
FENN
IT2FKNN
GAfuzzy-kNN
PFKNN
Orig. crisp
Statistic W-
378
162
200.5
167
183
286.5
237
151
120.5
z score
-0.913
-3.327
-2.637
-3.105
-3.351
-2.057
-2.501
-4.009
-3.178
unadjusted p
0.181
4.39E-4
4.18E-3
9.51E-4
4.03E-4
1.99E-2
6.19E-3
3.05E-5
7.43E-4
Holm corrected α
0.05
0.00714
0.0125
0.01
0.00625
0.025
0.0166
0.00556
0.00833
In the second stage, Wilcoxon signed rank test post-hoc test is used to identify those algorithms which have a statistically significantly lower accuracy than the accuracy of the control treatment. In accordance with [33], the mean ranks post-hoc test of [34] has not been used due to its inconsistency of power.
Normal approximation of the distribution of the statistic is used to calculate the p-value since the number of datasets is large. The z-score is determined as
The Familywise Error Rate (FWER) is bounded to 0.05 by using the Holm-Bonferroni correction. This correction is less conservative than the Bonferroni correction and it does not alter the significance levels of pairwise comparisons with strong treatments when weaker ones are later added to the comparison set.
The p-values of the 9 pairwise comparisons with the proposed fuzzy k-NN rule with the crisp initialization as the control treatment, and the proposed fuzzy k-NN rule with the fuzzy initialization as the control treatment, are tabulated against the Holm-Bonferroni corrected significance levels in Tables 8 and 9, respectively. The accuracy of the proposed fuzzy k-NN rule with either initialization is statistically significantly higher than the accuracy of all algorithms except EF-kNN-IVFS [21]. Post-hoc tests are inconclusive in showing the statistical significance of the advantage of either the proposed fuzzy k-NN rule or EF-kNN-IVFS. On the other hand, for the values of k giving the best results, while the ratio of the run time of EF-kNN-IVFS to that of the fuzzy k-NN rule with fuzzy initialization is 98.8, the ratios of the run times of proposed fuzzy k-NN rule with crisp and fuzzy initializations to those of their original counterparts are merely 9.9 and 13.0, respectively.
Pairwise accuracy comparisons (Wilcoxon signed rank test) between the 9 algorithms and the proposed fuzzy rule with fuzzy initialization
EF-kNN-IVFS
kNN
D-SKNN
IF-KNN
FENN
IT2FKNN
GAfuzzy-kNN
PFKNN
Orig. fuzzy
Statistic W-
417
190.5
231.5
178
145.5
220.5
217.5
134
92.5
z score
0.101
-3.104
-2.572
-3.414
-3.820
-2.358
-2.581
-3.836
-3.636
unadjusted p
0.540
9.56E-4
5.05E-3
3.21E-4
6.68E-5
9.18E-3
4.93E-3
6.26E-5
1.38E-4
Holm corrected α
0.05
0.01
0.0166
0.00833
0.00625
0.025
0.0125
0.00556
0.00714
Cohen’s Kappa statistic accounts for the chance agreement between actual and predicted labels. It is
where , pa is accuracy probability. For each algorithm, Table 10 also lists, the maximum κ over {3, 5, 7, 9} neighbors for each dataset and the average maximum κ over all datasets.
AUC statistics (maximum over k ∈ {3, 5, 7, 9} neighbors) for the classification of the two class Keel datasets
Data set
EF-kNN-IVFS
kNN
D-SKNN
IF-KNN
FENN
IT2FKNN
GAfuzzy-kNN
PFKNN
Orig. crisp
Prop. crisp
Orig. fuzzy
Prop. fuzzy
Appendicitis
0.822
0.817
0.795
0.837
0.843
0.846
0.825
0.825
0.827
0.834
0.839
0.846
Banana
0.959
0.963
0.951
0.967
0.965
0.966
0.960
0.959
0.958
0.958
0.964
0.962
Bands
0.645
0.728
0.749
0.727
0.688
0.746
0.731
0.712
0.761
0.765
0.756
0.754
Bupa
0.591
0.641
0.668
0.654
0.652
0.676
0.649
0.678
0.687
0.713
0.662
0.693
Haberman
0.667
0.629
0.618
0.653
0.689
0.665
0.650
0.637
0.636
0.625
0.686
0.673
Heart
0.845
0.856
0.851
0.857
0.853
0.866
0.856
0.867
0.867
0.866
0.856
0.863
Ionosphere
0.891
0.933
0.926
0.935
0.868
0.937
0.933
0.929
0.937
0.939
0.937
0.937
Mammographic
0.849
0.893
0.878
0.889
0.873
0.878
0.868
0.875
0.860
0.862
0.874
0.874
Monk-2
0.923
0.988
0.983
0.980
0.913
0.957
0.986
0.966
0.974
0.976
0.956
0.958
Phoneme
0.922
0.932
0.933
0.941
0.940
0.954
0.953
0.946
0.956
0.956
0.952
0.948
Pima
0.784
0.793
0.760
0.800
0.797
0.801
0.790
0.799
0.789
0.798
0.797
0.805
Ring
0.823
0.914
0.881
0.920
0.692
0.924
0.918
0.924
0.914
0.910
0.919
0.910
Sonar
0.899
0.909
0.917
0.927
0.885
0.924
0.96
0.878
0.930
0.936
0.921
0.921
Spambase
0.939
0.949
0.949
0.954
0.951
0.965
0.962
0.948
0.965
0.969
0.963
0.966
Spectfheart
0.835
0.798
0.788
0.797
0.813
0.812
0.816
0.792
0.797
0.799
0.812
0.813
Titanic
0.729
0.670
0.746
0.756
0.726
0.739
0.727
0.561
0.754
0.754
0.731
0.731
Twonorm
0.996
0.996
0.992
0.997
0.997
0.997
0.997
0.996
0.996
0.996
0.997
0.996
Wdbc
0.993
0.992
0.986
0.994
0.993
0.993
0.992
0.993
0.992
0.992
0.994
0.993
Wisconsin
0.986
0.991
0.991
0.992
0.984
0.992
0.991
0.993
0.990
0.990
0.992
0.991
AVERAGE
0.847
0.863
0.861
0.873
0.849
0.876
0.872
0.857
0.873
0.876
0.874
0.875
Receiver Operating Characteristic (ROC) is a plot of the True Positive Rate (TPR) vs. the False Positive Rate (FPR) as a function of the grade of the positive class. For each algorithm, Table 11 lists the Area Under Curve (AUC) of the maximum ROC (maximum TPR as a function of FPR) over {3, 5, 7, 9} neighbors for the two-class datasets (excluding Hepatitis which had too few samples for estimating ROC).
The average maximum κ values and AUC values in Table 10 suggest that IT2FKNN and GA-fuzzy-kNN, rather than EF-kNN-IVFS, strongly contend with the proposed method. For the Banana dataset, the proposed fuzzy k-NN rule is not better than the original fuzzy k-NN rule and is inferior to IT2FKNN in terms of AUC performance. A scatter plot of the 2-D feature vectors of this dataset reveals that most known boundary samples of each class are inside the convex hull of the other class so that the proposed distance is not employed. This is distinct from the linearly separable, two-class problem of Section 5.1 where the proposed fuzzy rule has a performance advantage.
Conclusions and future work
This paper has presented a new fuzzy k-NN rule based classifier. The soft voting process for the class of the unknown sample employs its shortest distance to the line passing through the nearest neighbor and a neighbor of this nearest neighbor of the same class. It is also proposed that the new distance be employed only when, for each nearest neighbor, the unknown sample lies outside of the convex hull of the set of known samples of the same class as the nearest neighbor. The proposed fuzzy k-NN rule has a significantly higher classification accuracy than the original fuzzy k-NN rule and most other fuzzy nearest neighbor type algorithms, albeit a modestly higher complexity.
In future work, it might be of interest to see if the proposed classifier can benefit from denoising and despeckling methods (e.g. [35, 36]) to improve its performance on noisy (image) data.
References
1.
E.Fix and J.L.Hodges, Discriminatory analysis: Small sample performance, USAF School of Aviation Medicine, Randolph Field, Tex, Project 21-49-004Rept. 11 (1952).
2.
W.Ren, Y.Yuy, J.Zhang and K.Huang, Learning Convolutional NonLinear Features for K Nearest Neighbor Image Classification, in: Intl Conf Pattern Recogn, 2014, 2014.
3.
T.M.Cover and P.E.Hart, Nearest neighbor pattern classification, IEEE Trans Inf TheoryIT-13 (1967), 21–27.
4.
A.Jozwik, A learning scheme for a fuzzy k-NN rule, Pattern Recogn Lett1 (1983), 287–289.
V.T.Kissiov, A fuzzy version of the K-NN method, Fuzzy Sets Syst49(3) (1992), 323–329.
12.
J.H.Han and Y.K.Kim, A fuzzy K-NN algorithm using weights from the variance of membership values, in: IEEE Comp Society Conf Comp Vision and Pattern Recognition, Ft Collins, CO, USA, 1999.
13.
H.B.Mitchell and P.A.Schaefer, A ’soft’ K-nearest neighbor voting scheme, Intl J Intel Syst16(4) (2001), 459–468.
14.
T.D.Pham, An Optimally Weighted Fuzzy k-NN Algorithm, in: Intl Conf Pattern Recogn Image Analysis: Pattern Recognition and Data Mining, Bath, UK,2005, pp. 239–247.
Y.K.Kim and J.H.Han, Fuzzy K-NN algorithm using modified K-selection, in: Intl Joint Conf of 4th IEEE International Conf on Fuzzy Syst and 2nd Intl Fuzzy Eng Symp, 1995.
17.
M.Arif, M.U.Akram, F.A.Afsar and F.A.A.Minhas, Pruned fuzzy k-nearest neighbor classifier for beat classification, J Biomedical Sci Eng3 (2010), 380–389.
18.
M.-S.Yang and C.-H.Chen, On the edited fuzzy K-nearest neighbor rule, IEEE Trans Syst Man Cybern Part B (Cybernetics)28(3) (1998), 461–466.
19.
F.C.-H.Rhee and C.Hwang, An interval type-2 fuzzy Knearest neighbor, in: 12th IEEE Intl Conf Fuzzy Syst, 2003 FUZZ ’03, 2003, pp. 25–28.
20.
X.Hu and C.Xie, Improving fuzzy k-nn by using genetic algorithm, J Comp Info Syst1(2) (2005), 203–213.
21.
J.Derrac, F.Chiclana, S.Garcia and F.Herrera, Evolutionary fuzzy k-nearest neighbors algorithm using interval-valued fuzzy sets, Info Sci329 (2016), 144–163.
22.
T.Denoeux, A k-nearest neighbor classification rule based on Dempster-Shafer theory, IEEE Trans Syst Man Cybern25(5) (1995), 804–813.
23.
L.M.Zouhal and T.Denoeux, An evidence-theoretic k-NN rule with parameter optimization, IEEE Trans Syst Man Cybern Part C (Appl and Reviews)28(2) (1998), 263–271.
24.
S.Ezghari, A.Zahi and K.Zenkouar, A new nearest neighbor classification method based on fuzzy set theory and aggregation operators, Expert Syst with Appl80 (2017), 58–74.
25.
S.Rasheeda, D.Stashuka and M.Kamel, Adaptive fuzzy k-NN classifier for EMG signal decomposition, Medical Engineering & Physics28(7) (2006), 694–709.
26.
Y.Huang and Y.Li, Prediction of protein subcellular locations using fuzzy k-NN method, Bioinformatics20(1) (2004), 21–28.
27.
A.Sengur, An expert system based on principal component analysis, artificial immune system and fuzzy k-NN for diagnosis of valvular heart diseases, Comp in Biology and Medicine38(3) (2008), 329–338.
28.
H.Nezamabadi-pour and E.Kabir, Concept learning by fuzzy k-NN classification and relevance feedback for efficient image retrieval, Expert Syst Appl36(3) (2009), 5948–5954.
29.
P.Wolfe, Finding the nearest point in a polytope, Math Programming11 (1976), 128–149.
30.
D.Chakrabarty, P.Jain and P.Kothari, Provable submodular minimization using Wolfe’s algorithm, in: Advances in Neural Information Proc Syst 27, Vol. 20, 2014, pp. 802–809.
31.
J.Demsar, Statistical comparisons of classifiers over multiple data sets, J Mach Learn Res7 (2006), 1–30.
S.Gai, B.Zhang, C.Yang and L.Yu, Speckle noise reduction in medical ultrasound image using monogenic wavelet and Laplace mixture distribution, Digital Signal Processing72 (2018), 192–207. http://dblp.uni-trier.de/db/journals/dsp/dsp72.html#GaiZYY18