Fuzzy k-NN classification with weights modified by most informative neighbors of nearest neighbors

Abstract

In the conventional fuzzy k-NN classification rule, the vote cast by each nearest neighboring known (labelled) sample on the class membership grades of the unknown (unlabelled) sample is formed by weighting the nearest neighbor’s class membership grades by the inverse of the nearest neighbor’s distance to the unknown sample. This paper proposes a modification of the weight (distance) used for each nearest neighbor by employing the geometrical relation among the nearest neighbor, its most informative known neighbor of the same class and the unknown sample. It is also proposed that this modification be only (conditionally) applied when the feature vector of the unknown sample lies outside the convex hull of the feature vectors of the known samples of each class. Results on a large number of datasets from the UCI and KEEL repositories and synthetically generated datasets show that, in return for a modest increase in classification complexity over the original fuzzy k-NN rule, the proposed fuzzy k-NN rule offers a better classification accuracy than the accuracies of the original fuzzy kNN rule and most other nearest neighbor type algorithms.

Keywords

k-nearest-neighbor fuzzy classification distance convex hull

1 Introduction

Even though it was introduced nearly seven decades ago as a simple non-parametric learning approach, k-Nearest-Neighbor (k-NN) rule [1] surpasses common classification and regression algorithms in many application domains. Recently, the optimization of convolutional neural network features for k-NN classification [2] suggested that k-NN could be an integral classification component of deep learning architectures.

The k-NN classification rule estimates the a posteriori density from the samples in the local neighborhood and assigns the most frequently observed class label to the unknown sample (sample with unknown class label). The error rate of the rule is no worse than twice the Bayes rate for dense datasets and asymptotically approaches the Bayes rate for large k [3].

The pioneering fuzzy k-NN learning approach [4] extends the k-NN rule by assigning class membership grades to the samples and determines the class membership grades of an unknown sample by using its nearest neighbors’ known class membership grades. A sequence of triples of membership vectors for the training samples, k and error rate are iteratively determined in three phases [4]. In [5], an interpretation of this method is presented and its performance is compared against the performances of two classifiers based on the Fuzzy C-means algorithm [6]. In the more general approach of [7], a soft voting for the class membership grades of an unknown sample is facilitated by weighting the known membership grades of each nearest neighbor of the unknown sample by the nearest neighbor’s normalized inverse distance to the unknown sample. The fuzzy k-NN rules of [7] are adopted as the basis of the current work.

1.1 Review of fuzzy k-NN rules of [7]

Let $\hat{x}$ denote an unknown sample (its feature vector) and K denote the index set of its k nearest neighbors. Let μ_i (x_j) and c (x_j) denote the fuzzy membership grade of i-th class and the assigned class, respectively, of j-th nearest neighboring known sample x_j. The grade μ_i (x) represents the possibility of sample x belonging to i-th class. In the fuzzy k-NN classification rule, the membership grade of $\hat{x}$ in i-th class is estimated as $μ_{i} (\hat{x}) = \frac{\sum_{j : j \in K} μ_{i} (x_{j}) d_{j}^{- p}}{\sum_{j : j \in K} d_{j}^{- p}}$ (1) where d_j =∥ v_j ∥ and $v_{j} = \hat{x} - x_{j}$ is the vector pointing from (the feature vector of) the j-th nearest neighbor x_j to (the feature vector of) the unknown sample $\hat{x}$ .

The indicator function $1$ _mm outputs 1 if m = n and 0 otherwise. If x_j has membership in only one class, say c (x_j), but no others, i.e. μ_i (x_j) = $1$ _{ic(x_j)}, crisp initialization version of Equation 1 can be expressed as $μ_{i} (\hat{x}) = \frac{\sum_{j : j \in K_{i}} d_{j}^{- p}}{\sum_{j : j \in K} d_{j}^{- p}}$ (2)

In the above, K_i denotes the index set of the k_i nearest neighbors of $\hat{x}$ having assigned class i, where ∪_iK_i = K and ∑_ik_i = k.

In [7], the fuzzy initialization version of Equation 1 is achieved by the fuzzification of the memberships of the known samples from assigned class c (x_j) to fuzzy class membership grades μ_i (x_j) as $μ_{i} (x_{j}) = 0.51 1_{i c (x_{j})} + (n_{i, j} / k_{i n i t}) * 0.49$ (3)

where n_i,j is the number of nearest neighbors (out of k_init) of x_j having assigned class i. The soft voting schemes in Equation 1 and 2 additively pool the i-th class membership grade estimates from all the k nearest neighbors to form the final robust estimate $μ_{i} (\hat{x})$ .

1.2 Contribution and motivation

The main contribution of this paper is the use of a second sample $x_{j}^{'}$ alongside each nearest neighbor x_j for determining the distance used to weight μ_i (x_j) in order to obtain an estimate for $μ_{i} (\hat{x})$ .

Let S be the set of all known (training) samples and C (x) denote the random variable representing the class of generic sample x with its instantiation being the assigned class label c (x). Good estimates of $μ_{i} (\hat{x})$ need to exploit $I (C (\hat{x}); {C (x) : x \in S} | S, \hat{x})$ , the statistical dependence (mutual information) between $C (\hat{x})$ and the classes of the samples in S. However, the fuzzy rules of Equation 1 and 2, that employ only the distance d_j between x_j and $\hat{x}$ for estimating $μ_{i} (\hat{x})$ exploit $I (C (\hat{x}); C (x_{j}) | x_{j}, \hat{x})$ for each nearest neighbor x_j.

To get a better estimate for each x_j, the current work aims to exploit $I (C (\hat{x}); C (x_{j}^{'}), C (x_{j}) | x_{j}, x_{j}^{'}, \hat{x})$ , where, given $I (C (\hat{x}); C (x_{j}) | x_{j}, \hat{x})$ , $x_{j}^{'}$ is the sample that maximizes $I (C (\hat{x}); C (x_{j}^{'}) | C (x_{j}), x_{j}, x_{j}^{'}, \hat{x})$ . Towards this end, the proposed method introduces the distance $d_{j}^{'}$ derived from d_j to form an estimate of $μ_{i} (\hat{x})$ for each x_j. The new distance $d_{j}^{'}$ is based on the positions of both x_j and $x_{j}^{'}$ relative to $\hat{x}$ in the feature space. It is an approximate measure of the distance of $\hat{x}$ from the set of samples of the same class as x_j.

Let $C ≜ C (\hat{x})$ , C_j ≜ C (x_j) and $C_{j}^{'} ≜ C (x_{j}^{'})$ and denote Kullback-Leibner distance by D (. ∥ .). Then $\begin{array}{l} I (C; C_{j}^{'}, C_{j} | x_{j}, x_{j}^{'}, \hat{x}) \\ = I (C; C_{j} | x_{j}, \hat{x}) + I (C; C_{j}^{'} | C_{j}, x_{j}^{'}, x_{j}, \hat{x}) \\ = E_{C_{j}} [D (p_{C | C_{j}} ‖ p_{C})] + E_{C_{j}, C_{j}^{'}} [D (p_{C | C_{j}, C_{j}^{'}} ‖ p_{C | C_{j}})] \end{array}$ (4)

While the estimate in the original fuzzy k-NN rule of Equation 1 attains p_{C|C_j} beyond a priori knowledge p_C through the use of d_j, it is the goal of the current work to make the estimate of $μ_{i} (\hat{x})$ additionally attain $p_{C | C_{j}, C_{j}^{'}}$ beyond p_{C|C_j} thru the use of $d_{j}^{'}$ .

As with Equation 1, the estimates formed using both d_j and $d_{j}^{'}$ for all the k nearest neighboring samples are additively pooled together to get the final estimate.

Section 2 reviews other prior work. Section 3 covers the proposed class membership grade estimation method. Section 4 the enhancement of the basic method by its conditional application. Experimental results demonstrating the performance advantage of the proposed method are reported in Section 5. This section also discusses the complexity of the proposed method and gives insights gained from dimensionality reduction. Section 6 presents concluding remarks.

2 Prior work

The theoretical works of [8, 9] prove the performance bound and the convergence rate of the fuzzy generalized nearest neighbor rule.

In fuzzy rule based determination of the initial class membership grades of known samples [10], the parameters of the antecedent and consequent sets as well as feature weights are optimized with a genetic algorithm. In [11], the distance to a training sample is modified by dividing by the level of the subset to which the sample belongs and the class memberships are expressed as functions of numbers of neighbors of each class and the class memberships of the neighbors in a fuzzy decision rule. Assigning weights to neighbors based on initial membership values, distance as well as standard deviation of class membership values gives more credibility to neighbors located parallel to the vector connecting the class means [12]. The fuzzy k-NN voting scheme in [13] achieves the effect of automatically varying the effective number of nearest neighbors according to the local density of the known samples. An estimation theoretic approach [14] determines the weights used to emphasize the membership grades of the neighbors in fuzzy k-NN rule as the optimal weights of an unbiased linear mean squared error estimator for the feature vector of the unknown sample.

The intuitionistic fuzzy k-NN approach [15] thresholds both the membership and nonmembership functions to define a voting function as either the membership function, its negative, or zero. The voting functions for the k nearest neighbors are averaged and shifted to yield the class membership grades.

In the adaptable nearest neighbor selection method [16], prototype like neighbors (high membership in one class) are emphasized in deciding the final membership for the unknown sample. Pruning of the training set to a prototype set [17] has also been proposed to reduce the complexity of application of fuzzy k-NN. Similarly, the class membership grade estimates for each training vector can be computed by fuzzy k-NN and a training vector can be deleted if the difference of the actual and estimate vectors of the class membership grades has a large norm [18].

The uncertainty in the choice of initial k (no. nearest of neighbors) can be managed by extending membership grades for each known sample as interval type 2 fuzzy memberships [19]. Here, classification is performed by first assigning the weighted average of the interval type 2 fuzzy memberships of its nearest neighbors to the unknown sample and then performing type reduction. Another approach [20] is to optimize the k_init and p parameters by a genetic algorithm which has been extended [21] by allowing the k_init and p to assume sets of values where the set elements are determined by the CHCevolutionary optimization. This leads to intervals for both the class membership grades of nearest neighbors and the weights assigned to these grades. A lexicographic order is applied to the sums of intervals of the nearest neigbors for all classes to get the class label for the test sample.

In [22], basic belief masses are assigned to each subset of the set of classes for each of the k nearest neighbors of the unknown sample and combination of these masses using Dempster’s rule. Later, [23] optimizes the parameters of the classification rule of [22].

For numerical features, [24] generalized the fuzzy k-NN rule by modelling each feature as a linguistic variable using FCM [6] and strong fuzzy partitioning. Similarities of individual features and votes of training samples are then aggregated by using operators to construct the unknown sample’s membership grades.

In the literature, the notable fuzzy k-NN classifier application areas are bioinformatics, image processing and biomedical signal processing [25 –28].

3 The Basic Method

Fig. 1 a) illustrates the essential mechanism of the proposed method on a 2-D example involving two linearly separable classes of known samples. The class boundaries are indicated by dotted lines and the unknown sample has an equal number of known nearest neighbors of each class (a total of k = 4) by design. Feature vectors of known data samples of each class are contained in an envelope (convex hull) that can be formed by interpolating these vectors. Rather than using the distance d_j of j-th nearest neighboring known sample to the unknown sample, the proposed approach uses the distance, $d_{j}^{'}$ , of the envelope of c (x_j) to the unknown sample. In the example of Fig. 1 a), the unknown sample is labelled as class B by the original fuzzy k-NN rule (crisp initialization) since it is closer to the two known nearest neighboring samples of class B than the two known nearest neighboring samples of class A. However, using $d_{j}^{'}$ in place of d_j in the fuzzy rule changes the classification result in favor of the more meaningful label of class A. Had the set of samples been denser in the feature space, the original fuzzy rule would also have yielded label A as the classification result. Put differently, the geometric structures of class regions are exploited to compensate for the sparseness of the dataset by using $d_{j}^{'}$ in place of d_j.

Fig. 1

a) When $d_{j}^{'}$ is used in place of d_j in the fuzzy k-NN rule, the unknown sample is assigned the label of the class with closer envelope (class A). b) $\hat{x}$ is outside the convex hull of class C (x_j). The derived distance $d_{j}^{'} = ∥ v_{j}^{'} ∥$ is the shortest distance from $\hat{x}$ to the line $l_{j}^{'}$ connecting x_j to another sample $x_{j}^{'}$ of class C (x_j) and inversely correlates with the confidence in the class. c) Contrary to (a) and (b), when $\hat{x}$ is inside the convex hull, $d_{j}^{'}$ , the distance to boundary of class on the left (right) is small (large) indicating a lower (higher) confidence in the class.

As illustrated in Fig. 1 b), for an unknown sample outside the envelope of a particular class, $d_{j}^{'} = ∥ v_{j}^{'} ∥ = d_{j} \sin (α_{j})$ where α_j is the smallest angle that $v_{j} = \hat{x} - x_{j}$ makes with a line segment $l_{j}^{'}$ connecting the nearest neighbor x_j to another labelled sample $x_{j}^{'}$ of the same class. The proposed distance ${\tilde{d}}_{j} = 0.5 (d_{j} + d_{j}^{'})$ (5) replaces the original distance d_j in the fuzzy rule.

The proposed distance ${\tilde{d}}_{j}$ is meaningful only if $\hat{x}$ , the feature vector of the unknown sample, lies outside the envelope of each class (see Section 4). When the convex hull condition is satisfied, $d_{j}^{'}$ could be interpreted as the shortest distance of $\hat{x}$ to a fictitious sample x_P of the same class as x_j and $x_{j}^{'}$ , assumed to lie on the line containing $l_{j}^{'}$ , i.e. $x_{P} = x_{j} + {proj}_{\vec{x_{j} x_{j}^{'}}} v_{j}$ . In order for $x_{j}^{'}$ to provide the largest information on the class identity of $\hat{x}$ , it is required to be of the same class as x_j even though it can be a far off neighbor of x_j.

It is possible for $\min (∥ x_{P} - x_{j} ∥, ∥ x_{P} - x_{j}^{'} ∥)$ to be large or $x_{P} \notin l_{j}^{'}$ . While the former necessitates interpolation of class identity using distant samples, the latter relies on extrapolation of the class identity from $x_{j}^{'}$ to x_P if α_j < π/2, and x_j to x_P if α_j > π/2. To handle such cases with unreliable $d_{j}^{'}$ , ${\tilde{d}}_{j}$ is opted as a more robust estimate than $d_{j}^{'}$ .

In the general D dimensional case, the proposed fuzzy k-NN rule with crisp initialization is $μ_{i} (\hat{x}) = \frac{\sum_{j : j \in K_{i}} {\tilde{d}}_{j}^{- p}}{\sum_{j : j \in K} {\tilde{d}}_{j}^{- p}}$ (6)

where ${\tilde{d}}_{j} = 0.5 d_{j} (1 + α_{j})$ and α_j is defined as in the 2-D example discussed above. The use of (1 + α_j) in place of (1 + sin (α_j)) effectively increases the weight given to neighbors with small angles since $\frac{\sin (α)}{α}$ is a strictly decreasing function of α_j. This modification of the proposed distance yields slightly more robust performance in the D dimensional feature space for D > 2. One can argue for this modification by referring to the 3-D drawing in Fig. 2. While for small α_j, the shortest distance of the unknown sample’s feature vector to a vector interpolated from feature vectors of the same class as x_j is possibly much shorter than that predicted by its distance to the line $l_{j}^{'}$ , for large α_j, the difference between the two is negligible.

Fig. 2

Left: If α_j, the angle between v_j and $l_{j}^{'}$ is large, the difference between $d_{j}^{'} = ∥ v_{j}^{'} ∥$ and the magnitude of v_j’s projection onto the normal of the shaded face (the shortest distance to a vector interpolated from feature vectors of the same class as x_j) is relatively small. Right: If α_j is small, it is likely that $d_{j}^{'}$ is much larger than the magnitude of v_j’s projection onto the normal of the shaded face.

Fig. 3

Two classes of samples uniformly distributed on [0,1]x[0,0.5] and [0,1]x[0.5,1]. Known samples: +, unknown samples: •, nearest neighbors: □. The colors indicate the classification decisions for k = 3 nearest neighbors. Left: Classification using d_j, Right: Classification using ${\tilde{d}}_{j}$ . The blown up sections show the correction of a classification decision for one unknown sample by use of ${\tilde{d}}_{j}$ .

For the fuzzy k-NN rule with fuzzy initialization, each nearest neighbor x_j could have multiple class memberships. Again, let $v_{j} = \hat{x} - x_{j}$ . For each class i with μ_i (x_j) >0, the smallest angle α_j,i that v_j makes with the line segment $l_{j, i}^{'}$ connecting x_j to a sample of assigned class i, $x_{j, i}^{'}$ , is determined. Based on the proposed distance ${\tilde{d}}_{j, i} = d_{j} (1 + α_{j, i})$ , the class membership grade of $\hat{x}$ is determined as $μ_{i} (\hat{x}) = \frac{\sum_{j : j \in K} μ_{i} (x_{j}) {\tilde{d}}_{j, i}^{- p}}{\sum_{j : j \in K} {\tilde{d}}_{j, i}^{- p}}$ (7)

In this case, the proposed distance ${\tilde{d}}_{j, i}$ depends on the class index as well as the nearest neighbor index.

4 Conditional application of the proposed distance

Assuming that the density of the known samples is fairly high, the feature space position of an unknown sample relative to the convex hull of the known samples of a certain class offers a simple means for assessing the confidence in employing the proposed distance.

Specifically, three cases can be identified.

i) $\hat{x} \notin conv (S_{c (x_{j})})$ ∀j where S_i is the set of known samples of class i (as in Fig. 1 a), b)). In this case, $d_{j}^{'}$ correlates with the inverse of the confidence in c (x_j).

ii) $\hat{x} \in conv (S_{i_{0}})$ for one class i₀ s.t. c (x_j) = i₀ ∀j even for a sparse distribution of samples in local neighborhood of $\hat{x}$ . Since the unanimous voting in the fuzzy rule yields the same classification result when ${\tilde{d}}_{j}$ replace d_j, ${\tilde{d}}_{j}$ need not replace d_j.

iii) | {i : ∃ j s . t . c (x_j) = i} |>1 and ∃i s.t. $\hat{x} \in conv (S_{i})$ . Fig. 1 c) suggests that the derived distance $d_{j}^{'}$ need not correlate with the inverse of the confidence in the class c (x_j). Hence, in this case, d_j is employed instead of ${\tilde{d}}_{j}$ to determine the class of $\hat{x}$ .

Therefore, the first case is distinguished from the other two by testing $\hat{x} \notin conv (S_{c (x_{j})})$ ∀j. If the test returns false, the proposed distance ${\tilde{d}}_{j}$ is not used and Equation 1 is employed with d_j as a fallback solution.

Due to its convergence speed and runtime, Wolfe’s method [29] as described in [30] has been adopted for testing

\hat{x} \notin conv (S_{c (x_{j})})

\hat{x}

is first subtracted from the feature vector of each known sample of a particular class to get a set of vectors denoted as P. The implementation in Algorithm 1 then tests

\vec{0} \notin conv (P)

Algorithm 1
Wolfe’s algorithm implementation [30]

\vec{q_{1}} = \arg \min_{\vec{p} : \vec{p} \in P} ∥ \vec{p} ∥

, initialize λ_i = 1_1,i

U_{Dx 1} = {\vec{q_{1}}}

, Maintain

\vec{x} = \sum_{i : q_{i} \in U} λ_{i} \vec{q_{i}}

While

∥ \vec{x} ∥ > ϵ

\vec{q} = arg min_{\vec{p} : \vec{p} \in P} ∥ {proj}_{\vec{x}} \vec{p} ∥

∥ {proj}_{\vec{x}} \vec{q} ∥ \geq ∥ \vec{x} ∥

Declare

\vec{0} \notin conv (P)

else

Add new column

\vec{q}

to U_DxI as

{\vec{q}}_{I + 1}

While True do

Determine

\vec{y} = arg min_{z \in aff (U)} ∥ z ∥

(i.e.

\vec{y} = U \vec{α}

\vec{α} = (U^{T} U)^{- 1} \vec{1} / {\vec{1}}^{T} (U^{T} U)^{- 1} \vec{1}

)

ifα_i ≥ 0 then Break

else

θ = min {min_{i : α_{i} < λ_{i}} \frac{λ_{i}}{(λ_{i} - α_{i})}, 1}

λ_i = θα_i + (1 - θ) λ_i

Remove columns from U_DxI

for which λ_i = 0

end if

end while

\vec{x} \leftarrow \vec{y}

end if

end while

Declare

\vec{0} \in conv (P)

In the case of crisp initialization, the conditionally modified distance ${\hat{d}}_{j}$ replaces ${\tilde{d}}_{j}$ in Equation 6 where ${\hat{d}}_{j} = {\begin{matrix} {\tilde{d}}_{j} & if \hat{x} \notin conv (S_{c (x_{j})}) \\ d_{j} & otherwise \end{matrix}$ (8)

For fuzzy initialization, for each class i that nearest neighbor x_j has nonzero membership grade, the inclusion of $\hat{x}$ in the convex hull of S_i is tested separately. The conditionally modified distance ${\hat{d}}_{j, i}$ replaces ${\tilde{d}}_{j, i}$ in Equation 7 where ${\hat{d}}_{j, i} = {\begin{matrix} {\tilde{d}}_{j, i} & if \hat{x} \notin conv (S_{i}) \\ d_{j} & otherwise \end{matrix}$ (9)

5 Experimental results

In Section 5.1, a figure of merit called the grade advantage is reported alongside accuracy for each classification task. It reflects the confidence in the classification decision-making by comparing the grades of the correct class $c_{\hat{x}}^{*}$ and the class $c_{\hat{x}}^{†}$ through $γ = \sum_{\hat{x}} \frac{μ_{c_{\hat{x}}^{*}} (\hat{x})}{μ_{c_{\hat{x}}^{*}} (\hat{x}) + μ_{c_{\hat{x}}^{†}} (\hat{x})}$ (10) where $c_{\hat{x}}^{†} = arg max_{i : i \neg = c_{\hat{x}}^{*}} μ_{i} (\hat{x})$ . Each unknown sample contributes a value between 0 and 1 to γ. The larger the γ is, the better the fuzzy k-NN rule is performing in terms of assigning grades to the classes. For a small size sample (small dataset), γ is a better predictor of the classifier performance on the population (larger dataset) than accuracy is.

In the next two subsections, all performance figures reported are based on five fold cross-validation, and the tested fuzzy k-NN rules employ unnormalized attributes and the Euclidean distance metric for selecting the nearest neighbors. The parameters of the fuzzy k-NN rules are set as k_init = k and p = 2.

5.1 Comparisons with original fuzzy k-NN rule

In the first experiment, two linearly separable classes of 450 unknown and 50 known samples each, have been synthetically generated to have feature vectors uniformly distributed on [0,1]x[0,0.5] and [0,1]x[0.5,1]. In Fig. 5.1, it is seen that more than 40% (14 out of 33) of the unknown samples that are incorrectly classified by the crisp initialized fuzzy k-NN rule when original distance d_j is employed (left) are correctly classified when the proposed distance ${\tilde{d}}_{j}$ is employed (right).

Secondly, two-class Gaussian and truncated Gaussian datasets with independent features have been synthetically generated. Data sets G1-G3 have 0 mean for the features of both classes and standard deviation of 1 and 2 for the features of the first and second classes, respectively. Data sets G4-G10 have standard deviation of 1 for the features of both classes and means of 0 and 1 for the features of the first and second classes, respectively. Truncated Gaussian datasets TG3,TG6,TG9 have been obtained from G3,G6,G9, respectively, by discarding those samples with feature vectors having a distance from the class mean vector of greater than the feature std dev (TG6), or twice the feature std dev (TG3, TG9). First row of Table 1 also displays the dimension D and size n of each dataset as (D,n) under its label. Since the performance advantage is largely dependent on the few choices of parameter combinations, average performances are not reported. The results are summarized in Table 1.

Table 1
Classification error ϵ%, grade advantage γ of fuzzy k-NN rule employing original and conditionally applied proposed distance (bold:winner)

Rule (init.)	k	Measure	G1	G2	G3	G4	G5	G6	G7	G8	G9	G10	TG3	TG6	TG9
Rule (init.)	k	Measure	(5, 10²)	(5, 10³)	(5, 10⁴)	(2, 10²)	(2, 10³)	(2, 10⁴)	(5, 10²)	(5, 10³)	(5, 10⁴)	(15, 10³)	(5, 10⁴)	(2, 10⁴)	(5, 10⁴)
Original	5	ϵ%	22	20.8	19.25	26	28.8	29.57	20	16.5	16.06	4.1	9.58	8.12	7.08
(crisp)	γ	73.03	73.2	76.39	67.43	68.47	68.04	74.2	78.73	80.04	93.17	86.62	91.79	91.5
Cond.	5	ϵ%	14	19.2	17.96	25	28.65	29.54	19	16.3	15.9	4.1	8.67	8.05	6.46
(crisp)	γ	76.26	76.8	78.18	68.2	68.65	68.08	75.67	80.31	80.59	93.51	87.93	91.92	92.58
Original	5	ϵ%	39	22.6	17.94	29	25.85	27.03	19	15.9	14.74	4	10.4	7.95	6.81
(fuzzy)	γ	65.48	69.8	75.14	67.12	68.89	68.01	72.21	77.94	79.7	92.33	86.35	91.55	91.6
Cond.	5	ϵ%	26	18.4	16.61	29	25.75	27.04	18	15.5	14.63	4	7.65	7.85	6.28
(fuzzy)	γ	68.81	75.08	77.75	68.6	69.1	68.05	74.69	79.83	80.34	92.86	88.09	91.69	92.61
Original	10	ϵ%	36	22.3	17.95	29	27.95	28.65	17	15.4	14.6	3.6	9.85	8.1	6.83
(crisp)	γ	67.35	71.34	75.87	67.62	68.39	68.1	74.03	77.89	79.96	92.62	85.22	91.67	91.22
Cond.	10	ϵ%	26	18.1	16.86	29	27.9	28.63	18	15.3	14.56	3.8	7.5	8.05	6.19
(crisp)	γ	70.54	75.92	78.07	68.87	68.59	68.15	75.77	79.79	80.57	93.16	87.25	91.86	92.53
Original	10	ϵ%	47	24.7	17.59	28	25	26.05	18	14.9	13.82	3.5	11.38	8.17	6.26
(fuzzy)	γ	61.28	67.51	74.26	67.15	68.67	68.03	72.25	77.11	79.53	91.58	83.96	91.34	91.13
Cond.	10	ϵ%	41	17.5	16.24	28	24.95	26.06	19	14.8	13.81	3.4	7.34	8.15	5.86
(fuzzy)	γ	64.15	73.65	77.41	68.87	68.89	68.08	74.69	79.28	80.23	92.31	86.82	91.57	92.44

Although there is a conspicuous accuracy advantage with the proposed method for datasets G1-G3 with zero mean classes, the advantage is very slim for datasets G4-G10 with different class means since a larger fraction of test samples fall into case ii) of Section 4 for datasets G4-G10. Truncation of the class distributions leads to larger accuracy gains since, in truncated datasets, a much larger fraction of test samples that are located outside the convex hulls of the training samples benefit from the proposeddistance.

Ten datasets from the UCI repository have also been used to compare the classification performance of the proposed fuzzy k-NN rule against that of the original one. Table 2 provides results with unconditional and conditional application of the proposed distance.

Table 2

Classification error ϵ%, grade advantage γ for fuzzy k-NN rule employing original and unconditionally/conditionally applied proposed distance

Rule (init.)	k	Measure	Pima	Ionosphere	Wine	Occupancy	Iris	Glass	Ecoli	Yeast	Waveform	Seeds	AVG.
Original	5	ϵ%	29.17	17.38	23.03	1.07	3.33	30.37	14.58	42.39	17.9	10.95	19.02
(crisp)	γ	66.69	81.97	72.51	98.7	95.67	68.39	81.08	53.93	77.39	89.97	78.63
Uncond.	5	ϵ%	28.78	16.24	22.47	0.99	4	27.1	13.69	42.05	17.84	11.43	18.46
(crisp)	γ	67.62	82.67	73.61	98.79	95.99	69.22	82.22	54.55	77.87	89.59	79.21
Cond.	5	ϵ%	27.22	16.24	21.91	1.06	4	26.16	13.39	41.24	17.84	10.95	18
(crisp)	γ	69.22	82.67	74.36	98.71	95.97	69.25	83.29	54.92	77.87	89.78	79.6
Original	5	ϵ%	27.99	18.8	28.09	0.99	3.33	31.78	14.29	40.57	17.04	11.9	19.48
(fuzzy)	γ	66.46	80.07	70.5	98.58	94.8	65.73	80.5	54.75	75.79	89.57	77.68
Uncond.	5	ϵ%	27.34	17.95	24.72	0.98	3.33	29.91	14.59	40.17	16.7	10	18.57
(fuzzy)	γ	67.48	80.92	71.67	98.65	95.58	66.75	81.96	55.33	76.45	89.04	78.39
Cond.	5	ϵ%	25.26	17.95	23.03	0.99	3.33	30.38	13.4	40.64	16.7	10	18.17
(fuzzy)	γ	69.41	80.92	72.65	98.59	95.64	66.77	83.14	55.41	76.45	89.27	78.83
Original	10	ϵ%	26.56	17.38	21.91	1.04	2.67	32.71	14.58	40.57	16.54	9.52	18.35
(crisp)	γ	67	80.14	72.32	98.65	94.9	66.45	80.1	54.96	76.96	89.1	78.06
Uncond.	10	ϵ%	25.52	17.1	24.16	0.96	3.34	28.97	14.28	40.03	16.32	8.57	17.93
(crisp)	γ	68.18	80.99	73.7	98.76	95.53	67.34	81.6	55.52	77.55	89.9	78.91
Cond.	10	ϵ%	25	17.1	23.03	1.03	3.34	30.84	14.28	40.57	16.32	8.57	18.01
(crisp)	γ	69.8	80.99	74.51	98.66	95.65	67.25	82.64	55.62	77.55	90.12	79.28
Original	10	ϵ%	26.69	19.37	24.72	1.02	4	32.71	14.29	40.36	16.14	10.95	19.03
(fuzzy)	γ	66.2	78.08	70.58	98.53	93.42	63.53	79.04	54.97	75.28	88.35	76.8
Uncond.	10	ϵ%	25.26	18.8	23.03	0.99	4.67	30.84	14.29	40.23	15.74	10	18.39
(fuzzy)	γ	67.41	79.11	71.93	98.61	94.56	64.62	80.82	55.49	76.04	89.07	77.77
Cond.	10	ϵ%	24.74	18.8	21.35	1.01	4.67	32.24	13.4	41.03	15.74	10	18.3
(fuzzy)	γ	69.31	79.12	72.83	98.54	94.81	64.48	81.96	55.48	76.04	89.37	78.19

It is observed that the unconditional or conditional application of the proposed distance provides advantages in error rates in 8 out of 9 datasets for k=5 nearest neighbors, and 7 out of 9 datasets for k=10 nearest neighbors. On the average, both strategies yield advantages in error rates for all combinations of initialization type and k value. The average percentage reductions in classification error rate due to the unconditional application of proposed distance are 2.94% and 4.67%, respectively, for crisp and fuzzy initialization when k=5. Conditional application of the proposed distance increases these reductions to 5.36% and 6.72%, respectively, for crisp and fuzzy initialized fuzzy k-NN. When k=10, the respective reductions in error rates are 2.29% and 3.36% due to the unconditional application of proposed distance and 1.85% and 3.84% due to its conditional application. The advantage of the use of proposed distance is less for k=10 than for k=5.

The advantage of conditional application is not immediately clear, since, for k=10 and crisp initialization, unconditional application of the proposed distance has a larger average percentage reduction in error rate than its conditional application. However, for all combinations of k with initialization types, conditional application of the proposed distance yields a larger average γ than its unconditional application.

For statistical comparison of two classifiers over multiple datasets, [31] recommends the Wilcoxon signed rank test over the paired t-test. The Wilcoxon test statistic w^- is formed by taking the absolute differences of the error rates, ranking them and finally summing the ranks for which the error rate of the proposed fuzzy k-NN rule exceeds the error rate of the original fuzzy k-NN rule. From the exact p-values (p = Pr {W^- ≤ w^-}) reported in Table 3, it can be concluded that fuzzy initialization as well as k=5 nearest neighbors generally yield significantly lower error rates (higher classification accuracies).

Table 3

Wilcoxon statistic and p-values (×10^-2) calculated from Table 2 H₀ : accuracy of proposed rule = accuracy of original rule

	Unconditionally applied				Conditionally applied
	crisp k=5	fuzzy k=5	crisp k=10	fuzzy k=10	crisp k=5	fuzzy k=5	crisp k=10	fuzzy k=10
W ^-	12	2	15	5	3	1	12	11
p val	6.45	0.391	11.5	1.76	0.781	0.391	12.3	4.98

In the D dimensional feature space, the original fuzzy k-NN rule has a complexity of O (Dn). Since each known sample of the same class as a nearest neighbor could be its most informative neighbor, the unconditional application of the proposed distances has complexity O ((k + 1) Dn). For large n, the Wolfe’s algorithm complexity of O (n³ + Dn²) (see [30]) dictates the overall complexity in the conditional application of the proposed distance. For small n, the additional complexity needed to compute the proposed distance may actually be larger than the complexity of Wolfe’s algorithm so that it may be more advantageous to conditionally apply the proposed distance. In Table 4, average multiplicative runtime increase factors for the UCI datasets are reported for different scenarios. On the average, the complexity increase is the greatest for k=10 and fuzzy initialization, and the conditional application of the proposed distance yields a modest complexity increase over its unconditional application.

Table 4

Average multiplicative run time increase factors

	crisp k=5	fuzzy k=5	crisp k=10	fuzzy k=10
Unconditionally applied	2.3	3.7	4.4	5.5
Conditionally applied	6.1	9.3	8.6	11.9

5.2 Insights gained by dimensionality reduction

Linear Discriminant Analysis has renown as a preprocessing transformation for retaining the non-redundant information in the data and has been applied in this work to gain some insight into how the dimensionality of the data affects the performance gain of the proposed fuzzy k-NN rule over the original fuzzy k-NN rule. Table 5 shows on four datasets of Section 5.1 that when LDA is applied, the performance advantage of the proposed rule is largely reduced or nullified.

Table 5
Effect of dimensionality reduction with LDA on error ϵ% (k=5)

	Orig. crisp		Prop. crisp		Orig. fuzzy		Prop. fuzzy
	w/o LDA	with LDA	w/o LDA	with LDA	w/o LDA	LDA LDA	w/o LDA	with LDA
Ion.	17.38	14.53	16.24	14.25	18.80	12.54	17.95	12.54
Pima	29.17	30.08	27.21	29.95	27.99	27.08	25.26	26.95
Wine	23.03	0.56	21.91	0.56	28.09	0.56	23.03	0.56
Glass	30.37	35.51	26.17	35.05	31.78	35.98	30.37	35.05

While any pair of classes may be separated by a hyperplane in a local region of a high dimensional space, such structure in the data may vanish for multimodal class distributions when it is projected to a lower dimensional space. Therefore, the proposed fuzzy k-NN rule may lose its power in a reduced dimensional space where test samples fall inside convex hulls of classes.

5.3 Competitiveness against common classifiers in a remote sensing image classification application

Forest Type Mapping dataset of 523 samples in UCI repository contains remote sensing data of a forested area in Japan. 4 different forest types are mapped based on the spectral characteristics using ASTER satellite imagery. Spectral measurements and the deviations of these measurements from their predictions are specified as features. Alongside the ten fold stratified cross validation classification accuracy results achieved by the custom implementations of the crisp and fuzzy initialized versions of the original and proposed fuzzy k-NN rules, the corresponding results achieved by Weka implementations of crisp initialized fuzzy k-NN classifier and the common classifiers, Support Vector Machines (SVM), Random Forests (RF), Naive Bayes (NB) are listed in Table 6. The optimal combination of regularization and radial basis width parameters for SVM and the number of features and tree depth for RF have been determined by grid search. Best k : k ∈ {1, 30} has been determined by exhaustive search for the original fuzzy k-NN rule with crisp initialization and has been used in the other fuzzy k-NN rules.

Table 6
Classification errors ϵ% on the Forest Type Mapping dataset

Orig. crisp	Orig. fuzzy	Prop. crisp	Prop. fuzzy	Weka Orig. crisp	Weka SVM	Weka RF	Weka NB
9.94	9.75	9.75	9.37	10.51	10.13	10.71	13.96

In this application domain, the proposed fuzzy k-NN rule has a greater performance advantage over SVM, RF and NB than the original fuzzy k-NN rule and could be the classifier of choice if its higher computational complexity is not an issue.

5.4 Performance comparisons with fuzzy KNN type algorithms on Keel datasets

Experiments have also been conducted on the predesigned, 10-fold stratified cross validation (no covariate shift reduction) fold partitions of the 44 datasets from the KEEL-Dataset repository reported in [21]. The performances of the conditionally applied proposed and original fuzzy k-NN rules are compared against the performances of the 8 nearest neighbor type benchmark algorithms reported in Tables A.7-A.10 of [21]. Six of these benchmark algorithms other than EF-kNN-IVFS [21] and kNN, are the best performing subset of fuzzy nearest neighbor type classifier algorithms comparatively evaluated in [32].

Table 7 lists the maximum of accuracy values over {3, 5, 7, 9} neighbors for the classification of each dataset with each benchmark algorithm (taken from Tables A.7-A.10 of [21]) and the original and proposed fuzzy k-NN rules (k_init = 3, p = 2) with crisp and fuzzy initialization. A statistical analysis of these results has been performed by taking the proposed fuzzy rule (with either initialization) as the control treatment (algorithm) and comparing it against the original fuzzy rule (with the same initialization) and the other 8 algorithms in [21] by using hypothesis testing. In the first stage, the Friedman hypothesis test has been used as per [31] with the null hypothesis that the performances of m = 10 algorithms are indistinguishable. Let a_ij represent the accuracy of j-th treatment for i-th dataset. In a randomized block design, where each block represents one dataset, ranks, r_ij, of the accuracies, a_ij, are obtained by ranking the accuracies from 1 to m within each block. The average ranks for the algorithms are

{\bar{r}}_{j} = \frac{1}{b} \sum_{i = 1}^{b} r_{ij}

for b blocks. The Friedman statistic

S = \frac{12 b}{m (m + 1)} \sum_{j = 1}^{m} ({\bar{r}}_{j} - \frac{m + 1}{2})^{2}

(11)

Table 7

Accuracy % / Kappa (×10^-2) (maximums over k ∈ {3, 5, 7, 9} neighbors) for the classification of the Keel datasets

Data set

EF-kNN-IVFS [21]

kNN [3]

D-SKNN [22]

IF-KNN [15]

FENN [18]

IT2FKNN [19]

GAfuzzy-kNN[20]

PFKNN [17]

Original crisp

Proposed crisp

Original fuzzy

Proposed fuzzy

Appendicitis

87.91/51

87.91/57

87.00/55

87.91/57

Balance

89.28/76

89.44/80

88.63/79

89.27/80

88.95/80

89.60/81

87.52/77

89.44/80

90.40/82

88.96/80

90.40/82

Banana

89.83/77

89.89/80

89.91/80

90.23/80

89.62/79

89.28/78

89.57/79

88.96/78

88.94/78

89.42/79

Bands

71.85/19

71.46/38

71.02/36

69.40/30

70.35/32

69.63/32

69.67/28

73.63/42

73.39/42

71.31/35

72.46/37

Bupa

65.75/18

62.53/22

62.26/22

62.88/23

66.92/32

65.57/28

65.40/29

64.31/27

66.36/30

66.06/30

66.59/31

Cleveland

56.64/30

56.92/28

56.59/27

56.62/28

58.24/30

57.64/30

57.55/31

57.31/30

56.99/30

56.96/30

56.95/29

57.27/29

Dermatology

96.63/95

96.90/96

96.35/95

96.90/96

74.86/69

96.90/96

96.62/96

Ecoli

83.33/72

82.45/75

81.86/74

82.77/76

82.17/75

83.64/77

83.92/77

83.64/77

82.76/76

83.34/77

83.65/77

Glass

73.66/53

70.11/59

71.00/60

69.99/59

69.39/58

73.82/64

73.38/63

71.09/61

74.35/65

74.27/65

72.83/63

73.35/63

Haberman

71.25/15

71.55/14

71.23/14

72.56/14

74.84/14

68.98/9

70.30/13

70.59/9

68.30/13

68.29/12

68.97/13

68.65/10

Hayes-roth

66.88/31

30.0/-14

60.00/34

50.63/19

58.13/30

65.00/42

64.38/42

70.00/51

75.00/69

65.63/44

70.00/52

Heart

80.00/58

80.74/61

79.63/59

80.37/60

78.52/57

80.74/61

81.85/63

80.74/61

Hepatitis

86.51/52

89.19/44

89.19/64

85.02/53

84.67/53

84.32/52

86.09/55

87.76/49

85.08/46

84.67/53

Ionosphere

87.17/61

96.00/94

95.33/93

96.00/94

85.75/66

86.60/69

84.61/64

85.18/65

Iris

95.33/94

85.18/65

84.90/64

84.88/64

84.61/64

87.75/72

81.74/56

95.33/93

96.00/94

Led7Digit

71.60/69

45.20/39

73.00/70

72.60/69

71.60/68

71.40/68

65.20/61

71.20/68

71.20/60

71.60/68

71.40/68

Mammo.

80.12/54

81.71/63

81.34/63

80.86/62

80.62/61

80.25/60

79.53/59

80.37/61

79.74/59

79.99/54

80.37/61

80.97/62

Marketing

31.11/20

30.43/20

30.98/20

30.50/20

31.60/21

31.15/21

29.98/20

30.82/21

30.72/21

31.03/19

30.78/21

31.05/21

Monk-2

91.97/56

96.29/93

81.82/63

84.84/69

92.42/85

86.01/72

94.75/89

94.26/88

89.69/79

93.35/87

Movement

86.11/73

78.61/77

83.06/82

80.56/79

76.67/75

84.17/83

85.00/84

74.72/73

86.94/86

88.61/88

84.72/84

85.00/84

Newthyroid

96.80/85

95.37/90

94.44/87

94.00/86

95.87/91

96.34/92

92.58/82

97.23/94

96.77/93

96.32/92

96.77/93

Page-blocks

96.22/74

95.91/77

95.81/76

95.76/75

96.03/77

96.20/78

95.83/76

96.11/78

96.33/77

95.96/77

96.14/78

Penbased

99.38/99

99.32/99

99.34/99

99.33/99

99.31/99

99.34/99

99.33/99

98.89/99

99.40/99

99.44/99

99.34/99

99.38/99

Phoneme

90.49/68

88.49/72

88.56/72

88.45/72

88.29/71

89.95/75

89.62/74

89.49/74

90.64/77

90.69/77

89.64/74

89.54/74

Pima

74.49/40

73.19/38

74.10/40

74.23/38

74.23/40

74.36/40

72.92/37

73.06/38

74.62/41

73.45/38

75.01/42

Ring

73.77/6

71.82/43

64.38/28

56.24/12

62.41/24

66.32/32

58.81/17

71.84/43

72.59/45

63.07/26

66.61/33

Satimage

90.88/87

90.94/89

90.96/89

90.68/88

90.24/88

90.77/89

90.41/88

89.73/87

90.96/89

91.13/89

90.63/88

91.08/89

Segment

96.88/94

95.41/95

96.02/95

95.80/95

95.45/95

96.41/96

96.54/96

81.43/78

96.80/96

97.01/97

96.36/96

96.62/96

Sonar

86.00/59

83.10/66

82.10/64

81.64/63

82.57/65

85.52/70

79.24/58

85.00/69

83.55/66

Spambase

91.86/76

89.78/78

90.34/80

90.17/79

89.52/78

91.23/82

91.26/82

90.30/80

91.69/83

92.47/84

91.15/81

91.73/83

Spectfheart

78.30/25

77.58/32

77.15/26

78.28/18

78.29/26

78.69/30

77.18/24

77.58/32

78.69/27

78.69/25

Tae

71.00/19

45.79/19

53.75/31

51.08/26

68.33/52

70.25/55

67.00/50

67.04/48

67.67/51

66.33/49

Texture

99.00/98

98.75/99

98.76/99

98.62/98

98.49/98

98.78/99

98.91/99

88.91/88

98.93/99

99.00/99

98.75/99

98.85/99

Thyroid

93.99/29

94.00/40

93.96/38

93.92/36

93.75/32

93.97/37

93.93/38

93.78/32

94.07/41

94.50/45

93.97/37

94.33/40

Titanic

78.83/43

76.24/41

78.19/45

78.33/44

78.74/44

71.61/24

78.19/45

78.19/43

78.83/45

Twonorm

97.19/94

97.11/94

97.18/94

97.09/94

97.08/94

96.97/94

97.12/94

97.11/94

97.14/94

97.20/94

Vehicle

70.57/58

72.34/63

72.81/64

70.57/61

69.75/60

71.16/62

68.08/57

72.58/63

73.06/64

71.40/62

73.18/64

Vowel

99.39/96

97.78/98

97.88/98

97.47/97

96.36/96

98.48/98

99.09/99

92.22/91

98.79/99

98.89/99

98.38/98

98.48/98

Wdbc

97.18/92

97.18/94

97.01/94

96.83/93

97.18/94

97.01/93

97.36/94

97.18/94

97.01/93

97.18/94

Wine

96.05/95

96.63/95

97.19/96

96.60/95

96.08/94

97.19/96

Wine q.red

69.36/30

55.60/28

60.35/36

59.66/35

59.16/34

68.98/50

67.48/48

69.35/50

68.85/50

69.48/51

68.10/49

69.23/50

Wine q.wht.

68.54/31

51.04/25

59.51/38

57.35/35

58.76/36

68.38/52

66.74/49

68.74/52

68.03/51

68.64/52

68.27/51

68.58/51

Wisconsin

97.25/93

97.25/94

97.11/94

97.10/94

97.25/94

96.51/92

97.39/94

97.25/94

97.39/94

97.26/94

Yeast

59.91/43

57.62/45

59.57/47

59.04/47

59.64/47

60.25/48

58.29/46

60.45/48

59.37/47

59.10/47

59.98/48

59.78/47

AVERAGE

82.64/59

79.34/61

81.36/65

80.62/63

80.09/61

81.81/65

82.04/66

79.61/61

82.51/67

82.90/67

81.90/65

82.41/66

is chi-square distributed with m-1 degrees of freedom. For the m = 10 algorithms and b = 44 datasets with accuracies shown in Table 8, the statistic is determined to be 66.9583 and 73.221, for crisp and fuzzy initialization, respectively. The corresponding p-values (p = Pr {X > S}) are 5.3E-18 and 3.89E-21.

Table 8

Pairwise accuracy comparisons (Wilcoxon signed rank test) between the 9 algorithms and the proposed fuzzy rule with crisp initialization

	EF-kNN-IVFS	kNN	D-SKNN	IF-KNN	FENN	IT2FKNN	GAfuzzy-kNN	PFKNN	Orig. crisp
Statistic W^-	378	162	200.5	167	183	286.5	237	151	120.5
z score	-0.913	-3.327	-2.637	-3.105	-3.351	-2.057	-2.501	-4.009	-3.178
unadjusted p	0.181	4.39E-4	4.18E-3	9.51E-4	4.03E-4	1.99E-2	6.19E-3	3.05E-5	7.43E-4
Holm corrected α	0.05	0.00714	0.0125	0.01	0.00625	0.025	0.0166	0.00556	0.00833

In the second stage, Wilcoxon signed rank test post-hoc test is used to identify those algorithms which have a statistically significantly lower accuracy than the accuracy of the control treatment. In accordance with [33], the mean ranks post-hoc test of [34] has not been used due to its inconsistency of power.

Normal approximation of the distribution of the statistic is used to calculate the p-value since the number of datasets is large. The z-score is determined as $z = \frac{W^{-} - b (b + 1) / 4}{\sqrt{b (b + 1) (2 b + 1) / 24}}$ (12)

The Familywise Error Rate (FWER) is bounded to 0.05 by using the Holm-Bonferroni correction. This correction is less conservative than the Bonferroni correction and it does not alter the significance levels of pairwise comparisons with strong treatments when weaker ones are later added to the comparison set.

The p-values of the 9 pairwise comparisons with the proposed fuzzy k-NN rule with the crisp initialization as the control treatment, and the proposed fuzzy k-NN rule with the fuzzy initialization as the control treatment, are tabulated against the Holm-Bonferroni corrected significance levels in Tables 8 and 9, respectively. The accuracy of the proposed fuzzy k-NN rule with either initialization is statistically significantly higher than the accuracy of all algorithms except EF-kNN-IVFS [21]. Post-hoc tests are inconclusive in showing the statistical significance of the advantage of either the proposed fuzzy k-NN rule or EF-kNN-IVFS. On the other hand, for the values of k giving the best results, while the ratio of the run time of EF-kNN-IVFS to that of the fuzzy k-NN rule with fuzzy initialization is 98.8, the ratios of the run times of proposed fuzzy k-NN rule with crisp and fuzzy initializations to those of their original counterparts are merely 9.9 and 13.0, respectively.

Table 9

Pairwise accuracy comparisons (Wilcoxon signed rank test) between the 9 algorithms and the proposed fuzzy rule with fuzzy initialization

	EF-kNN-IVFS	kNN	D-SKNN	IF-KNN	FENN	IT2FKNN	GAfuzzy-kNN	PFKNN	Orig. fuzzy
Statistic W^-	417	190.5	231.5	178	145.5	220.5	217.5	134	92.5
z score	0.101	-3.104	-2.572	-3.414	-3.820	-2.358	-2.581	-3.836	-3.636
unadjusted p	0.540	9.56E-4	5.05E-3	3.21E-4	6.68E-5	9.18E-3	4.93E-3	6.26E-5	1.38E-4
Holm corrected α	0.05	0.01	0.0166	0.00833	0.00625	0.025	0.0125	0.00556	0.00714

Cohen’s Kappa statistic accounts for the chance agreement between actual and predicted labels. It is

κ = (p_{a} - p_{e}) / (1 - p_{e})

(13) where

p_{e} = \sum_{c} \Pr {C = c} \Pr {\hat{C} = c}

, p_a is accuracy probability. For each algorithm, Table 10 also lists, the maximum κ over {3, 5, 7, 9} neighbors for each dataset and the average maximum κ over all datasets.

Table 10

AUC statistics (maximum over k ∈ {3, 5, 7, 9} neighbors) for the classification of the two class Keel datasets

Data set	EF-kNN-IVFS	kNN	D-SKNN	IF-KNN	FENN	IT2FKNN	GAfuzzy-kNN	PFKNN	Orig. crisp	Prop. crisp	Orig. fuzzy	Prop. fuzzy
Appendicitis	0.822	0.817	0.795	0.837	0.843	0.846	0.825	0.825	0.827	0.834	0.839	0.846
Banana	0.959	0.963	0.951	0.967	0.965	0.966	0.960	0.959	0.958	0.958	0.964	0.962
Bands	0.645	0.728	0.749	0.727	0.688	0.746	0.731	0.712	0.761	0.765	0.756	0.754
Bupa	0.591	0.641	0.668	0.654	0.652	0.676	0.649	0.678	0.687	0.713	0.662	0.693
Haberman	0.667	0.629	0.618	0.653	0.689	0.665	0.650	0.637	0.636	0.625	0.686	0.673
Heart	0.845	0.856	0.851	0.857	0.853	0.866	0.856	0.867	0.867	0.866	0.856	0.863
Ionosphere	0.891	0.933	0.926	0.935	0.868	0.937	0.933	0.929	0.937	0.939	0.937	0.937
Mammographic	0.849	0.893	0.878	0.889	0.873	0.878	0.868	0.875	0.860	0.862	0.874	0.874
Monk-2	0.923	0.988	0.983	0.980	0.913	0.957	0.986	0.966	0.974	0.976	0.956	0.958
Phoneme	0.922	0.932	0.933	0.941	0.940	0.954	0.953	0.946	0.956	0.956	0.952	0.948
Pima	0.784	0.793	0.760	0.800	0.797	0.801	0.790	0.799	0.789	0.798	0.797	0.805
Ring	0.823	0.914	0.881	0.920	0.692	0.924	0.918	0.924	0.914	0.910	0.919	0.910
Sonar	0.899	0.909	0.917	0.927	0.885	0.924	0.96	0.878	0.930	0.936	0.921	0.921
Spambase	0.939	0.949	0.949	0.954	0.951	0.965	0.962	0.948	0.965	0.969	0.963	0.966
Spectfheart	0.835	0.798	0.788	0.797	0.813	0.812	0.816	0.792	0.797	0.799	0.812	0.813
Titanic	0.729	0.670	0.746	0.756	0.726	0.739	0.727	0.561	0.754	0.754	0.731	0.731
Twonorm	0.996	0.996	0.992	0.997	0.997	0.997	0.997	0.996	0.996	0.996	0.997	0.996
Wdbc	0.993	0.992	0.986	0.994	0.993	0.993	0.992	0.993	0.992	0.992	0.994	0.993
Wisconsin	0.986	0.991	0.991	0.992	0.984	0.992	0.991	0.993	0.990	0.990	0.992	0.991
AVERAGE	0.847	0.863	0.861	0.873	0.849	0.876	0.872	0.857	0.873	0.876	0.874	0.875

Receiver Operating Characteristic (ROC) is a plot of the True Positive Rate (TPR) vs. the False Positive Rate (FPR) as a function of the grade of the positive class. For each algorithm, Table 11 lists the Area Under Curve (AUC) of the maximum ROC (maximum TPR as a function of FPR) over {3, 5, 7, 9} neighbors for the two-class datasets (excluding Hepatitis which had too few samples for estimating ROC).

The average maximum κ values and AUC values in Table 10 suggest that IT2FKNN and GA-fuzzy-kNN, rather than EF-kNN-IVFS, strongly contend with the proposed method. For the Banana dataset, the proposed fuzzy k-NN rule is not better than the original fuzzy k-NN rule and is inferior to IT2FKNN in terms of AUC performance. A scatter plot of the 2-D feature vectors of this dataset reveals that most known boundary samples of each class are inside the convex hull of the other class so that the proposed distance is not employed. This is distinct from the linearly separable, two-class problem of Section 5.1 where the proposed fuzzy rule has a performance advantage.

6 Conclusions and future work

This paper has presented a new fuzzy k-NN rule based classifier. The soft voting process for the class of the unknown sample employs its shortest distance to the line passing through the nearest neighbor and a neighbor of this nearest neighbor of the same class. It is also proposed that the new distance be employed only when, for each nearest neighbor, the unknown sample lies outside of the convex hull of the set of known samples of the same class as the nearest neighbor. The proposed fuzzy k-NN rule has a significantly higher classification accuracy than the original fuzzy k-NN rule and most other fuzzy nearest neighbor type algorithms, albeit a modestly higher complexity.

In future work, it might be of interest to see if the proposed classifier can benefit from denoising and despeckling methods (e.g. [35, 36]) to improve its performance on noisy (image) data.

References

Fix and

J.L.

Hodges , Discriminatory analysis: Small sample performance, USAF School of Aviation Medicine, Randolph Field, Tex, Project 21-49-004 Rept. 11 (1952).

Ren ,

Yuy ,

Zhang and

Huang , Learning Convolutional NonLinear Features for K Nearest Neighbor Image Classification, in: Intl Conf Pattern Recogn, 2014, 2014.

T.M.

Cover and

P.E.

Hart , Nearest neighbor pattern classification, IEEE Trans Inf Theory IT-13 (1967), 21–27.

Jozwik , A learning scheme for a fuzzy k-NN rule, Pattern Recogn Lett 1 (1983), 287–289.

J.C.

Bezdek ,

S.K.

Chuah and

Leep , Generalized k-nearest neighbor rules, Fuzzy Sets Syst 18 (1986), 237–256.

J.C.

Bezdek ,

Ehrlich and

Full , FCM: The fuzzy c-means clustering algorithms, Comp Geosci 10 (1984), 191–203.

J.M.

Keller ,

Gray and

J.A.

Givens , A fuzzy K-nearest neighbor algorithm, IEEE Trans Syst Man Cybern SMC-15(4) (1985), 580–585.

Yang and

Chen , On strong consistency of the fuzzy generalized nearest neighbor rule, Fuzzy Sets Syst 60 (1993), 273–281.

Yang and

Chen , Convergence rate of the fuzzy generalized nearest neighbor rule, Comp Math Appl 27 (1994), 1–8.

10.

T.W.

Chua and

W.W.

Tan , A new fuzzy rule-based initialization method for K-Nearest neighbor classifier, in: FUZZIEEE, IEEE, 2009, pp. 415–420. ISBN ISBN 978-1-4244-3596-8. http://dblp.uni-trier.de/db/conf/fuzzIEEE/fuzzIEEE2009.html#ChuaT09.

11.

V.T.

Kissiov , A fuzzy version of the K-NN method, Fuzzy Sets Syst 49(3) (1992), 323–329.

12.

J.H.

Han and

Y.K.

Kim , A fuzzy K-NN algorithm using weights from the variance of membership values, in: IEEE Comp Society Conf Comp Vision and Pattern Recognition, Ft Collins, CO, USA, 1999.

13.

H.B.

Mitchell and

P.A.

Schaefer , A ’soft’ K-nearest neighbor voting scheme, Intl J Intel Syst 16(4) (2001), 459–468.

14.

T.D.

Pham , An Optimally Weighted Fuzzy k-NN Algorithm, in: Intl Conf Pattern Recogn Image Analysis: Pattern Recognition and Data Mining, Bath, UK, 2005, pp. 239–247.

15.

L.I.

Kuncheva , An intuitionistic fuzzy k-nearest neighbors rule, Notes Intuitionistic Fuzzy Sets 1 (1995), 56–60.

16.

Y.K.

Kim and

J.H.

Han , Fuzzy K-NN algorithm using modified K-selection, in: Intl Joint Conf of 4th IEEE International Conf on Fuzzy Syst and 2nd Intl Fuzzy Eng Symp, 1995.

17.

Arif ,

M.U.

Akram ,

F.A.

Afsar and

F.A.A.

Minhas , Pruned fuzzy k-nearest neighbor classifier for beat classification, J Biomedical Sci Eng 3 (2010), 380–389.

18.

M.-S.

Yang and

C.-H.

Chen , On the edited fuzzy K-nearest neighbor rule, IEEE Trans Syst Man Cybern Part B (Cybernetics) 28(3) (1998), 461–466.

19.

F.C.-H.

Rhee and

Hwang , An interval type-2 fuzzy Knearest neighbor, in: 12th IEEE Intl Conf Fuzzy Syst, 2003 FUZZ ’03, 2003, pp. 25–28.

20.

Hu and

Xie , Improving fuzzy k-nn by using genetic algorithm, J Comp Info Syst 1(2) (2005), 203–213.

21.

Derrac ,

Chiclana ,

Garcia and

Herrera , Evolutionary fuzzy k-nearest neighbors algorithm using interval-valued fuzzy sets, Info Sci 329 (2016), 144–163.

22.

Denoeux , A k-nearest neighbor classification rule based on Dempster-Shafer theory, IEEE Trans Syst Man Cybern 25(5) (1995), 804–813.

23.

L.M.

Zouhal and

Denoeux , An evidence-theoretic k-NN rule with parameter optimization, IEEE Trans Syst Man Cybern Part C (Appl and Reviews) 28(2) (1998), 263–271.

24.

Ezghari ,

Zahi and

Zenkouar , A new nearest neighbor classification method based on fuzzy set theory and aggregation operators, Expert Syst with Appl 80 (2017), 58–74.

25.

Rasheeda ,

Stashuka and

Kamel , Adaptive fuzzy k-NN classifier for EMG signal decomposition, Medical Engineering & Physics 28(7) (2006), 694–709.

26.

Huang and

Li , Prediction of protein subcellular locations using fuzzy k-NN method, Bioinformatics 20(1) (2004), 21–28.

27.

Sengur , An expert system based on principal component analysis, artificial immune system and fuzzy k-NN for diagnosis of valvular heart diseases, Comp in Biology and Medicine 38(3) (2008), 329–338.

28.

Nezamabadi-pour and

Kabir , Concept learning by fuzzy k-NN classification and relevance feedback for efficient image retrieval, Expert Syst Appl 36(3) (2009), 5948–5954.

29.

Wolfe , Finding the nearest point in a polytope, Math Programming 11 (1976), 128–149.

30.

Chakrabarty ,

Jain and

Kothari , Provable submodular minimization using Wolfe’s algorithm, in: Advances in Neural Information Proc Syst 27, Vol. 20, 2014, pp. 802–809.

31.

Demsar , Statistical comparisons of classifiers over multiple data sets, J Mach Learn Res 7 (2006), 1–30.

32.

Derrac ,

Garcia and

Herrera , Fuzzy nearest neighbor algorithms: Taxonomy, experimental anal ysis and prospects, Inf Sci 260 (2014), 98–119. http://dblp.uni-trier.de/db/journals/isci/isci260.html#DerracGH14.

33.

Benavoli ,

Corani and

Mangili , Should we really use post-hoc tests based on mean-ranks? J Mach Learn Res 17(1) (2016), 152–161.

34.

Nemenyi , Distribution-free Multiple Comparisons, PhD thesis, Princeton University, 1963.

35.

Gai , Multiresolution monogenic wavelet transform combined with bivariate shrinkage functions for color image denoising, CSSP 37(3) (2018), 1162–1176. http://dblp.uni-trier.de/db/journals/cssp/cssp37.html#Gai18.

36.

Gai ,

Zhang ,

Yang and

Yu , Speckle noise reduction in medical ultrasound image using monogenic wavelet and Laplace mixture distribution, Digital Signal Processing 72 (2018), 192–207. http://dblp.uni-trier.de/db/journals/dsp/dsp72.html#GaiZYY18

Fuzzy k-NN classification with weights modified by most informative neighbors of nearest neighbors

Abstract

Keywords

1 Introduction

1.1 Review of fuzzy k-NN rules of [7]

3 The Basic Method

Algorithm 1 Wolfe’s algorithm implementation [30]

Table 1 Classification error ϵ%, grade advantage γ of fuzzy k-NN rule employing original and conditionally applied proposed distance (bold:winner)

Table 5 Effect of dimensionality reduction with LDA on error ϵ% (k=5)

Table 6 Classification errors ϵ% on the Forest Type Mapping dataset

References

Algorithm 1
Wolfe’s algorithm implementation [30]

Table 1
Classification error ϵ%, grade advantage γ of fuzzy k-NN rule employing original and conditionally applied proposed distance (bold:winner)

Table 5
Effect of dimensionality reduction with LDA on error ϵ% (k=5)

Table 6
Classification errors ϵ% on the Forest Type Mapping dataset