Abstract
Keywords
Introduction
The
The Prototype Selection (PS) technique [6] is one of the main solutions for the above shortcomings of the KNN classifier. The prototype selection technique is a special case of data preprocessing. It can reduce the size of the training set by removing noise and redundant samples while keeping the classification accuracy. Hence, the prototype selection technique is usually regarded as a multi-objective optimization problem by weighing the classification accuracy (accuracy) and the reduction rate (
Condensing methods [7] and edition methods [8] are two typical representatives of prototype selection techniques. Condensing methods aim to search the minimal consistent subset by finding representatives. CNN (Condensing Nearest Neighbor) [7], Generalized Condensed Nearest Neighbor (GCNN) [9] and Multiedit CNN [10] are classical condensing methods. Edition methods intend to handle harmful samples, such as noise, outliers and borderline samples. ENN (Edited Nearest Neighbors) [8], NCNEdit (Nearest Centroid Neighbor Edition) [11] and ENaN (Edited Natural neighbor) [12] are instances of edition methods. Condensing methods can achieve a high reduction rate and rarely change the classification accuracy. By contrast, edition methods tend to improve the classification accuracy rather than the reduction rate.
Recently, hybrid prototype selection techniques combining condensing methods with edition methods have been developed and studied widely. Hybrid prototype selection techniques usually employ edition methods to delete noise and smooth the class boundary. Then, they use condensing methods to reduce the number of training samples dramatically. Representative examples are HMN-based IS (Hit Miss Networks-based Instance Selection) [13], CBP (Class Boundary Preserving Algorithm) [14], 1-FN [15], 1-EN [15], ISR (InstanceRank) [16], IRB (InstanceRank based on Borders) [17], BNNT (Binary Nearest Neighbor Tree) [18], ATISA (Adaptive Threshold-based Instance Selection Algorithm) [19], LSIR (Local Sets-based Instance Reduction) [20], MTRKNN (Modified Template Reduction for KNN) [12], NNGIR (Natural Neighborhood Graph-based Instance Reduction) [21], CNNIR (Constraint Nearest Neighbor for Instance Reduction) [22], LSNaNIR ( local Sets with Natural Neighbors for Instance Reduction) [23], etc. Although extensive empirical results have proved their effectiveness, they still have the following technical defects:
Adopted edition methods (e.g. ENN, ENaN and so on) in most hybrid prototype selection techniques (e.g. CBP, ISR, IRB, ATISA, MTKNN, NNGIR and CNNIR) are based on the majority voting and the local neighborhood of the tested sample. They are susceptible to harmful samples (noise and unsafe borderline samples) around the tested sample, when
Illustrating the drawbacks of edition methods in existing hybrid prototype selection techniques. Most of them (e.g. HMN-based IS, 1-FN, 1-EN, BNNT, ATISA, LSIR, NNGIR and CNNIR, LSNaNIR) retain many internal samples, which contributes little to the classification of KNN classifier and (or) leads to the low reduction rate. Hence, they are hard to achieve both high accuracy and high reduction. In fact, Borderline samples actually play a decisive role in the classification accuracy of the KNN classifier. Many methods rely on many parameters. CBP requires 3 parameters (

In order to solve the above technical defects and overcome the shortcomings of KNN classifiers, a hybrid prototype selection technique based on relative density and density peaks clustering (named PST-RD-DPC) is proposed. First, a new edition method based on relative density and distance (named EMRDD) is proposed to filter out harmful samples and smooth the class boundary. Second, a new condensing method based on relative density and density peaks clustering [24] (named CMRDDPC) is proposed to retain representative borderline samples. After that, the reduced training set is used to train the KNN classifier to achieve classification tasks. The main contributions are highlighted as follows:
A hybrid prototype selection technique based on relative density and density peaks clustering (PST-RD-DP) is proposed against the technical defects of existing hybrid prototype selection techniques and shortcomings of KNN classifiers. A new edition method based on relative density and distance (EMRDD) is proposed. Compared to existing edition methods [8, 11, 12], EMRDD can alleviate the negative effect of harmful samples around test samples. A new condensing method based on relative density and density peaks clustering (CMRDDPC) is proposed. Compared to existing condensing methods [7, 9, 10], CMRDDPC retains more representative borderline samples and is parameter-free.
The rest of our work is structured as follows. Related prototype selection techniques are reviewed in Section 2. The main ideas and algorithm process of PST-RD-DP are introduced in Section 3. Intensive experiments and experimental settings are shown in Section 4. Finally, conclusions are summarized and arrangements are made in Section 5.
The KNN classifier [1] is one of the most well-known instance-based supervised non-parametric classifications. However, it is sensitive to noise and has a large computational complexity and long response times. Prototype selection techniques have been postulated as a good solution. Scholars usually divide prototype selection techniques into three categories, i.e., condensing methods, edition methods and hybrid methods.
Condensing methods intend to search the minimal consistent subset by finding representatives. CNN [7] is one of the most classical condensing methods. CNN continuously adds a new sample whose nearest neighbor does not match in label with itself into the condensing set until all members have been absorbed. GCNN [9] and Multiedit CNN [10] are extensions of CNN. GCNN modifies the absorption criterion of CNN. GCNN checks whether all samples have been strongly absorbed. Multiedit CNN applies Multiedit to overcome the negative impacts of noise.
Edition methods focus on handling harmful samples, such as noise, outliers and borderline samples. ENN [8] is the earliest edition method. NCNEdit [11] and ENaN [12] are two representative improvements of ENN. ENN filter outs a harmful sample, when the harmful sample is misclassified by its
Hybrid prototype selection techniques combine condensing methods with edition methods. HMN-based IS [13] is a hit miss network (HMN)-based hybrid prototype selection technique. In HMN-based IS, a HMN-based edition method (HMN-E) is proposed to handle noise and a HMN-based condensing method (HMN-C) is proposed to remove samples without affecting the accuracy of 1-NN classifier. CBP [14] is a multi-stage hybrid prototype selection technique. CBP employs ENN to prune noise and harmful borderline samples. Then, it uses the geometric characteristics of the underlying distribution to retain borderline samples. 1-FN [15], 1-EN [15], ISR [16] and IRB [17] are rank-based hybrid prototype selection techniques. 1-FN and 1-EN employ the voting strategy with a threshold to edit and condense the training set. The farther neighbors receive votes from considered samples in 1-FN while the neighbors near their enemies receive votes from considered samples in 1-EN. ISR is inspired by PageRank. It uses the idea of PageRank to compute the rank value of each sample. IRB employs the similarity between each sample and its
The above hybrid prototype selection techniques have been proven to be effective in weighing the classification accuracy and the reduction rate. Yet, they still have the following technical drawbacks (as analyzed in Section 1): (a) adopted edition methods are susceptible to harmful samples around tested samples; (b) they retain too many internal samples, which contributes little to the classification of KNN classifier and leading to the low reduction; (c) they rely on many parameters. Hence, a novel competitive hybrid prototype selection technique based on relative density and density peaks clustering (PST-RD-DP) is proposed against the above issues.
Proposed hybrid prototype selection technique
Let
A flowchart of the proposed PST-RD-DP is shown in Fig. 2, which provides an overview. First, the proposed edition method based on relative density and distance (name EMRDD) is used to prune noise and smooth the class boundary. Hence, a training set without noise can be obtained. Second, the proposed condensing method based on relative density and density peaks clustering (CMRDDPC) is used to find the reduced set from the training set without noise. The reduced set contains representative borderline samples. After that, the reduced set is used to achieve the classification task of the KNN classifier.
Flowchart of PST-RD-DP.
In the following, the proposed edition method based on relative density (EMRDD) is introduced in Section 3.1. The proposed condensing method based on relative density and density peaks clustering (EMRDDPC) is introduced in Section 3.2. The pseudo-code and analysis of PST-RD-DP are described in Section 3.3.
Most existing hybrid prototype selection techniques use edition methods (e.g. ENN, ENaN and so on) based on the local neighborhood with the majority voting to prune harmful samples. However, these edition methods are susceptible to harmful samples around the tested sample, as analyzed in Section 1. Hence, a new edition method based on relative density and distance (EMRDD) is proposed in PST-RD-DP. The proposed EMRDD is inspired by two characteristics of noise, shown in Fig. 3. There are two different classes of samples in Fig. 3. Circles are class 1 and heptagons are class 2. Samples A and D are noise because they belong to class 1 and are surrounded by samples of class 2.
Describing two characteristics of noise.
Characteristic One: the density between noise and its homogeneous
In Eq. (1), Heter_NN
Characteristic Two: the distance between noise and the class center with the same class label as the noise is greater than the distance between the noise and the class center with a different class from the noise. In Fig. 3 (b), P1 and P2 are the class centers of class 1 and class 2, respectively. Noisy sample A is close to P2 and away from P1. The above characteristic is formulated in Eq. (2).
Let
In Eq. (3), the function
RelativeDistance(
If sample
The pseudo-code of the proposed EMRDD is described in Algorithm 1. At Line 2, the center for each class is calculated. The heterogeneous and homogeneous
Existing edition methods (ENN, ENaN, and so on) [8, 11, 12] employ the majority voting strategy based on the local neighborhood and are susceptible to harmful samples around the test sample. By contrast, the proposed EMRDD uses the relative density and the relative distance to analyze not only the local distribution but also the global distribution of test samples. Hence, EMRDD is less affected by harmful samples around the test sample. As shown and analyzed in Figs 2 and 3, noisy sample A is misjudged as a normal sample by ENN and noisy sample A is correctly identified by EMRDD.
The condensing methods in most existing hybrid prototype selection techniques retain many internal samples, which contributes little to the classification of the KNN classifier and (or) leads to a low reduction rate. Hence, a new condensing method based on relative density and density peaks clustering (CMRDDPC) is proposed in PST-RD-DP.
CMRDDPC is inspired by the characteristic of the relative density: if sample
The local density of each sample is denoted as
In Eq. (8), the distance between two samples is based on their relative density. The dc is a cutoff value and calculated by the Eq. (9).
Then, the DPC calculates the distance
Next, cluster centers are determined by the comprehensive value of the local density
Footnotes
Funding
The Natural Science Foundation of Chongqing, China (cstc2019jcyj-msxmX0694).



