Self-training algorithm based on density peaks combining globally adaptive multi-local noise filter

Abstract

Self-training algorithm highlights the speed of training a supervised classifier through small labeled samples and large unlabeled samples. Despite its long considerable success, self-training algorithm has suffered from mislabeled samples. Local noise filters are designed to detect mislabeled samples. However, two major problem with this kind of application are: (a) Current local noise filters have not treated the spatial distribution of the nearest neighbors in different classes in much detail. (b) They are being disadvantaged when mislabeled samples are located in overlapping areas of different classes. Here, we develop an integrated architecture – self-training algorithm based on density peaks combining globally adaptive multi-local noise filter (STDP-GAMLNF), to improve detecting efficiency. Firstly, the spatial structure of the data set is revealed by density peak clustering, and it is used for empowering self-training to label unlabeled samples. In the meantime, after each epoch of labeling, GAMLNF can comprehensively judge whether a sample is a mislabeled sample from multiple classes or not, and it will reduce the influence of edge samples effectively. The corresponding experimental results conducted on eighteen UCI data sets demonstrate that GAMLNF is not sensitive to the value of the neighbor parameter $k$ , and it is capable of adaptively finding the appropriate number of neighbors of each class.

Keywords

Self-training algorithm density peaks clustering noise filter

ï»¿

1. Introduction

Supervised learning is popular research in machine learning and data mining. Although supervised learning effectively make a wide variety of labeled samples build a classifier, labeling samples usually takes a long time and consume excessive money. On the contrary, semi-supervised learning (SSL) [1, 2] train a classifier using tiny labeled samples and a number of unlabeled samples, which has been successfully developed to many real applications, such as text classification [3], face recognition [4], medicine [5] and other fields. SSL algorithm includes self-training algorithm [6], deep generative models [7], co-training algorithm [8] and graph-based semi-supervised learning [9]. Self-training algorithm is widely used since it does not take account of initial assumptions of data set and is easy to be implemented.

In terms of the implementation of the algorithm, self-training algorithm contains three steps. (a) A small number of labeled samples are utilized to train an initial classifier. (b) The unlabeled samples with the highest confidence are selected and are predicted by the classifier. (c) The noise filter is used to detect mislabeled samples, and then remaining relabeled samples are added to the training set. The three steps are repeated until the stop condition is reached. Researchers generally concur that mislabeled samples in the self-training algorithm slash the accuracy of the trained classifier in practice. Hence, a diverse range of modified noise filters have been designed to reduce the impact of erroneously labeled samples on training. The main ideas of them are summarized as follows: The local noise filter judges whether the relabeled samples are the mislabeled samples by finding the spatial structure of the relabeled samples and the local adjacent samples, or computes the probability that the relabeled samples are the mislabeled samples by conducting hypothesis test. Despite these local noise filters do increase the classification accuracy of the classifiers, they have a number of technical defects:

(a)
Current local noise filters only focus on the spatial distribution of the nearest neighbors within one class. Considering the different distribution of samples in different classes, local noise filters need different $k$ nearest neighbors for different classes.
(b)
Local noise filters based on $k$ nearest neighbors are impacted by noise samples, especially when the size of surrounding labeled samples is small. When $k$ value is very small, the local noise filters are about to neglect vital nearest neighbors. Conversely, if $k$ value is large, the local noise filters will be subjected to relabeled samples with the error class. Evidently, it is unreasonable for ENN to utilize a fixed value $k$ to detect relabeled samples. What is therefore needed is an adaptive selection rule.
(c)
The majority rule has been adopted as the decision rule of current local noise filters, which is less effective. The majority rule does not work when mislabeled samples are located in overlapping areas of different classes. Thus, closer nearest neighbor contributes more to detection, and it acquires more voting weights [10].

Based on the above analysis, a novel globally adaptive multi-local noise filter using harmonic mean distance (GAMLNF) is proposed. GAMLNF is grounded in the local mean-based $k$ nearest neighbor (LMKNN) [11]. Each class finds its corresponding local average vector, which takes the spatial distribution of the nearest neighbors within each class into account [12]. Firstly, GAMLNF utilizes a globally adaptive nearest neighbor selection rule [13], which calculates $r$ globally nearest neighbors, and GAMLNF dynamically finds out $k_{i}$ local nearest neighbors of each class. Secondly, $k_{i}$ multi-local mean vectors are calculated on the basis of $k_{i}$ local nearest neighbors, and the sole pseudo mean vector is calculated, which effectively reduces the influence of mislabeled samples. Finally, GAMLNF calculates the distance between the pseudo mean vector of each class and unlabeled samples by harmonic mean distance [14, 15]. Naturally, GAMLNF takes the influence of spatial distribution in each class into consideration. The strategies empower GAMLNF to improve the ability of detecting mislabeled samples, and the classification accuracy can be boosted.

In order to better detect the mislabeled samples in STDP, an integrated self-training algorithm based on density peaks combining globally adaptive multi-local noise filter (STDP-GAMLNF) is designed in this study. Firstly, STDP-GAMLNF adopts DPC to reveal the whole spatial structure of samples [11], according to which the unlabeled samples are relabeled. Next, through the novel noise filter, a mislabeled sample is comprehensively detected from multiple classes, and the influence of edge samples and mislabeled samples is decreased effectively. Finally, the mislabeled samples are deleted, which will corrupt the spatial structure revealed by DPC. Space structure restoration (SSR) technology with a doubly-linked list is used for storing spatial structure, and automatically connects the unlabeled “previous” samples with the nearest density peak one in the deleted samples. In this regard, SSR ensures that every unlabeled sample is precisely relabeled.

In short, the main works of this study are summarized as follows:

(a)
GAMLNF algorithm is proposed. It considers the global and local spatial structure simultaneously, and provides higher weight for the closer labeled samples by harmonic mean distance in the local detection voting process, which is not sensitive to the neighbor parameter $k$ value and is capable of adaptively finding the appropriate number of neighbors of each class.
(b)
The robust self-training algorithm STDP-GAMLNF, a valid solution for mislabeling in STDP, is verified.
(c)
SSR is proposed in STDP-GAMLNF, which effectively prevents spatial structure from being corrupted by the deletion of mislabeled samples and ensures that each unlabeled sample is labeled.
(d)
Compared with existing methods, the algorithm developed here remarkably increases the average accuracy and AUC value.

This paper is organized as follows. Section 2 presents related work. Section 3 describes preliminaries. Section 4 constructs STDP-GAMLNF. Section 5 evaluates the performance of STDP-GAMLNF through comparison experiments. Finally, the paper concludes Section 6, as well as future plans.
2. Related work

Self-training algorithm is a method of semi-supervised classification algorithm. The current studies of self-training algorithm highlights the need for selecting unlabeled samples with high confidence for labeling and detecting mislabeled samples.

Extensive research has explored the selection of unlabeled samples with high confidence for labeling. The innovative and seminal work of Adankon and Cherie [16] pioneered a new approach to self-training, that is, Help-training, where a support vector machine (SVM) is trained based on a naive bayes (NB). Nevertheless, this method still has a drawback in that the performance of Help-training entirely depends/is restricted to the number and distribution of labeled samples. Gan et al. [17] developed semi-supervised fuzzy C-means algorithm (SFCM) to find unlabeled samples which have a strong discriminant ability in the local data structure, and unlabeled samples are relabeled through fuzzy C-means algorithm. However, the algorithm does not take account of the global structure information of the data set nor does it obtain spatial structure on non-spherical distribution data. In view of this, Wu et al. [18] designed a self-training algorithm based on density peaks of data (STDP). The density peaks clustering (DPC) algorithm is adopted to discover the spatial structure of the data set, which accredits the self-training algorithm to relabel representative unlabeled samples.

Although STDP considers the global spatial structure of data sets, it still produce some mislabeled samples during the training process. If the mislabeled samples are added to the training data set, it will ultimately reduce the classification accuracy of the classifier [19]. Thus, numerous self-training algorithms based on the local noise filters have been proposed, such as self-training based on density peaks clustering and cut edge weight statistic (STDP_CEWS) [20], self-training method based on density peaks clustering with extended parameter-free local noise filter (STDPNF) [21] and multi-label self-training with editing (MLSTE) [22]. STDP_CEWS identifies mislabeled samples by cutting edges weight statistics. MLSTE adopts edited nearest neighbor (ENN) to filter out noisy samples while STDPNF uses natural neighbors to remove noisy samples.

Despite the current noise filters increase the accuracy of the algorithm, existing local noise filters only take account of the spatial distribution of the nearest neighbors within one class. At the same time, it is particularly vulnerable to the influence of noise point samples, so a novel globally adaptive multi-local noise filter using harmonic mean distance is proposed to solve the above problems. While using DPC to find the space structure of data, SSR technology is also introduced to prevent spatial structure from being destroyed by deleting incorrectly marked samples, so that each unlabeled sample has the opportunity to be labeled.

3. Preliminaries

In this section, notations used in this paper are first shown. In order to create a better noise filter to detect the mislabeled samples in STDP algorithm, we draw lessons from LMKNN. Meanwhile, the harmonic average distance is proved to be capable of giving more weight to the closer samples, improving the performance of LMKNN. Then, STDP, LMKNN and the harmonic mean distance are described concisely.

3.1 Notations

In this study, the notations are described in Table 1.

Table 1
The notations used in this paper

ID	Notations	Explanations
1	${\bm{T}}=\{(x_{i},y_{i})\|1\leqslant i\leqslant n\}$	$x_{i}$ is the training samples, $y_{i}$ is its corresponding label.
2	$n$	The size of data set.
3	${\bm{Y}}=\{y_{i}\|y_{i}\in\{w_{i}\}$ , $i=1,2,\ldots,m\}$	The number of class.
4	${\bm{L}}=\{l_{1},l_{2},\ldots,l_{u}\}$	The labeled data set.
5	${\bm{U}}=\{u_{1},u_{2},\ldots,u_{l}\}$	The unlabeled data set.
6	$w_{j}$	The $j$ th class of data set.
7	$\text{NN}_{j}^{r}$	The nearest neighbor set for class $w_{j}$ with $r$ nearest neighbors.
8	$m_{j}$	The local mean vector for class $w_{j}$ .
9	HMD( $\cdot$ )	The harmonic mean distance.
10	AMD( $\cdot$ )	The arithmetic mean distance.
11	$d(x_{t},x_{i})$	The Euclidean distance between $x_{t}$ and $x_{i}$ .
12	PD	The probability difference between the highest probability and the lowest one.
13	$\beta$	Local density.
14	$\delta$	The nearest local density sample.
15	$L_{j}$	The labeled samples set in class $w_{j}$
16	order	The spatial structure is revealed by DPC.
17	$r_{i}$	The number of nearest neighbors in class $w_{j}$ .

3.2 STDP algorithm

STDP algorithm adopts DPC to reveal the whole spatial structure of the samples [18]. DPC is a density-based clustering algorithm that discovers the global spatial structure of data and automatically find the number of clusters. DPC is based on the two assumptions:

(a)
The highest density sample in a cluster is the center sample of the cluster.
(b)
There is a large distance between centroid samples.

Based on two assumptions, DPC first calculates the local density $\beta_{i}$ and the nearest local density sample $\delta_{i}$ for each sample $x_{i}$ . The spatial structure of the data is found by assigning $x_{i}$ to its corresponding sample $\delta_{i}$ , where $\beta_{i}$ and $\delta_{i}$ are computed as:

$\displaystyle\beta_{i}=\sum\limits_{j}{\rm X}(d_{ij}-d_{c})$ (1)

$\displaystyle X(x)=\left\{\begin{array}[]{ll}1,&x<0\\ 0,&x\geqslant 0\end{array}\right.$ $\displaystyle\delta_{i}=\left\{\begin{array}[]{ll}\max_{J:\beta_{i}<\beta_{j}}% (d_{ij}),&\text{others}\\ \max_{j}(d_{ij}),&\forall j,\beta_{i}\geqslant\beta_{j}\end{array}\right.$ (2)

STDP uses the global spatial structure revealed by DPC to select representative unlabeled samples for relabeling. The relabeling process is divided into two steps. Step 1 is iteratively to select and relabel all the unlabeled “next” samples of the labeled samples. Step 2 is iteratively to select and relabel all the unlabeled “previous” samples of the labeled samples. Figure 1 shows a general framework for STPD.

Figure 1.
The framework of STDP.

3.3 LMKNN algorithm

LMKNN algorithm is a very effective and simple classification algorithm in supervised learning [11]. It adopts a local mean vector and distance decision rule and considers the spatial distribution of the nearest neighbors of each class. The algorithm is immune to edge samples and significantly improves the accuracy of classification.

Suppose that ${\bm{L}}_{j}=\{(x_{i},y_{i})|(1\leqslant i\leqslant n_{j})\}$ , ${\bm{L}}_{j}$ is labeled samples set in class $w_{j}$ , $U_{j=1}^{w}L_{j}=L$ , and the parameter $r$ represents the number of neighbors in each class. The unlabeled sample $x^{*}\in R^{d}$ would be relabeled as follows:

Firstly, $r$ samples nearest to $x_{i}$ in ${\bm{L}}_{j}$ are calculated by the local nearest neighbor (LNN) rule, which is presented in Algorithm 1. $\text{NN}_{j}^{r}=\{x_{i},y_{i})|y_{i}=w_{j},1\leqslant i\leqslant r\}$ indicates the $r$ samples point nearest to $x_{i}$ in the class $w_{j}$ , then the $r$ samples are sorted in ascending order.

Algorithm 1: LNN rule
Input: ${\bm{L}}$ , $x_{i}$ .
Output: ${\textit{LNN}}_{j}$ (local nearest neighbor in class $w_{j}$ )
Process:
1:	The local Euclidean distance is calculated $\{d(x_{i},y_{i})\|(y_{i}\in L,1\leqslant i\leqslant n)\}$ .
2:	Sort $\{d(x_{i},y_{i})\|(y_{i}\in L,1\leqslant i\leqslant n)\}$ by the ascending order.
3:	Select the first $r$ samples as $r$ nearest neighbor in class $w_{j}$ based on sort $\{d(x_{i},y_{i})\|(y_{i}\in{\bm{L}},1\leqslant i\leqslant n)\}$ .

After that, the local mean vector $m_{j}^{r}$ is calculated from the $r$ nearest samples in each class

$\displaystyle m_{j}^{r}=\frac{1}{r}\sum_{i=1}^{r}y_{i},y_{i}\in\text{NN}_{j}^{r}$ (3)

Finally, the local mean vector closest to unlabeled sample $x_{i}$ is obtained, the class of $x_{i}$ is the class of the local mean vector, which is calculated by Eq. (4)

$\displaystyle w_{c}=\arg{\mathop{\min}\limits_{w_{{j}}}}\ d(x,m_{j}^{r})$ (4)

where Euclidean distance is used to find the local mean vector nearest to $x$ .

3.4 The harmonic mean distance

The harmonic mean distance is defined as HMD( $\cdot$ ) [14, 15], which is used to measure the distance between the unlabeled sample $x$ and $n$ local mean vector of labeled samples in class $w$ LAM ${}_{w}(x)=\{m_{w}^{i}|1\leqslant i\leqslant n\}$ . The harmonic mean distance HMD( $x,\{m_{w}^{i}\}_{i=1}^{n}$ ) is defined as

$\displaystyle\text{HMD}(x,\{m_{w}^{i}\}_{i=1}^{n})=\frac{n}{\sum_{i=1}^{n}% \frac{1}{d(x,m_{w}^{i})}}$ (5)

In order to explain the difference between harmonic average distance and arithmetic mean distance, the proof is given.

The arithmetic mean distance is used to measure the distance between an unlabeled sample $x$ and $n$ local mean vector set in class $w\ \text{LAM}_{w}(x)=\{m_{w}^{i}|1\leqslant i\leqslant n\}$ , which is denoted as AMD( $x,\{m_{w}^{i}\}_{i=1}^{n}$ ), as in Eq. (6), the value of $d(x,m_{w}^{i}$ ) is represented $\frac{\partial\text{AMD}(x,\{m_{w}^{i}\}_{i=1}^{n})}{\partial d(x,m_{w}^{i})}$ . Equation (7) proves that the value of $\partial\text{AMD}(x,\{m_{w}^{i}\}_{i=1}^{n})$ is independent from $d^{2}(x,m_{w}^{i})$ , which proves that the distance between the unlabeled sample $x$ and the local average vector has the same weight, so it give no more weight to the nearest local average vector.

$\displaystyle\text{AMD}(x,\{m_{w}^{i}\}_{i=1}^{n})=\frac{\sum_{i=1}^{n}d(x,m_{% w}^{i})}{n}$ (6)

$\displaystyle\frac{\partial\text{AMD}(x,\{m_{w}^{i}\}_{i=1}^{n})}{\partial d(x% ,m_{w}^{i})}=\frac{\partial\left[\frac{\sum_{i=1}^{n}d(x,m_{w}^{i})}{n}\right]% }{\partial d(x,m_{w}^{i})}=\frac{\partial[\sum_{i=1}^{n}d(x,m_{w}^{i})]}{% \partial d(x,m_{w}^{i})}\times\frac{1}{n}=\frac{1}{n}$ (7)

Similarly, Eq. (8) demonstrates that $\frac{\partial\text{HMD}(x,\{m_{w}^{i}\}_{i=1}^{n})}{\partial d(x,m_{w}^{i})}$ is inversely proportional to the value of $d^{2}(x,m_{w}^{i})$ . Compared with the arithmetic average distance, the harmonic average distance pays more attention to the local average vector of the labeled samples which is the closest to the unlabeled sample $x$ .

$\displaystyle\frac{\partial\text{HMD}(x,\{m_{w}^{i}\}_{i=1}^{n})}{\partial d(x% ,m_{w}^{i})}=\frac{\partial\left[\frac{n}{\sum_{i=1}^{n}\frac{1}{d(x,m_{w}^{i}% )}}\right]}{\partial d(x,m_{w}^{i})}=n\times\left(\frac{1}{\partial d(x,m_{w}^% {i})\times\sum_{i=1}^{n}\frac{1}{d(x,m_{w}^{i})}}\right)^{2}=\frac{\text{HMD}^% {2}(x,\{m_{w}^{i}\}_{i=1}^{n})}{n\times d^{2}(x,m_{w}^{i})}=\frac{1}{n}\times% \frac{\text{HMD}^{2}(x,\{m_{w}^{i}\}_{i=1}^{n})}{d^{2}(x,m_{w}^{i})}$ (8)

4. The proposed algorithm

Figure 2 shows the framework of STDP-GAMLNF, which is mainly divided into three steps. The first step in this process is to discover the spatial structure of the data set via DPC. The second step is to relabel the unlabeled “next” samples of labeled samples according to the spatial structure, and use GAMLNF to detect mislabeled samples. If it is detected as a mislabeled sample, SSR is adopted to repair the spatial structure to ensure that each unlabeled sample is relabeled. Otherwise, the unlabeled sample is added to ${\bm{L}}$ . Finally, the unlabeled “previous” samples of the labeled samples are relabeled according to the updated spatial structure. GAMLNF is established to detect mislabeled samples, and SSR is used to repair the spatial structure. After that, all unlabeled samples are relabeled as labeled samples which exert a great influence on training a better supervised classifier.

Figure 2.

The framework of STDP-GAMLNF.

4.1 Globally adaptive multi-local noise filter using harmonic mean distance

In the semi-supervised learning environment, GAMLNF adaptively find the appropriate $k$ parameters to detect and delete the mislabeled samples, which significantly enhances the algorithm’s accuracy.

Inspired by the modified algorithm GAKNN of LMKNN summarized in 2.3, we use GANN to find the neighbors of the labeled sample around relabeled samples. Specifically, the labeled sample set ${\bm{L}}$ has $m$ classes, GANN adaptively finds $r_{i}$ nearest neighbors in each class around the relabeled sample $x_{i}$ , as shown in Algorithm 2.

Algorithm 2: GANN
Input: ${\bm{L}}$ , $x$ .
Output:GANN ${}_{j}$ (local nearest neighbors in class $w_{j}$ ), $r_{i}$ .
Process:
1:	The global Euclidean distance is calculated $d(x_{i},y_{i})\|(y_{i}\in L,1\leqslant i\leqslant n)\}$ .
2:	Sort $\{d(x_{i},y_{i})\|(y_{i}\in L,1\leqslant i\leqslant n)\}$ in the ascending order.
3:	The $k$ nearest labeled samples closet to relabeled sample $x_{i}$ are obtained, where $k=r\times M$ .
4:	$i=$ 1
5:	$r_{i}=0$
6:	while $\leqslant k$
7:	If the class of labeled samples $x_{i}$ is $w_{j}$ , $x_{i}$ is regarded as the $k_{i}$ -th nearest neighbor in class $w_{j}$ .
8:	$r_{i}=r_{i}+1$
9:	end while

When GANN adaptively finds $r_{i}$ nearest neighbors in each class, the multi-local average vector $\text{MLMV}_{j}^{r}=\{m_{j}^{s}|1\leqslant s\leqslant k\}$ in each class is calculated, where $m_{j}^{s}$ is the local average vector, and is calculated by the first $s$ nearest neighbor in class $w_{j}$ , its calculation is shown in Eq. (9).

$\displaystyle m_{j}^{s}=\frac{1}{s}\sum_{i=1}^{s}y_{i},y_{i}\in\text{GANN}_{j}% ^{s}$ (9)

A closer local average vector contributes more to detection, so it requires more voting weights. Thus, HMD( $x_{i}$ , $m_{j}^{r}$ ), $m_{j}^{r}\subseteq\text{MLMV}_{j}^{r}$ of Eq. (5) is used to calculate the pseudo average vector $m_{j}^{*}$ . In particular, if there aren’t any neighbors in class $w_{j}$ , then $d(x_{i},w_{j})=\infty$ . Lastly, the closest pseudo average vector to the relabeled sample $x_{i}$ is calculated by the Euclidean distance. If the relabeled sample is not in accordance with the class of $m_{j}^{*}$ , the relabeled sample is considered as a mislabeled sample and is deleted from ${\bm{U}}$ . Otherwise, it is added to the ${\bm{L}}$ set. GAMLNF is displayed in Algorithm 3.

The computational complexity of GAMLNF mainly occurs in the search of the global nearest neighbor (Stage 1), the pseudo average vector calculation (Stage 2), and the Euclidean distance calculation (Stage 3). In Stage 1, GAMLNF obtains $k$ global nearest neighbors from $n$ labeled samples according to Algorithm 2. Therefore, it is easy to calculate the computational complexity that is $O(nk)$ .

In Stage 2, GAMLNF needs to calculate $m\times r$ local average vectors, and then calculate $m$ pseudo average vectors through the harmonic average distance. Hence, it is apparent that the computational complexity is $O(mr)$ according to Algorithm 3.

In Stage 3, GAMLNF obtains the closest pseudo-average vector to $x$ from the $m$ pseudo-average vectors and its complexity is $O(m)$ .

Algorithm 3: GAMLNF
Input: ${\bm{U}}^{re}$ (relabeled sample), ${\bm{L}}$ (labeled sample)
Output: ${\bm{U}}^{re}$ (Updated relabeled samples), ${\bm{L}}^{re}$ (Updated labeled samples)
Process:
1	while $i\leqslant n$
2	while $j\leqslant m$
3	Find the $\textbf{{GANN}}_{j}$ of $x_{i}$ in class $w_{j}$ by Algorithm 1 .
4	if $\textbf{{GANN}}_{j}\neq\emptyset$
5	The multi-local average vector $\text{MLMV}_{j}^{r}$ is calculated in class $w_{j}$ , according to the Eq. (9).
6	$m_{j}^{r}\subseteq\text{MLMV}_{j}^{r}$ is used to calculate the pseudo average vector $m_{j}^{*}$ in class $w_{j}$ , according to HMD( $x_{i}$ , $m_{j}^{r}$ ).
7	$d(x_{i},w_{j})$ is calculated in class $w_{j}$ , according to the Euclidean distance from $x_{i}$ to $m_{j}^{*}$
8	else
9	$d(x_{i},w_{j})=\infty$
10	end while.
11	Sort $\{d(x_{i},w_{j})\|(1\leqslant j\leqslant r_{j})\}$ in the ascending order.
12	The class $w*$ of $d(x_{i},w_{j})$ with the shortest distance is obtained based on sort $\{d(x_{i},w_{j})\|(1\leqslant j\leqslant r_{j})m\}$ .
13	if $c*\neq c^{re}$
14	Delete the sample $x_{i}$ from ${\bm{U}}$
15	else Add the sample $x_{i}$ to ${\bm{L}}$
16	end while.

4.2 Space structure restoration

GMLNF is used to detect and delete the mislabeled samples. However, it tends to corrupt the spatial structure revealed by DPC. As can be seen in Fig. 3, the circles and triangles represent real class $w_{1}$ and class $w_{2}$ , the green and blue represent labeled samples as $w_{1}$ and $w_{2}$ , colorless symbols represent unlabeled samples. As shown in Fig. 3a, each sample is designated to the nearest and highest local density sample in the spatial structure disclosed by DPC. The unlabeled “next” samples are relabeled by the spatial structure. However, according to Fig. 3b, when the unlabeled “next” samples $x_{2}$ and $x_{3}$ are designated to a relabeled sample $x_{1}$ , there are no “previous” other samples between $x_{2}$ and $x_{3}$ . If $x_{1}$ is mislabeled sample and $x_{1}$ is deleted by the GMLNF, the spatial structure is corrupted, $x_{2}$ and $x_{3}$ will not be relabeled.

Figure 3.

Space structure restoration step diagram.

Accordingly, SSR algorithm is proposed. As shown in Fig. 3c, the doubly-linked list structure functions as the storage of the spatial structure. When $x_{1}$ is deleted by the noise filter, the nearest highest local density sample $x_{4}$ is obtained by the doubly-linked list structure, then $x_{2}$ and $x_{3}$ are designated to $x_{4}$ . When the unlabeled sample $x_{4}$ is relabeled, the unlabeled samples $x_{2}$ and $x_{3}$ get the chance to be relabeled. SSR algorithm is displayed in Algorithm 4.

Algorithm 4: SSR
Input: $x$ (deleted sample), ${\bm{L}}$ (labeled data set), ${\bm{U}}$ (unlabeled data set), order (spatial structure)
Output:order
Process:
1:	${\bm{P}}=$ order ( $x$ , previous)
2:	$k=$ length ( ${\bm{p}}$ )
3:	$i=1$
4:	while $i\leqslant k$
5:	$i=$ find( $P_{i}$ , U), where $P_{i}$ is the $i$ -th sample in ${\bm{P}}$ .//return 1 when $P_{i}$ is found in ${\bm{U}}$ , otherwise return 0.
6:	$z=$ order( $P_{i}$ , previous)
7:	$j=$ find( $z,{\bm{L}}$ )//return 1 when $z$ is found in ${\bm{L}}$ , otherwise return 0.
8:	if ( $i==1\&\&j==1$ )
9:	order( $P_{i}$ ,next) $=$ order( $x$ , next)
10:	end while.

4.3 Self-training algorithm based on density peaks combining globally adaptive multi-local noise filter

The whole algorithm is shown as Algorithm 5. At Step 1–Step 4, the DPC algorithm is used to reveal the spatial structure of the data set. The spatial structure is stored in the doubly-linked list. At the same time, ${\bm{L}}$ is used to train the initial classifier C. Step 5–Step 11 are the process of iteratively training classifier C. To begin this process, the unlabeled “next” sample of a labeled sample is relabeled by the classifier C according to the order of spatial structure. Then the unlabeled samples are detected by GAMLNF. Prior to adding relabeled samples to the ${\bm{L}}$ , SSR is applied to repair the mislabeled samples which may cause the corruption of spatial structure. On completion of training classifier C on the basis of updated ${\bm{L}}$ , the process of iteratively training classifier C is performed. Step 12–Step 18 iteratively train classifier C using the method as previously mentioned at Step 5–Step 11.

STDP-GAMLNF focuses on both the global spatial structure revealed by DPC and the spatial structure in each class by GANN algorithm. Simultaneously, GAMLNF adaptively finds $r_{i}$ local nearest neighbors from the $k$ global nearest neighbors in each class. GAMLNF adopts the harmonic average distance to allocate more voting weight to the nearest neighbor. These experiments confirm that STDP-GAMLNF elaborately/successfully trains an effective supervised classifier.

Algorithm 5: STDP-GAMLNF
Input: ${\bm{L}},{\bm{U}}$
Output: The trained classifier C
Process:
1:	Calculate $ρ_{i}$ for each sample $x_{i}$ of ${\bm{L}}$ and ${\bm{U}}$ according to Eq. (1)
2:	Calculate $\delta_{i}$ for each sample $x_{i}$ of ${\bm{L}}$ and ${\bm{U}}$ according to Eq. (3.2)
3:	Establish the structure of data space by making each sample $x_{i}$ designates to its unique nearest sample $\rho_{i}$
4:	Train a classifier $C$ using ${\bm{L}}$
5:	Repeat until all the “next”s samples of samples of ${\bm{L}}$ are selected from ${\bm{U}}$
6:	${\bm{U}}^{re}=$ Select ${\bm{T}}$ from ${\bm{U}}$ , where each sample $x_{j}$ is the “next” samples of samples of ${\bm{L}}$ according to the structure of data space
7:	Update the current unlabeled dataset ${\bm{U}}={\bm{U}}-{\bm{U}}^{re}$
8:	Label the samples of ${\bm{U}}^{re}$ with the trained classifier $C$
9:	GAMLNF is used to detect false labeled samples of ${\bm{U}}^{re}$ , and SSR is used to repair the spatial structure
10:	Update the current labeled dataset ${\bm{U}}={\bm{U}}\cup{\bm{U}}^{re}$
11:	Retrain the classifier $C$ with ${\bm{U}}$
12:	Repeat until all the samples are selected from ${\bm{U}}$
13:	${\bm{U}}^{re}=$ elect ${\bm{T}}$ from ${\bm{U}}$ , where each sample $x_{j}$ is the “previous” samples of samples of ${\bm{L}}$ according to the structure of data space
14:	Update the current unlabeled dataset ${\bm{U}}={\bm{U}}-{\bm{U}}^{re}$
15:	Label the samples of $U^{re}$ with the trained classifier $C$
16:	GAMLNF is used to detect false labeled samples of ${\bm{U}}^{re}$ , and SSR is used to repair the spatial structure
17:	Update the current labeled dataset ${\bm{L}}={\bm{L}}\cup{\bm{U}}^{re}$
18:	Retrain the classifier $C$ with ${\bm{L}}$
19:	Return the classifier $C$

5. Experiments

5.1 Data sets and setting of experiments

We use a PC with 32 G memory, Core i5 CPU and 64-bit operating system to run some experiments, in order to verify the efficiency of the proposed algorithm. Furthermore, MATLAB2016b and PyCharm2021 are utilized to program.

To prove the effectiveness of STPD-GAMLNF, 18 benchmark data sets of experiments are selected from UCI [23] repositories. The data sets are depicted in Table 2. The selection of data sets is in light of different size, Attributes and Class. In order to display the applicability of the algorithm to different distributed data sets, the data sets with flow pattern distribution and spherical distribution are selected. The missing values are supplemented by the mean value. The available evidence supports the conclusion that our algorithm is immune to noise samples.

Table 2
The descriptions of UCI data sets

ID	Data sets	Size	Attributes	Class	Abbreviation
1	Australian_Credit_Approval	690	2	2	ACA
2	biodegradation	1055	41	2	BIO
3	Blood_Transfusion_Service_Center	748	4	2	BTSC
4	Avila	20867	12	12	AVI
5	Contraceptive_Method_Choice_Data_Set	1473	9	3	CMCDS
6	Gauss50	2000	50	2	GAU
7	German	1000	24	2	GER
8	Image_segmentation	2310	19	7	IS
9	Indian_Liver_Patient_Dataset	583	10	2	ILPD
10	mammographic_masses	961	5	2	MM
11	Segmentation	2130	19	7	SEG
12	Spambase	4601	57	2	SPA
13	Tic_Tac_Toe_Endgame	958	9	2	TTTE
14	Website_Phishing	1353	9	3	WP
15	WaveForm	5000	21	3	WAV
16	Wireless_Indoor_Localization	2000	7	4	WIL
17	Banana	8800	2	2	BAN
18	letter	15534	16	26	LET

The training set is randomly divided into the labeled sample set ${\bm{L}}$ and the unlabeled sample set ${\bm{U}}$ . To increase the reliability of the proposed algorithm, four experiments are conducted. 10-fold cross-validation is used in all the experiments. Concise descriptions are as follows:

(a)

Experiment 1 is a comparative experiment between STDP-GAMLNF and other representative algorithms. Mean accuracy, variance, ROC, AUC work as evaluation indexes, and the ratio of ${\bm{L}}$ is set at 20%. The CART functions as the base classifier.

(b)

Experiment 2 compares the effect of the noise filter with the representative noise filter in STDP. Kappa coefficient is used as an evaluation index, and the ratio of ${\bm{L}}$ is set at 20%. CART and KNN are selected as the base classifier.

(c)

Experiment 3 studies the influence of the $k$ value on the proposed algorithm. KNN is selected as the base classifier.

(d)

Experiment 4 studies the impact of the proportion of labeled samples on the noise filter. The evaluation indexes are accuracy and variance, and the ratio of ${\bm{L}}$ is set from 10% $\sim$ 90%. The base classifier is KNN.

5.2 Experiment 1: Comparisons between our algorithm and existing representative work

This experiment compares our algorithm with some representative algorithms to demonstrate the effectiveness of STDP-GAMLNF. To prove the ability of our noise filter to detect mislabeled samples, the following parameters are used: Mean accuracy, variance, ROC, AUC. The ROC curve for multi-classification is calculated as:

Suppose the number of test samples is $n$ and that of class is $w$ . After the training, the probability of each test sample in each class is calculated, and matrix $P$ with $w\times n$ is obtained. Each row represents the probability value of a test sample in each class (sorted by class label). Correspondingly, the label of each test sample is converted into a one-hot encoding form, and each position is used to label whether it belongs to the corresponding class. That operation gets label matrix ${\bm{L}}$ .

Under each class, the probability of $n$ test samples of the class (column in matrix P) will be calculated. Therefore, according to each corresponding column in the probability matrix ${\bm{P}}$ and the label matrix ${\bm{L}}$ , the False Positive Rate (FPR) and the True Positive Rate (TPR) under each threshold are calculated to draw a ROC curve. In this way, $w$ ROC curves are drawn. At last, average the $w$ ROC curves to get the final ROC curve and AUC value.

Table 3
Comparison algorithms and parameters in experiment 1

Abbreviation	Algorithms	Parameters
STHP	Self-training with hierarchal prototype	$h=$ 8, $\theta=\pi$ /2
MLSTE	Multi-label self-training with editing	$k=$ 10
ST_FCM	Self-training with SFCM	Threshold $\varepsilon_{1}=1/n$ ( $n$ is the number of classes)
STDP	Self-training with DPC	$P_{a}=$ 2. Please refer to the original literature
STDPNF	STDP combining PNF	$P_{a}=$ 2.
STDP-GAMLNF	Ours algorithm	$P_{a}=$ 2. $r=k\times n$ ( $n$ is the number of classes)

Comparison algorithms and our algorithm are shown in Table 3. And dc parameter of the proposed algorithm will be discussed in detail in STDP [18], so the dc of the algorithms is set as recommended in STDP. The introduction of the comparison algorithms are as follows:

(a)

Self-training hierarchal prototype-based [24] for semi-supervised classification by extending the recently introduced HP classifier with a self-training mechanism based on the widely used pseudo-label technique. After being primed with labeled samples, the STHP classifier continues to self-evolve its multi-granular structure from unlabeled samples via pseudo-labeling without human supervision.

(b)

Multi-label self-training with editing (MLSTE) [19] uses ENN to detect noisy samples by self-training algorithm. Although MLSTE is applied to multi-label tasks, it is applied to single-label tasks as well.

(c)

Self-training with FCM (ST_FCM) [17] is a clustering algorithm, which is used to guide self-training method to train an effective classifier.

(d)

Self-training with DPC (STDP) [25] integrates the structure of feature space revealed by DPC into the self-training process to train an effective classifier.

(e)

STDP integrated by parameter-free local noise filter (STDPNF) [6] is parameter-free and removes mislabeled samples by exploiting the information of both labeled data and unlabeled data.

From Table 4, it is clear that the accuracy of STDP-GAMNF is better than that of the comparison algorithms. Compared with STDPNF, it accurately detects mislabeled samples and takes account of the spatial structure within each class. 12 out of 18 data sets’ average accuracy of STDP-GAMNF is better than that of the comparison algorithms, and good stability is demonstrated from the variance. However, in AVI and MM data sets, the average accuracy of STDP-GAMLNF is lower than that of STDP. The distribution of the two data sets is totally even, and the correctly labeled samples are deleted during the noise filter detection, causing massive samples reductions in ${\bm{L}}$ set. Finally, it leads to a decrease in the average accuracy rate. Meanwhile, according to Tables 5, 9 out of 18 data sets’ AUC values are improved compared with STDP, which proves that STDP-GAMLNF not only improves the average accuracy but also increases the AUC value to get a better classification effect. Figures 4 and 5 respectively show the ROC of the two-class data set BIO and the multi-class data set CMCDS. It is apparent that STDP-GAMNF increases ROC in each class and solves the data skew problem.

Table 4

Experimental results (MCA $\pm$ STD) of comparison algorithms

Data sets	ST_FCM		MLSTE		STHP		STDP		STDPNF		Ours
ACA	83.91	$\pm$ 5.48	67.82	$\pm$ 4.77	68.55	$\pm$ 8.02	83.91	$\pm$ 6.08	83.91	$\pm$ 6.08	84.05	$\pm$ 5.87
BIO	82.93	$\pm$ 2.89	71.07	$\pm$ 5.46	74.39	$\pm$ 5.36	83.03	$\pm$ 2.74	84.25	$\pm$ 2.76	83.78	$\pm$ 2.93
BTSC	78.74	$\pm$ 3.12	77.01	$\pm$ 2.53	79.27	$\pm$ 3.22	79.27	$\pm$ 2.59	79.01	$\pm$ 3.38	80.21	$\pm$ 2.94
AVI	89.83	$\pm$ 0.59	68.41	$\pm$ 0.8	70.34	$\pm$ 0.82	90.31	$\pm$ 0.85	90.02	$\pm$ 0.72	90.02	$\pm$ 0.54
CMCDS	52.53	$\pm$ 3.75	50.57	$\pm$ 3.75	51.72	$\pm$ 4.69	54.77	$\pm$ 4.46	53.35	$\pm$ 3.79	55.66	$\pm$ 3.15
GAU	74.9	$\pm$ 1.85	92.65	$\pm$ 1.39	88.05	$\pm$ 2.76	71.9	$\pm$ 3.26	73.95	$\pm$ 3.1	78.4	$\pm$ 3.09
GER	71.5	$\pm$ 3.92	68.9	$\pm$ 5.3	69.1	$\pm$ 4.12	72.9	$\pm$ 4.84	73.4	$\pm$ 4.52	74.6	$\pm$ 3.27
IS	91.99	$\pm$ 1.06	80.56	$\pm$ 2.95	87.22	$\pm$ 1.53	93.33	$\pm$ 1.4	93.76	$\pm$ 1.31	94.15	$\pm$ 0.77
ILPD	70.15	$\pm$ 5.16	56.76	$\pm$ 8.88	67.39	$\pm$ 7.6	68.93	$\pm$ 4.54	71.35	$\pm$ 4.77	72.21	$\pm$ 3.91
MM	83.55	$\pm$ 2.95	76.27	$\pm$ 5.11	78.98	$\pm$ 4.32	83.45	$\pm$ 4.7	83.56	$\pm$ 4.91	81.79	$\pm$ 5.47
SEG	91.60	$\pm$ 2.06	83.63	$\pm$ 2.36	86.06	$\pm$ 1.92	93.33	$\pm$ 1.99	93.42	$\pm$ 1.63	94.19	$\pm$ 1.54
SPA	89.28	$\pm$ 1.46	74.59	$\pm$ 2.41	74.01	$\pm$ 2.8	89	$\pm$ 1.65	89.39	$\pm$ 1.52	89.54	$\pm$ 1.01
TTTE	69.61	$\pm$ 4.92	60.54	$\pm$ 2.13	64.08	$\pm$ 4.56	70.02	$\pm$ 5.08	69.61	$\pm$ 4.92	71.07	$\pm$ 4.91
WP	86.98	$\pm$ 3.02	84.47	$\pm$ 4.17	84.92	$\pm$ 4.23	87.87	$\pm$ 2.73	87.57	$\pm$ 2.52	88.83	$\pm$ 2.86
WAV	78.38	$\pm$ 1.75	80.7	$\pm$ 1.46	83.48	$\pm$ 1.17	77.98	$\pm$ 1.55	78.38	$\pm$ 1.53	79.96	$\pm$ 1.41
WIL	93.55	$\pm$ 1.55	97.2	$\pm$ 1.13	97.7	$\pm$ 1.03	92.4	$\pm$ 1.82	94.8	$\pm$ 1.22	95.6	$\pm$ 1.34
BAN	90.22	$\pm$ 1.04	87.79	$\pm$ 1.19	89.95	$\pm$ 1.05	90.37	$\pm$ 0.82	90.85	$\pm$ 1.06	91.62	$\pm$ 0.90
LET	75.82	$\pm$ 0.95	75.11	$\pm$ 0.77	81.54	$\pm$ 0.88	77.28	$\pm$ 0.94	77.5	$\pm$ 0.56	81.26	$\pm$ 0.69

Table 5

Experimental results (AUC) of comparison algorithms

Data sets	ST_FCM	MLSTE	STHP	STDP	STDPNF	Ours
ACA	0.89	0.78	0.76	0.91	0.9	0.89
BIO	0.85	0.85	0.89	0.83	0.87	0.89
BTSC	0.79	0.78	0.73	0.74	0.78	0.79
AVI	0.96	0.88	0.89	0.96	0.96	0.95
CMCDS	0.65	0.59	0.6	0.63	0.67	0.71
GAU	0.79	0.92	0.83	0.75	0.78	0.84
GER	0.73	0.59	0.59	0.72	0.68	0.63
IS	0.96	0.94	0.96	0.96	0.97	0.98
ILPD	0.62	0.79	0.68	0.55	0.64	0.66
MM	0.81	0.82	0.84	0.86	0.86	0.86
SEG	0.98	0.94	0.95	0.98	0.98	0.98
SPA	0.98	0.92	0.91	0.98	0.98	0.98
TTTE	0.83	0.45	0.57	0.55	0.58	0.61
WP	0.87	0.85	0.80	0.88	0.87	0.88
WAV	0.89	0.91	0.92	0.88	0.88	0.89
WIL	0.97	0.98	0.98	0.97	0.97	0.98
BAN	0.92	0.92	0.95	0.94	0.94	0.93
LET	0.91	0.92	0.95	0.9	0.91	0.93

Table 6

Comparison algorithms and parameters in experiment 2

Abbreviation	Algorithms	Parameters
STDP	Self-training with DPC	$P_{a}=$ 2.
STDP_ENN	STDP combining ENN	$P_{a}=$ 2, $k=$ 3.
STDP_RENN	STDP combining RENN	$P_{a}=$ 2, $r=$ 3.
STDP_ALLKNN	STDP combining ALLKNN	$P_{a}=$ 2, $k=$ 5.
STDP_MENN	STDP combining MENN	$P_{a}=$ 2, $k=$ 3.
STDPNF	STDP combining PNF	$P_{a}=$ 2.
STDP_CEWS	STDP combining CEWS	$P_{a}=$ 2, $\theta=$ 0.1, significant level $\alpha=$ 0.05
STDP_GAMLNF	Our algorithm	$P_{a}=$ 2, $r=k\times n$ ( $n$ is the number of classes)

Table 7

Experimental results of kappa with KNN as base classifier

Data sets	STDP	STDP_ ENN	STDP_ RENN	STDP_ ALLKNN	STDP_ MENN	STDP_ PNF	STDP_ CEWS	Ours
ACA	0.3632	0.3466	0.3466	0.3467	0.3466	0.3645	0.3685	0.3761
BIO	0.4885	0.4806	0.4806	0.4823	0.4806	0.4996	0.4882	0.504
BTSC	0.2700	0.2605	0.2605	0.2605	0.2605	0.27	0.2641	0.2829
AVI	0.6246	0.6244	0.6248	0.6363	0.6244	0.6301	0.6234	0.6617
CMCDS	0.2598	0.24	0.24	0.238	0.2357	0.2535	0.2375	0.2638
GAU	0.7945	0.7993	0.7993	0.7702	0.7993	0.7945	0.7747	0.7886
GER	0.2754	0.2699	0.2699	0.2313	0.2699	0.2664	0.2655	0.2772
IS	0.8738	0.8753	0.8753	0.8819	0.8758	0.8738	0.8723	0.891
ILPD	0.2679	0.2312	0.2312	0.2795	0.2312	0.2665	0.2855	0.2632
MM	0.5771	0.5834	0.5834	0.5859	0.5954	0.5732	0.5928	0.5711
SEG	0.8546	0.8586	0.8586	0.8667	0.8586	0.8596	0.8606	0.8834
SPA	0.4455	0.4413	0.4418	0.4288	0.4417	0.4381	0.4409	0.4721
TTTE	0.1919	0.2573	0.2573	0.2414	0.2158	0.223	0.1894	0.2511
WP	0.7362	0.748	0.748	0.747	0.746	0.74	0.7363	0.7316
WAV	0.7554	0.7553	0.7553	0.7533	0.755	0.7578	0.7458	0.7692
WIL	0.9611	0.9618	0.9618	0.9625	0.9618	0.9618	0.9625	0.9718

Figure 4.

ROC of biodegradation data set.

Figure 5.

ROC of Contraceptive_Method_Choice data set.

5.3 Experiment 2: Comparisons between the proposed noise filter and existing ones

In order to prove the effectiveness of GAMNF, we selects 6 representative noise filters ENN [26], RNN [27], ALLKNN [21], MENN [28], CEWS [20], PNF [21] for comparison. To compare their abilities of solving the problem of mislabeling, STDP algorithm is chosen to combine the 6 noise filters (STDP_ENN, STDP_RNN, STDP_ALLKNN, STDP_MENN, STDP_CEWS, STDP_PNF). Table 6 shows the comparison algorithms and related parameters.

Experiment 2 uses the kappa coefficient to judge the quality of the noise filter, which not only checks the consistency and measures the classification accuracy, but also judges the ability of unbalanced classification [29]. The calculation method of kappa coefficient is shown in Eq. (10) where $p_{o}$ is classification accuracy. The calculation method of $p_{e}$ is shown in Eq. (11), where the total number of samples is $n$ , the number of real samples in each class is $a_{1}$ , $a_{2}$ , $\ldots$ , $a_{c}$ , and the predicted number of samples in each class is $b_{1}$ , $b_{2}$ , $\ldots$ , $b_{c}$ .

$\displaystyle\text{kappa}=\frac{p_{o}-p_{e}}{1-p_{e}}$ (10) $\displaystyle p_{e}=\frac{a_{1}\times b_{1}+a_{2}\times b_{2}+\ldots+a_{c}% \times b_{c}}{n\times n}$ (11)

Tables 7 and 8 shows the experimental results of GAMNF in STDP and other comparison algorithms. The base classifiers in Tables 7 and 8 are respectively KNN and CART.

From Tables 7 and 8, when the base classifier is KNN, 11 out of 16 data sets’ kappa coefficient of our algorithm are better than those of the comparison algorithms. When the base classifier is CART, 7 out of 16 data sets’ kappa coefficient of our algorithm are better than those of the comparison algorithm. In summary, the overall accuracy of STDP_GAMLNF is higher than that of the comparison algorithms, and the effectiveness is better when the base classifier is KNN. However, the two base classifiers on the MM and TTTE data sets fail to perform well. In particular, all the noise filters on the TTTE data set lead to the accuracy reduction, because the spatial structure revealed by DPC has already been sufficient to support the training of the self-training algorithm, but the noise filter interferes the relabeled samples, resulting in the deletion of too many unlabeled samples, which will corrupt the complete spatial structure. While STDP_GAMLNF effectively detects mislabeled samples by multi-local mean vectors, and uses harmonic average distance to focus on local mean vectors close to the relabeled samples in each class, it secures better detection ability. GAMLNF adopts the idea of multi-local. By synthesizing multiple average vectors in different categories, the errors caused by edge samples are reduced by averaging, which decreases the influence of the algorithm for edge samples.

Table 8

Experimental results of kappa with CART as base classifier

Data sets	STDP	STDP_ ENN	STDP_ RENN	STDP_ ALLKNN	STDP_ MENN	STDP_ PNF	STDP_ CEWS	Ours
ACA	0.6698	0.6602	0.6836	0.6655	0.66	0.6698	0.6698	0.6727
BIO	0.6197	0.6408	0.6312	0.6096	0.6408	0.6412	0.6015	0.6338
BTSC	0.2895	0.2813	0.2813	0.2809	0.2813	0.2622	0.2755	0.2936
AVI	0.8739	0.8425	0.8327	0.8319	0.8425	0.8695	0.8772	0.8683
CMCDS	0.3009	0.3179	0.316	0.3017	0.3238	0.2778	0.3078	0.3125
GAU	0.4367	0.5616	0.5451	0.5487	0.5604	0.478	0.4769	0.5665
GER	0.3307	0.3556	0.3489	0.3439	0.3555	0.3597	0.3594	0.3367
IS	0.922	0.9250	0.926	0.9412	0.9235	0.9270	0.928	0.9316
ILPD	0.2278	0.2857	0.2429	0.2385	0.2974	0.2664	0.2599	0.2977
MM	0.6675	0.6971	0.6751	0.6679	0.6522	0.6703	0.672	0.6337
SEG	0.9219	0.9123	0.9245	0.9219	0.9123	0.9229	0.9235	0.9321
SPA	0.7684	0.7872	0.7858	0.8002	0.7955	0.7764	0.7648	0.7792
TTTE	0.357	0.3421	0.2522	0.3396	0.3303	0.3461	0.3546	0.3474
WP	0.7786	0.764	0.769	0.7775	0.7576	0.7742	0.779	0.7985
WAV	0.6691	0.7056	0.7177	0.7181	0.7056	0.675	0.6712	0.6989
WIL	0.8983	0.9391	0.9398	0.9411	0.9391	0.9304	0.9003	0.9411

Figure 6.

Variance results of different $K$ values.

Figure 7.

ACC of different algorithms with respect to different percentages of L on 6 data sets.

5.4 Experiment 3: Sensitivity of noise filters to the neighborhood

k

value

In GAMLNF $r$ depends on $k$ , which is supposed to be specially concentrated on. Therefore, GAMLNF only needs to consider whether it is sensitive to $k$ or not. The compared algorithms select noise filters with $k$ values, such as ENN, RNN, ALLKNN, and MENN. In order to better obtain the sensitivity of each noise filter to the value of $k$ , $k$ is tested and from 1 to 10. The average accuracy of $k$ is used to find the variance. The smaller the variance is, the less sensitive the noise filter is to $k$ .

The experimental results are shown in Fig. 6. It is clear that the variance of GAMLNF is much lower than other noise filters, and it can achieve an adaptive global search for the appropriate $k_{i}$ for each class. Although GAMLNF is not as stable as ALLKNN on the ILPD data set, it has a better classification effect. Single pseudo-average vector is obtained for $k_{i}$ local average vector in each class, and the true class of the relabeled samples by pseudo-average vectors is obtained.

5.5 Experiment 4: The impact of ratio of labeled samples

The experimental parameters are consistent with Table 6. Simultaneously, STDP_MENN, STDP_PNF and STDP_GAMLNF perform better than other algorithms in Experiment 2, so they are used as compared algorithms. Figure 7 shows that the ratio of ${\bm{L}}$ (10% $\sim$ 90%) affects the average accuracy in different data sets.

As shown in Fig. 7, GAMLNF performs better than the comparison algorithms on ACA, BIO, IS, ILPD, SEG, and TTTE data sets. GAMLNF calculates the most suitable local average vector for each class globally, which improves the classification accuracy through multiple average centers. But on IS, ILPD, and TTTE data sets, when the ratio of ${\bm{L}}$ is 10%, the classification effect is significantly lower than the compared algorithms. GAMLNF needs to find a local average vector in each class. When the ratio of ${\bm{L}}$ is too small, GANN will fail to find enough local neighbors in each class, and the global neighbors will only gather in one class. It is of vital importance to note that all algorithms are capable of increasing the accuracy when the labeled samples ratio increase, especially when the ratio of ${\bm{L}}$ is from 10% $\sim$ 50%.

6. Conclusions and future study

In this paper, we firstly explain in detail how we use GAMLNF to detect mislabeled samples. Secondly, we introduce how to repair the spatial structure damaged by the noise filter. In the end, we propose our STDP-GAMLNF algorithm framework. Specifically, the spatial structure of the data sets is revealed by DPC, and then the spatial structure is used to empower self-training to label unlabeled samples. Meanwhile, GAMLNF is used to detect whether the relabeled sample is mislabeled sample or not, after each epoch of labeling. SSR technology reconstructs the corrupted space structure. Together, all the experiments confirmed that (a) Compared with mainstream methods, the proposed algorithm effectively improves the average accuracy and AUC value. (b) GAMLNF’s ability to detect mislabeled samples is better than other comparison algorithms. (c) The presented algorithm is not sensitive to the neighbor parameter $k$ value and is able to adaptively find the appropriate number of neighbors of each class. (d) The average accuracy of the algorithm will be improved when the ratio of ${\bm{L}}$ increases, and it is better than other comparison algorithms in most cases.

Future research should focus on how to modify the noise filter to promote the detection of mislabeled samples when there are tiny labeled samples. Our study finds that the increasing number of labeled samples significantly improve the accuracy of the algorithm. Thereby, we will also explore how to relabel the erroneously labeled samples through the updated spatial structure so as to prevent labeled samples from being simply deleted.

Compliance with ethical standards

Ethical approval: This article does not contain any studies with human participants or animals performed by any of the authors.

Funding details: This work is supported by the scientific and technological innovation project of double-city economic circle construction in Chengdu-Chongqing area (No. KJCX2020024), Chongqing University Innovation Research Group funding (No.CXQT20015).

Conflict of interest: The authors declare that they have no conflicts of interest.

Informed consent: Informed consent was obtained from all individual participants included in the study.

Authorship contributions

Shuaijun Li: Conceptualization, Methodology, Validation, Formal analysis, Data Curation, Writing-Original Draft. Jia Lu: Supervision, Writing-Review & Editing.

References

et al., disentangled variational auto-encoder for semi-supervised learning, Information Sciences 482(12) (2019), 73–85.

Yuan

et al., semi-supervised stacked autoencoder-based deep hierarchical semantic feature for real-time fingerprint liveness detection, Journal of Real-Time Image Processing 17(1) (2020), 55–71.

Tran

V.C.

Nguyen

N.T.

and Fujita

, A combination of active learning and self-learning for named entity recognition on twitter using conditional random fields, Knowledge-Based Systems 132 (2017), 179–187.

Liu

et al., boosting semi-supervised face recognition with noise robustness, IEEE Transactions on Circuits and Systems for Video Technology 99 (2021), 10–18.

Zhang

et al., traditional Chinese medicine knowledge service based on semi-supervised BERT-BiLSTM-CRF model, in: 2020 International Conference on Service Science, 2020, pp. 64–69.

and Zhu

, semi-supervised self-training method based on an optimum-path forest, IEEE Access 7(3) (2019), 36388–36399.

Pande

and Awate

S.P.

, generative deep-neural-network mixture modeling with semi-supervised MinMax+EM learning, in: 25th International Conference on Pattern Recognition, 2021, pp. 5666–5673.

Slivka

et al., A tool for flexible experimenting with co-training based semi-supervised algorithms, Knowledge-Based Systems 121(1) (2017), 2–8.

Anis

et al., A sampling theory perspective of graph-based semi-supervised learning, IEEE Transactions on Information Theory 65(4) (2019), 2322–2342.

10.

Mateos-García

García-Gutieŕrez

and Riquelme-Santos

J.C.

, An evolutionary voting for k nearest neighbors, Expert Syst Appl 43 (2016), 9–14.

11.

Mitania

and Hamamoto

, A local mean-based nonparametric classifier, Pattern Recognit Lett 27(10) (2006), 1151–1159.

12.

Gou

et al., Improved pseudo nearest neighbor classifification, Knowl Based Syst 70 (2014), 361–375.

13.

Pan

et al., A new globally adaptive k-nearest neighbor classifier based on local mean optimization, Soft Computing 25(3) (2021), 1–15.

14.

Pan

Wang

and Ku

, A new general nearest neighbor classifification based on the mutual neighborhood information-a, Knowl Based Syst 121(1) (2017), 142–152.

15.

Pan

Wang

and Ku

, A new k-harmonic nearest neighbor classififier based on the multi-local means-b, Expert Syst Appl 67(2) (2017), 115–125.

16.

Adankon

M.M.

and Cheriet

, Help-Training for semi-supervised support vector machines, Pattern Recognition 44(9) (2011), 2220–2230.

17.

Gan

et al., using clustering analysis to improve semi-supervised classification, Neurocomputing 101(3) (2013), 290–298.

18.

et al., A self-training semi-supervised classification algorithm based on density peaks of data and differential evolution, in: IEEE 15th International Conference on Networking Sensing and Control, 2018, pp. 1–6.

19.

Wei

Wang

and Zhao

, Semi-supervised multi-label image classification based on nearest neighbor editing, Neurocomputing 119 (2013), 462–468.

20.

Wei

Yang

and Qiu

, Improving self-training with density peaks of data and cut edge weight statistic, Soft Computing 24 (2020), 15595–15610.

21.

Zhu

and Wu

, A self-training method based on density peaks and an extended parameter-free local noise filter for k nearest neighbor, Knowledge-Based Systems 184(3) (2019), 104–113.

22.

Triguero

et al., On the characterization of noise filters for self-training semi-supervised in nearest neighbor classification, Neurocomputing 132 (2014), 30–41.

23.

Asuncion

and Newman

, UCI machine learning repository, 2007. Available: http://archive.ics.uci.edu/ml/datasets.php.

24.

, A self-training hierarchical prototype-based approach for semi-supervised classification, Information Sciences 535 (2020), 204–224.

25.

Shang

Luo

et al., Self-training semi-supervised classification based on density peaks of data, Neurocomputing 275 (2017), 180–191.

26.

Wilson

D.L.

, Asymptotic properties of nearest neighbor rules using edited data, IEEE Trans Syst Man Cybern 2(3) (1972), 408–421.

27.

Tomek

, An experiment with the edited nearest-neighbor rule, IEEE Trans Syst Man Cybern 6(6) (1967), 448–452.

28.

Hattori

and Takahashi

, A new edited k-nearest neighbor rule in the pattern classification problem, Pattern Recognit 33(3) (2000), 521–528.

29.

Pei

et al., A threshold-free classification mechanism in genetic programming for high-dimensional unbalanced classification, in: IEEE Congress on Evolutionary Computation, 2020, pp. 1–8.

Self-training algorithm based on density peaks combining globally adaptive multi-local noise filter

Abstract

Keywords

1. Introduction

3. Preliminaries

3.1 Notations

Table 1 The notations used in this paper

5. Experiments

5.1 Data sets and setting of experiments

Table 2 The descriptions of UCI data sets

Table 3 Comparison algorithms and parameters in experiment 1

5.5 Experiment 4: The impact of ratio of labeled samples

6. Conclusions and future study

Compliance with ethical standards

Authorship contributions

References

Table 1
The notations used in this paper

Table 2
The descriptions of UCI data sets

Table 3
Comparison algorithms and parameters in experiment 1