An LVQ clustering algorithm based on neighborhood granules

Abstract

Learning Vector Quantization (LVQ) is a clustering method with supervised information, simple structures, and powerful functions. LVQ assumes that the data samples are labeled, and the learning process uses labels to assist clustering. However, the LVQ is sensitive to initial values, resulting in a poor clustering effect. To overcome these shortcomings, a granular LVQ clustering algorithm is proposed by adopting the neighborhood granulation technology and the LVQ. Firstly, the neighborhood granulation is carried out on some features of a sample of the data set, then a neighborhood granular vector is formed. Furthermore, the size and operations of neighborhood granular vectors are defined, and the relative and absolute granular distances between granular vectors are proposed. Finally, these granular distances are proved to be metrics, and a granular LVQ clustering algorithm is designed. Some experiments are tested on several UCI data sets, and the results show that the granular LVQ clustering is better than the traditional LVQ clustering under suitable neighborhood parameters and distance measurement.

Keywords

Supervised learning granular computing LVQ clustering neighborhood granules

1 Introduction

Granular computing is a branch of artificial intelligence, which covers fuzzy sets, rough sets, and quotient space theory. Zadeh, the founder of fuzzy sets, believes that there are information granules in many fields in the world [1], and information granules are distinct in different fields. In 1982, Pawlak proposed the rough set theory [2], describing the concepts of information granules and granular size [3]. Lin first proposed the concept of granular computing in 1996 [4] and applied it to data mining. Kang analyzed structures of granules from the views of sets and formal concept analysis [5]. Liu defined granular language and constructed a new model of logical reasoning based on the information granules [6]. Chen and Zhang designed a granular classifier based on convolution operations [7]. Chen proposed a granular computing-based classification method from an algebraic granule structure [8]. Miao studied the correlation between information entropy and granular computing, and innovatively introduced information entropy into the field of granular computing [9, 10]. To further develop rough set theory, many scholars introduced the neighborhood relation into rough sets and proposed neighborhood granular computing [11, 12]. Yao proposed a neighborhood rough set model [13, 14] and Hu proposed a neighborhood granulation method to construct neighborhood classifiers [15]. Wang analyzed the connection between rough sets and granular computing, to conduct some granular computing models and applications [16].

Clustering is one of the important researches in machine learning, data mining, pattern recognition, etc. And it plays an extremely important role in identifying the internal structure of data [17]. The goal of clustering is to classify similar samples and make diversities between data with different categories [18 –20]. Cluster analysis involves three similarity measures: similarity between samples, the similarity between classes, and the similarity between different clustering results [21]. The LVQ is a clustering algorithm based on the prototype. The original LVQ is similar to most supervised learning algorithms that need to obtain class labels [22]. However, the LVQ clustering algorithm is severely sensitive to initial values. If the deviation of selected initial values is too large, then a good clustering effect will not be produced, resulting in insufficient clustering accuracy. Therefore, many scholars improved the LVQ algorithm. For example, Tapan proposed a variant of the LVQ algorithm as a data structure preserving LVQ [23]. Cruz-Vega proposed an LVQ-algorithm based on granular computing [24]. Shen proposed a novel OSSL method based on learning vector quantization (LVQ) [25]. Jatmiko proposed Adaptive Fuzzy-Neuro Generalized Learning Vector Quantization using the PI membership function (AFNGLVQ-PI) [26].

The relevant theories of granular computing are relatively perfect, and it is mainly used in the classification field of machine learning, while there is relevant literature on the combination of clustering and granular computing [27]. In this paper, the LVQ clustering algorithm is combined with neighborhood granulation technology, and Neighborhood Granular LVQ (NGLVQ) algorithm is proposed to improve the clustering effect of nonlinear fractal data. Meanwhile, this method can make each neighborhood granular vector global and improve the convergence rate of clustering. A neighborhood granule is constructed on a feature of a sample, and the neighborhood granular vector is formed on the multidimensional features after granulation. By defining the size and operation rules of neighborhood granular vectors, the granular distance measure is calculated. Based on the measure, the clustering algorithm of NGLVQ is further designed. Finally, clustering experiments are performed on several UCI datasets. Experimental results show that the NGLVQ is better than the traditional LVQ on some clustering indexes.

2 Neighborhood granulation and granular vectors

The neighborhood granulation technology refers to the neighborhood rough set model proposed by Yao [28] and the neighborhood classifier proposed by Hu [29]. To facilitate the measurement and calculation of neighborhood granules, the concept of neighborhood granular vector is proposed.

Set the information system as IS = (U, F), where the sample set is U = {x₁, x₂, . . . , x_n} and the attribute set is F = {a₁, a₂, . . . , a_m}. Given a sample x ∈ U, for any attribute a ∈ F, v (x, a) ∈ [0, 1] represents the normalized value of the sample x on the attribute a.

Set the information system as IS = (U, F), for samples x, y ∈ U, a single attribute a ∈ F, then the Manhattan distance between x and y on the single attribute a is: $S_{a} (x, y) = | v (x, a) - v (y, a) | .$ (1)

Define 1. Set the information system as IS = (U, F), for samples x, y ∈ U, a single attribute a ∈ F, given a neighborhood parameter δ, the neighborhood discriminant function of samples x, y is defined as: $φ (x, y) = {\begin{matrix} 0, S_{a} (x, y) > δ \\ 1, S_{a} (x, y) \leq δ \end{matrix} .$ (2)

When φ (x, y) =1, x, y are neighbors. If φ (x, y) =0, it indicates that x, y are not adjacent.

Define 2. Given that the information system is IS = (U, F), for any sample x ∈ U and any attribute a ∈ F, then the x performs neighborhood granulation on the attribute a, and the neighborhood granules are defined as: $g_{a} (x) = {r_{j}}_{j = 1}^{n} = {r_{1}, r_{2}, \dots, r_{n}},$ (3) where r_j = φ (x, x_j) is the neighborhood discriminant function of samples x, x_j, indicating whether they are adjacent, and n is the number of samples on attribute a.

Define 3. Set the information system as IS = (U, F), for any sample x ∈ U, any attribute subset P ⊆ F, suppose P = {a₁, a₂, . . . , a_m}, then the neighborhood granular vector of x on the attribute subset P is defined as: $G_{p} (x) = {(g_{1} (x), g_{2} (x), \dots, g_{m} (x))}^{T},$ (4) where g_m (x) is the neighborhood granule of sample x on attribute a_m.

Neighborhood granular vector is composed of neighborhood granules, which are composed of 0 or 1 and represent the neighborhood relationship between samples. Neighborhood granules are ordered sets of zeros or ones. Thus, the element of the neighborhood granular vector is an ordered set, unlike the traditional vector, where the element is a real number.

Define 4. Set the information system as IS = (U, F), for any sample x ∈ U, any attribute a ∈ F, the size of neighborhood granule g_a (x) is defined as: $Size (g_{a} (x)) = | g_{a} (x) | = \sum_{j = 1}^{n} r_{j} .$ (5)

It is easy to know that the size of granules in the neighborhood satisfies: 1 ≤ | (g_a (x) | ≤ n.

Define 5. Set the information system as IS = (U, F), for any sample x ∈ U, any attribute subset P ⊆ F, Suppose P = {a₁, a₂, . . . , a_m}, then the size of neighborhood granular vector G_P (x) of x is defined as: $Size (G_{p} (x)) = | G_{p} (x) | = \sqrt{\sum_{i = 1}^{m} {| g_{i} (x) |}^{2}} .$ (6)

The size of neighborhood granular vector G_P (x) is also called the modulus of neighborhood granular vector, and it is easy to know that its size satisfies: $\sqrt{m} \leq | G_{p} (x) | \leq n * \sqrt{m}$ .

Example 1. An information system IS = (U, F) is shown in Table 1, U = {x₁, x₂, x₃, x₄} is a sample set,and F = {a, b, c} is an attribute set. Set the neighborhood granulation parameter as δ = 0.1.

The sample set is U = {x₁, x₂, x₃, x₄}. If neighborhood granulation is carried out according to attribute a, the neighborhood granules are: g₁ = g_a (x₁) = {1, 1, 0, 0}, g₂ = g_a (x₂) = {1, 1, 1, 0}, g₃ = g_a (x₃) = {0, 1, 1, 0}, g₄ = g_a (x₄) = {0, 0, 0, 1}.

If neighborhood granulation is carried out according to attribute b, the neighborhood granules are: g₅ = g_b (x₁) = {1, 0, 1, 1}, g₆ = g_b (x₂) = {0, 1, 0, 0}, g₇ = g_b (x₃) = {1, 0, 1, 0}, g₈ = g_b (x₄) = {1, 0, 0, 1}.

If neighborhood granulation is carried out according to attribute c, the neighborhood granules are: g₉ = g_c (x₁) = {1, 1, 0, 0}, g₁₀ = g_c (x₂) = {1, 1, 1, 1}, g₁₁ = g_c (x₃) = {0, 1, 1, 1}, g₁₂ = g_c (x₄) = {0, 1, 1, 1}.

Table 1

An information system

U	a	b	c
x ₁	0.1	0.2	0.1
x ₂	0.2	0.5	0.2
x ₃	0.3	0.3	0.3
x ₄	0.7	0.1	0.3

If P = {a, b, c}, then the neighborhood granular vector of x₁ on P is:

G_P (x₁) = (g_a (x₁) , g_b (x₁) , g_c (x₁)) ^T = ({1, 1, 0, 0} , {1, 0, 1, 1} , {1, 1, 0, 0}). The size of the neighborhood granular vector is: $\begin{matrix} | G_{p} (x_{1}) | & = \sqrt{\sum_{i = 1}^{3} {| g_{i} (x_{1}) |}^{2}} \\ = \sqrt{2 * 2 + 3 * 3 + 2 * 2} = 4.123 . \end{matrix}$

The neighborhood granular vector of x₂ on P is:

G_P (x₂) = (g_a (x₂) , g_b (x₂) , g_c (x₂)) ^T = ({1, 1, 1, 0} , {0, 1, 0, 0} , {1, 1, 1, 1}).The size of the neighborhood granular vector is: $\begin{matrix} | G_{p} (x_{2}) | & = \sqrt{\sum_{i = 1}^{3} {| g_{i} (x_{2}) |}^{2}} \\ = \sqrt{3 * 3 + 1 * 1 + 4 * 4} = 5.099 . \end{matrix}$

The neighborhood granular vector of x₃ on P is:

G_P (x₃) = (g_a (x₃) , g_b (x₃) , g_c (x₃)) ^T = ({0, 1, 1, 0} , {1, 0, 1, 0} , {0, 1, 1, 1}).The size of the neighborhood granular vector is: $\begin{matrix} | G_{p} (x_{3}) | & = \sqrt{\sum_{i = 1}^{3} {| g_{i} (x_{3}) |}^{2}} \\ = \sqrt{2 * 2 + 2 * 2 + 3 * 3} = 4.123 . \end{matrix}$

The neighborhood granular vector of x₄ on P is:

G_P (x₄) = (g_a (x₄) , g_b (x₄) , g_c (x₄)) ^T = ({0, 0, 0, 1} , {1, 0, 0, 1} , {0, 1, 1, 1}).The size of the neighborhood granular vector is: $\begin{matrix} | G_{p} (x_{4}) | & = \sqrt{\sum_{i = 1}^{3} {| g_{i} (x_{4}) |}^{2}} \\ = \sqrt{1 * 1 + 2 * 2 + 3 * 3} = 3.742 . \end{matrix}$

3 Operations and distances of neighborhood granular vectors

Define 6. Set the information system as IS = (U, F), where the attribute set is F = {a₁, a₂, . . . , a_m}. For ∀x, y ∈ U, there exists two neighborhood granular vectors G_F (x) = (g₁ (x) , g₂ (x) , . . . , g_m (x)) ^T and G_F (y) = (g₁ (y) , g₂ (y) , . . . , g_m (y)) ^T on F, then the intersection, union, subtraction and xor operations of the two neighborhood granular vectors are defined as: $\begin{matrix} G_{F} (x) \land G_{F} (y) = (g_{1} (x) \land g_{1} (y), g_{2} (x) \land \\ g_{2} (y), \dots, g_{m} (x) \land g_{m} (y))^{T}; \end{matrix}$ (7) $\begin{matrix} G_{F} (x) \lor G_{F} (y) = (g_{1} (x) \lor g_{1} (y), g_{2} (x) \lor \\ g_{2} (y), \dots, g_{m} (x) \lor g_{m} (y))^{T}; \end{matrix}$ (8) $\begin{matrix} G_{F} (x) - G_{F} (y) = (g_{1} (x) - g_{1} (y), g_{2} (x) - \\ g_{2} (y), \dots, g_{m} (x) - g_{m} (y))^{T}; \end{matrix}$ (9) $\begin{matrix} G_{F} (x) \oplus G_{F} (y) = (g_{1} (x) \oplus g_{1} (y), g_{2} (x) \oplus \\ g_{2} (y), \dots, g_{m} (x) \oplus g_{m} (y))^{T} . \end{matrix}$ (10)

Define 7. Set the information system as IS = (U, F), where the attribute set is F = {a₁, a₂, . . . , a_m}. For ∀x, y ∈ U, there exists two neighborhood granular vectors G_F (x) = (g₁ (x) , g₂ (x) , . . . , g_m (x)) ^T and G_F (y) = (g₁ (y) , g₂ (y) , . . . , g_m (y)) ^T on F, then the relative distance of the two neighborhood granular vectors is defined as: $d (G_{F} (x), G_{F} (y)) = \frac{1}{m} \sum_{i = 1}^{m} \frac{| g_{i} (x) \oplus g_{i} (y) |}{| g_{i} (x) \lor g_{i} (y) |},$ (11) where,|F| = m.

It is easy to know that the relative distance of neighborhood granular vector satisfies: 0 ≤ d ( G_F (x) , G_F (y)) ≤1.

Define 8. Set the information system as IS = (U, F), where the attribute set is F = {a₁, a₂, . . . , a_m}. For ∀x, y ∈ U, there exists two neighborhood granular vectors G_F (x) = (g₁ (x) , g₂ (x) , . . . , g_m (x)) ^T and G_F (y) = (g₁ (y) , g₂ (y) , . . . , g_m (y)) ^T on F, then the absolute distance between the two neighborhood granular vectors is defined as: $h (G_{F} (x), G_{F} (y)) = \frac{1}{m * n} \sum_{i = 1}^{m} | g_{i} (x) \oplus g_{i} (y) |,$ (12) where,|F| = m,|U| = n.

It is easy to know that the absolute distance of the neighborhood granular vector satisfies: 0 ≤ h ( G_F (x) , G_F (y)) ≤1.

Theorem 1. The relative distance between two neighborhood granular vectors is a distance measure, which satisfies the following three properties:

(1) Non-negative, 0 ≤ d (G_F (x) , G_F (y)) ≤1;

(2) Symmetry, d (G_F (x) , G_F (y)) = d (G_F (y) , G_F (x));

(3) Triangle inequality, d (G_F (x) , G_F (y)) + d (G_F (y) , G_F (z)) ≥ d (G_F (x) , G_F (z)). Proof. (1) Suppose s = g_i (x), t = g_i (y), from g_i (x) ⊕ g_i (y) = g_i (x) ∨ g_i (y) - g_i (x) ∧ g_i (y), shows $\frac{| g_{i} (x) \oplus g_{i} (y) |}{| g_{i} (x) \lor g_{i} (y) |} = \frac{| g_{i} (x) \lor g_{i} (y) - g_{i} (x) \land g_{i} (y) |}{| g_{i} (x) \lor g_{i} (y) |}$ , then $0 \leq \frac{| g_{i} (x) \oplus g_{i} (y) |}{| g_{i} (x) \lor g_{i} (y) |} \leq 1$ . From F = {a₁, . . . , a_m}, shows |F| = m. Therefore, $0 \leq \sum_{i = 1}^{m} \frac{| g_{i} (x) \oplus g_{i} (y) |}{| g_{i} (x) \lor g_{i} (y) |} \leq m$ , then $0 \leq \frac{1}{m} \sum_{i = 1}^{m} \frac{| g_{i} (x) \oplus g_{i} (y) |}{| g_{i} (x) \lor g_{i} (y) |} \leq 1$ . So, 0 ≤ d (G_F (x) , G_F (y)) ≤1 is proved.

(2) From g_i (x) ∨ g_i (y) = g_i (y) ∨ g_i (x), g_i (x) ∧ g_i (y) = g_i (y) ∧ g_i (x), shows $\frac{1}{m} \sum_{i = 1}^{m} \frac{| g_{i} (x) \oplus g_{i} (y) |}{| g_{i} (x) \lor g_{i} (y) |} = \frac{1}{m} \sum_{i = 1}^{m} \frac{| g_{i} (y) \oplus g_{i} (x) |}{| g_{i} (y) \lor g_{i} (x) |}$ . Therefore, d (G_F (x) , G_F (y)) = d (G_F (y) , G_F (x)) is proved.

(3) From the literature [30], $\frac{| g_{i} (x) \oplus g_{i} (y) |}{| g_{i} (x) \lor g_{i} (y) |} + \frac{| g_{i} (y) \oplus g_{i} (z) |}{| g_{i} (y) \lor g_{i} (z) |} \geq \frac{| g_{i} (x) \oplus g_{i} (z) |}{| g_{i} (x) \lor g_{i} (z) |}$ . For this reason, $\frac{1}{m} \sum_{i = 1}^{m} \frac{| g_{i} (x) \oplus g_{i} (y) |}{| g_{i} (x) \lor g_{i} (y) |} + \frac{1}{m} \sum_{i = 1}^{m} \frac{| g_{i} (y) \oplus g_{i} (z) |}{| g_{i} (y) \lor g_{i} (z) |} \geq \frac{1}{m} \sum_{i = 1}^{m} \frac{| g_{i} (x) \oplus g_{i} (z) |}{| g_{i} (x) \lor g_{i} (z) |}$ sets up. According to the relative distance definition of neighborhood granular vector, d (G_F (x) , G_F (y)) + d (G_F (y) , G_F (z)) ≥ d (G_F (x) , G_F (z)) is established.

Theorem 2. The absolute distance between two neighborhood granular vectors is a distance measure, which satisfies the following three properties:

(1) Non-negative, 0 ≤ h (G_F (x) , G_F (y)) ≤1;

(2) Symmetry, h (G_F (x) , G_F (y)) = h (G_F (y) , G_F (x));

(3) Triangle inequality, h (G_F (x) , G_F (y)) + h (G_F (y) , G_F (z)) ≥ h (G_F (x) , G_F (z)).

Proof. (1) Suppose s = g_i (x), t = g_i (y), from g_i (x) ⊕ g_i (y) = g_i (x) ∨ g_i (y) - g_i (x) ∧ g_i (y), 1 ≤ |g_i (x) | ≤ n, shows 0 ≤ |g_i (x) ⊕ g_i (y) | ≤ n. From F = {a₁, a₂, . . . , a_m}, shows |F| = m. Therefore, $0 \leq \sum_{i = 1}^{m} | g_{i} (x) \oplus g_{i} (y) | \leq m * n$ , then $0 \leq \frac{1}{m * n} \sum_{i = 1}^{m} | g_{i} (x) \oplus g_{i} (y) | \leq 1$ . So, 0 ≤ h (G_F (x) , G_F (y)) ≤1 is proved.

(2) From g_i (x) ∨ g_i (y) = g_i (y) ∨ g_i (x), g_i (x) ∧ g_i (y) = g_i (y) ∧ g_i (x), shows $\frac{1}{m * n} \sum_{i = 1}^{m} | g_{i} (x) \oplus g_{i} (y) | = \frac{1}{m * n} \sum_{i = 1}^{m} | g_{i} (y) \oplus g_{i} (x) |$ . Therefore, h (G_F (x) , G_F (y)) = h (G_F (y) , G_F (x)) is established.

(3) From |g_i (x) ⊕ g_i (y) | + |g_i (y) ⊕ g_i (z) | ≥ |g_i (x) ⊕ g_i (z) |. Shows $\frac{1}{m * n} \sum_{i = 1}^{m} | g_{i} (x) \oplus g_{i} (y) | + \frac{1}{m * n} \sum_{i = 1}^{m} | g_{i} (y) \oplus g_{i} (z) | \geq \frac{1}{m * n} \sum_{i = 1}^{m} | g_{i} (x) \oplus g_{i} (z) |$ . According to the definition of absolute distance of neighborhood granular vector, h (G_F (x) , G_F (y)) + h (G_F (y) , G_F (z)) ≥ h (G_F (x) , G_F (z)) is established.

4 The neighborhood granular LVQ clustering algorithm

The LVQ algorithm belongs to a prototype clustering, which usually initializes and iteratively updates the prototype and then obtains parameters of the prototype. Therefore, the LVQ algorithm tries to find a group of prototype vectors to describe data, but it assumes that samples have labels before training, and labels are used in the training process to assist clustering. The original LVQ algorithm is sensitive to the initial values, and even the clustering is not successful. Since a granule contains global information of a sample, introduce granular computing into the LVQ clustering to reduce sensitivity to initial values. According to the above neighborhood granular vector and its distance measurement, the Neighborhood Granular LVQ (NGLVQ) algorithm is further designed. It performs clustering with a granular vector as a unit, initializes the center points of q granular clusters first, and conducts iterative training to find new q clusters.

4.1 The cluster principle of NGLVQ

The data are granulated before the training of NGLVQ clustering algorithm. First, neighborhood granulation is carried out on the samples. After neighborhood granulation, each sample becomes a granular vector, and the label of each granular vector is the original category label. Assuming that the divided cluster is q, the learning objective is to find q prototype granular vectors P₁, P₂, . . . , P_q. Initially, a labeled granular vector is randomly selected from the vectors. Then, the closest prototype granular vector P_i is found according to the distance between granular vectors. The prototype granular vector is updated according to whether the labels between the two granular vectors are the same. After reaching the maximum number of iterations, the latest prototype granular vectors P₁, P₂, . . . , P_q are obtained. Finally, the sample set is divided into q clusters by the distance measurement between granular vectors.

For a granular vector set GT = {G_F (x₁) , G_F (x₂) , . . . , G_F (x_n)}, the prototype granular vector is: $μ = \frac{1}{n} \sum_{i = 1}^{n} G_{F} (x_{i}) .$ (13)

The prototype granular vector is a centroid of granular vectors in a same cluster. As for q clusters, they are q prototype granular vectors, represented as (μ₁, μ₂, . . . , μ_q).

4.2 The NGLVQ clustering algorithm

Input: An information system is IS = (U, F, T), where the sample set is U = {x₁, x₂, . . . , x_n},its corresponding label set is T = {y₁, y₂, . . . , y_n}, and attribute set is F = {a₁, a₂, . . . , a_m}; A class cluster parameter q, a neighborhood parameter δ, a maximum iteration N, a learning rate η ∈ [0, 1];

Process: (1) The sample set U is granulated by neighbors to become GT = {G_F (x₁) , G_F (x₂) , . . . , G_F (x_n)};

(2) q neighborhood granular vectors are randomly selected from GT as the initial prototype granular vectors (μ₁, μ₂, . . . , μ_q). Suppose their corresponding labels are (y₁, y₂, . . . , y_q);

(3) For t = 1 to N

(3.1) A granulated sample (G_F (x_r) , y_r) is randomly selected from GT and T;

(3.2) Calculate the granular distance of neighborhood granular vector G_F (x_r) and each prototype granular vector μ_j (j = 1, 2, . . . , q) : d_rj = d (G_F (x_r) , μ_j) or d_rj = h (G_F (x_r) , μ_j); Find the prototype granular vector $μ_{j^{*}} (j^{*} = \arg \min_{j \in {1, 2, . . ., q}} d_{rj})$ closest to G_F (x_r);

(3.3) if y_r = y_{j
^*} then $μ^{'} = μ_{j^{*}} + η (G_{F} (x_{r}) - μ_{j^{*}})$ (14) else $μ^{'} = μ_{j^{*}} - η (G_{F} (x_{r}) - μ_{j^{*}})$ (15) end if j = 1, 2, . . . , q

The prototype granular vector μ_{j
^*} is updated by μ′;

$η (t) = η (1) (1 - \frac{t}{N})$ .

(4) For i = 1, 2, . . . , n, calculate the granular distance of neighborhood granular vector G_F (x_i) and each prototype granular vector μ_j () : d_ij = d (G_F (x_i) , μ_j) or d_ij = h (G_F (x_i) , μ_j); Mark x_i as the category λ_j (j = 1, 2, . . . , q) corresponding to the smallest d_ij; Last update C_λj = C_λj ∪ x_i;

Output: A cluster partition C = (C₁, C₂, . . . , C_q).

Fig. 1

NGLVQ algorithm flow chart.

5 Experiments

In some experiments, the NGLVQ algorithm is used to cluster on seven UCI datasets, and the dataset information is shown in Table 2.

Table 2
Seven UCI datasets

Datasets Sample Feature Category

Iris 150 4 3

Haberman 306 3 2

Wine 178 13 3

Seeds 210 7 3

Pima-indians-diabetes 768 8 2

WDBC 569 30 2

CMC 1473 9 3

Datasets	Sample	Feature	Category
Iris	150	4	3
Haberman	306	3	2
Wine	178	13	3
Seeds	210	7	3
Pima-indians-diabetes	768	8	2
WDBC	569	30	2
CMC	1473	9	3

Due to the different values among samples, a pre-processing of data is needed before clustering. For example, the numerical range of one feature may be [100,1000], and the numerical range of another feature may be [-0.1,0.1]. In distance calculation, a large difference in numerical values will lead to different results. Features with large numerical values will play a decisive role, while those with small numerical values may be ignored. To eliminate the influence of unit and scale differences between features, features need to be normalized. In this paper, maximum and minimum normalization is adopted to transform the range of each feature into within [0,1], and its formula is as follows: $X_{norm} = \frac{X - X_{\min}}{X_{\max} - X_{\min}} .$ (16)

After the normalization of the data, neighborhood granulation is performed on these data to construct neighborhood granular vectors. The relative distance and absolute distance between neighborhood granular vectors are used for distance calculation. The experiment compares our clustering method based on the relative and absolute distances with the traditional LVQ clustering algorithm to verify the actual effects.

5.1 Data visualization

TSNE was used for dimensionality reduction during the experiment, and the characteristics of each sample were reduced to 2 dimensions for visual comparison. In the experiment, the clustering of Original data, relative distance based NGLVQ, absolute distance based NGLVQ and traditional LVQ algorithm are compared. The data visualization is shown in Figs. 2–8.

Fig. 2

Iris data cluster results.

Fig. 3

Haberman data cluster results.

Fig. 4

Wine data cluster results.

Fig. 5

Seeds data cluster results.

Fig. 6

Pima-indians-diabetes data cluster results.

Fig. 7

WDBC data cluster results.

Fig. 8

CMC data cluster results.

The experimental results of the Iris data set show that the clustering effects of NGLVQ based on relative distance and NGLVQ based on absolute distance are similar to the original data, while the traditional LVQ clustering algorithm is poor. The experimental results of the Haberman dataset and CMC dataset show that there are significant differences between NGLVQ and traditional LVQ clustering algorithm and original data. The experimental results of the Wine data set show that the clustering effect of NGLVQ based on relative distance and NGLVQ based on absolute distance roughly restores the original data, while the traditional LVQ clustering algorithm is poor. Experimental results of the Seeds data set showed that NGLVQ based on absolute distance had a better effect. The experimental results on Pima-Indians-Diabetes and WDBC data sets show that the cluster results of the NGLVQ algorithm are more consistent with the original data than the traditional LVQ algorithm, however, although the traditional LVQ clustering algorithm is divided into two categories, there is a huge gap between it and the original data.

5.2 Influence of neighborhood parameters

This section analyzes the parameters of the neighborhood granulation and studies the influence of neighborhood parameters by the experimental results. The original data of the experiment are labeled, and the classification accuracy is obtained by comparing the original data labels with the clustering labels. In the experiment, neighborhood parameters ranging from 0 to 1 were used to carry out experiments. The learning rate was fixed at 0.2, and the number of iterations was 200. The experimental results of different data sets are shown in Figs. 9–15.

Fig. 9

Classification accuracy of different neighborhood parameters in Iris dataset.

Fig. 10

Classification accuracy of different neighborhood parameters in Haberman dataset.

Fig. 11

Classification accuracy of different neighborhood parameters in Wine dataset.

Fig. 12

Classification accuracy of different neighborhood parameters in Seeds dataset.

Fig. 13

Classification accuracy of different neighborhood parameters in Pima-indians-diabetes dataset.

Fig. 14

Classification accuracy of different neighborhood parameters in WDBC dataset.

Fig. 15

Classification accuracy of different neighborhood parameters in CMC dataset.

In the experiment of the Iris dataset, when the neighborhood parameter is 0.2, the classification accuracy of NGLVQ based on absolute distance reaches the maximum of 0.9733, and when the neighborhood parameter is 0.6, the classification accuracy of NGLVQ based on relative distance reaches the maximum of 0.9667. When the neighborhood parameters are 0.2 and 0.6, the NGLVQ performs better than the traditional LVQ.

In the experiment of the Haberman dataset, when the neighborhood parameter is 0 to 0.4, the classification accuracy of NGLVQ based on relative distance is slightly higher than that of traditional LVQ. When the neighborhood parameter is 0.15, the classification accuracy of NGLVQ based on relative distance reaches the maximum of 0.7614.

In the experiment of the Wine dataset, when the neighborhood parameters are between [0.05,0.75], the classification accuracy of the NGLVQ based on relative distance is higher than that of the traditional LVQ and the maximum value is 0.8933. The classification accuracy of NGLVQ based on absolute distance reaches the maximum of 0.8989 when the neighborhood parameter is 0.3, and the clustering effect is better than that of traditional LVQ when the neighborhood parameter is 0.15-0.7.

In the experiment of the Seeds dataset, when the neighborhood parameter is 0.35, the classification accuracy of NGLVQ based on absolute distance is slightly higher than that of traditional LVQ. The classification accuracy under other neighborhood parameter values is lower than that of traditional LVQ.

In the experiment of the Pima-Indians-Diabetes data set, when the neighborhood parameter is 0.35, the classification accuracy of NGLVQ based on absolute distance reaches the maximum of 0.7227, and when the neighborhood parameter is 0.3, the classification accuracy of NGLVQ based on relative distance reaches the maximum of 0.7148. NGLVQ is superior to traditional LVQ in most neighborhood parameters.

In the experiment of the WDBC data set, when the neighborhood parameter is 0.35, the classification accuracy of NGLVQ based on absolute distance reaches the maximum of 0.9139, and when the neighborhood parameter is 0.3, the classification accuracy of NGLVQ is based on the relative distance reaches the maximum of 0.891.NGLVQ based on absolute distance is superior to traditional LVQ when the neighborhood parameters are 0.1-0.4 (except 0.3).

In the experiment of the CMC dataset, when the neighborhood parameter is 0.3, the classification accuracy of NGLVQ based on absolute distance is slightly higher than that of traditional LVQ. When the neighborhood parameter is 0.15, the classification accuracy of NGLVQ based on relative distance is slightly higher than that of traditional LVQ. The classification accuracy under other neighborhood parameter values is lower than that of traditional LVQ.

As can be seen from Figs. 9–15, for different datasets and neighborhood parameters have a great influence on the classification accuracy. In each experimental data set, an appropriate neighborhood parameter can always be found, which makes the classification accuracy of NGLVQ superior to that of traditional LVQ. The maximum accuracy of NGLVQ based on absolute distance is generally greater than that of NGLVQ based on relative distance.

5.3 Comparisons of clustering algorithms

NGLVQ clustering algorithm based on relative distance and NGLVQ clustering algorithm based on absolute distance is compared with traditional LVQ clustering algorithm, Agglomerative clustering algorithm, and Gaussian mixture algorithm inaccuracy, ARI, and NMI of 7 data sets. The value range of ARI(Adjusted Rnd Index) is [-1,1], and the larger the value is, the more consistent the clustering result is with the real situation. NMI(Normalized Mutual Information) is commonly used in clustering to measure the similarity of two clustering results. The value range of NMI is [0,1]. The higher the value, the more accurate the partitioning.

According to Tables 3–5, under the three evaluation indexes, the score of the NGLVQ algorithm in Haberman, PIMA-Indians - Diabetes, and CMC data sets is higher than that of the other four algorithms. In Iris and Seeds data sets, the scores of the NGLVQ algorithm based on absolute distance were better than those of the other five algorithms. Although the score of the NGLVQ algorithm is higher than that of the LVQ clustering algorithm in the Wine data set, the score of the NGLVQ algorithm based on two kinds of distance is lower than that of Agglomerative Cluster, Gaussian Mixture, and K-means. Although the score of the NGLVQ algorithm is higher than the LVQ clustering algorithm in the WDBC data set, the score of the NGLVQ algorithm based on two kinds of distance is lower than Gaussian Mixture and K-means. It can be seen from the above experiments that the NGLVQ algorithm has better clustering performance on data sets with small feature numbers, and most of the experimental results are significantly better than the other four algorithms. The time complexity of the NGLVQ algorithm is O(N*Q), the granulation time complexity is O(m*n²), and the single cluster iteration time complexity is O(N). Where N is the number of iterations, Q is the number of clusters, m is the number of features and n is the number of samples. Meanwhile, for the NGLVQ algorithm, the clustering performance using absolute distance is higher than that using relative distance.

Table 3
Comparison of accuracy of various data clustering algorithms

Datasets NGLVQ (absolute distance) NGLVQ (relative distance) LVQ Agglomerative Cluster Gaussian Mixture K-means

Iris 0.9680 0.9627 0.8933 0.8867 0.9667 0.8867

Haberman 0.7575 0.7582 0.7353 0.7353 0.7353 0.7353

Wine 0.9011 0.9213 0.6854 0.9775 0.9607 0.9438

Seeds 0.8972 0.8867 0.9048 0.8714 0.8429 0.8905

Pima-indians-diabetes 0.7318 0.7206 0.6888 0.6510 0.6510 0.6680

WDBC 0.9300 0.9248 0.8682 0.8682 0.9420 0.9279

CMC 0.4647 0.4544 0.4270 0.4297 0.4270 0.4345

Datasets	NGLVQ (absolute distance)	NGLVQ (relative distance)	LVQ	Agglomerative Cluster	Gaussian Mixture	K-means
Iris	0.9680	0.9627	0.8933	0.8867	0.9667	0.8867
Haberman	0.7575	0.7582	0.7353	0.7353	0.7353	0.7353
Wine	0.9011	0.9213	0.6854	0.9775	0.9607	0.9438
Seeds	0.8972	0.8867	0.9048	0.8714	0.8429	0.8905
Pima-indians-diabetes	0.7318	0.7206	0.6888	0.6510	0.6510	0.6680
WDBC	0.9300	0.9248	0.8682	0.8682	0.9420	0.9279
CMC	0.4647	0.4544	0.4270	0.4297	0.4270	0.4345

Table 4

Comparison of ARI of various data clustering algorithms

Datasets	NGLVQ (absolute distance)	NGLVQ (relative distance)	LVQ	Agglomerative Cluster	Gaussian Mixture	K-means
Iris	0.9076	0.8932	0.7323	0.7196	0.9039	0.7163
Haberman	0.1738	0.1868	-0.0233	-0.0021	0.1040	-0.0040
Wine	0.7208	0.7726	0.4312	0.9310	0.8975	0.8368
Seeds	0.7635	0.6994	0.7369	0.6752	0.6115	0.7049
Pima-indians-diabetes	0.1939	0.1759	0.0854	0.0727	0.0707	0.1024
WDBC	0.7385	0.7203	0.5338	0.5383	0.7802	0.7302
CMC	0.0495	0.0413	0.0058	0.0145	0.0058	0.0221

Table 5

Comparison of NMI of various data clustering algorithms

Datasets	NGLVQ (absolute distance)	NGLVQ (relative distance)	LVQ	Agglomerative Cluster	Gaussian Mixture	K-means
Iris	0.9008	0.8723	0.7907	0.7837	0.8997	0.7419
Haberman	0.0776	0.0853	0.0181	0.0001	0.0704	0.0007
Wine	0.7238	0.7628	0.4302	0.9086	0.8770	0.8417
Seeds	0.7191	0.6799	0.7082	0.6900	0.6424	0.6742
Pima-indians-diabetes	0.1181	0.1031	0.0571	0.0357	0.0541	0.0517
WDBC	0.6478	0.6140	0.4980	0.4170	0.6680	0.6231
CMC	0.0481	0.0545	0.0125	0.0091	0.0125	0.0276

The clustering performance of the NGLVQ algorithm is inferior to that of the Agglomerative Cluster algorithm and Gaussian Mixture algorithm and K-means algorithm for data sets with a large number of features and categories (such as Wine and WDBC data sets). However, the NGLVQ algorithm is not much worse than the Agglomerative Cluster algorithm and Gaussian Mixture algorithm, and K-means algorithm in terms of performance evaluation index scores. Different from other clustering algorithms, the NGLVQ algorithm uses neighborhood granulation technology to make a breakthrough in structure, so that data can be processed in advance before running. The convergence speed and clustering performance of the NGLVQ algorithm are improved so that the algorithm has a good effect on different types of data sets.

6 Conclusions

In this paper, the LVQ clustering algorithm is improved by neighborhood granulation. A granule and a granular vector are constructed in the data set, and the size and operations of neighborhood granular vectors are defined. The relative distance and absolute distance measurement of neighborhood granular vectors is proposed, to design the NGLVQ clustering algorithm. The experimental results show that the NGLVQ clustering algorithm can achieve better performance and low prediction errors. The NGLVQ clustering algorithm clusters samples successfully, and also obtain good results under the condition of appropriate neighborhood parameters. NGLVQ clustering algorithm can get better clustering results compared with the traditional LVQ clustering algorithm. Compared with other clustering algorithms, the NGLVQ clustering algorithm has a better clustering performance on data sets with smaller feature numbers. In future work, try to study new granulation methods to improve performance and apply them to other research fields. And try to study the local granulation and the granular measurement to comprehensively improve the speed and performance of granular computing.

Footnotes

Acknowledgements

This research was supported by the National Natural Science Foundation of China under Grant 61976183; Major Project of Industry-university-research Innovation Fund of Chinese Universities at New Generation Information Technology Innovation of China (2019ITA01011).

References

Zadeh

L.A.

, Toward a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic, Fuzzy Sets Syst 90(2) (1997), 111–127.

Pawlak

, Rough sets, Int. J. Comput. Inf. Sci 11(1) (1982), 341–356.

Wang

G.Y.

, Zhang

Q.H.

and Hu

, An overview of granular computing, CAAI T Intell Syst 2(6) (2007), 8–26.

Lin

T.Y.

, Granular computing on binary relations I: data mining and neighborhood systems, Rough Sets in Knowledge Discovery (1998), 165–166.

Kang

X.P.

and Miao

D.Q.

, A study on information granularity in formal concept analysis based on concept-bases, Knowl.-Based Syst 105 (2016), 147–159.

Liu

and Liu

, Granules and applications of granular computing in logical reasoning, Journal of Computer Research and Development 4 (2004), 546–551.

Chen

Y.M.

and Zhang

, Granule vectors and granular convolutional classifiers, IEEE Access 8 (2020), 2042–2051.

Chen

L.S.

and Zhao

, A granular computing based classification method from algebraic granule structure, IEEE Access 9 (2021), 68118–68126.

Miao

D.Q.

and Wang

, On the relationships between information entropy and roughness of knowledge in rough set theory, Pattern Recognition and Artificial Intelligence 1 (1998), 34–40.

10.

Miao

D.Q.

and Wang

, An information representation of concepts and operations in rough set theory, Journal of Software 10(2) (1999), 113–116.

11.

Q.H.

, Yu

D.R.

and Xie

Z.X.

, Numerical attribute reduction based on neighborhood granulation and rough approximation, Journal of Software 19(3) (2008), 640–649.

12.

Duan

, Hu

Q.H.

, Zhang

L.J.

and Qian

Y.H.

, Feature selection for multi-label classification based on neighborhood rough sets, Journal of Computer Research and Development 52(1) (2015), 56–65.

13.

Yao

Y.Y.

, Information granulation and rough set approximation, Int. J. Intell. Syst. 16(1) (2001), 87–104.

14.

Yao

Y.Y.

, Zhang

, Miao

D.Q.

and Xu

F.F.

, Set-theoretic approaches to granular computing, Fundamenta Informaticae 115(2–3) (2012), 247–264.

15.

Zhu

and Hu

Q.H.

, Adaptive neighborhood granularity selection and combination based on margin distribution optimization, Inf. Sci. 249 (2013), 1–12.

16.

Wang

G.Y.

, Zhang

Q.H.

, Ma

X.A.

and Yang

Q.S.

, Granular computing models for knowledge uncertainty, Journal of Software 22(4) (2011), 676–694.

17.

Keyvan

and Ebrahim

, From clustering to clustering ensemble selection: A review, Eng Appl Artif Intel 104 (2021).

18.

Mostafa

S.M.

, Clustering algorithms: taxonomy, comparison, and empirical analysis in 2D datasets, Journal on Artificial Intelligence 2(4) (2020), 189–215.

19.

Mostafa

S.M.

, Towards improving machine learning algorithms accuracy by benefiting from similarities between cases, Journal of Intelligent & Fuzzy Systems 40(1) (2021), 947–972.

20.

Algarni

, Ragab

, Alamri

and Mostafa

S.M.

, Towards improving predictive statistical learning model accuracy by enhancing learning technique, Computer Systems Science & Engineering 42(1) (2022), 303–318.

21.

McNicholas

P.D.

, Model-based clustering, Journal of Classification 33(3) (2016), 331–373.

22.

David

and Pablo

, A review of learning vector quantization classifiers, Neural Computing and Applications 25 (2014), 511–524.

23.

Tapan

and Wang

D.H.

, An improved LVQ algorithm with data-structure preserving visualization, Int J Innov Comput I 8(10A) (2012), 6959–6974.

24.

Cruz-Vega

and Escalante

H.J.

, An online and incremental GRLVQ algorithm for prototype generation based on granular computing, Soft Computing 21(14) (2017), 3931–3944.

25.

Shen

Y.Y.

, Zhang

Y.M.

, Zhang

X.Y.

and Liu

C.L.

, Online semi-supervised learning with learning vector quantization, Neurocomputing 399 (2020), 467–478.

26.

Jatmiko

, Sunandar

, Alvissalim

M.S.

, Tawakal

M.I.

and Fukuda

, Development of Adaptive Fuzzy-Neuro Generalized Learning-Vector Quantization Using PI Membership Function (AFNGLVQ-PI), IEEE Access 9 (2021), 47452–47480.

27.

Kuo

R.J.

, Lin

, Zulvia

F.E.

and Lin

C.C.

, Integration of cluster analysis and granular computing for imbalanced data classification: A case study on prostate cancer prognosis in Taiwan, Journal of Intelligent & Fuzzy Systems 32(3) (2017), 2251–2267.

28.

Yao

Y.Y.

, Relational interpretations of neighborhood operators and rough set approximation operators, Information Sciences 111(1) (1998), 239–259.

29.

Q.H.

, Yu

D.R.

and Xie

Z.X.

, Neighborhood classifiers, Expert Systems With Applications 34(2) (2008), 866–876.

30.

Chen

Yumin

, Qin

Nan

, Li

Wei

and Xu

Feifei

, Granule structures, distances and measures in neighborhood systems, Knowledge-Based Systems 165 (2019), 268–281.

An LVQ clustering algorithm based on neighborhood granules

Abstract

Keywords

1 Introduction

2 Neighborhood granulation and granular vectors

4.1 The cluster principle of NGLVQ

Table 2 Seven UCI datasets Datasets Sample Feature Category Iris 150 4 3 Haberman 306 3 2 Wine 178 13 3 Seeds 210 7 3 Pima-indians-diabetes 768 8 2 WDBC 569 30 2 CMC 1473 9 3

Footnotes

Acknowledgements

References

Table 2
Seven UCI datasets

Datasets Sample Feature Category

Iris 150 4 3

Haberman 306 3 2

Wine 178 13 3

Seeds 210 7 3

Pima-indians-diabetes 768 8 2

WDBC 569 30 2

CMC 1473 9 3