MD-SPKM: A set pair k-modes clustering algorithm for incomplete categorical matrix data

Abstract

In order to solve the clustering problem with incomplete and categorical matrix data sets, and considering the uncertain relationship between samples and clusters, a set pair k-modes clustering algorithm is proposed (MD-SPKM). Firstly, the correlation theory of set pair information granule is introduced into k-modes clustering. By improving the distance formula of traditional k-modes algorithm, a set pair distance measurement method between incomplete matrix samples is defined. Secondly, considering the uncertain relationship between the sample and the cluster, the definition of the intra-cluster average distance and the threshold calculation formula to determine whether the sample belongs to multiple clusters is given, and then the result of set pair clustering is formed, which includes positive region, boundary region and negative region. Finally, through the selected three data sets and four contrast algorithms for experimental evaluation, the experimental results show that the set pair k-modes clustering algorithm can effectively handle incomplete categorical matrix data sets, and has good clustering performance in Accuracy, Recall, ARI and NMI.

Keywords

Incomplete categorical matrix data set pair information granule k-modes set pair distance set pair k-modes

1. Introduction

Clustering is a basic research field and plays an important role in data analysis. In prototype clustering algorithm, k-means clustering algorithm [1, 2] is very effective for processing large data sets, but it is only suitable for numerical data sets and not for processing categorical data sets. Therefore, Huang proposed k-modes algorithm [3] to calculate the distance between two samples by simply matching the categorical samples to solve the clustering problem of categorical data. After that, scholars have made many improvements to the k-modes algorithm in terms of better initialization techniques or more efficient distance measurement [4, 5, 6, 7], but deal with samples described by one feature vector. In real databases, there are matrix sample data described by multiple features vectors [8]. For example, the user’s shopping record, each user is a sample. However, each user will buy more than one item, and the type and quantity of goods purchased are different, so the categorical matrix sample data will be generated. Traditional simple matching methods for calculating the distance between two samples can not be directly used to deal with such data sets. Due to the phenomena of data omission, incomplete, reading restriction and so on, a large number of missing values will exist. For example, the user’s shopping records, due to improper data stored procedures and for the purpose of protecting the privacy of users, there will be some missing records. Therefore, it is necessary to study the clustering problem of incomplete matrix data sets.

The result of traditional clustering algorithm is represented by a set with clear boundary, which can only represent two kinds of relations between samples and clusters, that is, samples belong to the clusters, or samples do not belong to the clusters, but in many applications, there are three kinds of relations between samples and clusters: certainly belong, may belong and does not belong. Because the information collected at present has limitations, the relationship between some samples and clusters can not be accurately judged, then the third relationship is given: may belong to. For this reason, by introducing the idea of three-way decisions [9, 10] into the traditional clustering, Yu propose a framework for three-way clusters [11]. The clustering results are represented by two sets, which are typical sample sets and fringe sample sets, respectively. However, most of the clustering algorithms considering the three relationships between samples and clusters are designed for numerical data, and the research on categorical data is less.

In the study of categorical data clustering, traditional distance measurement methods generally can only use one feature vector to represent matrix data with multiple features vectors descriptions. However, this will lead to the loss of a large amount of raw information, which can not achieve comprehensive consideration of all the data. In addition, the data usually faced with missing values is an incomplete matrix data. In order to solve the above clustering problem with incomplete, categorical matrix data sets and consider the uncertain relationship between samples and clusters, a set pair k-modes clustering algorithm (MD-SPKM) for incomplete categorical matrix data is proposed. (1) In order to solve the problem of incomplete and uncertain matrix data sets, based on the correlation theory of set pair information granules [12], a set pair distance between incomplete matrix samples is proposed. (2) Considering the uncertain relationship between samples and clusters, the definition of the intra-cluster average distance is given to distinguish the samples that certainly belong to clusters and the samples that may belong to clusters. In addition, because the sample may be related to multiple clusters, the threshold calculation formula for determining whether the sample belongs to multiple clusters is given. (3) A set pair k-modes clustering algorithm for incomplete categorical matrix data is proposed, and a set pair clustering result composed of positive region, boundary region and negative region is formed. The sample in the positive region belongs to the cluster, the sample in the boundary region may belong to the cluster, and the sample in the negative region does not belong to the cluster. (4) Experimental comparison and analysis with the existing four representative algorithms, the results show that the proposed algorithm can effectively deal with incomplete categorical matrix data sets.

2. Related works

In clustering algorithm, distance measurement between samples is a key step. For categorical data, scholars have proposed a variety of distance measurement methods. Such as, Huang et al. [13] propose a distance measurement method based on interdependence redundancy, not only consider the difference of attribute values between the two samples, but also consider the interaction between attributes. Zhou et al. [14] propose a global relationship difference measurement method for k-modes clustering algorithm, which not only considers the relationship between samples and all clustering modes, but also considers the differences of various attributes, rather than simple matching. However, these distance measurement methods are aimed at samples represented by only one feature vector, which can obtain better clustering results. But, when measuring the distance of matrix samples with multiple sets of attribute features, only one set of features can be selected for distance calculation, and all data can not be considered. Therefore, Cao et al. proposed Set-Valued k-modes algorithm [15] and fuzzy SV-k-modes algorithm [16] for processing set-valued samples, defined the distance function between two samples with set-valued features, and proposed set-valued pattern representation of clustering centers. Li et al. proposed a MD fuzzy k-modes algorithm for categorical matrix data [17] and presented a heuristic updating algorithm for clustering centers. Although the above algorithm solves the clustering problem of matrix data, its algorithm is mainly aimed at complete data sets.

The missing value in the sample can not be properly treated, because the missing attribute value itself is uncertain, whether it is deleted or filled, it will cause certain errors. The set pair information granule theory is proposed by combining the set pair correlation theory on the existing granule calculation methods in order to consider the certain and uncertain of the research object in the granule calculation. Therefore, this paper introduces the knowledge of set pair information granule into clustering analysis to study the clustering problem of incomplete matrix data sets.

Due to the existence of three relationships between samples and clusters, scholars have proposed a variety of three-way clustering algorithms in order to better characterize this relationship. Such as Wang et al. [18] proposed three-way clustering frameworks based on the idea of contraction and expansion in mathematical morphology. Zhang [19] extend the rough k-means (RKM) according to the three-way weight and the three-way assignment method, and get a three-way c-means clustering algorithm and so on. However, the research object of the above algorithm is numerical data, and there are few studies on categorical data. In addition, there is no research work on the uncertain relationship between samples and clusters for incomplete categorical matrix data. Based on the set pair information granule theory, this paper proposes a set pair k-modes clustering algorithm for incomplete categorical matrix data, which can represent the relationship between samples and clusters.

3. Basic theory

3.1 Incomplete information system

Information system $S$ is expressed by the quaternion $S=(U,A,V,f)$ , where, $U=\{X_{1},X_{2},\ldots,X_{i}$ , $\ldots,X_{n}\}$ is a set of non-empty finite samples, $n$ as the number of data samples; $A=\left\{A_{1},A_{2},\ldots,A_{m}\right\}$ is a non-empty finite attribute set, $m$ is the number of attributes; $V=\left\{V_{1},V_{2},\ldots,V_{m}\right\}$ is the set of value ranges for the attribute $A$ , $V_{s}$ is the value range of attribute $A_{s}$ ; $f$ is an information function, $f:v_{is}=f\left(X_{i},A_{s}\right)\in V_{s}$ , indicates that the value of the sample $X_{i}$ on the attribute $A_{s}$ is $v_{is}$ .

$X_{i}$ is the $i$ sample in the region, which has $A=\left|m\right|$ attribute values. When some attribute values $v_{is}\left({1\leqslant s\leqslant m}\right)$ of the sample $X_{i}$ are unknown, the information system $S$ becomes an incomplete information system.

Definition 1. (Incomplete information system for matrix samples): A matrix sample information systemã€€is expressed by the quaternion $S=(U,A,V,f)$ , where, $U=\left\{{X_{1},X_{2},\ldots,X_{i},\ldots,X_{n}}\right\}$ is a set of non-empty finite samples, $n$ as the number of matrix samples in the region; $A=\left\{A_{1},A_{2},\ldots,A_{m}\right\}$ is a non-empty finite set of attributes that describes a matrix sample, $m$ ã€€is the number of attributes; $V=\left\{V_{1},V_{2},\ldots,V_{m}\right\}$ is the set of value ranges for the attribute $A$ , $V_{s}$ is the value range of attribute $A_{s}$ , $V_{X_{i}}^{A_{s}}$ represents the set of recorded values of matrix samples $X_{i}$ under attribute $A_{s}$ , and $\cup_{i=1}^{n}V_{X_{i}}^{A_{s}}=V_{s}$ , $\cup_{s=1}^{m}V_{s}=V$ ; $f$ is an information function, $f:V_{X_{i}}^{A_{s}}=f\left({X_{i},A_{s}}\right)\in V_{s}$ .

Where, $X_{i}=\left\{{X_{i1},X_{i2},\ldots,X_{im}}\right\}$ , $X_{is}=\left[{v_{i1s},v_{i2s},\ldots,v_{ir_{i}s}}\right]^{\prime}$ , $r_{i}$ represents the number of record the matrix sample $X_{i}$ , $v_{ijs}$ represents the $j$ record value of the sample $X_{i}$ under the attribute $A_{s}$ . $X_{i}$ is the $i$ matrix sample in the region, which has $\left|{m\times r_{i}}\right|$ attribute values. When the attribute value of some matrix samples is unknown, the information system $S$ is an incomplete information system of matrix samples, which is also the research object of this paper.

3.2 Set pair information granule

Set pair analysis [20] is a new theory to study the system of certainty and uncertainty. By analyzing the characteristics of two things and getting the expression of positives, differences and negatives contact degree is as follows: $\rho=a+bi+cj$ , $a$ denotes positive degree, $b$ denotes difference degree, and $c$ denotes negative degree. Among them, $i\in[-1,1]$ , $j=-1$ is called difference degree and negative degree mark symbol, which is used to identify the direction and uncertainty of classification information.

Definition 2. [12] Let the problem space $W=\left({U,A,V}\right)$ , $W_{0}=\left({U_{0},A_{0},V_{0}}\right)$ , $A_{0}\subseteq A$ , $V_{0}\subseteq V$ , $R\subseteq A_{0}$ , define a pair of subsets at $W$ , certain information granules set $X^{C}=\left\{{X_{1}^{C},X_{2}^{C},\ldots,X_{m}^{C}}\right\}$ and uncertain information granules set (difference granules set) $X^{U}=\left\{{X_{1}^{U},X_{2}^{U},\ldots,X_{n}^{U}}\right\}$ . Then for information $x\in W_{0}$ , there is a pair of maps:

$\displaystyle\tau X^{C}:W_{0}\to\left[{0,1}\right],x\to\tau X^{C}\left(x\right% )=a_{R}+c_{R}$ $\displaystyle\tau X^{U}:W_{0}\to\left[{0,1}\right],x\to\tau X^{U}\left(x\right% )=b_{R}$

Where, $a_{R}+b_{R}$ and $c_{R}$ are called $x$ about $X^{C}$ , $X^{U}$ degree of certainty and degree of uncertainty. $X_{i}^{C}\in X^{C}\left({1\leqslant i\leqslant m}\right)$ , $X_{j}^{U}\in X^{U}\left({1\leqslant j\leqslant n}\right)$ is called certain and uncertain information granule.

Definition 3. [12] Based on the certain information granule set $X^{C}$ , $R\subseteq A_{0}$ , a pair of subsets at $X^{C}$ are positive information granule set $X^{C_{s}}=\left\{{X_{1}^{C_{s}},X_{2}^{C_{s}},\ldots,X_{k}^{C_{s}}}\right\}$ and negative information granule set $X^{C_{o}}=\left\{{X_{1}^{C_{o}},X_{2}^{C_{o}},\ldots,X_{l}^{C_{o}}}\right\}$ . Then for information $x\in X^{c}$ , there is a pair of maps:

$\displaystyle\tau X^{C_{s}}:X^{C}\to\left[{0,1}\right],x\to\tau X^{C_{s}}\left% (x\right)=a_{R}$ $\displaystyle\tau X^{C_{o}}:X^{C}\to\left[{0,1}\right],x\to\tau X^{C_{o}}\left% (x\right)=c_{R}$

Where, $a_{R}$ are called $x$ about $X^{C_{s}}$ degree of positive, $c_{R}$ are called $x$ about $X^{C_{o}}$ degree of negative. $X_{i}^{C_{s}}\in X^{C_{s}}\left({1\leqslant i\leqslant k}\right)$ , $X_{j}^{C_{o}}\in X^{C_{o}}\left({1\leqslant j\leqslant l}\right)$ is called positive and negative information granule, respectively.

The information table is shown in Table 1. The problem space $W=\left({U,A,V}\right)$ , where, $U=\left\{{1,2,3,4,5,6,7,8}\right\}$ , $A=\left\{\textit{Height, Age, Eye}\right\}$ .

Table 1
A table of information

Object	Height	Age	Eye
1	Short	Youth	Blue
2	Short	Youth	Brown
3	Tall	Middle age	Blue
4	Tall	Old age	Blue
5	Tall	Old age	Dark
6	Middle	Youth	Blond
7	Middle	Old age	Dark
8	Middle	Youth	Blond

If only the height of each object is currently known, there is $R=\left\{\textit{Height}\right\}$ . As a result, $U/R=\left\{{\left\{{1,2}\right\},\left\{{3,4,5}\right\},\left\{{6,7,8}\right\}% }\right\}$ can be obtained. Let $W_{0}=\left({U_{0},A_{0},V_{0}}\right)$ is the $W$ subspace, where $U_{0}=\left\{{1,2,7,8}\right\}$ .

Under the $\left\{{1,2,7,8}\right\}$ , with regard to the equivalence relation $R=\left\{\textit{Height}\right\}$ in region, we can obtain that the positive information granule set, negative information granule set and difference granule set are: $X_{R}^{C_{s}}=\left\{{\left\{{1,2}\right\}}\right\}$ , $X_{R}^{C_{o}}=\left\{{\left\{{3,4,5}\right\}}\right\}$ , $X_{R}^{U}=\left\{{\left\{{6,7,8}\right\}}\right\}$ . At the same time, the positive degree $a_{R}=\frac{2}{8}$ , negative degree $c_{R}=\frac{3}{8}$ and difference degree $b_{R}=\frac{3}{8}$ can be obtained.

Definition 4. [21] There are two connection numbers, $\rho_{1}=a_{1}+b_{1}i+c_{1}j$ , $\rho_{2}=a_{2}+b_{2}i+c_{2}j$ , and the sum of them is still a connection number: $\rho=\rho_{1}+\rho_{2}=a+bi+cj$ , where $a=\left({a_{1}+a_{2}}\right)$ , $b=\left({b_{1}+b_{2}}\right)$ , $c=\left({c_{1}+c_{2}}\right)$ .

Inference 1: There are $n$ connection numbers, $\rho_{1}=a_{1}+b_{1}i+c_{1}j$ , $\rho_{2}=a_{2}+b_{2}i+c_{2}j$ , $\ldots$ , $\rho_{n}=a_{n}+b_{n}i+c_{n}j$ , and the sum of $n$ connection numbers is still a connection number: $\rho=\rho_{1}+\rho_{2}+\ldots+\rho_{n}=a+bi+cj$ , where $a=\left({a_{1}+a_{2}+\ldots+a_{n}}\right)$ , $b=\left({b_{1}+b_{2}+\ldots+b_{n}}\right)$ , $c=\left({c_{1}+c_{2}+\ldots+c_{n}}\right)$ .

3.3 Traditional k-modes algorithm

K-modes algorithm [3] is an extension of the k-means algorithm, and the extended k-modes has the following characteristics: (1) Using simple matching distance measures. (2) Using modes instead of means to update clustering centers. (3) Using frequency correlation strategies to update modes in clustering. These extensions exclude numerical limitations in k-means algorithms so that clustering can be applied to categorical data sets.

Let $U=\left\{{X_{1},X_{2},\ldots,X_{n}}\right\}$ be a set of categorical data, each sample $X_{i}=\left\{{X_{i1},X_{i2},\ldots,X_{im}}\right\}$ $\left({1\leqslant i\leqslant n}\right)$ described by $m$ attributes $A_{1},A_{2},\ldots,A_{m}$ . The set $Q_{j}\left({1\leqslant j\leqslant k}\right)$ is the modes of cluster $C_{j}$ , arbitrary $C_{j}$ consists of $n_{i}$ samples, where $C_{j}=\left\{{X_{1},X_{2},\ldots,X_{n_{i}}}\right\}$ . The $Q_{j}$ of each cluster takes the value with the highest frequency in each attribute.

Given any two sample $X_{i}$ and $X_{j}$ , distance measurement formula is [3]:

$\displaystyle d\left({X_{i},X_{j}}\right)=\sum\limits_{r=1}^{m}{\delta\left({X% _{ir},X_{jr}}\right)}$ $\displaystyle\delta\left({X_{ir},X_{jr}}\right)=\left\{{\begin{array}[]{l}1,% \mbox{ }X_{ir}\neq X_{jr}\\ 0,\mbox{ }X_{ir}=X_{jr}\\ \end{array}}\right.$

The objective functions of k-modes clustering algorithm are:

$\displaystyle F\left(W,Q\right)=\sum\limits_{j=1}^{k}{\sum\limits_{i=1}^{n}w_{% ij}}d\left(X_{i},Q_{j}\right)$ $\displaystyle w_{ij}\in\left\{0,1\right\},1\leqslant j\leqslant k,1\leqslant i% \leqslant n;\sum\limits_{j=1}^{k}{w_{ij}}=1,1\leqslant i\leqslant n;0<\sum% \limits_{i=1}^{n}{w_{ij}}<n,1\leqslant j\leqslant k$

Where, $W=\left[{w_{ij}}\right]_{n\times k}$ , $w_{ij}=\left\{{\begin{array}[]{cl}1,&X_{i}\in Q_{j}\\ 0,&X_{i}\notin Q_{j}\\ \end{array}}\right.$ , $k$ is the number of clusters, $w_{ij}=1$ represents the $i$ sample divided into the $j$ cluster, and $Q_{j}$ is the center of the $j$ cluster.

4. Set pair k-modes clustering algorithm

4.1 Measurement of set pair distance between matrix samples

Traditional k-modes distance measurement method can only be applied to samples with only one set of attribute features description, but there are some limitations for matrix categorical data with multiple sets of attribute features description. In order to solve the problem of distance measurement of incomplete and matrix samples, based on the theory of set pair information granules, a distance formula based on set pair connection degree is proposed, which is to extend the distance between different samples in the original clustering process to include three dimensions: positive degree, difference degree and negative degree.

Definition 5. (The set pair connection degree under a single attribute): Given matrix sample $X_{i}$ and $X_{j}$ , each sample is described by $m$ attribute $A_{1},A_{2},\ldots,A_{m}$ . Make the matrix sample $X_{i}$ and $X_{j}$ establish set pair connection degree under the attribute $A_{s}$ , the attribute value is the same as the positive information granules $S$ , the missing as the difference information granules $Q$ , the other as the negative information granules $F$ . A set pair analysis method is used to analyze the recorded values of two samples under attribute $A_{s}$ . For a record value, if this record exists in both samples, then both records are divided into positive information granules. If there is a missing record value, the missing record value is divided into difference information granules. And the remaining other record values are divided into negative information granules, the set pair connection degree formula established under a single attribute $A_{s}$ is:

$\displaystyle\rho_{ij}^{A_{s}}=\frac{S}{N}+\frac{Q}{N}i+\frac{F}{N}j$ (1)

Where, the $S$ value is the number of the same record value, the $Q$ value is the number of the missing record value, the $F$ value is the number of other record values, and the $N$ value is the sum of the two sample attribute record values. Special note is that when the data set is complete and there is no missing value, the number of difference information granules is 0.

Equation (1) can also be reduced to:

$\displaystyle\rho_{ij}^{A_{s}}=a+bi+cj$

Where, $a=\frac{S}{N}$ , $b=\frac{Q}{N}$ , $c=\frac{F}{N}$ , $a+b+c=1$ .

Definition 6. (The set pair connection degree under multiple attributes): The set pair connection degree of matrix sample $X_{i}$ and $X_{j}$ under attribute $A_{1},A_{2},\ldots,A_{m}$ is $\rho_{ij}^{A_{1}},\rho_{ij}^{A_{2}},\ldots,\rho_{ij}^{A_{m}}$ , respectively, the comprehensive set pair connection degree of sample $X_{i}$ and $X_{j}$ is:

$\displaystyle\rho_{ij}=\frac{1}{m}\left({\rho_{ij}^{A_{1}}+\rho_{ij}^{A_{2}}+% \ldots+\rho_{ij}^{A_{m}}}\right)$ (2)

A matrix sample data set is shown in Table 2, $X=\left\{{X_{1},X_{2}}\right\}$ , $A=\left\{{A_{1},A_{2},A_{3}}\right\}$ . The set pair connection degree between sample $X_{1}$ and $X_{2}$ is calculated as follows:

The set pair connection degree between sample $X_{1}$ and $X_{2}$ under attribute $A_{1}$ is:

$\displaystyle\rho_{12}^{A_{1}}=\frac{6}{9}+\frac{1}{9}i+\frac{2}{9}j=\frac{2}{% 3}+\frac{1}{9}i+\frac{2}{9}j$

The set pair connection degree between sample $X_{1}$ and $X_{2}$ under attribute $A_{2}$ is:

$\displaystyle\rho_{12}^{A_{2}}=\frac{6}{9}+0i+\frac{3}{9}j=\frac{2}{3}+0i+% \frac{1}{3}j$

The set pair connection degree between sample $X_{1}$ and $X_{2}$ under attribute $A_{3}$ is:

$\displaystyle\rho_{12}^{A_{3}}=\frac{6}{9}+\frac{2}{9}i+\frac{1}{9}j=\frac{2}{% 3}+\frac{2}{9}i+\frac{1}{9}j$

The comprehensive set pair connection degree of sample $X_{i}$ and $X_{j}$ is:

$\displaystyle\rho_{12}=\frac{1}{3}\left({\rho_{12}^{A_{1}}+\rho_{12}^{A_{2}}+% \rho_{12}^{A_{3}}}\right)=\frac{2}{3}+\frac{1}{9}i+\frac{2}{9}j$

Table 2

Matrix sample data sets

Sample	Attribute $A_{1}$	Attribute $A_{2}$	Attribute $A_{3}$
$X_{1}$	1	0	a
	2	1	b
	3	1	b
	4	0	Blank
	2	1	c
$X_{2}$	2	0	a
	1	0	Blank
	4	0	b
	Blank	1	c

Definition 7. (Potential function): For a three element c numbers $\rho=a+bi+cj$ , its potential function is denoted as:

$\displaystyle\textit{Shi}\left(\rho\right)=\frac{e^{a}}{e^{c}}$ (3)

s Generally speaking, the larger the value the potential function $\frac{e^{a}}{e^{c}}$ indicates the smaller the changing trend of the distance between two samples.

Definition 8. (Set pair distance measure): The three element set pair connection degree between the two sample $X_{i},X_{j}$ is $\rho=a+bi+cj$ . In order to measure the distance between the two samples, the distance between the two samples is reflected according to the positive degree $a$ in the set pair connection degree and the potential function $\textit{Shi}\left(\rho\right)$ of the connection number. The formula is:

$\displaystyle d\left({X_{i},X_{j}}\right)=1\mathord{\left/{\vphantom{1{\left[{% a+\textit{Shi}\left(\rho\right)}\right]}}}\right.\kern-1.2pt}{\left[{a+\textit% {Shi}\left(\rho\right)}\right]}$ (4)

Where, $a+\textit{Shi}\left(\rho\right)$ represents the similarity between two matrix samples, and the larger the value of the $a+\textit{Shi}\left(\rho\right)$ , the greater the similarity between the two matrix samples. $d\left({X_{i},X_{j}}\right)$ represents the distance between two matrix samples, and the smaller its value indicates the smaller the distance between two samples. $a$ used to measure the same degree of attribute value between samples, the larger the $a$ indicates the higher the same degree between the two samples. $\textit{Shi}\left(\rho\right)$ used to measure the trend of distance change between two samples, the larger the $\textit{Shi}\left(\rho\right)$ value indicates the smaller the change trend of distance between two samples.

4.2 Sets pair k-modes clustering result representation

Three kinds of information granule sets are defined in set pair granule space: positive granule set, difference granule set and negative granule set. Positive granule set and negative granule set belong to certain information granule, difference granule set belong to uncertain information granule, and their three kinds of information granule sets correspond exactly to the three relations existing in the cluster, so the correlation theory of set pair information granule is introduced into k-modes cluster. Based on the concept of three kinds of information granule sets, the set pair clustering results represented by positive region $C_{s}$ , boundary region $C_{u}$ and negative region $C_{o}$ are proposed. The samples in the positive region must belong to the cluster, and the samples in the boundary region may belong to the cluster, and the samples in the negative region does not belong to the cluster. The three regions have the following properties:

(i)
$C_{s}\left({C_{i}}\right)\neq\emptyset$
(ii)
$\bigcup\limits_{i=1}^{k}{\left({C_{s}\left({C_{i}}\right)\cup C_{u}\left({C_{i% }}\right)}\right)}=U$
(iii)
$C_{s}\left({C_{i}}\right)\cap C_{s}\left({C_{j}}\right)=\emptyset,i\neq j$

Properties (i) indicate that the positive regions of each cluster are not empty sets, with at least one sample in the positive regions. Properties (ii) indicate that each sample must be partitioned, and properties (iii) indicate that the positive regions of any two clusters intersect as empty sets.
4.3 Design idea and implementation of algorithm

Aiming at the characteristics of incomplete and categorical matrix samples, a set pair distance measurement method based on set pair information granule is presented. Applying the set pair distance measurement method to k-modes clustering, a set pair k-modes clustering algorithm for incomplete categorical matrix data is proposed. In traditional k-modes algorithms, a cluster is represented by a set with clear boundaries, which only considers two relationships between samples and clusters: belonging and not belonging. However, assigning uncertain samples to a cluster reduces the accuracy of clustering results. Therefore, this paper improves on k-modes clustering algorithm. the proposed set pair k-modes clustering algorithm takes into account the uncertain relationship between samples and clusters. Each cluster is represented by positive region $C_{s}$ , boundary region $C_{u}$ and negative region $C_{o}$ . The relevant definitions in this algorithm are as follows:

Definition 9. (The intra-cluster average distance): Suppose the number of samples in the cluster $C_{j}\left({1\leqslant j\leqslant k}\right)$ is $n$ , given $\mu_{j}$ as the cluster center of the cluster $C_{j}$ . The distance between $X_{i}$ , $X_{j}$ of any two samples is $d\left({X_{i},X_{j}}\right)=1\mathord{\left/{\vphantom{1{\left[{a+Shi\left(% \rho\right)}\right]}}}\right.\kern-1.2pt}{\left[{a+Shi\left(\rho\right)}\right]}$ . Hence, the average distance within the cluster $C_{j}$ is:

$\displaystyle\bar{d}_{j}=\frac{1}{n}\sum\limits_{i=1}^{n}{d\left({X_{i},\mu_{j% }}\right)}$ (5)

Where, $\bar{d}_{j}$ represents the average of the distance between all samples in the cluster $C_{j}$ and the cluster center $\mu_{j}$ .

Definition 10. (Threshold $\alpha_{j}$ ): Suppose there are $n$ samples in the boundary region of the cluster $C_{j}\left({1\leqslant j\leqslant k}\right)$ , and the cluster centers of cluster $C_{j}\left({1\leqslant j\leqslant k}\right)$ are recorded as $\mu_{j}$ . The distance between the sample $X_{i}$ and the cluster center $\mu_{j}$ is $d_{ij}$ .The minimum distance between the sample $X_{i}$ and the cluster centers of other clusters is $d_{ih}\left({1\leqslant h\leqslant k,h\neq j}\right)$ . The threshold $\alpha_{j}\left({1\leqslant j\leqslant k}\right)$ for cluster $C_{j}$ is:

$\displaystyle\alpha_{j}=\frac{1}{n}\sum\limits_{i=1}^{n}{\left({d_{ih}-d_{ij}}% \right)}$ (6)

Where, $\alpha_{j}$ is used to determine whether the sample $v_{i}$ is related to multiple clusters.

MD-SPKM algorithm of this paper divides the clustering process into the following parts:

4.3.1 Division of initial samples

Suppose $U=\left\{{X_{1},X_{2},\ldots,X_{n}}\right\}$ is a set of $n$ matrix samples, and randomly select $k$ samples from the sample set $U$ as the initial cluster center $\left\{{\mu_{1},\mu_{2},\ldots,\mu_{k}}\right\}$ . To establish set pair connection degree between each sample $X_{i}\left({1\leqslant i\leqslant n}\right)$ in the sample set and each cluster center $\mu_{j}$ according to Eqs (1) and (2). Then, according to the set pair distance Eq. (4), the distance $d\left({X_{i},\mu_{j}}\right)$ between each sample and each cluster center is obtained. Select the nearest cluster center to determine the cluster markers $\lambda_{i}=\arg\min_{j\in\left\{{1,2,\ldots,k}\right\}}d\left({X_{i},\mu_{j}}\right)$ of the sample $X_{i}$ , assign the sample to the corresponding cluster: $C_{\lambda_{i}}=C_{\lambda_{i}}\cup\left\{{X_{i}}\right\}$ .

Figure 1.

Schematic diagram of set pair k-modes clustering.

4.3.2 Cluster center update

The representation of cluster centers is crucial in the clustering process. For this reason, the update problem of cluster centers will be discussed next. Given a set $X=\{X_{1},X_{2},\ldots,X_{n}\}$ , $Q=\{Q_{1},Q_{2},\ldots,Q_{m}\}$ is a matrix sample and is described by $m$ attributes so that the objective function $D\left({X,Q}\right)=\sum_{i=1}^{n}{d\left({X_{i},Q}\right)}$ is minimized, then $Q$ is the cluster center of the set $X$ . The cluster centers of various clusters are updated after each sample is assigned. After all the samples are assigned to the cluster, the distance between the sample and the cluster center of the current cluster is recalculated. If there is a sample closer to the other cluster, the sample is redistributed to the cluster, and the cluster center of the two clusters is updated and repeated until no sample changes the cluster, thus determining the preliminary clustering results $\left\{{C_{1},C_{2},\ldots,C_{k}}\right\}$ .

4.3.3 Set pair clustering results generation

Next, the preliminary clustering results are subdivided, the main task is to construct the positive region and boundary region. The average distance $\bar{d}_{j}\left({1\leqslant j\leqslant k}\right)$ of each cluster is calculated according to Eq. (5), and then compared with the distance $d\left({X_{i},\mu_{j}}\right)$ between the sample and the cluster center. If $d\left({X_{i},\mu_{j}}\right)\leqslant\bar{d}_{j}$ , indicates the high similarity between the sample and the cluster center in the cluster $C_{j}$ , the sample is placed in the positive region. If $d\left({X_{i},\mu_{j}}\right)>\overline{d_{i}}$ , indicates that the similarity between the sample and the cluster center in the cluster $C_{j}$ is relatively low, the sample is placed in the boundary region. The sample in the positive region can only belong to one cluster, and the sample in the boundary region may belong to more than one cluster. For this reason, the sample in the boundary region is judged according to Eq. (6) whether it has relation with multiple clusters. If the $\left({d_{ih}-d_{ij}}\right)<\alpha_{j}$ is satisfied, the sample is assigned to the boundary region of the cluster $C_{j}$ and $C_{h}$ simultaneously. If the $\left({d_{ih}-d_{ij}}\right)\geqslant\alpha_{j}$ is satisfied, the sample $v_{i}$ is kept the current allocation state without any operation. Finally, a set pair clustering result containing positive region $C_{s}$ and boundary region $C_{u}$ is formed. The result is expressed as:

$\displaystyle C=\left\{{\left\{{C_{s}\left({C_{1}}\right)\cup C_{u}\left({C_{1% }}\right)}\right\},\left\{{C_{s}\left({C_{2}}\right)\cup C_{u}\left({C_{2}}% \right)}\right\},\ldots\left\{{C_{s}\left({C_{k}}\right)\cup C_{u}\left({C_{k}% }\right)}\right\}}\right\}$

Where, $C_{s}\left({C_{k}}\right)$ is the positive region of cluster $C_{k}$ , $C_{u}\left({C_{k}}\right)$ is the boundary region of cluster $C_{k}$ . The positive region and boundary region together form a cluster. The schematic diagram of the clustering results is shown in Fig. 1.

The implementation flow of the algorithm is as follows:

Algorithm: MD-SPKM

Input: Sample set

U=\left\{{X_{1},X_{2},\ldots,X_{n}}\right\}

, Cluster

k

k

samples were randomly selected as the initial cluster center

\left\{{\mu_{1},\mu_{2},\ldots,\mu_{k}}\right\}

;

C_{j}=\emptyset\left({1\leqslant j\leqslant k}\right)

3: For

i=1,2,\ldots,n

4: Computer the distance

d\left({X_{i},\mu_{j}}\right)

;

\lambda_{i}

is a cluster marker,

\lambda_{i}=\arg\min_{j\in\left\{{1,2,\ldots,k}\right\}}d\left({X_{i},\mu_{j}}\right)

;

C_{\lambda_{i}}=C_{\lambda_{i}}\cup\left\{{X_{i}}\right\}

;

7: Updating the clustering center to minimize

D\left({X,Q}\right)=\sum\limits_{i=1}^{n}{d\left({X_{i},Q}\right)}

;

Q

is the updated cluster center;

9: End for

10: Repeat

11: For

i=1,2,\ldots,n

12: Recalculate the distance

d\left({X_{i},\mu_{j}}\right)

;

13: If

d\left({X_{i},\mu_{j}}\right)>d\left({X_{i},\mu_{l}}\right)\left\{{1\leqslant j% \leqslant k,1\leqslant l\leqslant k}\right\}

then;

14:

X_{i}

is redistributed to

C_{l}

and update

\mu_{j}

and

\mu_{l}

;

15: Else

16: Keep the current allocation state;

17: End if

18: End for

19: Until no sample changed the cluster;

20: Calculate the average distance

\bar{d}_{j}\left({1\leqslant j\leqslant k}\right)

of each cluster

21: To compare

\bar{d}_{j}\left({1\leqslant j\leqslant k}\right)

with

d\left({X_{i},\mu_{j}}\right)

;

22: If

d\left({X_{i},\mu_{j}}\right)\leqslant\bar{d}_{j}\left({1\leqslant j\leqslant k% }\right)

then

23:

C_{s}\left({C_{j}}\right)=C_{s}\left({C_{j}}\right)\cup\left\{{X_{i}}\right\}

;

24: Else

25:

C_{u}\left({C_{j}}\right)=C_{u}\left({C_{j}}\right)\cup\left\{{X_{i}}\right\}

;

26: End if

27: The

\left({d_{ij}-d_{ih}}\right)

value of each sample in the boundary region is calculated;

28: Calculation of threshold

\alpha_{j}\left({1\leqslant j\leqslant k}\right)

;

29: If

\left({d_{ij}-d_{ih}}\right)<\alpha_{j}

then

30:

C_{u}\left({C_{j}}\right)=C_{u}\left({C_{j}}\right)\cup\left\{{X_{i}}\right\}

and

C_{u}\left({C_{h}}\right)=C_{u}\left({C_{h}}\right)\cup\left\{{X_{i}}\right\}

;

31: Else

32: Hold the sample

X_{i}

current allocation state

33: End if

Output: Set pair clustering results:

C=\left\{{\left\{{C_{s}\left({C_{1}}\right)\cup C_{u}\left({C_{1}}\right)}% \right\},\left\{{C_{s}\left({C_{2}}\right)\cup C_{u}\left({C_{2}}\right)}% \right\},\ldots\left\{{C_{s}\left({C_{k}}\right)\cup C_{u}\left({C_{k}}\right)% }\right\}}\right\}

4.4 Algorithm complexity analysis

Since in most practical application scenarios, the data has a large number of attributes, which will have a certain impact on the efficiency of the algorithm, so let the number of samples in the region is $n$ , the number of attributes is $m$ , and the number of clusters is $k$ . The complexity of this algorithm is mainly generated by two parts. The first part is to calculate the distance between the sample and each cluster center, and the complexity is $O(tkmn)$ , where the $t$ is the number of iterations. And the second part is to refine the initial clustering results and assign each sample into the positive region or the boundary region of the corresponding cluster. The time complexity is $O(kmn)$ . So the time complexity of the whole MD-SPKM algorithm is $O(tkmn)+O(kmn)$ .

5. Experimental evaluation

A series of experiments are carried out in this section to evaluate the clustering performance of the proposed algorithm MD-SPKM. The experimental comparison is carried out by selecting four representative algorithms. The first is the classical k-modes algorithm proposed by Huang [3]. The second is the k-mw-modes algorithm based on matrix samples proposed by Cao et al. [8]. The third is the MD fuzzy k-modes algorithm proposed by Li et al. [17]. The fourth is the fuzzy k-modes algorithm based on rough set proposed by Saha et al. [22], which is denoted as RFKMd algorithm.

The algorithm proposed in this paper can solve the problem of incomplete matrix data clustering and complete matrix data clustering. In order to prove the validity of MD-SPKM algorithm, two aspects will be evaluated experimentally. First, select the complete matrix data set, compare and analyze the proposed algorithm and the selected contrast algorithm. Second, select the incomplete matrix data set, under the condition of different missing rate, compare and analyze the proposed algorithm MD-SPKM with the four contrast algorithms.

This algorithm and contrast algorithm are implemented on a DELL computer (Windows 10, Intel (R) Core (TM) i5-8300H, CPU @2.30GHz 2.30GHz). The programming platform is Python3.7.

5.1 Data sets

Since the matrix sample data with label information is relatively rare, it is necessary to preprocess the given data. In this paper, the Multidimensional Scaling method is used to process the data. The goal of this method is to obtain the distribution of $n$ points in the $p$ dimension space. This paper sets the $p=2$ . Two similar samples are represented by two points with close distance in multidimensional space, so the distribution of data can be reflected by visualizing $n$ points. Because the distribution of real data sets is usually disordered, some sample points can be deleted by visualization method to obtain relatively clear data structure.

In this experiment, three data sets were selected to evaluate the algorithm, Musk, Microsoft Web and MovieLens, respectively. Musk downloaded from the UCI database containing 476 record values for 92 samples. Each sample has 167 attributes. The Microsoft Web dataset was also downloaded from the UCI database, containing 165257 record values of 32711 users for 294 websites. Each record has two properties, user ID and web ID. MovieLens data set was created in 2018 and downloaded from the MovieLens website. This paper uses the ratings data, which contains 1000209 rating scores for 3900 movies by 6040 users. Each record value has four attributes, user ID, movie ID, rating and timestamp.

Figure 2.

Visualization of Microsoft Web data sets after preprocessing.

Figure 3.

Visualization of MovieLens data sets after preprocessing.

Musk the data characteristics of the data set are good, there is no noise data and missing value, which can be clustered directly, so it is not preprocessed. For the Microsoft Web data set, it is necessary to delete samples with fewer than 7 record value bars first, then visualize the data set by Multidimensional Scaling method, and select the samples in the $x<$ $-$ 0.05 or $x$ $>$ 0.05 position in the coordinate system, so as to get the preprocessed data set. A result of visualization of the processed Microsoft Web data set is shown in Fig. 2. For the MovieLens data set, first of all, the experimental selection will not affect the clustering results of the timestamp to delete, then visualizes the data set in the coordinate system, and selects the sample with $x<$ $-$ 0.1, $y<$ $-$ 0.1 or $x<$ $-$ 0.1, $y>$ 0.1 or $x>$ 0.1, $y<$ $-$ 0.1 or $x>$ 0.1, $y>$ 0.1 position in the coordinate system. The preprocessed data set is obtained. A result of visualization of the processed MovieLens data set is shown in Fig. 3. There is no noise data and missing value for Microsoft Web and MovieLens data after preprocessing. For the three data sets selected in this paper, the final data sets after preprocessing are shown in Table 3.

5.2 Evaluation indicators

The evaluation of clustering, also called clustering validity, is the key process to evaluate the performance of clustering algorithm. The following evaluation indicators are often used to evaluate the performance of clustering algorithms. Let the clustering result of MD-SPKM algorithm is $C=\left\{{\left\{{C_{s}\left({C_{1}}\right)\cup C_{u}\left({C_{1}}\right)}% \right\},\ldots,\left\{{C_{s}\left({C_{k}}\right)\cup C_{u}\left({C_{k}}\right% )}\right\}}\right\}$ , the number of clusters is $k$ . The real clustering result is $C^{*}=\left\{{C_{1}^{*},C_{2}^{*},\ldots,C_{k}^{*}}\right\}$ , the number of real clusters is $k^{*}$ .

(1)
Accuracy

$\displaystyle\textit{ACC}=\frac{1}{n}\sum\limits_{k=1}^{n}{\theta_{k}}$

Where, the $n$ is the total number of samples, and the $\theta_{k}$ is the number of samples correctly assigned in the cluster $C_{k}$ .

Table 3
Description of matrix data sets

ID Data set Number of samples Number of features Records Number of clusters

1 Musk 92 167 476 2

2 Microsoft Web 2602 2 24102 2

3 MovieLens 1038 3 30308 4

(2)
Recall

$\displaystyle\textit{Recall}=\frac{\textit{TP}}{\textit{TP}+\textit{FN}}$

Where, TP indicates that the actual is positive and the prediction is also positive. FN indicates that the actual is positive and the prediction is negative.
(3)
Adjusted Rand Index (ARI)

$\displaystyle\textit{ARI}=\frac{\sum\nolimits_{ij}{C_{\left|{C_{i}\cap C_{j}^{% }}\right|}^{2}-{\left[{\sum\nolimits_{i}{C_{\left|{C_{i}}\right|}^{2}\sum% \nolimits_{j}{C_{\left|{C_{j}^{}}\right|}^{2}}}}\right]}\mathord{\left/{% \vphantom{{\left[{\sum\nolimits_{i}{C_{\left|{C_{i}}\right|}^{2}\sum\nolimits_% {j}{C_{\left|{C_{j}^{}}\right|}^{2}}}}\right]}{C_{n}^{2}}}}\right.\kern-1.2pt% }{C_{n}^{2}}}}{\frac{1}{2}\left[{\sum\nolimits_{i}{C_{\left|{C_{i}}\right|}^{2% }+\sum\nolimits_{j}{C_{\left|{C_{j}^{}}\right|}^{2}}}}\right]-{\left[{\sum% \nolimits_{i}{C_{\left|{C_{i}}\right|}^{2}\sum\nolimits_{j}{C_{\left|{C_{j}^{% }}\right|}^{2}}}}\right]}\mathord{\left/{\vphantom{{\left[{\sum\nolimits_{i}{C% _{\left|{C_{i}}\right|}^{2}\sum\nolimits_{j}{C_{\left|{C_{j}^{}}\right|}^{2}}% }}\right]}{C_{n}^{2}}}}\right.\kern-1.2pt}{C_{n}^{2}}}$

$\textit{ARI}\in[-1,1]$ , the larger the value, the more consistent the clustering result with the real situation.
(4)
Normalized mutual information (NMI)

$\displaystyle\textit{NMI}=\frac{\sum\limits_{i=1}^{k}{\sum\limits_{j=1}^{k^{}% }{\left|{C_{i}\cap C_{j}^{}}\right|\log\left({{\left|{C_{i}\cap C_{j}^{}}% \right|\cdot n}\mathord{\left/{\vphantom{{\left|{C_{i}\cap C_{j}^{}}\right|% \cdot n}{\left|{C_{i}}\right|}}}\right.\kern-1.2pt}{\left|{C_{i}}\right|}\cdot% \left|{C_{j}^{}}\right|}\right)}}}{\sqrt{\sum\limits_{i=1}^{k}{\left|{C_{i}}% \right|\log\left({{\left|{C_{i}}\right|}\mathord{\left/{\vphantom{{\left|{C_{i% }}\right|}n}}\right.\kern-1.2pt}n}\right)\sum\limits_{j=1}^{k^{}}{\left|{C_{j% }^{}}\right|\log\left({{\left|{C_{j}^{}}\right|}\mathord{\left/{\vphantom{{% \left|{C_{j}^{}}\right|}n}}\right.\kern-1.2pt}n}\right)}}}}$

NMI values range from 0 to 1 and NMI are equal to 1 only if the detected cluster and the real cluster are completely consistent.

Where, $\left|{C_{i}}\right|$ is the number of samples in cluster $C_{i}$ . $\left|{C_{j}^{}}\right|$ the number of samples in the real cluster $C_{j}^{}$ . $\left|{C_{i}\cap C_{j}^{}}\right|$ represents the number of samples shared in the cluster $C_{i}$ and real cluster $C_{j}^{*}$ .
5.3 Experimental results and analysis

ID	Data set	Number of samples	Number of features	Records	Number of clusters
1	Musk	92	167	476	2
2	Microsoft Web	2602	2	24102	2
3	MovieLens	1038	3	30308	4

5.3.1 MD-SPKM algorithm performance analysis

To better show the clustering effect of the MD-SPKM algorithm, taking the Microsoft Web data set as an example, the change of evaluation index during iteration is given. Since each cluster is composed of positive regions and boundary regions, the results of four indexes are calculated only for the positive region and the combined of the positive region and boundary region, respectively. The specific results are shown in Fig. 4.

MD-SPKM algorithm can take into account the uncertain relationship between the sample and the cluster, assign the samples that certainly belong to the cluster into the positive region, and assign the samples that may belong to the cluster into the boundary region. Figure 4a gives the clustering results that only consider positive regions, and Fig. 4b gives the clustering results that consider the combination of positive regions and boundary regions. From these two graphs, it can be seen that the results of the four indicators are constantly rising as the number of iterations increases. However, under all indicators, the results of the combination of positive regions and boundary regions are higher than those considering only positive regions, because each cluster is divided into two parts: positive region and boundary region. The two sets are obtained by contraction or expansion based on two-way clustering. The optimal clustering results obtained in Fig. 4b are Accuracy 0.9297, Recall 0.9552, ARI 0.7384 and NMI 0.6405. While only positive region is considered in Fig. 4a, the Accuracy still reaches 0.8981, the ARI reaches 0.7, and the NMI reaches 0.5856. Only the Recall rate decreases slightly to 0.5994, but the overall clustering results are still good.

5.3.2 Comparative analysis of experiments under complete data set

On the complete matrix data set, four algorithms are selected to compare with the proposed algorithm MD-SPKM, which are k-modes [3], k-mw-modes [8], MD fuzzy k-modes [17] and RFKMd [22], respectively. The experimental contrast results obtained through four evaluation indicators are shown in Tables 4–6, and the best performance of each dataset has been highlighted in bold.

Table 4
Comparison of five algorithms on Musk data sets

Algorithms	ACC	Recall	ARI	NMI
k-modes	0.5378	0.5431	0.0035	0.0053
k-mw-modes	0.6703	0.4318	0.1081	0.1122
MD fuzzy k-modes	0.6413	0.4090	0.0670	0.0678
RFKMd	0.4810	0.5091	$-$ 0.0033	0.0003
MD-SPKM–	0.6956	0.7446	0.1399	0.1479

Figure 4.

Iterative clustering of Microsoft Web data sets.

Table 5

Comparison of five algorithms on Microsoft Web data sets

Algorithms	ACC	Recall	ARI	NMI
k-modes	0.5507	0.5032	0.0011	0.0002
k-mw-modes	0.9147	0.7297	0.5609	0.5254
MD fuzzy k-modes	0.8674	0.7696	0.5396	0.4350
RFKMd	0.5527	0.4993	$-$ 0.00029	0.0013
MD-SPKM	0.9297	0.9552	0.7384	0.6405

Table 6

Comparison of five algorithms on MovieLens data sets

Algorithms	ACC	Recall	ARI	NMI
k-modes	0.2528	0.2389	$-$ 0.0021	0.0072
k-mw-modes	0.6006	0.5852	0.2588	0.3230
MD fuzzy k-modes	0.5340	0.5513	0.1759	0.1954
RFKMd	0.2741	0.2467	$-$ 0.00002	0.0082
MD-SPKM	0.6225	0.6030	0.2596	0.3256

Table 7

Clustering results of Musk data sets at four missing rates

Algorithms	Missing rates/%	ACC	Recall	ARI	NMI
k-modes	5	0.4159	0.4337	0.0250	0.0140
	10	0.4852	0.5156	$-$ 0.0044	0.0009
	15	0.4432	0.5041	$-$ 0.0022	0.0006
	20	0.4391	0.4976	0.0007	0.0001
k-mw-modes	5	0.6703	0.5909	0.0660	0.0822
	10	0.6593	0.5454	0.0611	0.0650
	15	0.6373	0.5227	0.0608	0.0501
	20	0.6043	0.5037	0.0302	0.0494
MD fuzzy k-modes	5	0.6373	0.5227	0.0656	0.0562
	10	0.6156	0.5129	0.0601	0.0551
	15	0.5997	0.5095	0.0564	0.0418
	20	0.5824	0.4935	0.0307	0.0319
RFKMd	5	0.4601	0.4777	0.0031	0.0015
	10	0.5210	0.5422	$-$ 0.0017	0.0059
	15	0.4517	0.4937	0.0010	0.0002
	20	0.4601	0.4978	$-$ 0.0007	0.000002
MD-SPKM	5	0.6813	0.6740	0.1243	0.0825
	10	0.6766	0.6739	0.0625	0.0658
	15	0.6149	0.6382	0.0510	0.0515
	20	0.6077	0.6170	0.0312	0.0266

As can be seen from Tables 4–6, under the Musk, Microsoft Web and MovieLens data sets, the values of the proposed algorithm MD-SPKM under the evaluation indexes are higher than those of the contrast algorithm. Compared with other algorithms, the results of Musk and Microsoft Web data sets under four evaluation indexes were significantly improved, among which the recall rate was the most significant. The optimal recall rate of Musk data set in contrast algorithm is 0.5431, while the optimal recall rate is 0.7446 under MD-SPKM algorithm. The optimal recall rate of the Microsoft Web dataset in the contrast algorithm is 0.7696, while the optimal rate is 0.9552 under MD-SPKM algorithm in this paper. Although the improvement of MovieLens data set under four indicators is not obvious, it also implemented slightly higher than four contrast algorithm.

5.3.3 Comparative analysis of experiments with different missing rates

In order to simulate the data set containing missing values, the attribute values of some samples were deleted randomly. In the experiment, the missing rate of 5%,10%,15%,20% was selected to generate the incomplete matrix data set randomly. In order to eliminate the influence of missing data structure or distribution, at each missing rate, multiple different data sets were generated, and then the average value of the experimental results obtained from multiple runs was taken. The Musk, Microsoft Web and MovieLens data sets were experimentally evaluated at different missing rates through four evaluation indicators. The results are shown in Tables 7–9. The best performance of each data set at different missing rates has been highlighted in bold.

Table 8
Clustering results of Microsoft Web data sets at four missing rates

Algorithms	Missing rates/%	ACC	Recall	ARI	NMI
k-modes	5	0.5488	0.5009	0.0002	0.00001
	10	0.5527	0.5046	0.0017	0.0004
	15	0.5561	0.5078	0.0030	0.0015
	20	0.5534	0.5047	0.0018	0.0006
k-mw-modes	5	0.8977	0.8270	0.5348	0.5348
	10	0.8789	0.8012	0.5739	0.5183
	15	0.8704	0.7203	0.5493	0.4763
	20	0.8466	0.6712	0.4796	0.4569
MD fuzzy k-modes	5	0.8289	0.8115	0.4918	0.4076
	10	0.8166	0.8018	0.4581	0.3907
	15	0.7993	0.7908	0.4230	0.3666
	20	0.7817	0.7915	0.4205	0.3488
RFKMd	5	0.5507	0.4981	$-$ 0.0008	0.0007
	10	0.5526	0.5013	0.0005	0.00011
	15	0.5538	0.5024	0.0009	0.0004
	20	0.5377	0.4893	$-$ 0.0037	0.0028
MD-SPKM	5	0.9135	0.9475	0.6838	0.5863
	10	0.9031	0.9466	0.6499	0.5566
	15	0.8885	0.9423	0.6047	0.5158
	20	0.8835	0.9308	0.5882	0.5006

Table 9

Clustering results of MovieLens data sets at four missing rates

Algorithms	Missing rates/%	ACC	Recall	ARI	NMI
k-modes	5	0.2770	0.2563	0.0017	0.0079
	10	0.3011	0.2523	0.0003	0.0065
	15	0.3026	0.2536	0.0007	0.0102
	20	0.2998	0.2511	0.00004	0.0103
k-mw-modes	5	0.5955	0.5800	0.2499	0.3155
	10	0.5891	0.5717	0.2410	0.3018
	15	0.5642	0.5681	0.2218	0.2845
	20	0.5571	0.5577	0.2158	0.2841
MD fuzzy k-modes	5	0.5287	0.5493	0.1711	0.1917
	10	0.5249	0.5314	0.1647	0.1890
	15	0.5119	0.5226	0.1657	0.1882
	20	0.4971	0.5147	0.1557	0.1620
RFKMd	5	0.2997	0.2502	$-$ 0.00026	0.0025
	10	0.2997	0.2686	0.0051	0.0083
	15	0.2997	0.2503	$-$ 0.00026	0.0025
	20	0.2955	0.2570	0.0035	0.0049
MD-SPKM	5	0.6100	0.5912	0.2547	0.3164
	10	0.5816	0.5830	0.2416	0.3006
	15	0.5711	0.5733	0.2374	0.3019
	20	0.5647	0.5624	0.2370	0.2904

From Tables 7–9, it can be seen that in most cases, MD-SPKM algorithm has better clustering performance than k-modes, k-mw-modes, MD fuzzy k-modes and RFKMd algorithms, especially in the Microsoft Web dataset, all the indexes are higher than the contrast algorithm under the four missing rates, which also shows the superiority of the MD-SPKM algorithm. The experimental results of the Musk dataset are given in Table 7. MD-SPKM algorithm outperforms the contrast algorithm under the condition that the missing rate is 5% and 10%. At the missing rate of 15%, although it is not the best algorithm, the Accuracy is also close to the best contrast algorithm.

Table 9 shows the experimental results of MovieLens data set at different missing rates. As can see that when the missing rate is 5%,15% and 20%, the results of the MD-SPKM algorithm under four indexes are higher than that of the contrast algorithm. In addition, under the condition that the missing rate is 10%, the highest Accuracy of the contrast algorithm reaches 0.5891, although MD-SPKM algorithm is not optimal, it also reaches 0.5816. All in all, the MD-SPKM algorithm not only can deal with incomplete categorical matrix data clustering problem, but also get better clustering effect.

6. Conclusions

For the study of categorical data clustering, the data facing is not only matrix data, but also with missing values and it is an incomplete matrix data. In order to cluster such data and consider the uncertain relationship between samples and clusters, a set pair k-modes clustering algorithm for incomplete categorical matrix data is proposed. Firstly, the missing value of the data set is given the processing method, and the sample is granulated by using the relevant theory of set pair information granules. The sample is divided into positive granules, difference granules, negative granules. A set pair distance measurement method between matrix samples is proposed, and the distance between different samples is extended to include the definition of distance with three dimensions of positive degree, difference degree and negative degree. Thus, the samples are divided into various clusters to form preliminary clustering results. Secondly, considering the uncertain relationship between the sample and the cluster, the definition of the intra-cluster average distance is given, which is used to distinguish the samples that certainly belong to the cluster from the samples that may belong to the cluster. And the threshold calculation formula is given to determine whether the sample belongs to multiple clusters, and the preliminary clustering results are subdivided to form set pair clustering results including positive region $C_{s}$ , boundary region $C_{u}$ and negative region $C_{o}$ . Finally, through the selected three data sets and four contrast algorithms for experimental evaluation, the results show that MD-SPKM proposed algorithm can effectively handle incomplete matrix data sets and outperform the contrast algorithm under multiple evaluation indicators.

Footnotes

Acknowledgments

This research is supported by The Natural Science Foundation of Hebei Province (F2018209374) and The Natural Science Foundation of Hebei Province (F2016209344).The authors also gratefully acknowledge the helpful comments and suggestions of tutor, which have improved the presentation.

References

Wang

T.X.

and Gao

J.Y.

, An improved k-means algorithm based on kurtosis test, Journal of Physics: Conference Series 1267 (2019), 012027.

Luchi

Santos

Rodrigues

et al., Genetic sampling k-means for clustering large data sets, Lecture Notes in Computer Science 9423 (2015), 691–698.

Huang

Z.X.

, Extensions to the k-means algorithm for clustering large data sets with categorical values, Data Mining and Knowledge Discovery 2 (1998), 283–304.

Shi

Z.Q.

and Chen

S.P.

, An improved k-modes clustering algorithm, Operations Research and Management Science 28 (2019), 112–117.

Jiang

Liu

G.Z.

and Du

J.W.

, Initialization of k-modes clustering using outlier detection techniques, Information Sciences 332 (2016), 167–183.

Peng

L.W.

and Liu

Y.G.

, Attribute weights-based clustering centres algorithm for initialising K-modes clustering, Cluster Computing 22 (2019), 6171–6179.

Wang

and Tang

, Improved cluster center initialization method for clustering categorical data, Journal of Computer Applications 38 (2018), 73–76.

Cao

F.Y.

L.Q.

and Huang Joshua

Z.X.

et al., k-mw-modes: An algorithm for clustering categorical matrix-object data, Applied Soft Computing 57 (2017), 605–614.

Yao

Y.Y.

, Three-way decisions and cognitive computing, Cognitive Computation 8 (2016), 1–2.

10.

Yao

Y.Y.

, Three-way decision and granular computing, International Journal of Approximate Reasoning 103 (2018), 107–123.

11.

, Three-way cluster analysis, Peak Data Science (2016), 31–35.

12.

Zhang

C.Y.

Wang

L.Y.

and Li

M.X.

et al., Model of three-way decision based on the space of set pair information granule and its application, Journal on Communications (2016), 15–24.

13.

Huang

Y.H.

Hao

Z.F.

Cai

R.C.

et al., K-modes algorithm based on interdependence redundancy measure, Journal of Chinese Computer Systems 37 (2016), 1790–1793.

14.

Zhou

H.F.

Zhang

Y.H.

and Liu

Y.B.

, A global-relationship dissimilarity measure for the k-modes clustering algorithm, Computational Intelligence and Neuroscience 2017 (2017), 1–7.

15.

Cao

F.Y.

Huang Joshua

Z.X.

Liang

J.Y.

et al., An algorithm for clustering categorical data with set-valued features, IEEE Transactions on Neural Networks and Learning Systems 29 (2018), 4593–4606.

16.

Cao

F.Y.

Huang Joshua

Z.X.

Liang

J.Y.

et al., A fuzzy SV-k-modes algorithm for clustering categorical data with set-valued attributes, Applied Mathematics and Computation 295 (2017), 1–15.

17.

S.Y.

Zhang

M.M.

and Cao

F.Y.

, A MD fuzzy k-modes algorithm for clustering categorical matrix-object data, Journal of Computer Research and Development 56 (2019), 1325–1337.

18.

Wang

P.X.

and Yao

Y.Y.

, CE3: A three-way clustering method based on mathematical morphology, Knowledge-Based Systems 155 (2018), 54–65.

19.

Zhang

, A three-way c-means algorithm, Applied Soft Computing Journal 82 (2019), 1568–4946.

20.

Zhao

K.Q.

, Set pair analysis and its preliminary application, Exploration of Nature 1 (1994), 67–72.

21.

Huang

D.C.

Zhao

K.Q.

and Lu

Y.Z.

, The fundamental operation of arithmetic on connection number a+b⁢i+c⁢j and its application, Mechanical & Electrical Engineering Magazine 17 (2000), 81–84.

22.

Saha

Sarkar

J.P.

and Maulik

, Rough set based fuzzy k-modes for categorical data, Swarm, Evolutionary, and Memetic Computing 7677 (2012), 323–330.

MD-SPKM: A set pair k-modes clustering algorithm for incomplete categorical matrix data

Abstract

Keywords

1. Introduction

2. Related works

3. Basic theory

3.1 Incomplete information system

3.2 Set pair information granule

Table 1 A table of information

4. Set pair k-modes clustering algorithm

4.1 Measurement of set pair distance between matrix samples

4.3.3 Set pair clustering results generation

4.4 Algorithm complexity analysis

5. Experimental evaluation

5.1 Data sets

5.3.1 MD-SPKM algorithm performance analysis

5.3.2 Comparative analysis of experiments under complete data set

Table 4 Comparison of five algorithms on Musk data sets

Table 8 Clustering results of Microsoft Web data sets at four missing rates

Footnotes

Acknowledgments

References

Table 1
A table of information

Table 4
Comparison of five algorithms on Musk data sets

Table 8
Clustering results of Microsoft Web data sets at four missing rates