Abstract
In order to solve the clustering problem with incomplete and categorical matrix data sets, and considering the uncertain relationship between samples and clusters, a set pair k-modes clustering algorithm is proposed (MD-SPKM). Firstly, the correlation theory of set pair information granule is introduced into k-modes clustering. By improving the distance formula of traditional k-modes algorithm, a set pair distance measurement method between incomplete matrix samples is defined. Secondly, considering the uncertain relationship between the sample and the cluster, the definition of the intra-cluster average distance and the threshold calculation formula to determine whether the sample belongs to multiple clusters is given, and then the result of set pair clustering is formed, which includes positive region, boundary region and negative region. Finally, through the selected three data sets and four contrast algorithms for experimental evaluation, the experimental results show that the set pair k-modes clustering algorithm can effectively handle incomplete categorical matrix data sets, and has good clustering performance in Accuracy, Recall, ARI and NMI.
Keywords
Introduction
Clustering is a basic research field and plays an important role in data analysis. In prototype clustering algorithm, k-means clustering algorithm [1, 2] is very effective for processing large data sets, but it is only suitable for numerical data sets and not for processing categorical data sets. Therefore, Huang proposed k-modes algorithm [3] to calculate the distance between two samples by simply matching the categorical samples to solve the clustering problem of categorical data. After that, scholars have made many improvements to the k-modes algorithm in terms of better initialization techniques or more efficient distance measurement [4, 5, 6, 7], but deal with samples described by one feature vector. In real databases, there are matrix sample data described by multiple features vectors [8]. For example, the user’s shopping record, each user is a sample. However, each user will buy more than one item, and the type and quantity of goods purchased are different, so the categorical matrix sample data will be generated. Traditional simple matching methods for calculating the distance between two samples can not be directly used to deal with such data sets. Due to the phenomena of data omission, incomplete, reading restriction and so on, a large number of missing values will exist. For example, the user’s shopping records, due to improper data stored procedures and for the purpose of protecting the privacy of users, there will be some missing records. Therefore, it is necessary to study the clustering problem of incomplete matrix data sets.
The result of traditional clustering algorithm is represented by a set with clear boundary, which can only represent two kinds of relations between samples and clusters, that is, samples belong to the clusters, or samples do not belong to the clusters, but in many applications, there are three kinds of relations between samples and clusters: certainly belong, may belong and does not belong. Because the information collected at present has limitations, the relationship between some samples and clusters can not be accurately judged, then the third relationship is given: may belong to. For this reason, by introducing the idea of three-way decisions [9, 10] into the traditional clustering, Yu propose a framework for three-way clusters [11]. The clustering results are represented by two sets, which are typical sample sets and fringe sample sets, respectively. However, most of the clustering algorithms considering the three relationships between samples and clusters are designed for numerical data, and the research on categorical data is less.
In the study of categorical data clustering, traditional distance measurement methods generally can only use one feature vector to represent matrix data with multiple features vectors descriptions. However, this will lead to the loss of a large amount of raw information, which can not achieve comprehensive consideration of all the data. In addition, the data usually faced with missing values is an incomplete matrix data. In order to solve the above clustering problem with incomplete, categorical matrix data sets and consider the uncertain relationship between samples and clusters, a set pair k-modes clustering algorithm (MD-SPKM) for incomplete categorical matrix data is proposed. (1) In order to solve the problem of incomplete and uncertain matrix data sets, based on the correlation theory of set pair information granules [12], a set pair distance between incomplete matrix samples is proposed. (2) Considering the uncertain relationship between samples and clusters, the definition of the intra-cluster average distance is given to distinguish the samples that certainly belong to clusters and the samples that may belong to clusters. In addition, because the sample may be related to multiple clusters, the threshold calculation formula for determining whether the sample belongs to multiple clusters is given. (3) A set pair k-modes clustering algorithm for incomplete categorical matrix data is proposed, and a set pair clustering result composed of positive region, boundary region and negative region is formed. The sample in the positive region belongs to the cluster, the sample in the boundary region may belong to the cluster, and the sample in the negative region does not belong to the cluster. (4) Experimental comparison and analysis with the existing four representative algorithms, the results show that the proposed algorithm can effectively deal with incomplete categorical matrix data sets.
Related works
In clustering algorithm, distance measurement between samples is a key step. For categorical data, scholars have proposed a variety of distance measurement methods. Such as, Huang et al. [13] propose a distance measurement method based on interdependence redundancy, not only consider the difference of attribute values between the two samples, but also consider the interaction between attributes. Zhou et al. [14] propose a global relationship difference measurement method for k-modes clustering algorithm, which not only considers the relationship between samples and all clustering modes, but also considers the differences of various attributes, rather than simple matching. However, these distance measurement methods are aimed at samples represented by only one feature vector, which can obtain better clustering results. But, when measuring the distance of matrix samples with multiple sets of attribute features, only one set of features can be selected for distance calculation, and all data can not be considered. Therefore, Cao et al. proposed Set-Valued k-modes algorithm [15] and fuzzy SV-k-modes algorithm [16] for processing set-valued samples, defined the distance function between two samples with set-valued features, and proposed set-valued pattern representation of clustering centers. Li et al. proposed a MD fuzzy k-modes algorithm for categorical matrix data [17] and presented a heuristic updating algorithm for clustering centers. Although the above algorithm solves the clustering problem of matrix data, its algorithm is mainly aimed at complete data sets.
The missing value in the sample can not be properly treated, because the missing attribute value itself is uncertain, whether it is deleted or filled, it will cause certain errors. The set pair information granule theory is proposed by combining the set pair correlation theory on the existing granule calculation methods in order to consider the certain and uncertain of the research object in the granule calculation. Therefore, this paper introduces the knowledge of set pair information granule into clustering analysis to study the clustering problem of incomplete matrix data sets.
Due to the existence of three relationships between samples and clusters, scholars have proposed a variety of three-way clustering algorithms in order to better characterize this relationship. Such as Wang et al. [18] proposed three-way clustering frameworks based on the idea of contraction and expansion in mathematical morphology. Zhang [19] extend the rough k-means (RKM) according to the three-way weight and the three-way assignment method, and get a three-way c-means clustering algorithm and so on. However, the research object of the above algorithm is numerical data, and there are few studies on categorical data. In addition, there is no research work on the uncertain relationship between samples and clusters for incomplete categorical matrix data. Based on the set pair information granule theory, this paper proposes a set pair k-modes clustering algorithm for incomplete categorical matrix data, which can represent the relationship between samples and clusters.
Basic theory
Incomplete information system
Information system
Where,
Set pair information granule
Set pair analysis [20] is a new theory to study the system of certainty and uncertainty. By analyzing the characteristics of two things and getting the expression of positives, differences and negatives contact degree is as follows:
Where,
Where,
The information table is shown in Table 1. The problem space
A table of information
A table of information
If only the height of each object is currently known, there is
Under the
K-modes algorithm [3] is an extension of the k-means algorithm, and the extended k-modes has the following characteristics: (1) Using simple matching distance measures. (2) Using modes instead of means to update clustering centers. (3) Using frequency correlation strategies to update modes in clustering. These extensions exclude numerical limitations in k-means algorithms so that clustering can be applied to categorical data sets.
Let
Given any two sample
The objective functions of k-modes clustering algorithm are:
Where,
Set pair k-modes clustering algorithm
Measurement of set pair distance between matrix samples
Traditional k-modes distance measurement method can only be applied to samples with only one set of attribute features description, but there are some limitations for matrix categorical data with multiple sets of attribute features description. In order to solve the problem of distance measurement of incomplete and matrix samples, based on the theory of set pair information granules, a distance formula based on set pair connection degree is proposed, which is to extend the distance between different samples in the original clustering process to include three dimensions: positive degree, difference degree and negative degree.
Where, the
Equation (1) can also be reduced to:
Where,
A matrix sample data set is shown in Table 2,
The set pair connection degree between sample
The set pair connection degree between sample
The set pair connection degree between sample
The comprehensive set pair connection degree of sample
Matrix sample data sets
s Generally speaking, the larger the value the potential function
Where,
Three kinds of information granule sets are defined in set pair granule space: positive granule set, difference granule set and negative granule set. Positive granule set and negative granule set belong to certain information granule, difference granule set belong to uncertain information granule, and their three kinds of information granule sets correspond exactly to the three relations existing in the cluster, so the correlation theory of set pair information granule is introduced into k-modes cluster. Based on the concept of three kinds of information granule sets, the set pair clustering results represented by positive region
Properties (i) indicate that the positive regions of each cluster are not empty sets, with at least one sample in the positive regions. Properties (ii) indicate that each sample must be partitioned, and properties (iii) indicate that the positive regions of any two clusters intersect as empty sets.
Aiming at the characteristics of incomplete and categorical matrix samples, a set pair distance measurement method based on set pair information granule is presented. Applying the set pair distance measurement method to k-modes clustering, a set pair k-modes clustering algorithm for incomplete categorical matrix data is proposed. In traditional k-modes algorithms, a cluster is represented by a set with clear boundaries, which only considers two relationships between samples and clusters: belonging and not belonging. However, assigning uncertain samples to a cluster reduces the accuracy of clustering results. Therefore, this paper improves on k-modes clustering algorithm. the proposed set pair k-modes clustering algorithm takes into account the uncertain relationship between samples and clusters. Each cluster is represented by positive region
Where,
Where,
MD-SPKM algorithm of this paper divides the clustering process into the following parts:
Suppose
Schematic diagram of set pair k-modes clustering.
The representation of cluster centers is crucial in the clustering process. For this reason, the update problem of cluster centers will be discussed next. Given a set
Set pair clustering results generation
Next, the preliminary clustering results are subdivided, the main task is to construct the positive region and boundary region. The average distance
Where,
The implementation flow of the algorithm is as follows:
Algorithm complexity analysis
Since in most practical application scenarios, the data has a large number of attributes, which will have a certain impact on the efficiency of the algorithm, so let the number of samples in the region is
Experimental evaluation
A series of experiments are carried out in this section to evaluate the clustering performance of the proposed algorithm MD-SPKM. The experimental comparison is carried out by selecting four representative algorithms. The first is the classical k-modes algorithm proposed by Huang [3]. The second is the k-mw-modes algorithm based on matrix samples proposed by Cao et al. [8]. The third is the MD fuzzy k-modes algorithm proposed by Li et al. [17]. The fourth is the fuzzy k-modes algorithm based on rough set proposed by Saha et al. [22], which is denoted as RFKMd algorithm.
The algorithm proposed in this paper can solve the problem of incomplete matrix data clustering and complete matrix data clustering. In order to prove the validity of MD-SPKM algorithm, two aspects will be evaluated experimentally. First, select the complete matrix data set, compare and analyze the proposed algorithm and the selected contrast algorithm. Second, select the incomplete matrix data set, under the condition of different missing rate, compare and analyze the proposed algorithm MD-SPKM with the four contrast algorithms.
This algorithm and contrast algorithm are implemented on a DELL computer (Windows 10, Intel (R) Core (TM) i5-8300H, CPU @2.30GHz 2.30GHz). The programming platform is Python3.7.
Data sets
Since the matrix sample data with label information is relatively rare, it is necessary to preprocess the given data. In this paper, the Multidimensional Scaling method is used to process the data. The goal of this method is to obtain the distribution of
In this experiment, three data sets were selected to evaluate the algorithm, Musk, Microsoft Web and MovieLens, respectively. Musk downloaded from the UCI database containing 476 record values for 92 samples. Each sample has 167 attributes. The Microsoft Web dataset was also downloaded from the UCI database, containing 165257 record values of 32711 users for 294 websites. Each record has two properties, user ID and web ID. MovieLens data set was created in 2018 and downloaded from the MovieLens website. This paper uses the ratings data, which contains 1000209 rating scores for 3900 movies by 6040 users. Each record value has four attributes, user ID, movie ID, rating and timestamp.
Visualization of Microsoft Web data sets after preprocessing.
Visualization of MovieLens data sets after preprocessing.
Musk the data characteristics of the data set are good, there is no noise data and missing value, which can be clustered directly, so it is not preprocessed. For the Microsoft Web data set, it is necessary to delete samples with fewer than 7 record value bars first, then visualize the data set by Multidimensional Scaling method, and select the samples in the
The evaluation of clustering, also called clustering validity, is the key process to evaluate the performance of clustering algorithm. The following evaluation indicators are often used to evaluate the performance of clustering algorithms. Let the clustering result of MD-SPKM algorithm is
Accuracy
Where, the Description of matrix data sets
Recall
Where, TP indicates that the actual is positive and the prediction is also positive. FN indicates that the actual is positive and the prediction is negative.
Adjusted Rand Index (ARI)
Normalized mutual information (NMI)
NMI values range from 0 to 1 and NMI are equal to 1 only if the detected cluster and the real cluster are completely consistent.
Where,
MD-SPKM algorithm performance analysis
To better show the clustering effect of the MD-SPKM algorithm, taking the Microsoft Web data set as an example, the change of evaluation index during iteration is given. Since each cluster is composed of positive regions and boundary regions, the results of four indexes are calculated only for the positive region and the combined of the positive region and boundary region, respectively. The specific results are shown in Fig. 4.
MD-SPKM algorithm can take into account the uncertain relationship between the sample and the cluster, assign the samples that certainly belong to the cluster into the positive region, and assign the samples that may belong to the cluster into the boundary region. Figure 4a gives the clustering results that only consider positive regions, and Fig. 4b gives the clustering results that consider the combination of positive regions and boundary regions. From these two graphs, it can be seen that the results of the four indicators are constantly rising as the number of iterations increases. However, under all indicators, the results of the combination of positive regions and boundary regions are higher than those considering only positive regions, because each cluster is divided into two parts: positive region and boundary region. The two sets are obtained by contraction or expansion based on two-way clustering. The optimal clustering results obtained in Fig. 4b are Accuracy 0.9297, Recall 0.9552, ARI 0.7384 and NMI 0.6405. While only positive region is considered in Fig. 4a, the Accuracy still reaches 0.8981, the ARI reaches 0.7, and the NMI reaches 0.5856. Only the Recall rate decreases slightly to 0.5994, but the overall clustering results are still good.
Comparative analysis of experiments under complete data set
On the complete matrix data set, four algorithms are selected to compare with the proposed algorithm MD-SPKM, which are k-modes [3], k-mw-modes [8], MD fuzzy k-modes [17] and RFKMd [22], respectively. The experimental contrast results obtained through four evaluation indicators are shown in Tables 4–6, and the best performance of each dataset has been highlighted in bold.
Comparison of five algorithms on Musk data sets
Comparison of five algorithms on Musk data sets
Iterative clustering of Microsoft Web data sets.
Comparison of five algorithms on Microsoft Web data sets
Comparison of five algorithms on MovieLens data sets
Clustering results of Musk data sets at four missing rates
As can be seen from Tables 4–6, under the Musk, Microsoft Web and MovieLens data sets, the values of the proposed algorithm MD-SPKM under the evaluation indexes are higher than those of the contrast algorithm. Compared with other algorithms, the results of Musk and Microsoft Web data sets under four evaluation indexes were significantly improved, among which the recall rate was the most significant. The optimal recall rate of Musk data set in contrast algorithm is 0.5431, while the optimal recall rate is 0.7446 under MD-SPKM algorithm. The optimal recall rate of the Microsoft Web dataset in the contrast algorithm is 0.7696, while the optimal rate is 0.9552 under MD-SPKM algorithm in this paper. Although the improvement of MovieLens data set under four indicators is not obvious, it also implemented slightly higher than four contrast algorithm.
In order to simulate the data set containing missing values, the attribute values of some samples were deleted randomly. In the experiment, the missing rate of 5%,10%,15%,20% was selected to generate the incomplete matrix data set randomly. In order to eliminate the influence of missing data structure or distribution, at each missing rate, multiple different data sets were generated, and then the average value of the experimental results obtained from multiple runs was taken. The Musk, Microsoft Web and MovieLens data sets were experimentally evaluated at different missing rates through four evaluation indicators. The results are shown in Tables 7–9. The best performance of each data set at different missing rates has been highlighted in bold.
Clustering results of Microsoft Web data sets at four missing rates
Clustering results of Microsoft Web data sets at four missing rates
Clustering results of MovieLens data sets at four missing rates
From Tables 7–9, it can be seen that in most cases, MD-SPKM algorithm has better clustering performance than k-modes, k-mw-modes, MD fuzzy k-modes and RFKMd algorithms, especially in the Microsoft Web dataset, all the indexes are higher than the contrast algorithm under the four missing rates, which also shows the superiority of the MD-SPKM algorithm. The experimental results of the Musk dataset are given in Table 7. MD-SPKM algorithm outperforms the contrast algorithm under the condition that the missing rate is 5% and 10%. At the missing rate of 15%, although it is not the best algorithm, the Accuracy is also close to the best contrast algorithm.
Table 9 shows the experimental results of MovieLens data set at different missing rates. As can see that when the missing rate is 5%,15% and 20%, the results of the MD-SPKM algorithm under four indexes are higher than that of the contrast algorithm. In addition, under the condition that the missing rate is 10%, the highest Accuracy of the contrast algorithm reaches 0.5891, although MD-SPKM algorithm is not optimal, it also reaches 0.5816. All in all, the MD-SPKM algorithm not only can deal with incomplete categorical matrix data clustering problem, but also get better clustering effect.
For the study of categorical data clustering, the data facing is not only matrix data, but also with missing values and it is an incomplete matrix data. In order to cluster such data and consider the uncertain relationship between samples and clusters, a set pair k-modes clustering algorithm for incomplete categorical matrix data is proposed. Firstly, the missing value of the data set is given the processing method, and the sample is granulated by using the relevant theory of set pair information granules. The sample is divided into positive granules, difference granules, negative granules. A set pair distance measurement method between matrix samples is proposed, and the distance between different samples is extended to include the definition of distance with three dimensions of positive degree, difference degree and negative degree. Thus, the samples are divided into various clusters to form preliminary clustering results. Secondly, considering the uncertain relationship between the sample and the cluster, the definition of the intra-cluster average distance is given, which is used to distinguish the samples that certainly belong to the cluster from the samples that may belong to the cluster. And the threshold calculation formula is given to determine whether the sample belongs to multiple clusters, and the preliminary clustering results are subdivided to form set pair clustering results including positive region
Footnotes
Acknowledgments
This research is supported by The Natural Science Foundation of Hebei Province (F2018209374) and The Natural Science Foundation of Hebei Province (F2016209344).The authors also gratefully acknowledge the helpful comments and suggestions of tutor, which have improved the presentation.
