A Hybrid KNN algorithm with Sugeno measure for the personal credit reference system in China

Abstract

Ever increasing ordinal variables are being collected by the Personal Credit Reference System in China, however this system suffers from analysis of this kind of data, which cannot be calculated by Euclidean distance. In this study, we put forward a hybrid KNN algorithm based on Sugeno measure, and we prove that the error of this algorithm is smaller than that of Euclidean distance, furthermore, we use real data obtained from the Personal Credit Reference System to perform experiments and get the user’s initial portrait. Through the comparisons with Kmeans algorithm and other different distance measures in KNN algorithm, we find that the hybrid KNN algorithm is more suitable for clustering personal credit data.

Keywords

Hybrid KNN clustering personal credit reference system Sugeno measure user’s portrait

1 Introduction

In 2013, China promulgated regulations for the management of the credit industry, which expedited the introduction of the Credit System. In 2014, China issued “the outline of credit system construction plan” (2014–2020), which proposed to build the society’s entire credit system on the basis of information resource sharing. In 2015, eight institutions were permitted to conduct personal credit investigation services. By the end of June of 2017, there were 92.6 million personal credit reports in the Credit System. Between January and May of 2017, the average daily number of personal credit report queries was 3.43 million.

Basic consumer data, mortgage data, and credit data are the main foundation of the Credit Reference, and the development of big data provides the possibility of exploring deeper data applications. With increasing demands for credit information, user classification and user portraits are keys to the practical application of credit information. Therefore, it can be said that user classification is the first step in the credit information process, which restricts the application value of the information as a whole.

Research on user personas in the Personal Credit Reference System is limited but increasing, and a clustering algorithm is often used to classify users. Clustering algorithms are based on the measure of distance, and can be divided into four categories: hierarchical clustering, segmented clustering, mixed clustering, and grid clustering. For large datasets, KNN hierarchical clustering is shown to be the most widely used algorithm. Current research focuses on selection of the initial point and algorithm performance enhancements, but it is clear that the KNN algorithm always deals with numeric attributes. For frequency attributes, this algorithm is shown to be invalid, which greatly limits its application.

Meanwhile, there are many numeric attributes in the Credit Reference, such as the number of loans, the number of credit cards, and the number of mortgages. These are variables of distance difference and cannot be effectively measured using value, making this a Hybrid Intelligent System [1]. Similarly, it can be found that combination of demographic similarity and cosine similarity between users is a real problem for Intelligent Systems [2]. Thus, according to the fusion theory, KNN clustering based on Euclidean distance measure does not provide an effective solution for such customer classification problems. Because the traditional KNN is based on Euclidean distance, and Euclidean distance is calculated by variance, which measures the numeric difference between data value, while the actual frequency difference cannot be effectively described by variance. For example, there is a variable named “the number of query”, we can calculate the Euclidean distance between samples by its mean and variance, but it is meaningless, from the record, we know there are only positive integers in this variable, and intuitive difference between five queries and zero queries is 5, not the Euclidean distance.

To address this research gap, we explored the Sugeno measure under two sets of sequences, and then constructed a KNN cluster based on the Sugeno distance. We defined the set upper limit as the threshold for cluster combination. This new algorithm has a smaller distance error when dealing with ordered variables through theoretical analysis. Compared with the cluster based on Euclidean distance, Cityblock distance and Correlation distance it can be found that this hybrid algorithm is more accurate for exploring user personas.

This article is divided into six parts. This introductory part has explained the problem. The next part summarizes the research related to this topic. The third part focuses on the calculation of distance under the Sugeno measure and designs a dynamic clustering algorithm. The fourth part carries out experiments on the Personal Credit Systems using Sugeno distance and KNN dynamic algorithm. In the fifth part, the clustering results of this hybrid algorithm are compared with Kmeans, and preliminary robust check of different distances are summarized. The last part discusses the findings of this study and provides a conclusion.

2 Literature review

Current research on the Personal Credit Reference System is relatively scarce. Many previous research efforts focused on the description of consumer credit behavior, including studies by Mikhed and Vogan [3]. According to the findings of Mikhed and Vogan, a clustering algorithm is often used to analyze consumer characteristics.

Clustering algorithms can be divided into four categories: hierarchical clustering, segmented clustering, mixed clustering, and grid clustering, as discussed by Bonchi, Garcia-Soriano, & Liberty [4] who found KNN hierarchical clustering to be the most widely used algorithm. Current research focuses on the selection of initial settings [5 –10] and algorithm performance enhancements [11 –15]. Scholars usually think that KNN algorithm has less ability to deal with the sample set with large density difference. Bhattacharya, Ghosh and Chowdhury [16] propose similarity function using this affinity function to classify the test patterns and then use KNN to conduct clustering. They gave a basic idea of improvement, which can be adopted by using different distance and separating clustering process. In the research of Deng, et al. [17], they argue that KNN is a lazy learning algorithm, so they propose to conduct a KNN clustering to separate the whole dataset into several parts at first, and then based on these separate parts, they let KNN to learn, and with big data sets their experiments show better performance. Later, Chen, et al. [18] find that KNN algorithm has the ability to deal with density difference problem. Gou, et al. [19] propose local mean representation-based k-nearest neighbor classifier, which they think will be better for small training samples and more robust with different density, but they only conduct experiments with UCI data. According to the research of Zhang, et al. [20], they use correlation matrix to reconstruct KNN algorithm by training data to assign different k values to different test data points, they also find that the KNN algorithm can only deal with numeric attributes, and for frequency attributes this algorithm is tested to be ineffective, greatly limiting its application. Jo and Hwang [21] find that KNN can classify text through string vectors, which have a good performance in text summarization.

Sugeno [22] proposed a special fuzzy interval set, named the Sugeno set, in his doctoral thesis. Klement, Mesiar, and Pap [23] designed an integral operation in monotone sequence according to this set, but the Sugeno integral proposed by Klement, Mesiar, and Pap can only deal with the horizontal order set. Agahi [24] discussed the operation of the partial inclusion sequence in his research. Smrek [25] proposed a general construction method for aggregation operators with the integrals of Choquet, Sugeno, and Shilkret. Halaš, Mesiar, and Pócs [26] introduced a characteristic Sugeno integral, that is, congruences, and stressed that it can be used in the multi-criteria decision. Hala, Mesiar, and Pócs [27] studied compatible aggregation functions on a general bounded distributive lattice of the Sugeno integral. Dubois, et al. [28] generalized the concept of the Sugeno integral, which can be split into two functions, one based on a general multiple-valued conjunction and one based on a general multiple-valued implication. Brabant and Couceiro [29] discussed the use of this parametrized notion in preference aggregation and learning. Shubair, Ramadass and Altyeb [30] propose kENFIS, which partitions the input space into clusters by using a new designed KNN-based evolving fuzzy clustering method and organizes the rule base using Takagi-Sugeno method, and they deployed this method into worm detection. Wang, et al. [31] argued fuzzy fractal dimension can reflect the dynamic structure and topological structure of complex network, it can be used by first-order Sugeno fuzzy approximator to reduce dimension. Increasing numbers of research efforts began to explore the usage of the Sugeno integral, including those works such as [32 –36].

Generally speaking, clustering with categorical variables can be challenging, and most clustering methods have difficulties detecting clusters of hugely different densities in a dataset. According to the research of Agahi [24], Sugeno integrals are aggregation operations involving a criterion weighting scheme based on the use of set functions. However, the categorical variables cannot always be clearly bounded, as found by Ezequiel Opez-Rubio [1]. Therefore, how to control the combination of clusters is the key problem in the application of Sugeno integrals.

3 Method

Because the setting of distance and the selection of initial point are the key to affect the clustering algorithm, we use Sugeno measure method to explore the clustering, which is focus on the monotonic sequence, and gives the distance within fuzzy sets.

The Sugeno measure uses sequence to calculate different element sequence distances. Thus, in our research we used distance as the basis for clustering. If set sequence {A_n} meets A₁ ⊆ A₂ ⊆ … ⊆ A_n, it is called a monotonic sequence, and we can define it as A_n↑. And this work can be found in Sugeno [22].

3.1 Method

Definition 1. If λ ∈ (-1, + ∞), h_λ : A → [0, 1] meets the following conditions:

h_λ (X) =1;

h_λ (A ∪ B) = h_λ (A) + h_λ (B) + λh_λ (A) h_λ (B)

$A_{n} ↑ \Rightarrow lim_{n \to \infty} h_{λ} (A_{n}) = h_{λ} (lim_{n \to \infty} A_{n})$

then h_λ is defined as the Sugeno set.

Definition 2. If $(X, A, m)$ is defined as the fuzzy space, $A \in A$ , h : X → [0, 1] then $\int_{A} h \circ m = sup_{λ \in [0, 1]} min {λ, m (A \cap h_{λ})}$ , which is the integral of this set.

By the definition of the Sugeno integral h (x_i) ⩽ h (x_i+1), it can be represented as $\int_{A} h \circ m = max_{i \in {1, 2, . . ., n}} min {h (x_{i}), m (A \cap X_{i})}$ , which reflects the measurement of ordered sets. Therefore, the distance of two levels of dependent ordered sets satisfies the following theorem 1, which can be found in the work of Smrek [25].

Theorem 1. In the Sugeno measure, the distance between two ordered sets can be defined by the following formula $r_{ij} = {\begin{matrix} 1, i = j \\ \frac{1}{M} \sum_{k = 1}^{m} x_{ik} \cdot x_{jk}, i \neq j \end{matrix}$ , and $M = max_{i \neq j} (\sum_{k = 1}^{m} x_{ik} \cdot x_{jk})$ . This means that under the Sugeno measure, the distance between the two sets can be measured by the feature elements. If there is an ordered set X = {x₁, x₂, …, x_n}, every variable x_i has m characteristic indexes, and x_ij represents the i element with j characteristic index.

So, the clustering algorithm can be designed as follows:

Step 1: Data standardization

For the X = {x₁, x₂, …, x_n}, we calculate the mean and variance, and then transform each element to $x_{ij}^{'} = \frac{x_{ij} - {\bar{x}}_{j}}{σ_{j}}$ , with ${\bar{x}}_{j} = \frac{1}{n} \sum_{i = 1}^{n} x_{ij}$ , $σ_{j}^{2} = \frac{1}{n} \sum_{i = 1}^{n} (x_{ij} - {\bar{x}}_{j})^{2}$ , i = 1, 2, …, n, j = 1, 2, …, m, thus obtaining the ordered standard set {x_i}. This method can be found in the research of Han, Han, and Zhao [37].

Step 2: Extract the characteristic elements

The method for extracting the characteristic elements is orthogonalization, which is described as follows: z₁ = x₁, $z_{2} = x_{2} - \frac{x_{2}^{'} z_{1}}{z_{1}^{'} z_{1}} z_{1}$ , ... , $z_{s} = x_{s} - \sum_{i = 1}^{s - 1} \frac{x_{s}^{'} z_{i}}{z_{i}^{'} z_{i}} z_{i}$ . With these elements, we obtain the initial fuzzy based set as C_d and the clustering cores are the most frequent items which are defined as C_i = {z_ij} , i = 1, ⋯ , N, j = 1, ⋯ , s.

Step 3: Construct the distance matrix

Using $r_{ij} = {\begin{matrix} 1, i = j \\ \frac{1}{M} \sum_{k = 1}^{m} z_{ik} \cdot z_{jk}, i \neq j \end{matrix}$ to define the distance between the two sets, and with the elements in C_d, we can calculate r as the distance between two cores.

Step 4: Clustering

Mark the farthest records between two cluster as z_d = {{z_d,1} , … {z_d,s}}, z_n-d = {{z_n-d,1} , … {z_n-d,s}} and so on.

Calculate the distances between the farthest records and within the maximum range of threshold λ for r_ij, we combine the classes as C_d′ = C_d ∪ {z_n-d}.

This process is repeated until all elements have been clustered.

Step 5: Cluster termination

Termination follows when two conditions are satisfied: one is that all of the elements have been clustered, and the other is that all of the distances are greater than the threshold λ. Then, the clustering results can be controlled by the threshold λ.

Algorithm:

Input: data set.

Output: Label of sample belong to cluster.

Steps:

1. Read the data set.

Data = Read(); C(i) = Data.col(); k(i) = mean(C(i)); s(i) = std(C(i));

x(i) = (x(i)-k(i))/s(i);

2. Get the initial core of cluster.

for (i = 1, z(i)! = null, i++)

z(i) = x(i)-x(i)z(i-1)/z(i-1)z(i-1);

end

3. Calculate the distance

for (i = 1, i++,j = i+1)

m(ci,cj) = max(z(i)-z(j));

C(ij) = (z(i)-z(j))/m(ci,cj);

end

for (k< = C(i))

Sc = sum(C(ij));

end

4. Clustering

def knn(x_test,x_train,y_train,k):

distances = [C(ij)]

y_kind = % save the label

for i in x_train:

distances.append(get_distance(x_test,i))

tmp = list(enumerate(distances)) % get the nearest k classes.

tmp.sort(key = lambda x:x[1])

min_k_dis = tmp[:k]

for j in min_k_dis:

t_key = y_train[j[0]]

if t_key in y_kind.keys():

y_kind[t_key]+ = 1

else:

y_kind.setdefault(t_key,1)

t = sorted(y_kind.items(),key = lambda x:x[1], reverse = True)

return t

3.2 Theoretical analysis

For a sample, if the included information of multiple indexes is described by fuzzy numbers, the sample can be regarded as fuzzy sample. For M fuzzy samples under N indexes, each fuzzy sample in the fuzzy sample set $\tilde{S} = {\underset{i}{\tilde{s}}}_{i = 1}^{M}$ can be treated as a multi-fuzzy-vector and each element in the vector is a fuzzy number, and accordingly, the i-th fuzzy sample can be expressed as $\underset{i}{\tilde{s}} = [{FN}_{i}^{(j)}] j = 1 N$ .

Euclidean distance between the fuzzy numbers of two fuzzy samples ( $\underset{r}{\tilde{s}}$ and $\underset{t}{\tilde{s}}$ ) under the j-th index, denoted as $A_{r}^{(j)}$ and $A_{t}^{(j)}$ , can be calculated as: $d_{rt}^{(j)} = \overset{1 / 2}{[\int_{\overset{(j)}{a}}^{\overset{(j)}{b}} \overset{2}{| A_{r}^{(j)} (x) - A_{t}^{(j)} (x) |} dx]}$ (1) where the domain of discourse $X = ℝ$ ; and $[\overset{(j)}{a}, \overset{(j)}{b}]$ denotes the common integral fields of the membership functions of fuzzy numbers $A_{r}^{(j)}$ and $A_{r}^{(j)}$ to the fuzzy samples $\underset{r}{\tilde{s}}$ and $\underset{t}{\tilde{s}}$ under the j-th index ( $[\overset{(j)}{a}, \overset{(j)}{b}] = [a_{r}^{(j)}, b_{r}^{(j)}] \cup [a_{t}^{(j)}, b_{t}^{(j)}]$ ). $[a_{t}^{(j)}, b_{t}^{(j)}] \subset ℝ$ represents the integral field that the membership function of the fuzzy number $A_{r}^{(j)}$ , denoted as $A_{r}^{(j)} (x)$ , can be expressed, while $[a_{t}^{(j)}, b_{t}^{(j)}] \subset ℝ$ represents the integral field that the membership function of the fuzzy number $A_{t}^{(j)}$ , denoted as $A_{t}^{(j)} (x)$ , can be expressed. In other words, the membership functions of two fuzzy numbers $A_{r}^{(j)}$ and $A_{t}^{(j)}$ in $ℝ$ can only be expressed in $[\overset{(j)}{a}, \overset{(j)}{b}]$ .

Intuitively, aiming at calculating the distance $\underset{rt}{d}$ between two fuzzy samples $\underset{r}{\tilde{s}}$ and $\underset{t}{\tilde{s}}$ , the distance between M fuzzy samples should be separately calculated under N indexes; then, by means of certain operator aggregation, i.e., Equation (1) can be integrated using the operator ∑, the following expressions can be derived: $\underset{rt}{d} = \sum_{j = 1}^{N} d_{rt}^{(j)} = \sum_{j = 1}^{N} \overset{1 / 2}{[\overset{2}{| A_{r}^{(j)} (x) - A_{t}^{(j)} (x) |}]}$ (2)

In spite of easy understanding, the calculated distance can increase the error. Here, a new distance, named Sugeno distance, is defined, which can lead to smaller error. The conclusion will be proved in detail later. Next, the definition of improved Euclidean distance is described. For the domain of discourse $X = ℝ$ , $[\overset{(j)}{a}, \overset{(j)}{b}]$ is a bounded interval. For N indexes, it can be determined an interval $[a, b] = [\land_{j = 1}^{N} \overset{(j)}{a}, \lor_{j = 1}^{N} \overset{(j)}{b}]$ . Assuming that ∨ and ∧ represent maximizing and minimizing in Zadeh operator, i.e., $\land_{j = 1}^{N} \overset{(j)}{a} = min {\overset{(j)}{a}}_{j = 1}^{N}$ and $\lor_{j = 1}^{N} \overset{(j)}{b} = max {\overset{(j)}{b}}_{j = 1}^{N}$ , the range [a, b] represents the maximum common integral field. In addition, if the common integral fields including P indexes (1 ⩽ P ⩽ N) is a bilateral unbounded interval or a unilateral unbounded interval, an unbounded interval can be regarded as the unlimited extension of a bounded interval. By taking a bilateral unbounded interval as an example, when $\overset{(j)}{a} \to - \infty$ and $\overset{(j)}{b} \to - \infty$ , $[\overset{(j)}{a}, \overset{(j)}{b}] \to (- \infty, + \infty)$ . For covering all cases, $[\overset{(j)}{a}, \overset{(j)}{b}]$ denotes the common integral fields of two fuzzy samples under the j-th index, and the improved Euclidean distance between two fuzzy samples ( $\underset{r}{\tilde{s}}$ and $\underset{t}{\tilde{s}}$ ), denoted as $\underset{rt}{d}$ , can be calculated as: $\underset{rt}{d} = \overset{1 / 2}{[\sum_{j = 1}^{N} \int_{a}^{b} \overset{2}{| A_{r}^{(j)} (x) - A_{t}^{(j)} (x) |} dx]}$ (3)

Equation (3) takes full consideration of the distance between two fuzzy samples ( $\underset{r}{\tilde{s}}$ and $\underset{t}{\tilde{s}}$ ) under N indexes, i.e., the distance between finite countable fuzzy numbers. Thus, this is made no boundary for clusters, and this confuses the inherent attribute characteristics of the category.

Moreover, let e_Σ (i, j) to be total error, $e_{Σ}^{(S)} (i, j)$ to be system error, and $e_{Σ}^{(R)} (i, j)$ to be random error. Then we can get $e_{Σ} (i, j) = e_{Σ}^{(S)} (i, j) + e_{Σ}^{(R)} (i, j)$ . More, in discrete fuzzy numbers, they can be denoted as $\underset{SD}{e} (i, j)$ , $e_{SD}^{(S)} (i, j)$ and $e_{SD}^{(R)} (i, j)$ . It is easy to know that $e_{Σ}^{(R)} (i, j) = e_{SD}^{(R)} (i, j)$ , and $[a, b] = [\land_{k = 1}^{N} \overset{(k)}{a}, \lor_{k = 1}^{N} \overset{(k)}{b}]$ , $\forall [\overset{(k)}{a}, \overset{(k)}{b}] \subseteq [a, b]$ , so we can infer that $e_{Σ}^{(S)} (i, j) = e {\int_{a}^{b} \overset{2}{| A_{i}^{(k)} (x) - A_{j}^{(k)} (x) |} dx} > e {\int_{\overset{(k)}{a}}^{\overset{(k)}{b}} \overset{2}{| A_{i}^{(k)} (x) - A_{j}^{(k)} (x) |} dx} = e_{SD}^{(S)} (i, j)$ . So it can be said that the error between fuzzy sets is smaller than that of traditional data.

In a word, using fuzzy distance can better distinguish the key features of categories, and fuzzy distance has less total error than Euclidean distance.

4 Experiment

4.1 Data preprocessing

In this experiment, we used a credit card database as the sample, as provided by the Personal Credit Reference System in China. The data set ranges from 2004 to 2009, with a total number of 65,536 accounts and 31 variables.

Firstly, we explored the basic information of accounts. In this sample, there are more than 40% of users with a bachelor degree or above. Nearly 80% of users are married. Users in the Personal Credit Reference System are mainly from Beijing, Jiangsu, Shanghai, Shandong, Zhejiang, and Guangdong. All of these regions in China are developed, and we can state that the regional distribution of this sample is reasonable and representative of all users.

Next, we examined the development of accounts. We examined trends in the ratio of debit accounts, the ratio of credit accounts, and the number of effective accounts, as shown in Fig. 1.

Fig. 1

The trend of accounts and the ratio of credit/debit.

From Fig. 1, we can see that the number of accounts showed rapid growth with an almost exponential trend until 2008. The growth then slowed, and the trend approached stability. The ratio of credit accounts changed rapidly initially, and then varied at a level between 30% and 35%. The ratio of debit accounts changed relatively less than the ratio of credit accounts, and has been around 10% and never over 16% in last five years. Therefore, it can be said that the users in this system have become stable, and we can use these samples to perform clustering.

Furthermore, in order to check whether there were any great changes in the sampled accounts, we took average annual income and average total debt to explore the trend of economic attributes in individual accounts, as shown in Fig. 2.

Fig. 2

The trend of average annual income and average total debt.

From Fig. 2, we can see that between 2004 and 2005 the average total consumer debt changed rapidly. After 2006, the average total user debt growth became relatively stable. Meanwhile, it can be seen that the average annual income increased, almost following a line, and its speed of growth was much less than that of the average total debt. Nonetheless, it can be inferred that average individual annual income and average individual total debt have become stable in recent years, and these samples can be used to represent all consumers in the system, and so the clusters are reasonable and stable. The sample for analysis comprises consumers in the Personal Credit Reference System with dates after 2006, and there are 53,892 records.

In order to make uniform the distance of the variables, we segmented the continuous variables, and the details are shown in Table 1. The observations marked with blue are the variables that were deleted for a certain reason.

Table 1

Variables and data preprocessing

Variables	Data preprocessing
ID number	Delete because of privacy
Name	Delete because of privacy
Query time	Only used with year
The number of queries
Gender	1 for male, 2 for female
Age	1 for 20–30; 2 for 31–40; 3 for 41–50; 4 for 51–60; 5 for over 60
Marital status	1 for married; 2 for unmarried
Telephone	Delete because of privacy
Affiliation	Delete because of privacy
Work phone	Delete because of too much missing value
Home phone	Delete because of too much missing value
Work place	Delete because of privacy
Place	1 for living place matches mailing place; 2 for other
Education degree	1 for without schooling; 2 for junior high school and below; 3 for high school and secondary school; 4 for undergraduate or college; 5 for master; 6 for doctor
Region	Zip code
Place of birth	Zip code
Annual income	1 for 1 wan yuan and below; 2 for 1–5 wan yuan; 3 for 6–10 wan yuan; 4 for 11–20 wan yuan; 5 for 21–50 wan yuan; 6 for 51–100 wan yuan; 7 for more than 100 wan yuan
Industry	Following GB/T 4754-2017, total 21 kinds
Occupation	Following GB6565-86, total 8 kinds
The number of loans
First time of loan	Only used with year
Total debt	1 for 10 wan yuan and below; 2 for 10–50 wan yuan; 3 for 51–100 wan yuan; 4 for 101–300 wan yuan; 5 for 301–500 wan yuan; 6 for 501–1000 wan yuan; 7 for more than 1000 wan yuan
Debt balance	1 for 10 wan yuan and below; 2 for 10–50 wan yuan; 3 for 51–100 wan yuan; 4 for 101–300 wan yuan; 5 for 301–500 wan yuan; 6 for 501–1000 wan yuan; 7 for more than 1000 wan yuan
Repayment monthly	1 for 1 wan yuan and below; 2 for 1.1–2.0 wan yuan; 3 for 2.1–5.0 wan yuan; 4 for 5.1–10 wan yuan; 5 for more than 10 wan yuan
The number of credit cards
First time of credit card	Only used with year
Credit card amount	1 for 1 wan yuan and below; 2 for 3 wan yuan; 3 for 3–5 wan yuan; 4 for 5–10 wan yuan; 5 for 10–30 wan yuan; 6 for more than 30 wan yuan
Credit balance	1 for 1000 yuan and below; 2 for 1001–3000 yuan; 3 for 3001–5000 yuan; 4 for 5001–10000 yuan; 5 for more than 1 wan yuan
Debit balance	1 for 100 yuan and below; 2 for 101–500 yuan; 3 for 501–1000 yuan; 4 for 1001–3000 yuan; 5 for more than 3000 yuan
The number of guarantees

Then, we used maximum/minimum-normalized processing to put all of these records into an interval [0, 1] in order to calculate the distance matrix of the sample, to give the distance the same dimension.

4.2 Dynamic clustering

According to the clustering algorithm, we explored the orthogonal process to find the characteristic elements in 31 variables, as described in step 2 of the algorithm. It was found that there were 10 variables in the results, which were gender and marital status (x1), the number of queries(x2), education degree (x3), industry (x4), the number of guarantees(x5), the number of credit cards (x6), the number of loans (x7), credit balances (x8), total debt(x9), and annual income(x10), listed by their variance.

With the 53,892 records and initial 10 variables, we used the maximal number of occurrences to determine the initial number of clusters and used the mode of the final cluster to determine the initial record core.

We explored the results by constructing a diagram to show the relationship between cluster number and Sugeno distance, as shown in Fig. 3.

Fig. 3

The number of clusters and the distance.

From Fig. 3, it can be seen that for distances between 13 and 18, the number of clusters is always 30, which is most representative of the experimental results and is the flattest area in the figure. Therefore, we set the number of initial clusters to 30.

The clustering growth process was performed using MATLAB 2016a, with the 30 initial classes. The clustering process is shown in Fig. 4.

Fig. 4

The combination process of clusters.

As shown in Fig. 4, there are five prominent peaks in the graph, and so we inferred that the number of final clusters was five. Furthermore, we summarized the growth process in as shown in Table 2.

Table 2

The summary of dynamic clustering

Category	Records	Max distance within category	Nearest category
1	11,483	1.0926	3
2	8682	1.0451	1
3	10,720	1.0218	1
4	2595	1.8278	5
5	20,412	1.0917	4

The final distance was found to be λ = 2, and the records in different clusters were not equal. Category 5 had the most records, and category 4 had the least. The maximum distance within categories was almost equal, and so it can be said that it was concentrated within classes. Category 4 had the largest distance, which was evidently different from the others. According to the distance out of category, we found that categories 2 and 3 had the tendency to combine with category 1, and category 4 had the tendency to combine with category 5. We also explored the modes of the 10 core variables within these categories, and the results are listed in the Table 3.

Table 3

The modes of 10 variables with 5 categories

Mode	1	2	3	4	5
Gender and marital status (x1)	11	21	11	12	11
Number of query (x2)	3	4	5	5	7
Education degree (x3)	4	3	4	3	5
Industry (x4)	3	5	4	9	10
Number of guarantee (x5)	3	1	2	2	3
Number of credit card (x6)	1	0	3	3	5
Number of loans (x7)	1	1	3	2	4
Credit balances (x8)	2	1	3	4	4
Total debt (x9)	2	2	4	4	5
Annual income (x10)	3	2	3	4	5

Using the χ² test, it was found that 8 of the 10 variables were different at the 95% significance level for category 1 and category 5; gender and marital status (x1) and the number of guarantees (x5) failed the test. Five of the 10 variables were different at the 95% significance level for category 1 and category 3; gender and marital status (x1), education degree (x3), industry (x4), total debt (x9), and annual income (x10) failed the test. Seven of the 10 variables were different at the 95% significance level for category 3 and category 5; gender and marital status (x1), the number of credit cards (x6), and total debt (x9) failed the test.

5 Robust check

5.1 Comparison with K-means

In order to compare the new algorithm with other algorithms, we used the same sample to perform experiments under Euclidean distance using K-means. The experiment is done using the same software Matlab 2016a. The algorithm finished with 13 categories, but the results were confused and the maximum distance was larger than the dynamic clustering algorithm. The details are shown in Tables 4 and 5.

Table 4
The cluster summary of K-means algorithm

Category Records Max distance within category Nearest category

1 2542 2.863 4

2 1639 2.632 4

3 1851 2.193 1

4 7259 3.325 6

5 5281 2.961 4

6 2649 2.762 2

7 8201 3.342 6

8 3372 2.887 1

9 6219 3.116 7

10 942 1.971 1

11 1393 1.984 1

12 4281 2.974 3

13 8263 3.427 10

Category	Records	Max distance within category	Nearest category
1	2542	2.863	4
2	1639	2.632	4
3	1851	2.193	1
4	7259	3.325	6
5	5281	2.961	4
6	2649	2.762	2
7	8201	3.342	6
8	3372	2.887	1
9	6219	3.116	7
10	942	1.971	1
11	1393	1.984	1
12	4281	2.974	3
13	8263	3.427	10

Table 5

The modes of 10 variables with 13 categories

Category	Gender and marital status (x1)	Number of of queries (x2)	Education degree (x3)	Industry (x4)	Number of guarantee s (x5)	Number of credit cards (x6)	Number of loans (x7)	Credit balances (x8)	Total credit (x9)	Annual income (x10)
1	11	1	2	3	1	3	1	0	1	2
2	21	1	3	2	0	0	1	1	2	2
3	21	3	3	4	2	0	0	1	2	3
4	22	3	3	5	0	2	0	2	1	4
5	22	7	4	6	3	1	1	2	3	4
6	21	5	4	5	1	6	1	1	3	3
7	11	6	4	3	2	4	2	1	4	2
8	11	4	4	9	0	2	0	0	4	5
9	12	1	4	9	5	1	4	4	3	2
10	12	2	4	9	0	0	0	1	1	3
11	12	2	3	10	5	3	3	5	2	4
12	21	3	3	10	6	4	2	2	3	5
13	11	4	3	10	3	5	5	2	2	5

From Table 4, it can be found that the distances within categories are larger than Sugeno KNN, moreover, it can be inferred that features are quite confusing compared with Sugeno KNN, just because the modes are not effectively separated among different categories with Table 5.

5.2 Comparison with other distances

For hierarchical clustering algorithm, the setting of initial point and the selection of distance are the key factors affecting the whole clustering results. In the research of Zhang, et al. [20], they argue that the setting of initial point can be adjusted during the learning process, thus the distance may be the most important key factor for the total performance of KNN.

Based on this, we conducted a few robust check experiments using different distance with KNN algorithm, here we chose a typical dataset “German Credit Card” which can be get from http://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data). In the data set, there are 20 predictors and 1 response variable.

Because there are fewer continuous variables in this data set, which are age and income, we divide them into discrete segments. Because there is no text variable, we only choose Euclidean distance, Cityblock distance, Correlation distance and Sugeno distance to compare. The result of Confusion Matrixes can be found in Fig. 5, and the confusion matrixes are constructed with 70% training samples and 30% validating samples.

Fig. 5

Confusion matrix with different KNN algorithms.

According to a series of different threshold, we chose ROC to conduct the further comparison. The result of ROC can be found in Fig. 6.

Fig. 6

ROC curves between different distances.

From Fig. 6, we can find that Sugeno KNN algorithm has a faster accuracy improvement with the expansion of the number of samples. At the same time, the Area Under Curve (AUC) of the algorithm is also the largest among several algorithms. So to some extent, it can be shown that Sugeno KNN algorithm has better effect on the clustering of discrete data.

6 Conclusion

In academic research, clustering algorithms are often implemented for user classification. The clustering algorithm is based on the sample distance measure. In large datasets, the most commonly used algorithm is the KNN hierarchical clustering algorithm, which uses Euclidean distance. Current research focuses on selection of the initial cluster center and performance improvement. KNN has been shown to provide good results with numerical attributes, but, for ordinal variables, the clustering algorithm is invalid. With the Personal Credit Reference System in China, there are many ordinal variables, such as the number of queries, the number of credit cards, and the number of loans. The difference between these variables cannot be measured by Euclidean distance. Therefore, the KNN algorithm cannot solve this problem.

By using partial order integral under Sugeno measure, we constructed Sugeno distance between two class sets, and then we reconstructed dynamic clustering algorithm through controlling the combination of classes by the cut-off threshold. The most advantage of this algorithm is that it can deal with ordinal variables, which has no meaning in Euclidean distance, so it is more suitable for the database which has a great number of ordered variables.

We used real data obtained from the Personal Credit Reference System to perform the analyses, and unified metric variables between the Sugeno distance, selected the preliminary category variables, explored the threshold test to design the initial setting, and finally experimented with the dynamic clustering algorithm through a consolidation process.

Through experimentation, it was found that credit users can be divided into five categories. The three main categories can be summarized as middle-income conservatives (category 1), middle-income activists (category 3), and high-income high leverage activists (category 5). Furthermore, it can be seen that in the Personal Credit Reference System, user portraits can be drawn from the two basic dimensions, which are income and credit.

However, the Sugeno measure is based on the finite order. If we deal with infinite order, the result is not clear. In this study, all of the continuous data have been segmented, so the segments have a great effect on the final result. Furthermore, just as for other clustering algorithms, the initial settings also affect the final result. Therefore, further discussions are necessary to elucidate the segmentation of continuous intervals and the initial settings. Moreover, the theoretical error of this algorithm requires future study.

In summary, user classification is the first step in credit information extraction, which restricts the application of the Personal Credit Reference System. There are a large number of ordinal variables in the system, so clustering using Sugeno measures is more practical and reasonable.

Footnotes

Acknowledgments

The work was supported by Visiting Scholar Grant Program of China Scholarship Council for Han (No. 201806495014), and the Fundamental Research Funds for the Central Universities.

References

Ezequiel

E.J.P.F.

and Opez-Rubio

, Unsupervised Learning by Cluster Quality Optimization, Information Sciences (2018).

Bahrani

, Minaei-Bidgoli

, Parvin

, Mirzarezaee

, Keshavarz

and Alinejad-Rokny

, User and item profile expansion for dealing with cold start problem, Journal of Intelligent & Fuzzy Systems 38(4) (2020), 4471–4483.

Mikhed

and Vogan

, How data breaches affect consumer credit, Journal of Banking & Finance 88 (2018), 192–207.

Bonchi

, Garcia-Soriano

and Liberty

, Correlation clustering: from theory to practice, Acm Sigkdd International Conference on Knowledge Discovery & Data Mining (2014).

, Cui

, Wang

and Su

, Efficient index-based KNN join processing for high-dimensional data, Information and Software Technology 49(4) (2007), 332–344.

, Zhang

, Huang

and Xiong

, High-dimensional kNN joins with incremental updates, Geoinformatica 14(1) (2009), 55.

Tan

, Zhang

and Wu

, Mutual kNN based spectral clustering, Neural Computing and Applications (2018).

Olivares

, Kermarrec

and Chiluka

, The out-of-core KNN awakens: the light side of computation force on large datasets, Computing 101(1) (2019), 19–38.

, Zhang

, Zhao

, Yang

and Pan

, KNN-based maximum margin and minimum volume hyper-sphere machine for imbalanced data classification, International Journal of Machine Learning and Cybernetics 10(2) (2019), 357–368.

10.

Ali

, Jung

L.T.

, Abdel-Aty

, Abubakar

M.Y.

, Elhoseny

and Ali

, Semantic-k-NN algorithm: An enhanced version of traditional k-NN algorithm, Expert Systems with Applications 151 (2020).

11.

Aburomman

A.A.

and Ibne Reaz

M.B.

, A novel SVM-kNN-PSO ensemble method for intrusion detection system, Applied Soft Computing 38 (2016), 360–372.

12.

Shi

, Han

and Yan

, Adaptive clustering algorithm based on kNN and density, Pattern Recognition Letters 104 (2018), 37–44.

13.

Nordhaug Myhre

, Øyvind Mikalsen

, Løkse

and Jenssen

, Robust clustering using a kNN mode seeking ensemble, Pattern Recognition 76 (2018), 491–505.

14.

Zhang

, Cost-sensitive KNN classification, Neurocomputing (2019).

15.

, Chen

and Song

, Boosted K-nearest neighbor classifiers based on fuzzy granules, Knowledge-Based Systems 195 (2020).

16.

Bhattacharya

, Ghosh

and Chowdhury

A.S.

, An affinity-based new local distance function and similarity measure for kNN algorithm, Pattern Recognition Letters 33(3) (2012), 356–363.

17.

Deng

, Zhu

, Cheng

, Zong

and Zhang

, Efficient kNN classification algorithm for big data, Neurocomputing 195 (2016), 143–148.

18.

Chen

, Hu

, Fan

, Shen

, Zhang

, Liu

, Du

, Li

, Chen

and Li

, Fast density peak clustering for large scale data based on kNN, Knowledge-Based Systems (2019), 104824.

19.

Gou

, Qiu

, Yi

, Xu

, Mao

and Zhan

, A Local Mean Representation-based K-Nearest Neighbor Classifier, ACM Transactions On Intelligent Systems and Technology 10(3) (2019), 1–25.

20.

Zhang

, Li

, Zong

, Zhu

and Cheng

, Learning k for kNN Classification, ACM Transactions on Intelligent Systems and Technology (TIST) 8(3) (2017), 1–19.

21.

and Hwang

S.O.

, Automatic text summarization using string vector based K nearest neighbor, Journal of Intelligent & Fuzzy Systems 35(6) (2018), 6005–6016.

22.

Sugeno

, Theory of fuzzy integrals and its applications, Doctoral Thesis Tokyo Institute of Technology (1974).

23.

Klement

E.P.

, Mesiar

and Pap

, A universal integral as common frame for choquet and Sugeno integral, IEEE Transactions On Fuzzy Systems 18(1) (2010), 178–187.

24.

Agahi

, k-generalized Sugeno integral and its application, Information Sciences 305 (2015), 384–394.

25.

Smrek

, Sugeno integrals with respect to level dependent capacities, Fuzzy Sets and Systems 291 (2016), 33–39.

26.

Halaš

, Mesiar

and Pócs

, A new characterization of the discrete Sugeno integral, Information Fusion 29 (2016), 84–86.

27.

Hala

, Mesiar

and Pócs

, Congruences and the discrete Sugeno integrals on bounded distributive lattices, Information Sciences (2016), 443–448.

28.

Dubois

, Prade

, Rico

and Teheux

, Generalized qualitative Sugeno integrals,416}, Information Sciences 415{– (2017), 429–445.

29.

Brabant

and Couceiro

, k -maxitive Sugeno integrals as aggregation models for ordinal preferences, Fuzzy Sets and Systems (2017).

30.

Shubair

, Ramadass

and Altyeb

, kENFIS: kNN-based evolving neuro-fuzzy inference system for computer worms detection, Journal of Intelligent & Fuzzy Systems 26(4) (2014), 1893–1908.

31.

Wang

M.L.

, Zhang

Z.Q.

, Wang

Q.D.

and Shao

H.Y.

, Adaptive Asymptotic Tracking of Nonlinear Systems Using Nonlinearly Parameterized First-Order Sugeno Fuzzy Approximator, International Journal of Fuzzy Systems 20(4) (2018), 1079–1087.

32.

Boczek

, Hovana

and Hutník

, General form of Chebyshev type inequality for generalized Sugeno integral, International Journal of Approximate Reasoning 115 (2019), 1–12.

33.

Daraby

, Rostampour

, Khodadadi

A.R.

, Rahimi

and Mesiar

, One version of the Prékopa-Leindler type inequality for the Sugeno integral, Fuzzy Sets and Systems (2019).

34.

Beliakov

, Gagolewski

and James

, Robust fitting for the Sugeno integral with respect to general fuzzy measures, Information Sciences (2019).

35.

Román-Flores

, Flores-Franulič0

, Aguirre-Cipe

and Romero-Martínez

, A Sugeno integral inequality of Carleman-Knopp type and some refinements, Fuzzy Sets and Systems (2019).

36.

Halaš

, Mesiar

, Pócs

and Torra

, A note on some algebraic properties of discrete Sugeno integrals, Fuzzy Sets and Systems 355 (2019), 110–120.

37.

Han

, Han

and Zhao

, Orthogonal support vector machine for credit scoring, Engineering Applications of Artificial Intelligence 26(2) (2013), 848–862.