Application of big data adaptive semi-supervised clustering method based on deep learning

Abstract

In order to solve the problems of high computational complexity, poor dimensionality reduction, and reduced clustering effect when the clustering task faces a large amount of big data, the application of a large data adaptive semi-supervised clustering method based on deep learning is proposed. Through the self-encoder of the deep clustering network, the analysis of the confrontation network is generated, and the semi-supervised deep clustering algorithm and algorithm of the adaptive strategy are optimized. Through the encoder layer structure of the deep coding network, different parameters are set for all data sets for algorithm experimental analysis. The results show that the data obtained by this method is faster, more accurate and more optimized than the traditional clustering method, which proves the effectiveness of the method.

Keywords

Big data deep learning semi-supervised learning application

1. Introduction

In the real world, there are a lot of data in various fields. In order to find valuable information from data, machine learning and data mining technologies came into being. Among them, cluster analysis is an important technology in machine learning and data mining research [1, 2, 3]. The purpose of clustering is to divide a group of objects into different classes according to their characteristics, so that the objects in the same class are highly similar and the objects in different classes are obviously different. Since the 1990s, in order to improve the performance of classifiers, many scholars have tried to combine supervised learning and unsupervised learning, and make comprehensive use of labeled samples and unlabeled samples to better train classifiers. However, it was not until the beginning of the 21st century that SSL began to form a theory and algorithm system independent of traditional supervised learning and unsupervised learning, and became a new type of machine learning method [4]. In practical applications, obtaining fully labeled data is usually very expensive and time-consuming, while obtaining unlabeled data is much easier. As shown in Fig. 1, SSL is an effective method to improve classifier learning performance by using a small amount of labeled data and a large amount of unlabeled data, which has been deeply studied in many fields.

Figure 1.

Example of semi supervised learning.

Clustering task is a basic problem in machine learning, computer vision, data compression and other fields. Traditional clustering methods have developed more mature, but the geometric increase of human activities leads to the increase of the amount and complexity of stored data, the connection between data and the characteristics of data itself have become more and more complex, and the difficulty of clustering task has also increased. It has been faced with the problems of high computational complexity, poor dimensionality reduction effect and reduced clustering effect. The significance of data lies not only in the data itself, but also in a series of analysis activities based on data to help people explore potential information and produce valuable deep information.

In recent years, due to the powerful ability of feature extraction and representation, the combination of deep learning and clustering task has attracted extensive attention. Clustering based on deep learning, also known as deep clustering, is gradually rising. Deep clustering is essentially a leading clustering method that uses the powerful representation capabilities of deep learning to improve the clustering results. The key is that the expression of the extracted data requires both the use of neural networks to learn low-dimensional expressions of data suitable for clustering, And can reflect the information and structural characteristics of the original data, so as to achieve better clustering effects. Literature [5] analyzes and summarizes big data clustering and small data clustering on the basis of the development of traditional clustering. Literature [6] summarizes some effective representative algorithms in the early stage of deep clustering combined with the development of recent years, but it lacks the research on deep clustering of graph neural network. With the continuous penetration of artificial intelligence technology, graph neural network machine learning has attracted more and more attention of researchers and practitioners, and is more and more widely used in tasks such as classification, clustering, link prediction and so on.

Cluster analysis is one of the most basic tasks in the fields of data mining, machine learning, pattern recognition and so on. Its goal is to divide the data set into several clusters so that the data in the same cluster are as similar as possible. Although many classical clustering algorithms have been published, such as k-means and Gaussian mixture model, affected by the disaster of dimension, the traditional clustering algorithms can not effectively deal with high-dimensional data sets [7]. In order to solve this problem, dimensionality reduction algorithms are often applied to data pre-processing before clustering. The principle is to map the original high-dimensional data to a low-dimensional latent space so that the transformed data can be easily distinguished. Classical dimensionality reduction algorithms include principal component analysis, independent component analysis, Laplacian feature mapping, and local linear embedding.

Semi supervised clustering method is an improvement on the traditional clustering method based on the fusion of a small amount of supervised information [8]. Semi supervised clustering method uses a small amount of prior information to guide the clustering process, and optimizes the clustering results to the greatest extent, so that it can meet the expectations of researchers. Through the research of semi supervised clustering method, a small amount of supervision information can be used to effectively improve the clustering performance. Because the research of semi supervised clustering method is conducive to the management of massive data and better mining effective information from massive data, the research of semi supervised clustering algorithm has high research and practical significance. Therefore, how to make use of these monitoring information more effectively to improve the performance of semi supervised clustering method is the main content of this paper. Based on this research background and significance, this paper proposes semi supervised clustering algorithms based on pairwise constraints of label information. Experiments on benchmark data sets verify that the algorithm in this paper is true and effective, which has certain research significance for the development of semi-supervised clustering.

2. Deep clustering network structure

2.1 Autoencoder

The autoencoder is a typical deep learning model of the encoder-decoder architecture. In an unsupervised way, the original input is transformed into an intermediate representation to capture the potential feature representation of the input. This type of neural network learns the representation of data from the input layer and attempts to reproduce the original data at the output layer. By doing this, the autoencoder model can learn its potential distribution from incomplete data and generate new credible estimates [9, 10, 11]. Autoencoders have been widely used in many fields, including dimensionality reduction, image recognition and text classification. Figure 2 shows the typical structure of the autoencoder.

Figure 2.

Autoencoder network structure.

The autoencoder is composed of input layer, hidden layer and output layer, in which the input layer and hidden layer constitute the encoder, and the hidden layer and output layer constitute the decoder. The input layer obtains the original input, the middle hidden layer encodes the input into a compact hidden representation, and the output layer reconstructs the original input. The encoder part of AE maps the input vector x to the hidden representation Z through nonlinear transformation. The calculation process is as follows:

$\displaystyle f_{\theta=(x)=s(xw^{T}+b)}$ (1)

In which $\theta$ Represents the weight matrix W and the offset vector b. Then, the decoder part maps the obtained hidden representation Z into a vector y with the same dimension as the input data $x$ , and the process is as follows:

$\displaystyle g^{\prime}_{\theta}(z)=s(w^{\prime}z+b^{\prime})$ (2)

There are many variants of the autoencoder, among which the more classic is the Denoising Autoencoder (DAE), as shown in Fig. 3. The goal of this variant is to extract and encode robust features from the noisy data $x$ . These noises may be due to the damage of the data through some additive mechanism or the introduction of missing data [12, 13, 14]. DAE is very similar to the basic AE, but the main difference is the random destruction of the model input in the training stage, which means thatybecomes the deterministic function of $x$ , not $x$ .

Figure 3.

Network structure of noise reduction automatic encoder.

2.2 Generative adversarial network

Generative Adversarial Network (GAN) is a deep learning method that has emerged in recent years. GAN does not require supervision information during the learning process, and the generator and discriminator among them compete with each other in a zero-sum game [15, 16, 17]. Both the generator and the discriminator can gradually obtain better performance and better representation during their respective training process. For example, in the case of an image generation problem, the generator generates an image from Gaussian noise, and the discriminator determines the quality of the generated image. This process continues until the output of the generator can be close to the actual input sample.

Figure 4.

Generative adversarial network structure.

The generation network takes random noise z as input and outputs the generated fake data $G(z)$ . The input of the identification network is real data $x$ or generated data $G(z)$ , and the probability that the input data comes from real data is output, that is, let the identification network try to calculate the probability that the input is a real sample according to the data distribution.

Figure 5.

The network structure of the adaptive semi-supervised deep clustering algorithm.

The objective of GAN is stated as follows:

$\displaystyle\mathop{G}\limits^{\min-}\mathop{D}\limits^{\max}{}^{V(D,G)}=E_{x% -p_{\textit{data}}(x)}[\log D(x)]+E_{z-p_{z}(z)}[\log(1-D(G(z)))]$ (3)

Based on the GAN framework, information maximizing generic advantageous networks can decompose discrete and continuous potential factors and expand them to complex data sets [18, 19]. The high efficiency of InfoGAN in the clustering effect mainly comes from maximizing the mutual information between the fixed small subset of the noise variable and the observation data. The optimization algorithm based on the generative adversarial network can impose multiple types of a priori on the basic framework, making the framework more flexible and diversified, but it has shortcomings such as modal collapse and difficulty in convergence.

3. Semi-supervised deep clustering algorithm based on adaptive strategy

Semi-supervised depth clustering algorithm based on adaptive strategy. The network structure of the algorithm is shown in Fig. 5. The algorithm maps the data from the original space to the feature space through the deep coding network, and obtains the feature representation results suitable for the clustering task. By learning the low-dimensional representation of the original data, it can alleviate the degradation of the performance of the traditional semi supervised clustering method in the face of high-dimensional data, and effectively alleviate the dimension disaster of data in the original space. At the same time, the algorithm also proposes a label adaptive strategy to correct the problem of label drift in cluster assignment. This strategy can not only improve the utilization of label information, but also weaken the excessive impact of deep coding network on cluster center. Finally, the algorithm designs a semi supervised joint learning framework to integrate label loss and clustering loss to adjust the potential representation and clustering center, and finally improve the performance of the clustering method in this chapter.

Data in the original space often shows the characteristics of too high dimension and more redundant information, resulting in its important features are not prominent enough, and it is difficult to effectively measure the similarity between data samples. To overcome the curse of dimensionality of the original data, the algorithm uses a stacked autoencoder network to construct a latent space of high-dimensional data, and learns a low-dimensional representation of the original data by minimizing the reconstruction loss. Autoencoder belongs to an unsupervised deep learning method. It realizes the purpose of automatic feature extraction through the process of encoding input data and decoding the result after dimensionality reduction it is often used to learn more essential representations of raw data. The deep coding network learns the latent features of the data in the low-dimensional space on the basis of the auto-encoder network, that is, calculates the hidden representations of the data samples, and uses these hidden representations to reconstruct the data samples, so as to minimize the loss between raw data and reconstructed data.

The adaptive semi-supervised deep clustering algorithm uses stacked autoencoders as a deep encoding network that learns the latent feature space of the initial data, and initializes it layer-wise by denoising autoencoders. Given a dataset $X=\{x_{i}\in R^{d_{1}}\}_{i=1}^{n},d_{1}$ with n sample points, $X=\{x_{i}\in R^{d_{1}}\}_{i=1}^{n},d_{1}$ represents the dimension of the data in the original space, the deep encoding network will learn the latent feature representation of the original data in the following steps:

$\displaystyle\bar{X}=\textit{Dropout(x)}$ $\displaystyle h=g_{e}(W_{e}\bar{x}+b_{e})$

where Dropout( $\bullet$ ) is a random mapping function that converts the value of some elements of the input layer to zero based on a given probability. The x of the input layer is often obtained after the Dropout function is applied. $W_{e}$ and $b_{e}$ represent the weights and partial bias vectors, which represent the parameters of the encoding part. h represents the output of the encoding network, and $g_{e}$ represents the encoding function.

After obtaining the latent representation of each data sample, the deep encoding network will decode the latent representation of the data by the following reconstruction function:

$\displaystyle\bar{h}=\textit{Dropout(h)}$ $\displaystyle t=g_{d}(W_{d}\bar{h}+b_{d})$

Among them, $\bar{h}$ is the result of the latent representation h after random mapping. $W_{d}$ and $b_{d}$ represent weight and bias vectors, which represent the parameters of the decoding part. $t$ represents the reconstructed data, and $g_{d}$ represents the decoding function.

3.1 Deep clustering of variational autoencoders

There are not many researches based on variational autoencoders, and the most representative algorithms are VaDE and GMVAE. The VaDE [7] algorithm assumes that the prior distribution of the hidden variable z obeys a Gaussian mixture model (GMM):

$\displaystyle p(c)=\text{Cat}(1/k),p(z|c)=N(u_{c},\sigma_{c}^{2})$ (4)

In which $c$ represents the cluster to which the sample belongs, Cat ( $1/K$ ) represents a discrete uniform distribution with a parameter of $1/k$ , $u_{c}$ and $\sigma_{c}^{2}$ respectively represent the mean vector and covariance matrix of the Gaussian distribution corresponding to the c-th class. The loss function of VaDE is defined as:

$\displaystyle L=\sum\limits_{i=1}^{N}(E_{q(z_{i}c_{i}|x_{i})}[\log p(x_{i}|z_{% i})]+\textit{KL}(q(z_{i}c_{i}|x_{i})||p(z_{i},c_{i})))$ (5)

In the above formula, the first term is the reconstruction loss (i.e. network loss), and the second term represents the KL divergence between the variational posterior distribution $q(z_{i}c_{i}|x_{i})$ and the prior distribution $p(x_{i}|z_{i})$ . Since $C$ represents the cluster assignment, it can also be regarded as the cluster loss, limiting the potential representation that $z$ is attached to the manifold of Gaussian mixture distribution. Suppose $q(z_{i}c_{i}|x_{i})$ can be decomposed into $q(z_{i}c_{i}|x_{i})$ , and further assume $q(z_{i}c_{i}|x_{i})$ obey Gaussian distribution.In the coding stage, VaDE uses the neural network $p(x_{i}|z_{i})$ to fit the parameters of the Gaussian distribution $q(z_{i}c_{i}|x_{i})$ :

$\displaystyle[u_{x},\log\theta_{x}^{2}]=f(z;\theta),p(x|z)=N(u_{x},\phi_{x}^{2% }I)$ (6)

After sampling from the Gaussian distribution $q(z|x)$ to get $z$ , use another neural network $f(z;\theta)$ to fit the parameters of the distribution $p(x|z))$ :

$\displaystyle[u_{x},\log\theta_{x}^{2}]=f(z;\theta),\text{q}(x|z)=N(u_{x},\phi% _{x}^{2}I)$ (7)

Note that $p(x|z)$ is also a Gaussian distribution. VaDE calculates the cluster assignment of samples according to $q(c|x)$ , and the calculation method is as follows:

$\displaystyle q(c|x)=E_{q(z|x)}[p(c|z)]$ (8)

Among them, $p(c|z)$ can be calculated according to the Bayesian formula. The GMVAE [8] algorithm is similar to VaDE. It also assumes that the hidden variables come from the Gaussian mixture model. The data generation process is as follows:

$\displaystyle p(c)=\textit{Cat}(1VK),p(w)=N(0,I)$ $\displaystyle p(z|c,w)=N(u_{c}(w;\beta),\theta_{c}^{2}(w;\beta)I$ (9) $\displaystyle p(x|z)=N(u_{c}(z;\theta),\theta^{2}(z;\theta)I)$

The difference from VaDE is that GMVAE believes that the mean and variance of the Gaussian prior are also random variables, which are approximated by a neural network with a parameter of $\beta$ . Due to the introduction of more variables, GMVAE is more complicated than VaDE, but the clustering effect is not significantly improved.

4. Pairwise constraint algorithm for deep clustering

The label information (positive label and negative label) reflects the membership relationship between the data object and the class label. The partition matrix $Y$ corresponding to the positive label information has been defined in Eq. (3.1). The negative label information reflects which class label the data object does not belong to. We use $Y-x\in Rn\times k$ to represent the partition matrix of negative label information, which is defined as follows:

$\displaystyle Y_{ij}^{-}\left\{\begin{array}[]{cl}-1,&\text{if}\ x_{i}^{\prime% }s\ \textit{label is not}\ y_{j}\\ 0,&\text{otherwise}\end{array}\right.$ (10)

Figure 6.

The process of transforming a matrix into a pair-wise relationship matrix.

Paired constraints include must-link and cannot-link, which reflect the relationship between data objects. Let $M=\{(x_{i},x_{j}):y_{i}=y_{j},1\leqslant i,j\leqslant n\}$ denote a setof must-connect constraints, $C=\{(x_{i},x_{j}):y_{i}\neq y_{j},1\leqslant i,j\leqslant n\}$ represents a set of cannot-link constraints, among them, $x_{i}$ and $x_{j}$ are the class labels corresponding to data objects $x_{i}$ and $x_{j}$ respectively. We use the $x_{i}\times y_{j}$ matrix $A_{ij}$ to represent the pairwise constraints, as shown below:

$\displaystyle A_{ij}=\left\{\begin{array}[]{ll}1,&\text{if}\ (x_{i},x_{j})\in M% ,\\ -1;&\text{if}\ (x_{i},x_{j})\in C,\\ 0,&\text{otherwise}\\ \end{array}\right.$ (11)

Among them, $\textit{YY}^{T}$ and $\frac{1}{k-1}(Y^{-}Y^{-T})$ are the pairwise relationship representations of $T^{\prime}(F)$ and $Y^{-}|$ , respectively, and $T^{\prime}(F)$ represents the number of clusters in the data set. It is impossible to judge whether two data objects belong to the same category only according to the negative labels of two data objects. Therefore, we use A to reflect the probability that two data objects belong to the same category.

According to the definition of the pairwise relationship matrix, we redefine the cost function $T^{\prime}(F)$ of the label propagation algorithm as follows:

$\displaystyle T^{\prime}(F)=||\text{FF}^{T}-P||^{2}$ (12)

Among them, $\text{FF}^{T}$ is the paired relationship representation of, and $T^{\prime}(F)$ represents the difference between the paired relationship given in advance and the paired relationship obtained by clustering. The new cost function can solve the problem of non-alignment between the pre-given class label and the cluster label obtained by clustering. In Fig. 6 we show the advantages of the new cost function $T^{\prime}(F)$ . It can be seen from Fig. 5 that $\text{YY}^{T}$ is completely equivalent to $\text{FF}^{T}$ . Therefore, the use of pairwise relation matrix can overcome the non-alignment problem.

4.1 Algorithm optimization

In this paper, stochastic gradient descent (SGD) and back propagation are used to optimize the joint objective function Eq. (13). It is worth noting that the parameters to be optimized or updated include two parts: The feature space $z_{i}$ and cluster center $\mu_{j}$ of each data point are embedded. The gradient of loss function $L$ relative to $z_{i}$ can be calculated as:

$\displaystyle\frac{\partial L}{\partial z_{i}}=2\sum\limits_{j=1}^{k}{(1+||z_{% i}-\mu_{j}||^{2}})^{-1}\times(p_{ij}-q_{ij})(z_{i}-\mu_{j})-2\lambda a_{i}\sum% \limits_{j=1}^{k}{(1+||z_{i}-\mu_{j}||^{2}})^{-1}\times\left(1-\frac{q_{ij}}{p% _{ij}}\right)(z_{i}-\mu_{j})$ (13)

Table 1

Clustering results of ACC metrics

Method	MNIST	USPS	REUTERS-10K
K-means	0.5294	0.6564	0.5163
DEC	0.841	0.7405	0.7245
DCN	0.810	0.71	0.7503
INDC	0.8805	0.7504	0.7534
SMKL	0.782	0.6814	0.7320
SDEC	0.8601	0.7538	0.6957
S ${}^{3}$ C	0.8261	0.846	0.6746
Semi-DEC	0.9547	0.8605	0.9024

Table 2

Clustering results of NMI metric

Method	MNIST	USPS	REUTERS-10K
K-means	0.4945	0.61	0.4923
DEC	0.8371	0.7526	0.4974
DCN	0.754	0.716	0.4103
INDC	0.8671	0.7845	0.4980
SMKL	0.6841	0.7103	0.4018
SDEC	0.8154	0.7658	0.423
S ${}^{3}$ C	0.7549	0.7845	0.4334
Semi-DEC	0.9451	0.8651	0.7634

The gradient of the loss function L relative to the cluster center $\mu_{j}$ can be calculated as:

$\displaystyle\frac{\partial L}{\partial u_{j}}=-2\sum\limits_{i=1}^{k}{(1+||z_% {i}-\mu_{j}||^{2}})^{-1}\times(p_{ij}-q_{ij})(z_{i}-\mu_{j})+2\lambda a_{i}% \sum\limits_{j=1}^{k}{y_{ij}(1+||z_{i}-\mu_{j}||^{2}})^{-1}\times\left(1-\frac% {q_{ij}}{p_{ij}}\right)(z_{i}-\mu_{j})$ (14)

In the back propagation process, the parameter $\partial L/\partial z_{i}$ in the deep coding network is updated by passing the gradient $\{w_{e},b_{e}\}$ downward. The cluster center $\mu_{j}$ is optimized by $\partial L/\partial z_{i}$ . When the difference between two consecutive clustering division results is less than the threshold tol% or the number of training times reaches the set maximum value, the clustering algorithm will no longer be executed.

Table 3

Clustering results of labeled data with different proportions

Data set	1%		2%		5%		10%
	ACC	NMI	ACC	NMI	ACC	NMI	ACC	NMI
MNIST	0.808	0.772	0.812	0.782	0.842	0.824	0.882	0.879
USPS	0.747	0.754	0.757	0.773	0.775	0.783	0.786	0.806
REUSTER-10K	0.750	0.503	0.756	0.13	0.768	0.553	0.793	0.83
	20%		30%		40%		50%
	ACC	NMI	ACC	NMI	ACC	NMI	ACC	NMI
MNIST	0.921	0.917	0.966	0.947	0.966	0.950	0.977	0.953
USPS	0.806	0.848	0.862	0.885	0.885	0.882	0.890	0.879
REUSTER-10K	0.864	0.69	0.919	0.765	0.956	0.830	0.958	0.833

Table 4

Classification accuracy of three data sets

Data set	MNIST	USPS	REUTERS-10K
Classification accuracy	0.972	0.931	0.949

Figure 7.

The accuracy of labeling data at different scales on MNIST, USPS, REUTERS-10K.

5. Experimental evaluation

5.1 Experimental parameter settings

The encoder layer structure of the deep coding network is set to d-500-500-2000-10 for all data sets, in which d is the dimension of the input data. All layers are fully connected, and the internal layers (except the input layer, embedded layer and output layer) are activated by Re LU nonlinear function. For each data set, a sample label list A will be dynamically generated according to whether there is label information in the data set. The length of the list is consistent with the size of the batch data obtained each time. If the sample point has a real label, its corresponding element value in list A is 1; If there is no real label, its corresponding element value is 0. The learning rate of SGD is 0.01. The convergence threshold tol% is set to 0.1%. After experimental testing, the trade-off parameter $\lambda$ of the label loss is set to 0.2 (determined by the grid search in {0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1.0, 2.0, 5.0}). For all algorithms, the number of cluster centers K is set as the number of real categories of the data set. During the experiment, each algorithm was run independently for 10 times, and the average results were taken.

Figure 8.

The trend of accuracy and model loss with the number of iterations.

Figure 9.

The impact of the trade-off parameter $\lambda$ on clustering performance.

5.2 Analysis of experimental results

This section presents experimental results on three representative data sets. Tables 1 and 2 report the results of the experimental results on the two evaluation indexes of ACC and NMI respectively. The label data selected in this article accounted for 30% of the total data. The best performance results are highlighted in bold in the table. Compared with the traditional K-means and SMKL methods, the method in this paper can learn more representative features through the deep coding network. Moreover, K-means is an unsupervised method, which can not use label information in the clustering process, which further leads to performance degradation. It can be seen that the method proposed in this paper has achieved the best effect.

Figure 10.

Runtime analysis.

Similarly, in order to evaluate the impact of prior knowledge on the performance of Semi-DEC, the proportion of labeled data in the training sample is increased from 1% to 50%.

Each experiment was conducted 10 times, and the average results are shown in Table 3. Table 4 shows the results of the classification task when using the same network structure as Semi-DEC. As shown in Table 3 and Fig. 4, two conclusions can be drawn in this chapter. Firstly, with the increase of the number of labeled samples, the results of ACC and NMI on the three data sets increase. Especially on MNIST dataset with 50% labeled data, ACC and NMI can reach 97.5% and 95.2% respectively. Secondly, when the proportion of labeled data can reach 50%, the clustering performance of Semi-DEC on the three data sets can roughly reach the classification accuracy. This also further proves the superior performance of Semi-DEC.

In order to further verify the effectiveness of this method, we carried out experiments from many aspects, using the impact of different proportions of labeled data on performance, the change process of loss function and accuracy, weighing the impact of parameters on clustering performance and run-time analysis. As shown in Fig. 7.

For the different effects of different labeled data on performance, Fig. 7 shows the change trend of clustering result accuracy on MNIST, USPS and reuster-10k data sets. The dotted line represents the classification accuracy results obtained through multiple experiments when the classification model uses the same network structure as the Semi-DEC algorithm. It can be seen more intuitively from Fig. 6 that on MNIST data set and REUSTER-10k data set, with the gradual increase of the proportion of labeled data, the clustering effect of Semi-DEC method can be very close to that of the classification model using all label information under the same network structure. Although there are some gaps between the clustering effect of Semi- DEC and that of classification model under network structure on USPS data set, it can be seen that the difference is not large, which is enough to reflect the effectiveness of this method in improving clustering performance.

From Fig. 8, we can see the process record that the loss function and accuracy change with the increase of training times. It can be seen from the Figureure that after reaching a certain number of iterations, the loss value of the data set or the clustering accuracy will gradually tend to be stable, which also proves the robustness of the method in this paper.

In order to get a more accurate and clear understanding of the specific impact of the label loss trade-off parameter $\lambda$ on the performance of the algorithm in this chapter, this experiment conducted multiple experiments on three data sets, and the sampling range of the trade-off parameter $\lambda$ is [0.01, 5.0]. Figure 8 shows the specific experimental results. It can be clearly seen from Fig. 8 that the performance of the Semi-DEC method in this paper can be maintained in a relatively stable state in the relatively wide range of, $\lambda$ that is, the range of its value from 0.05 to 2.0. The main reason for this achievement is that the semi-supervised loss occupies a dominant position in the overall loss function of the method. When the value of $\lambda$ is 0.2, its performance is also asymptotically optimal.

Regarding the running time, Figs 9 and 10 records the running time comparison between the Semi-DEC method and the DEC method in this paper. Since the method in this paper is further researched and improved on the basis of the DEC method, only the DEC method is selected for comparison in terms of running time. It can be seen that the method in this paper consumes relatively more time than the DEC algorithm in the training process. This is because the label adaptive strategy is added to the Semi-DEC method, and the label loss needs to be calculated after each round of clustering assignment is completed in the training process, so more time is required. However, this paper believes that it is worthwhile to increase the limited training time, because the clustering performance of this method has been significantly improved.

6. Conclusion

In order to be able to quickly retrieve the massive amounts of big data, and filter out the tasks that represent the feature representation space and clustering assignment of deep learning, the application of a big data adaptive semi-supervised algorithm based on deep learning is proposed. Based on the clustering network structure, the autoencoder network is analyzed, and the autoencoder network is deepened, the semi-supervised algorithm analysis of the generative confrontation network is carried out, and the adaptive clustering algorithm is further proposed, and then the algorithm is optimized, Different data sets are set as experimental parameters. Finally, different data sets MNIST, USPS and reuster-10k are compared, The results show that this method has better performance than the traditional clustering methods, and has important significance and reference value for the research of clustering.

References

Qin

Ding

. Survey of Semi-supervised Clustering. Computer Science. 2019; 46(9): 15-21.

Peng

Shi

. Interactive Image Segmentation Using Geodesic Appearance Overlap Graph Cut. Signal Processing: Image Communication. 2019; 78(9): 159-170.

Zhang

You

, et al. Adaptive Semi-supervised Cassifier Ensemble for High Dimensional Data Classification. IEEE Transactions on Cybernetics. 2019; 49(2): 366-379.

Tang

Liao

. A Semi-supervised Clustering Method Based on AP Algorithm. Electronic Warfare Technology. 2017; 32(1): 8-12.

Chai

. etc. Semi-supervised K-means Clustering Algorithm Based on Active Learning Priors. Journal of Computer Applications. 2018; 38(11): 3139-3143.

Zoidi

Tefas

Nikolaidi

, et al. Positive and Negative Label Propagations. IEEE Transactions on Circuits and Systems for Video Technology. 2018; 28(2): 342-355.

Yin

Chen

, et al. Semi-supervised Clustering with Metric Learning: An Adaptive Kernel Method. Pattern Recognition. 2010; 43(4): 1320-1333.

Zhang

Liu

, et al. Attributed Graph Clustering Via Adaptive Graph Convolution. arXiv preprintarXiv1906.01210, 2019.

Liu

. Research on the Memetic Algorithm Researc on Multimodal Function Optimization. JCIT: Journal of Convergence Information Technology. 2012; 7(18): 464-472.

10.

Ang

Sidiropoulos

, et al. Towards Kmeansfriendly Spaces: Simultaneous Deep Learning and Clustering//international conference on machine learning. PMLR. 2017; 3861-3870.

11.

Liu

. Application of Wireless Sensor Network Based Improved Immune Gene Algorithm in Airport Floating Personnel Positioning. Computer Communications. 2020; 160(160): 494-501.

12.

Liu

Tao

, et al. Partition Level Constrained Clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2018; 40(10): 2469-2483.

13.

Gertrudes

Zimek

Sander

, et al. A Unified View of Density-based Methods for Semi-supervised Clustering and Classification. Data Mining and Knowledge Discovery. 2019; 33(6): 1894-1952.

14.

Liu

. Fuzzy Self-adaptive Prediction Method for Data Transmission Congestion of Multimedia Network. Wireless Networks, Published: 13 August 2021.

15.

Chen

Wang

, et al. An Active Semi-supervised Clustering Algorithm Based on Seeds Set and Pairwise Constraints. Journal of Jilin University (Science Edition). 2017; 55(3): 664-672.

16.

Hao

. Cross-entropy Semi-supervised Clustering Based on Paired Constraints. Pattern Recognition and Artificial Intelligence. 2017; 30(7): 598-608.

17.

Yang

. Semi-supervised Spectral Clustering Based on Symbolic Network. Changsha: Hunan Normal University, 2019.

18.

Cucuringu

Pizzoferrato

Gennip

. An MBO Scheme for Clustering and Semi-supervised Clustering of Signed Networks. 2021; 68(4): 101-109.

19.

Gallego

Calvo-Zaragoza

Valero-Mas

, et al. Clustering-based K-nearest Neighbor Classification for Large-scale Data with Neural Codes Representation. Pattern Recognition. 2021; 74: 531-543.