Bert-CK: A study of user profile classification based on Bert and CK-means+ fusion

Abstract

In traditional user portrait construction methods, static word vectors can extract only shallow semantic representations, which cannot manage word polysemy. Moreover, the common clustering algorithm K-means has the problems of initial K values and unstable initial centroid selection. A Bert-CK model based on Bert and CK-means+ is proposed. First, Bert is used to extract semantic and syntactic text features at various levels, and word vectors and sentence vectors are obtained according to the context. Then, the CK-means+ algorithm is improved based on canopy and mean calculation. Next, the K value and initial centroid are determined. The sentence vectors are input to CK-means+ to obtain user classification and topic features. Finally, semantic features and topic features are fused and classified. CK-means+ is evaluated on the Sogou user portrait dataset. The experimental results verify that Bert-CK is better than the baseline model.

Keywords

User profile bert canopy K-means text classification

1 Introduction

User portrait technology is widely used in e-commerce [1], personalized recommendation [2, 3], demand forecasting [4] and other fields to realize the automatic matching of resources and differentiated needs of users. Therefore, the construction of accurate user portraits has aroused widespread concern in society. User profiling is the process of labelling users by acquiring their attributes through classification [5] or clustering algorithms [6]. Clustering algorithms gather similar users and extract textual topic features. The word vector model converts the user’s text data into vectors.

After analysing the mainstream methods for user portraits, we found that most scholars are still accustomed to using static word vector models such as Term Frequency and Inverse Document Frequency(TF-IDF), Word to Vector(Word2vec) [7] and Document to Vector (Doc2vec) when constructing user portraits. The models obtain only shallow representations and do not resolve the problem of multiple word meanings, resulting in limited user portrait classification accuracy. Clustering is an unsupervised algorithm that can partition most unlabelled data into various types based on the similarity between samples in user portraits. The K-means algorithm has become the most commonly used clustering algorithm for building user portraits because of its simple principle, easy implementation, and efficient operation [8]. However, the K-means algorithm suffers from the problems of unstable selection of cluster number K values and initial centroids. These problems lead to low user classification accuracy [9] and inaccurate topic label extraction [10, 11] when constructing user portraits. Deficiencies in both word vector techniques and clustering techniques have challenged the user portrait construction accuracy.

With the development of Natural Language Processing(NLP), pretrained models have emerged for major text processing tasks, classically represented by bidirectional encoder representations from transformers (Bert) [12]. Bert is first pretrained on a large text corpus and then applied to a variety of text processing tasks by fine-tuning. Semantic and syntactic text features are captured from different levels, and its pretraining task can address the problem of multiple word meanings. Therefore, combining the features of Bert’s model to build user portraits can help improve user portrait accuracy.

Therefore, to improve user portrait construction accuracy, a user portrait model based on Bert and CK-means+ is proposed for both the word vector level and the clustering level. The education, age and gender attributes of users are predicted from their search records on the Sogou user portrait dataset. The K-means is improved based on canopy and mean calculation, and the CK-means+ algorithm is proposed [13]. The canopy algorithm is first used to coarsely cluster the dataset and obtain the K value. Then, the initial centroids are obtained by canopy and mean calculation.

The main contributions of this article are as follows:

* The Bert algorithm is introduced to address the problem that static word vectors can extract only shallow representations and cannot resolve the problem of multiple meaning words.

* For the problem that the K value and initial centroid selection of the clustering algorithm K-means are unstable, the CK-means+ algorithm based on canopy and mean value calculation is proposed.

* To address the problem that the word vector lacks topic features, the CK-means+ clustering algorithm is introduced to add topic features and to cluster the users.

* We design and implement the construction of the Bert-CK user portrait model to achieve the prediction of basic user attributes by making full use of textual features and semantic features of data through data preprocessing, data enhancement and semantic vector fusion.

2 Related works

Clustering algorithms are widely used in the user portrait construction task [14] improved the K-means clustering algorithm to obtain the overall control category of power users, thereby evaluating the overall picture of user information and constructing a portrait of electricity users. [15] combined the strong complementarity of the K-means++ algorithm and the mini-batch K-means algorithm to improve the analysis and processing capabilities of the model. [16] used the Linear Discriminant Analysis(LDA) [17] algorithm to model the user’s interest points according to the number of Weibo users’ answers and messages. Scholars who use K-means to construct user portraits have improved certain efficiency but have not fundamentally improved the K-means algorithm. User text data processing is also very rough using only the TF-IDF algorithm for simple text processing. The user portrait results are not only low in accuracy but also very rough in granularity. Based on this, in this paper, we propose an improved CK-means+ algorithm and use the Bert neural network to extract multidimensional feature information from user data.

The word vector representation of a text is the basis of text classification tasks. The word embedding method first uses one-hot, TF-IDF and other models to discretely represent text. Later, static word embedding models based on shallow neural network structures such as word2vec and Global Vectors for Word Representation(GloVe) [18] trained continuous and dense word vector representations. Compared with that of the discrete representation methods, the dimensionality of word vectors is reduced, and the semantics of word vectors are enriched. The similarity between words can be captured. [19] and others have built user portraits based on the improved LDA and K-means algorithms to identify sensitive types of users in electricity payment. [20] and others obtained the data characteristics of user search records based on trigram, Doc2vec [21], Word2vec and Convolutional Neural Networks(CNN) models and analysed the user characteristics through eXtreme Gradient Boosting(XGBoost) [22] to construct user attribute portraits. The emergence of the static word embedding model has raised user portrait construction to a higher level, but its simple structure cannot resolve the problem that Chinese has different semantics in different contexts, nor can it learn the grammatical structure of Chinese. This makes user portrait construction under the Chinese dataset a bottleneck.

Subsequently, [12] proposed the Bert model in 2018, which uses a bidirectional multilayer transformer encoder structure for pretraining on a large text corpus. Fine-tuning the Bert model in downstream NLP tasks can realize the dynamic representation of word vectors in various contexts, thereby addressing the polysemy problem in Chinese words. [23] proposed a Bert-based user portrait algorithm to predict user attributes. [24, 25] constructed user portraits based on Bert and XGBoost. Several experimental results show that Bert has higher accuracy than static models such as word2vec and doc2vec. However, when studying the literature of other scholars, it was found that although scholars have used Bert to further improve the results, they have not considered user portrait construction overall and have not made targeted improvements. Scholars rely on Bert alone to extract the semantic features of the text, ignoring not only the overall theme meaning of the user text but also the clustering of similar users and the construction of group user portraits.

Based on the above research, we propose a Bert-CK user portrait model based on Bert and CK-means+. This method uses the Bert model to obtain the text semantic features and then uses the CK-means+ algorithm to cluster similar users to obtain topic features. Then, the topic features and semantic features are fused to obtain individual user portraits and group user portraits in parallel. Experimental results show that, compared with that of the baseline model, this model significantly improves the accuracy of user label prediction. This demonstrates the validity of the research in this paper and highlights the research and practical application value of this paper.

3 Bert-CK model

The Bert-CK user portrait model contains the following four layers, as shown in Fig. 1: (1) Word embedding layer: The text is preprocessed and input into the pretrained word vector model. Bert generates dynamic semantic vectors of the text to obtain word vectors and sentence vectors of the text. (2) Clustering layer: The sentence vectors are input to the CK-means+ algorithm, which clusters the sentence vectors, classifies users into different categories, and extracts the topic features of the text from them. (3) Semantic fusion layer: This layer fuses the semantic features and topic features, that is, splices the word vector and the topic vector. (4) Output layer: The spliced semantic vectors are fed into the fully concatenated layer and softmax classifier to predict the user’s education, age and gender labels.

Fig. 1

Bert-CK user profile model

3.1 Word embedding layer

Bert is a pretraining plus fine-tuning dual paradigm. The first pretraining task is the determination of masked words (MaskedLanguage Model, MLM), which learns contextually relevant bidirectional feature expressions. The second pretraining task is NextSentence Prediction (NSP), which allows the model to understand the relational features between sentences. The second step is the fine-tuning phase, where parameters are set for specific tasks. MLM randomly selects a certain percentage of lexical elements as masked lexical elements for prediction, and the model learns the masked lexical elements through the global context. Although MLM can encode bidirectional context to represent words, it cannot explicitly express the logical relationship between text pairs. NSP can be regarded as a sentence-level binary classification problem that mines the logical relationship between sentences by determining whether the latter sentence is a reasonable next sentence of the former sentence. Therefore, the Bert model achieves word vector representation and semantics by combining MLM and NSP feature extraction.

In Fig. 2, E1, E2, . . ., EN is the input sequence of the Bert model, Trm is the encoder model of the transformer, and T1, T2,. . .,TN. Trm represents the Transformer model and adopts an encoder decoder architecture, consisting of 6 encoder and 6 decoder layers stacked together. The encoder contains a self attention layer and a feedforward neural network. Decoder includes a self attention layer, an attention layer and a feedforward neural network.The input sequence of Bert is the sum of token embeddings, segment embeddings and positional embeddings, where the text sequence of token embeddings needs to start with the word element <cls> as the starting token. The input sequence model is shown in Fig. 3. Bert has a 12-layer encoder, which means that for each word, there are 12 corresponding 768-dimensional hidden state representations, and according to the Bert authors’ tests, the last four layers of the splicing vector perform best.

Fig. 2

Bert model.

Fig. 3

Input representation of BERT Model [12].

The length of a user’s monthly query record is very long, with an average length of 873.6. To cover more information, a data enhancement strategy is used to crop the length of the user query text data. Text below 1024 length is filled in, while lengths above 1024 are truncated. The maximum length that Bert can manage is 512, dividing the 1024 field into two 512-segment long fields and then inputting the fields into Bert. Then, the two subtexts share Bert, splicing the two 3-dimensional vectors of Bert output for the next processing step.

3.2 Clustering layer

3.2.1 Improvements to CK-means+

The canopy algorithm is an unsupervised preclustering algorithm that divides the data into multiple overlapping subsets and eliminates the spurious points and outliers of the dataset. The canopy algorithm is relatively insensitive to noise, which is more convenient and timesaving compared with the K-means algorithm and can obtain better results. K-means is sensitive to noise, and the introduction of canopy can effectively reduce the number of operations of the K-means algorithm, thus reducing the complexity of the K-means algorithm and improving its efficiency.

The Canopy algorithm clusters data stored in an array, manually setting the distances between T1 and T2, with T1>T2. Then, arbitrarily select a data point from the dataset as the center, calculate the distance d from other points to it, and classify according to the distance threshold to output K classes.

Step 1: The Canopy algorithm inputs the given set of data into matrix X, sets initial distance thresholds T1 and T2 (T1>T2), selects data point x_i as the starting midpoint C1, and stores the given data in the ArrayList set (candidate center points).

Step 2: Calculate the Euclidean distance d between other data x and C1 in the ArrayList, and compare d with the values of T1 and T2. When d<T1, divide this point into the Canopy set and; When d<T2, remove this point from the List set and no longer add it to other Canopy sets; Perform new operations on data with T2<d<T1, then update to calculate the distance between the point and the new center point, compare the distance and threshold, and classify it into categories.

Step 3: Repeat the for loop until the ArrayList set is empty.

In the canopy algorithm, the distance thresholds T1 and T2 are random, and the settings of T1 and T2 are usually decided empirically, which can lead to a very unstable accuracy and efficiency of the algorithm. If the thresholds are set too large, data from different categories can be grouped into one set; conversely, it causes data from the same category to be grouped into different categories. Therefore, our approach optimizes the canopy algorithm. The threshold T2 of the canopy algorithm has a certain influence on the result K; however, the threshold T2 is uncertain, so we propose a method to obtain the threshold T2 adaptively by finding the sum of the distances between points in the dataset.

In this paper, the initial centroids of K-means and the selection of the number of clusters K are improved by using canopy and mean value calculations. First, the threshold T2 of Canopy is optimized, and the total number of clusters K is obtained according to the canopy algorithm. Then, the average number of K clusters is obtained according to the mean calculation method. The data point closest to the average is taken as the initial centroid of the K-means algorithm. In this way, the value of K and the initial centroid of the K-means algorithm can be determined. Fig. 4 is the overall flow chart.

Fig. 4

Improved CK-means+ flow chart.

3.2.2 CK-means+ clustering sentence vectors

The mean function is set to average the semantic vector features of the penultimate layer of the Bert model as sentence vectors. The similarity between sentence vectors is calculated by CK-means+. The dimension of the sentence vector is [seq-length, hidden-size*2], so there is no need to reduce the dimension when clustering.

If the entire dataset D contains data from n users, each with a corresponding sentence vector d_i, then the dataset D=[d₁,d₂,...,d_n]. Calculate the cosine similarity of the sentence vector d_i as shown in Eq. (1). divide the entire dataset into k classes, and the text corresponding to each class is written to the file. Additionally, considering the effectiveness of the clustering algorithm, the effectiveness of clustering needs to be measured based on the contour coefficient. The clustering ends when the number of iterations is maximum.

Unlike the CK-means+ algorithm in Chapter 3, the similarity calculation method is changed from the Euclidean distance to the cosine value in this section in the face of high-dimensional sentence vectors, as shown in Eq. (1). The range of values [0,1] is taken, and the larger the value, the more similar they are. $S_{T} (d_{i} \cdot d_{i}) = \frac{\sum_{k = 1}^{n} w_{ik} \cdot w_{jk}}{\sqrt{\sum_{k = 1}^{n} (w_{ik})^{2}} \sqrt{\sum_{k = 1}^{n} (w_{jk})^{2}}}$ (1)

The class labels of each user, such as Cluster 0, Cluster 1, and Cluster 2, are obtained after the clustering is finished. We need labels to describe each cluster topic. This article uses TF-IDF calculation to select the 10 words with the highest word frequency in each cluster as labels for that cluster, as shown in formula 2.. Translate tags into word vectors through Bert to provide topic information for the next step of integrating semantic vectors. Summarize the search records of each category of users into corresponding documents, resulting in a total of K categories and K documents. The serial numbers of each document are 1, 2, 3... k; There are 1, 2, 3... x words in each document. TF represents word frequency, representing the number of occurrences of word x in this document. The more occurrences, the greater the importance of this word in this document.. IDF represents the frequency of anti document, representing the number of documents with word x appearing in K documents. The more documents contain the word x, the more common the word is and the less representative it is. The fewer documents containing the word x, the more unique the word, and the more representative it is of this document. The essence of TF-IDF is to weaken the common words that appear in most documents and screen out the truly high-frequency words that are most representative of the current document. $TF - IDF = TF (x) \times IDF (x)$ (2) $TF (x) = \sum_{k = 1}^{n} {tf}_{xk}$ (3) $IDF (x) = log (\frac{K}{{df}_{k}})$ (4) where tf_xk is the number of occurrences of word x in document k, K is the total number of documents, and df_x is the number of documents containing word i.

3.3 Semantic fusion layer

After obtaining the word vector output by Bert and the topic vector output by CK-means+, the topic vector and the semantic vector are spliced. There are two common splicing methods. The first is cat splicing, where vectors with the same shape are added directly without changing the dimensionality of the vectors. The second is stack splicing, which splices vectors with the same shape in an expanded dimension. Because stack splicing causes the vector dimension to become larger, which may cause dimensional disaster, this research utilizes cat splicing.

3.4 Output layer

Connecting the fully connected layer with the softmax classifier, ReLU is used as the activation function to mitigate the problem of model overfitting when using dropout. The value can be calculated by using the following mathematical formula. $o = f (Ws + b)$ (5) $g = softmax (o)$ (6)

where s is the feature representation of the text, o is the output of the fully connected layer, b is the bias term, W is the weight matrix, and g is the output vector of the last model. softmax calculates the probability formula for each attribute as follows. $P (y = k | x) = \frac{e^{{xw}_{k}}}{\sum_{i = 1}^{k} e^{{xw}_{i}}}$ (7) The update method of the Bert-CK model is gradient descent, the loss function is cross-entropy, and the cross-entropy formula is as follows. $Loss = - \frac{1}{N} \sum_{i = 0}^{N - 1} \sum_{k = 0}^{K - 1} y_{i, k} log p_{i, k}$ (8) where (i, k) is the sample point and k is the true label. It is assumed that there are K label values, p denotes the probability, and N denotes the number of samples.

4 Experiment and analysis

4.1 Dataset

The dataset is provided by Sogou and the Chinese Computer Society and is the dataset from the CCF "Sogou User Portrait Mining" competition [26]. The dataset contains 100,000 search results of users and their gender, age, and education labels. The contest evaluation criterion is the average of the accuracy of the three labels of gender, age and education of the user data.The dataset includes 100000 pieces of data, which are divided into 80000 training sets, 10000 testing sets, and 10000 validation sets based on experiments.The format of the data is<ID, education, age, gender, query term>.

Table 1
User profiling dataset

Field Description

ID Encrypted ID

Age 0:unknown age; 1:0-18 years old;

2: 19-23 years old; 3: 24-30 years old;

4: 31-40 years old;5: 41-50 years old;

6: 51-99 years old

Gender 0: unknown gender 1: male; 2: female

Education 0:unknown degree; 1:PhD 2:Master;

3:College student;4:High school student;

5:Junior high school student; 6:Elementary school;

Query List 2How to lose weight; gum pain why;

Rebirth of I am the Dragon Ao Tian;

how to improve concentration.

Field	Description
ID	Encrypted ID
Age	0:unknown age; 1:0-18 years old;
	2: 19-23 years old; 3: 24-30 years old;
	4: 31-40 years old;5: 41-50 years old;
	6: 51-99 years old
Gender	0: unknown gender 1: male; 2: female
Education	0:unknown degree; 1:PhD 2:Master;
	3:College student;4:High school student;
	5:Junior high school student; 6:Elementary school;
Query List	2How to lose weight; gum pain why;
	Rebirth of I am the Dragon Ao Tian;
	how to improve concentration.

4.2 Evaluation indicators

In this paper, the prediction of users’ education, age and gender labels are three tasks, which are trained and predicted separately. Among them, gender is dichotomous, and age and education are six categories. To measure the effectiveness of the model to manage the three tasks together, the accuracy and average accuracy are chosen as evaluation metrics in this paper. Accuracy is the ratio of the number of correctly classified samples to the total number of samples.If the accuracy rates of gender, education and age are expressed as Acc_gender,Acc_edu and Acc_age, respectively, then the average accuracy rate is calculated as shown in Equation 10. $Acc = \frac{TP + TN}{TP + TN + FP + FN}$ (9) $\bar{Acc} = \frac{{Acc}_{gender} + {Acc}_{edu} + {Acc}_{age}}{3}$ (10) Precision represents the ratio of the number of correctly retrieved samples to the total number of retrieved samples.Recall represents the ratio of the number of correctly retrieved samples to the number of samples that should be retrieved.The F1 value considers the precision rate and the recall rate at the same time so that the two can reach the highest level simultaneously to achieve a balance. For the multiclassification problem in this paper, the experiment uses Micro-F1 as an indicator. $P = \frac{TP}{TP + FP}$ (11) $R = \frac{TP}{TP + FN}$ (12) $\bar{F} 1 = \frac{2 * P * R}{P + R}$ (13) $\bar{Micro} - F 1 = \frac{\sum_{j = 1}^{L} 2 {TP}_{j}}{\sum_{j = 1}^{L} (2 {TP}_{j} + {FP}_{j} + {FN}_{j})}$ (14)

where L represents the number of category labels, TP represents the number of positive samples that are predicted to be positive, FP represents the number of positive samples that are predicted to be negative, and FN represents the number of negative samples that are predicted to be positive.

The silhouette coefficient is an important indicator of clustering effectiveness. Since this dataset has no clustering labels, it is impossible to determine the clustering accuracy with the true and predicted values, so the silhouette coefficient is chosen as an evaluation index. The contour relationship is determined by the degree of cohesion between the same clusters and the degree of separation between different clusters. The value interval of the contour coefficient is [-1, 1], and the larger the value, the closer the distance between the same class and the farther the distance between different classes, the better the clustering effect. Let sample point i. The calculation steps are (1) Cohesion degree a(i): the average dissimilarity of i vectors to other nodes within the same class. (2) Separation degree b(i): the minimum of the average dissimilarity of the i-vector to the nodes in other classes. (3) The contour coefficient of sample i is: $s (i) = \frac{b (i) - a (i)}{\max (a (i), b (i))}$ (15) (4) The average of s(i) for all sample points is then used as the contour coefficient of the whole experimental results, defined as S.

4.3 Experimental configuration

The PyTorch platform can automatically accumulate gradients and zero out the data after N batches have been calculated, i.e., the training batches are increased by a factor of N. Our experimental environment is as follows. The python version is 3.8; the deep learning framework is PyTorch 1.1.0; Video Cards is GeForce GTX 1080Tix2; and the CUDA version is 11.0.

The Bert model starts with a small learning rate of 5e-6, and the update method for the learning rate is exponential decay, with a decay rate of 0.98. The length of the input sentence is 512. The dimension of the word vector hidden layer is 768 dimensions. The batch size entered by the model is 128. The number of gradient accumulations is 4, which can increase the training batch without consuming more memory. The Python platform can automatically accumulate gradients and clear the data to zero after calculating N batches, which means the training batch is increased by N times. The number of epochs that the entire training set will be input during the training process is 30. To prevent overfitting, set the dropout rate to 0.5. The activation function is ReLU, the loss function is cross entropy, and the optimizer is Adam.

Table 2
Bert parameter setting

Parameters Parameter Value

learning _ rate 5e-6

sentence _ length 512

hidden _ size 768

batch _ size 128

gradient _ accumulation _ steps 4

epochs 30

dropout 0.5

Decline rate 0.98

activation function ReLU

Study rate update method Index decay

Loss function Cross-entropy function

Optimizer Adam

Parameters	Parameter Value
learning _ rate	5e-6
sentence _ length	512
hidden _ size	768
batch _ size	128
gradient _ accumulation _ steps	4
epochs	30
dropout	0.5
Decline rate	0.98
activation function	ReLU
Study rate update method	Index decay
Loss function	Cross-entropy function
Optimizer	Adam

4.4 Result

4.4.1 Word vector ablation experiment

To validate the effectiveness of the dynamic model, Bert was compared with three currently dominant word vector models: word2vec [7] (static), GloVe [27] (static), and ELMO [28] (dynamic). The word2vec word vector dimension was 300, the GloVe word vector dimension was 300, and the ELMO word vector dimension was 512. The first four groups are the results of the word vector model directly connected to the softmax classifier; the last four groups are the results of training after adding topic features by the CK-means+ algorithm. The experimental results are shown in Table 3.

Table 3
Accuracy of word vector model and clustering processing

Model Education Age Gender Average

Acc F1 Acc F1 Acc F1 Acc F1

DBOW+NN 54.36 - 55.70 - 81.32 - 63.79 -

DM+NN 51.63 - 53.48 - 78.20 - 61.60 -

MUPM 59.60 - 61.53 - 82.55 - 67.14 -

FastText 64.00 - 50.00 - 61.00 - 55.00 -

Word2Vec 58.81 58.83 49.82 49.82 72.70 70.32 60.44 59.66

GloVe 59.11 59.10 52.27 52.30 80.75 78.33 64.04 63.24

ELMO 61.49 61.49 57.29 57.29 82.63 80.78 67.14 66.52

Bert 63.78 63.78 58.92 58.90 82.68 80.93 68.46 67.87

Word2Vec-CK 59.67 59.69 50.68 50.73 73.56 71.91 61.30 60.78

GloVe-CK 61.22 61.23 53.34 53.34 81.86 80.60 64.98 65.06

ELMO-CK 63.38 63.37 59.65 59.68 83.30 80.43 68.24 67.83

Bert-CK 64.76 64.74 60.18 60.18 83.87 81.88 69.60 68.93

Model	Education	Age	Gender	Average
DBOW+NN	54.36	-	55.70	-	81.32	-	63.79	-
DM+NN	51.63	-	53.48	-	78.20	-	61.60	-
MUPM	59.60	-	61.53	-	82.55	-	67.14	-
FastText	64.00	-	50.00	-	61.00	-	55.00	-
Word2Vec	58.81	58.83	49.82	49.82	72.70	70.32	60.44	59.66
GloVe	59.11	59.10	52.27	52.30	80.75	78.33	64.04	63.24
ELMO	61.49	61.49	57.29	57.29	82.63	80.78	67.14	66.52
Bert	63.78	63.78	58.92	58.90	82.68	80.93	68.46	67.87
Word2Vec-CK	59.67	59.69	50.68	50.73	73.56	71.91	61.30	60.78
GloVe-CK	61.22	61.23	53.34	53.34	81.86	80.60	64.98	65.06
ELMO-CK	63.38	63.37	59.65	59.68	83.30	80.43	68.24	67.83
Bert-CK	64.76	64.74	60.18	60.18	83.87	81.88	69.60	68.93

From the experimental results in the first part of Table 3, we see that Bert-CK shows excellent performance compared with that of the baseline model. In addition, we also combine the proposed method with other baseline comparison models. In the third part of Table 3, we see that the performance of the baseline model and our proposed method have improved to varying degrees. The experimental results also verify the effectiveness of the proposed method.

Compared with word2vec, GloVe has an average accuracy increase of 3.6 and F1 improvement of 3.58; compared with word2vec-CK, GloVe-CK has an average accuracy increase of 3.68 and F1 improvement of 4.28. This is because training GloVe is based on the global corpus, while training word2vec is based on the local corpus, and the weight of GloVe can be mapped, while the weight of word2vec is fixed. Compared with word2vec, GloVe optimizes the weight function.

Compared with those of GloVe, ELMO has an average accuracy increase of 3.1 and F1 improvement of 3.28. Compared with those of GloVe-CK, ELMO-CK has an average accuracy increase of 3.26 and F1 improvement of 2.77. This is because ELMO can learn the semantics of words in various contexts, and ELMO adopts a two-layer LSTM structure. Compared with those from GloVe, it has multilayer extraction capabilities, so it can obtain higher indicators in text classification.

Compared with ELMO, Bert’s average accuracy rate increased by 1.32, and F1 increased by 1.35; compared with ELMO-CK, Bert-CK’s average accuracy rate increased by 1.36, and F1 increased by 1.1. This is because the ELMO model has only two layers of LSTM, and the internal parameters are not universal, so ELMO cannot retrieve the context representation of words at the same time, and the multilevel extraction ability is limited. Bert has a multilayer encoder and multilayer self-attention mechanism structure with strong parallel computing ability and can simultaneously utilize the deep semantic structure of context information.

Figures 6 show the changes in accuracy and loss values for Bert-CK and word2vec-CK on gender classification training. From Fig. 5, the accuracy of Bert-CK increases significantly, and the convergence of Bert-CK is faster, as seen in Fig. 6.

Fig. 5

Change in accuracy in gender classification.

Fig. 6

Change in loss value on gender classification.

4.4.2 Cluster ablation experiments

To demonstrate the effectiveness of CK-means+, ablation experiments were designed with the K-means [29] algorithm, CK-means algorithm and K-means++ [30] algorithm. CK-means is the algorithm with only canopy preprocessing, and the experimental results are shown in Table 4.

Table 4
Accuracy of the various clustering algorithms

Model Education Age Gender Average

Acc F1 Acc F1 Acc F1 Acc F1

Bert-K-means 63.64 63.64 59.05 59.08 82.48 80.12 68.47 67.61

Bert-CK-means 63.98 63.98 59.36 59.36 83.07 81.30 68.80 68.21

Bert-K-means++ 64.30 64.30 59.68 59.64 83.38 81.82 69.13 68.59

Bert-CK-means+ 64.76 64.74 60.18 60.18 83.87 81.88 69.60 68.93

Model	Education	Age	Gender	Average
Bert-K-means	63.64	63.64	59.05	59.08	82.48	80.12	68.47	67.61
Bert-CK-means	63.98	63.98	59.36	59.36	83.07	81.30	68.80	68.21
Bert-K-means++	64.30	64.30	59.68	59.64	83.38	81.82	69.13	68.59
Bert-CK-means+	64.76	64.74	60.18	60.18	83.87	81.88	69.60	68.93

From the experimental results in Table 4, among the four user portrait models, the Bert-CK-means+ model has the highest accuracy and F1 score, which demonstrates the superiority of the CK-means+ model in text classification tasks. Compared with that of K-means, CK-means+ achieves an accuracy rate of 1.13, and F1 increases by 1.32. Compared with that of the CK-means model, the accuracy rate increases by 0.8, and F1 increases by 0.72. Compared with that of the K-means++ model, the accuracy rate increases by 0.47, and F1 increases by 0.34. For the K-means algorithm, the most important thing is the selection of the K value and the initial centre point, so we analyse the reasons for the superior performance of CK-means+ from these two aspects.

From the perspective of the K value selection, the K value of K-means and K-means++ is random, while CK-means+ is preprocessed by the canopy algorithm, and the selection of the K value is more stable, so CK-means+ achieves the highest accuracy rate. As shown in Fig. 7, the K value of K-means is 28, the K value of K-means++ is 25, and the K value of CK-means and CK-means+ is 24. Both CK-means+ and CK-means are preprocessed by the canopy algorithm, so the K value is the same, but CK-means is not processed by the mean calculation method, and the selection of the initial centre point is random, so the CK-means accuracy is lower than that of CK-means+. K-means has the largest number of categories, and the choice of K value is the most unstable. This results in a large degree of dispersion between categories, which directly affects the clustering effect of the data; therefore, the K-means accuracy is the lowest.

Fig. 7

K value and contour coefficient of the clustering algorithm.

From the point of view of the selection of the initial centre point, the CK-means+ algorithm selects the initial centre point through the mean method. However, the initial centre points of the other three algorithms are random, so the accuracy of CK-means+ is the highest. The silhouette coefficients of the four algorithms are shown in Fig. 7. The silhouette coefficient of K-means is the smallest, and the clustering effect is the worst, so the classification effect of K-means is the worst. The silhouette coefficient of K-means++ is higher than that of CK-means, and the clustering effect is better, so the accuracy of K-means++ is higher than that of CK-means. CK-means+ has the largest silhouette coefficient value and the best clustering effect, so CK-means+ has the highest accuracy rate. Through the analysis of the above two aspects, the CK-means+ algorithm was shown to be effective for the adaptive improvement of the K value and the initial centre point.

4.4.3 Before and after clustering comparison analysis

From Table 3, all four word vector models have varying degrees of accuracy increase after the clustering process, and the degree of increase is related to the word vector models. Thus, to verify the influence of word vector models on the clustering effect and topic features, this research analyses the changes in the four groups of word vector models before and after the introduction of clustering, as well as the corresponding K values and profile coefficients.

In Fig. 8, blue represents the output result of word vectors being directly classified without clustering processing. The yellow color represents the output of the word vector after clustering and classification.From Fig. 8, after clustering, the average accuracy of word2vec increases by 0.86, GloVe increases by 0.94, ELMO increases by 1.1, and Bert increases by 1.14. Word2vec-CK improves the least, and Bert-CK improves the most. This is because the quality of word vectors affects the clustering judgement on the similarity of sentence vectors, which in turn affects the clustering results, and the quality of the word vector model further affects the quality of topic feature word vectors, thereby affecting the accuracy of the classification rate.

Fig. 8

Comparison of accuracy rates before and after clustering.

As shown in Fig. 9, word2vec-CK has the lowest contour coefficient value, and ELMO-CK and Bert-CK have the highest contour coefficients, which are similar. This indicates that CK-means+ is less effective in clustering word2vec word vectors, while it is more effective in clustering both ELMO and Bert word vectors. This is because word2vec and GloVe are both static models in which the expressions of words are fixed and do not change with the context, i.e., a word corresponds to a vector, while ELMO and Bert are dynamic models that can learn the meaning of each word according to the context, i.e., each meaning of each word corresponds to a vector, hence the similarity of ELMO and Bert sentence vectors. Therefore, the calculation of ELMO and Bert sentence vector similarity is more accurate.

Fig. 9

K values and contour coefficients for word vector clustering.

The above results show that the superiority of word vectors can affect the effect of clustering and the quality of topic features; the higher the quality of word vector expression is, the better the clustering effect and the higher the classification accuracy. In addition, the K value of all four groups of experiments is 24, which further proves the stability of CK-means+.

Through the above three sets of comparative experiments and analysis, we have shown that the Bert-CK model, by means of data preprocessing, introduction of Bert and CK-means+, can effectively improve the accuracy of user label classification. In the selection of the word vector model, Bert is superior to word2vec, GloVe and ELMO. With respect to the effect of introducing topics into the clustering algorithm, CK-means+ is superior to K-means, K-means++ and CK-means. For the different degrees of improvement of the word vector model before and after clustering processing, the quality of word vector expression is analysed on the clustering effect and topic. The influence of the quality of word vector representation on the clustering effect and topic features is analysed.

4.4.4 Application of user portrait in precision marketing

User portraits mainly serve the precision marketing business and precisely match diverse needs of users and enterprises. In the experimental results, we were able to obtain individual user portraits and group user portraits. This can help us analyse the current and potential needs of several types of users to achieve precise marketing. Table 5 shows some examples of clustered topic features. Figure 10 shows the number of people in different categories. Cluster 0 users are mostly women aged 18-35, and most of them are college students, accounting for 2121 people. Therefore, products such as beauty makeup, clothing, and food can be recommended for marketing. Women are the main consumption force, but the proportion of the number of people who shop on the web is relatively small, which shows that the method of shopping on the web is still not convenient and needs further improvement. Cluster 1 users are mostly men aged 25-30 who pay more attention to investment information such as funds and stocks, so more wealth management information in the financial industry can be recommended to them. Cluster 2 is mostly male college students aged 18-24 who are interested in electronic products, so tablets, Bluetooth headsets and other electronic products can be recommended to them. Therefore, based on the understanding of users, the network marketing strategies formulated are more directional and instructive, saving the time, cost and energy of enterprises and users and providing enterprises with more valuable marketing strategies.

Table 5
Example of theme tags

Theme Tags

Cluster 0 Merchants Goods Express Brand

Payment Online shopping Merchants Order

Payments official website

Cluster 1 U.S. China Capital Corporate

Market Capital Investment Capital

Fund Auto

Cluster 2 Mobile Phone Smart Hardware

Device Sports Health Usage

Efficacy Huawei

Cluster 3 Students Data Apps Mobile

Consulting Companies Training Dr.

University Movies

Theme Tags
Cluster 0	Merchants	Goods	Express	Brand
	Payment	Online shopping	Merchants	Order
	Payments	official website
Cluster 1	U.S.	China	Capital	Corporate
	Market	Capital	Investment	Capital
	Fund	Auto
Cluster 2	Mobile	Phone	Smart	Hardware
	Device	Sports	Health	Usage
	Efficacy	Huawei
Cluster 3	Students	Data	Apps	Mobile
	Consulting	Companies	Training	Dr.
	University	Movies

Fig. 10

Number of users of all types.

5 Conclusions

To address the problem that static word vectors cannot fully extract word representations and to resolve word polysemy in traditional user portrait model construction, we propose a Bert-CK user portrait model. The main idea is to introduce Bert to extract different levels of feature information of words and to manage the problem of word polysemy in the pretraining task. To improve the classification accuracy, CK-means+ is introduced to create topic features at the level of semantic features. It not only makes full use of word vectors and sentence vectors but also fully integrates semantic features and topic features, which effectively improves the accuracy of user profile classification. The shortcoming of this model is that it cannot better extract attribute features according to the downstream classification task, and the accuracy of the model may be improved by combining classical neural networks for text classification in the future.

Footnotes

Acknowledgements

This work is supported by the following funds: (1) National Natural Science Foundation of China (NSFC) project "Research on Key Technology of Air-Sky-Earth Multimodal Data Fusion for Digital Twin Agriculture" (62266043) (2) Major Project of High Resolution Earth Observation System of the National Defence Bureau of Science and Technology (95-Y50G37-9001-22/23) (3) Natural Science Foundation of the Autonomous Region: Research on Key Technology of Multisource Heterogeneous Data Fusion and Mining for Government Affairs (2021D01C083) (4) Autonomous Region Science and Technology Program Young Scientists Fund Project: Research on Operation Risk Assessment Method of Integrated Energy System Based on Deep Neural Network (2022D01C83)

References

Fan Wu , Ju Ren , Feng Lyu , Peng Yang , Yongmin Zhang , Deyu Zhang and Yaoxue Zhang , Boosting internet card cellular business via user portraits: A case of churn prediction, In IEEE INFOCOM 2022-IEEE Conference on Computer Communications, pages 640–649. IEEE, 2022.

Yuming Li , Johnny Chan , Gabrielle Peko and David Sundaram , A personalized harmful information detection system based on user portraits, 2022.

Huili Xue , Kongsheng Guo and Fen He , Research on personalized recommendation service of mobile library based on user portrait, In 2022 IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA), pages 689–692. IEEE, 2022.

Miao Liu and Liangliang Ben , Research on demand forecasting method of multi-user group based on big data, In Human Interface and the Management of Information: Applications in Complex Technological Environments: Thematic Area, HIMI 2022, Held as Part of the 24th HCI International Conference, HCII 2022, Virtual Event, June 26–July 1, 2022, Proceedings, Part II, pages 45–64. Springer, 2022.

Reina Alya Rahma , Radityo Adi Nugroho , Dwi Kartini , Mohammad Reza Faisal and Friska Abadi , Combination of texture feature extraction and forward selection for one-class support vector machine improvement in self-portrait classification, International Journal of Electrical and Computer Engineering 13(1) (2023), 425.

Chen

, Zhang

and Song

M.Q.

, areviewofuserprofilingresearch[j], Intelligence Science 037(004) (2019), 171–177.

Outperforming word2vec on analogy tasks with random projections.[j], CORR, page abs/1412.6616, 2014.

Zhao Lianyu , Sun Jigui and Liu

, , Research on clustering algorithm[j], Journal of Software 19(1) (2019), 48–61.

Habibie

, Kusuma

, Ma’Sum

M.A.

, et al., Design of intelligent k-means based on spark for big data clustering[c], International Workshop on Big Data and Information Security, IEEE, pages 89–96, 2017.

10.

Hua

, Wang

, Yin

, et al., Parallelizing k-means-based clustering on spark[c], International Conference on Advanced Cloud and Big Data, IEEE, pages 31–36, 2017.

11.

Tang

, Chen

, Li

, et al., A parallel random forest algorithm for big data in a spark cloud computing environment[j], IEEE Transactions on Parallel Distributed Systems 28(4) (2017), 919–933.

12.

Devlin

, Chang

M.W.

, Lee

, et al., Bert:pretraining ofdeep bidirectional transformers for language understandingj, Journal of Software, arX iv preprint arX iv:1810048052018.

13.

Nanfang

Z.E.

, Zhao

, Ma

T.H.

, Qian

Y.R.

, Shao

J.X.

and Xing

Y.N.

, Improved ck-means+ algorithm and parallel implementation[j], Journal of Software 43(05) (2022), 1240–1248.

14.

Huang Xiaohua Li Yefei Wang Guobin Miao and Jing Guangyao , Research on the construction method of electric power user portrait based on multi-source data fusion[j], Journal of Software 41(08) (2022), 93–96+125.

15.

Chen-Guang Wang , User portraits based on optimized k-means clustering algorithm [j], Science and Technology Innovation and Application, Journal of Software 12(18) (2022), 18–21.

16.

, Liu

, Niu

, et al., Microblog user interest modeling based on feature propagation[c], 2013 6th International Symposium on Computational Intelligence and Design(ISCID).IEEE Computer Society, 2013.

17.

Croft

W.B.

and Wei

, Lda-based document models for ad-hoc retrieval[c], SIGIR 2006 Proceedings of the 29th Annual International ACM SIGIR Conferenceon Research and Development in Information Retrieval, Seattle, Washington, USA.ACM, August (2006), 6–11.

18.

Chen Ming , Li Duojiao and He Chenwan , Text sentiment analysis based on glove model and united network[j], Journal of Physics, Conference Series 1748(3) (2021).

19.

Luo Shigang , Jiang Yuan , Wang Qiong , Ma Li , Wang Linxin and Zhou Shengcheng , Research on key technology of electricity bill payment user portrait based on improved word vector model[j], Power Information and Communication Technology 20(02) (2022), 42–48.

20.

Li Hengchao , Research on knowledge representation and model fusion in user portrait construction [d], Dalian University of Technology, 2017.

21.

Tomas Mikolov and Quoc Le

, Distributed representations of sentences and documents.[j], CoRR, page abs/1405.4053, 2014.

22.

Qin Rui , Ren Wenjing , Wen Guangrui , Zhang Zhifen and Huang Yiming , Xgboost-based on-line prediction of seam tensile strength for al-li alloy in laser welding: Experiment study and modelling[j], Journal of Manufacturing Processes 64 (2021).

23.

Qiu-Ping Zhang , Research on user profiling algorithm based on bert [d], Guangdong University of Technology, 2020.

24.

Haonan Jiang , Analysis and research on user profile construction algorithm based on user behavior data, 2020.

25.

Liu Chuang , Cai Guowei , Fang Yuan and Wang Yibo , Modified approach of manufacturer’s power curve based on improved bins and k-means++ clustering[j], Natural Science Edition 43(86-93) (2020).

26.

Big data precision marketing in sogou user portrait mining ccf competition dataset. https://www.datafountain.cn/competitions/239

27.

Chen Ming , Li Duojiao and He Chenwan , Text sentiment analysis based on glove model and united network[j], Journal of Physics: Conference Series 1748(3) (2021).

28.

Lino Juliana Arcanjo , Menezes David Guabiraba Abitbol de , Soares Jorge Barbosa , Furtado Vasco , Soares Junior Luiz , Farias Maria do Socorro Quintino , Lima Debora Lilian Nascimento , Pereira Eanes Delgado Barros , Holanda Marcelo Alcantara , Tomaz Betina Santos and Gomes Gabriela Carvalho , Elmo, a new helmet interface for cpap to treat covid-19-related acute hypoxemic respiratory failure outside the icu: a feasibility study.[j], Jornal brasileiro de pneumologia,publicacao oficial da Sociedade Brasileira de Pneumologia e Tisilogia 48(1) (2022).

29.

Nanfang Zhe Zhao Jingxia , Xing Yanni and Qian Yurong , A review of k-means initial centroid optimization research in spark environment[j], Computer Application Research 37(03) (2020), 641–647.

30.

Liu Chuang Cai Guowei , Fang Yuan and Wang Yibo , Modified approach of manufacturer’s power curve based on improved bins and k-means++ clustering[j], Sensors 22(21) (2022).