The short texts classification based on neural network topic model

Abstract

Aiming at the low effectiveness of short texts feature extraction, this paper proposes a short texts classification model based on the improved Wasserstein-Latent Dirichlet Allocation (W-LDA), which is a neural network topic model based on the Wasserstein Auto-Encoder (WAE) framework. The improvements of W-LDA are as follows: Firstly, the Bag of Words (BOW) input in the W-LDA is preprocessed by Term Frequency–Inverse Document Frequency (TF-IDF); Subsequently, the prior distribution of potential topics in W-LDA is replaced from the Dirichlet distribution to the Gaussian mixture distribution, which is based on the Variational Bayesian inference; And then the sparsemax function layer is introduced after the hidden layer inferred by the encoder network to generate a sparse document-topic distribution with better topic relevance, the improved W-LDA is named the Sparse Wasserstein-Variational Bayesian Gaussian mixture model (SW-VBGMM); Finally, the document-topic distribution generated by SW-VBGMM is input to BiGRU (Bidirectional Gating Recurrent Unit) for the deep feature extraction and the short texts classification. Experiments on three Chinese short texts datasets and one English dataset represent that our model is better than some common topic models and neural network models in the four evaluation indexes (accuracy, precision, recall, F1 value) of text classification.

Keywords

Short texts classification neural network topic model Variational Bayesian Gaussian mixture model (VBGMM)sparsemax BiGRU (Bidirectional Gating Recurrent Unit)

1 Introduction

With the continuous development of intelligent technology, text has become more and more important as the main carrier of information. Nowadays, short texts have gradually become the focus of text processing, and it plays a significant role in information search [1], social comments [2], interest recommendation [3]. Short texts refer to the text form within 160 characters in length, such as news headlines, comments information and document abstracts.

Short texts have the characteristics of sparse content, fast update speed, more abbreviations, and fewer words with practical meaning, so it is difficult to extract key features. Therefore, the key of short texts classification is to achieve effective feature extraction. At present, the methods of text feature extraction are mainly based on three types of models, including Vector Space Model (VSM), topic model and neural network. The VSM [4] extracts the features in the document and then simplifies it into vector operations in the vector space. Although the vector dimension of VSM has a clear meaning, the sparsity of the vector is very high when the amount of data is large. At the same time, VSM does not consider the relevance of text features. The topic model introduces the concept of “topic”, which has certain practical significance. Topic model assumes that the potential topics and words in the document should obey a certain probability distribution, so that the topic distribution representation of the document can be obtained after training. The most popular topic model is the Latent Dirichlet Allocation (LDA) proposed by Blei [5]. The text representation method based on neural network has developed rapidly since the emergence of the word2vec [6] algorithm. This method converts text into vectors at the word, sentence, or longer text level. At the same time, many neural network models have been proposed to extract the features of texts, such as Long Short-Term Memory (LSTM) [7] and Convolutional Neural Networks (CNN) [8].

Since short texts have fewer feature words, the generated topics are not clear and the relevance is poor when traditional topic models are applied to short texts. Therefore, many scholars have proposed different ideas to improve the effectiveness of traditional topic models when applied to short texts. Yan et al. [9] proposed the Biterm Topic Model (BTM), which improves the effectiveness of the LDA topic model when applied to short texts by extracting word pairs in the document, but it ignores the connection between multiple words. Lv et al. [10] expanded the content of the short texts by extracting appropriate words from the topic-word distribution matrix obtained by training the LDA topic model, and finally combined Support Vector Machine (SVM) for classification. Pang et al. [11] considered the relationship between Chinese microblog documents, and realized the effective application of the LDA topic model in the microblog short texts by aggregating multiple Chinese microblogs into one microblog document. Hu et al. [12] used external resources to expand short texts information to reduce the sparseness of short texts, and then selected representative topics from the topic distribution trained by online BTM as short texts features to complete classification. However, this method requires relatively high quality of external resources.

With the continuous development of deep learning technology, especially the successful application of the Variational Auto-Encoder (VAE) [13], the neural network topic model has developed rapidly. The main advantage of these neural network topic models is that they can be easily inference through the forward pass of the neural network, without the need to use Gibbs sampling or Variational Bayesian (VB) algorithm to perform complex iterative reasoning for topic of each word like traditional topic models. Miao et al. [14] first used VAE to construct the Neural Variational Document Model (NVDM). The NVDM assumes that the prior distribution of potential topics of the document obeys the Gaussian distribution, and the document-topic distribution is generated by an encoder network composed of a Multi-Layer Perceptron (MLP). Ding et al. [15] used pre-trained word vectors to measure the semantic similarity between words on the basis of NVDM, and the method is used as a part of the NVDM optimization function, thereby improving the consistency of the topic generated by the model. Dieng et al. [16] proposed embedded topic model (ETM) on the basis of VAE, which introduced word embedding and established topic vector distribution for each topic. ETM assumes that the potential topics of the document follow the logistic-normal distribution, and the topic-word distribution is the product of topic vectors and word vectors, which improves the interpretability of the topics generated by the model.

However, the Kullback-Leibler (KL) divergence in VAE causes prior probability distribution of all samples to match its posterior probability distribution, which makes the output and input of the VAE encoder very different. This problem is called posterior collapse [17]. Wasserstein distance solves the problem of KL divergence well, so the Wasserstein Auto-Encoder (WAE) [18] was born. Nan et al. [19] first proposed the neural network topic model Wasserstein-Latent Dirichlet Allocation (W-LDA) on the basis of WAE. W-LDA uses Dirichlet distribution as the prior distribution of potential topics and uses Maximum Mean Discrepancy (MMD) to match the high dimensional Dirichlet distribution. W-LDA avoids the problem of posterior collapse, and obtains more consistent topics.

In practice, although Dirichlet distribution is good at capturing sparse topics from document, the relevance of the obtained topics is often low. At the same time, W-LDA obtains the document-topic distribution through the softmax activation function after the hidden layer inferred by the encoder network, but the distribution generated by softmax is dense. In fact, there are not many hidden topics in the short texts, and the dense distribution is often contains many unclear topic features. In response to the above problems, this paper proposes a neural network topic model, Sparse Wasserstein-Variational Bayesian Gaussian mixture model (SW-VBGMM), suitable for feature extraction of short texts on the basis of W-LDA, and we combine SW-VBGMM with BiGRU to achieve effective short texts classification. The contributions in this paper are summarized as follows:

The Bag of Words (BOW) input to W-LDA is preprocessed by the Term Frequency-Inverse Document Frequency (TF-IDF) [20] to highlight the weight of keywords in the document, so that W-LDA can generate higher quality topic features.

On the basis of W-LDA, the prior distribution of potential topics in the model is replaced from the Dirichlet distribution to the Variational Bayesian Gaussian Mixture Model (VBGMM) [21]. Due to the multi-apex of VBGMM, more relevant topic features are obtained.

The sparsemax function layer [22] is introduced after the hidden layer inferred by the W-LDA encoder network to replace the softmax function layer to generate a document-topic distribution suitable for sparse features of short texts.

The sparse document-topic feature distribution generated by SW-VBGMM is input to BiGRU (Bidirectional Gating Recurrent Unit) [23] for deep feature extraction, and finally the softmax layer is used for classification, so as to obtain better short texts classification effect.

The remainder of this paper is organized as follows: The second section introduces the technologies and models related to this research. The third section introduces the proposed SW-VBGMM in this paper, and the process of completing short text classification. The fourth section presents the experimental results and discusses the results. The fifth section concludes the paper.

2 Related work

2.1 Wasserstein auto-encoder (WAE)

WAE was proposed by Google in 2018 [18]. It uses Wasserstein distance instead of KL divergence to measure the difference between the true distribution of the sample and the fitted distribution generated by the decoder network. Since Wasserstein distance can measure the distance between any two distributions, WAE can learn more complex data distributions, and the quality of samples generated by WAE is significantly higher than that of VAE.

WAE assumes that each sample in the training data S is generated by a latent variable L in the latent space. WAE first samples L from a prior distribution P_L in the encoder network, and then generates the fitted distribution P_t of the training data S by the decoder network. In order to minimize the optimal transport distance between the fitted distribution P_t and the true distribution P_s of the training data S, WAE needs to minimize the following objective function:

$inf_{Q (L | S)} E_{p_{s}} E_{Q (L | S)} [c (S, t (L)] + λ \cdot D_{L} (Q_{L}, P_{L})$ (1) where optimization objective of the Equation (1) is composed of reconstruction loss term and regularization term. The reconstruction loss term uses Wasserstein distance to measure the reconstruction error between the two distributions of P_t and P_s, and c is the cost function, t is an arbitrary mapping function form L to S. The regularization term D_L (Q_L, P_L) is an arbitrary divergence between the conditional posterior distribution $Q_{L} ≜ E_{p_{s}} Q (L | S)$ of the latent variable L and its prior distribution P_L, λ represents the regularization coefficient.

There are two different forms for the regularization term D_L (Q_L, P_L). The first is based on Generative Adversarial Networks (GAN) [24], which uses Jensen-Shannon (JS) divergence to measure the distance between distributions. The second is based on MMD [25], where D_L (Q_L, P_L) ≜ MMD_Z (Q_L, P_L), MMD uses a kernel function to map Q_L and P_L to a high-dimensional space and sample to obtain the expected values of the two distributions. The upper bound of the difference between all expected values of the two distributions is defined as the MMD distance. For the reproducing kernel function Z : L × L → R [18], the definition of MMD is as follows:

$\begin{matrix} {MMD}_{Z} (Q_{L}, P_{L}) = ∥ \int_{L} Z (L, \cdot) {dP}_{L} (L) \\ {- \int_{L} Z (L, \cdot) {dQ}_{L} (L) ∥}_{H_{Z}} \end{matrix}$ (2) where Z (L, ·) means that L is mapped to the high-dimensional space through the kernel function, and H_Z represents the Reproducing Kernel Hilbert Space (RKHS), which is composed of reproducing kernel functions Z. MMD performs well in matching high-dimensional data, and MMD also saves computing power than GAN.

2.2 Wasserstein-Latent Dirichlet Allocation (W-LDA)

Similar to other neural network topic models, W-LDA obtains the document-topic distribution from the input BOW through the inference of the encoder network. Decoder network samples the output of the encoder network to get the topic-word distribution, thereby completing the reconstruction of the input BOW.

The encoder network of the model is a Multi-Layer Perceptron (MLP), which maps the input BOW to the output layer with N neurons through the hidden layer, and then adds a softmax layer to obtain the document-topic distribution $φ \in S^{N - 1}$ . The purpose of the encoder network is to complete the inference: Q (φ | b) ≈ P (φ | b), where b represents BOW, and b_i represents the number of the i - th word in the document. The decoder network is a single-layer neural network, which samples the φ and adds a softmax layer to the generated weight matrix to obtain the topic-word distribution β. At the same time, the probability distribution $\tilde{b} \in S^{V - 1}$ of words in the document is obtained, where V represents the number of neurons in the output layer of the decoder network. The equation of $\tilde{b}$ is as follows:

${\tilde{b}}_{i} = \frac{exp g_{i}}{\sum_{j = 1}^{V} exp g_{j}}, g = β φ + w$ (3) where w represents the offset matrix. Therefore, the reconstruction loss term of the entire model is the negative cross-entropy loss between the input b to the encoder network and the document-word distribution $\tilde{b}$ generated by the decoder network. The equation is as follows:

$c (b, \tilde{b}) = - \sum_{i = 1}^{V} b_{i} log {\tilde{b}}_{i}$ (4)

W-LDA uses the MMD-based regularization term D_L (Q_L, P_L), MMD needs to match the Dirichlet distribution and the Dirichlet distribution is a high-dimensional continuous probability distribution with positive simplex as the support set. Therefore, MMD uses the information diffusion kernel function [26]. The equation is as follows:

$Z (L, L^{'}) = \exp (- \arccos^{2} (\sum_{n = 1}^{N} \sqrt{L_{n} L_{n} ’}))$ (5) where L is sampled from the conditional posterior distribution Q_L of latent variable L. L′ is sampled from the prior distribution P_L of L. Equation (5) uses the geodesic distance to measure the distance between L and L′. Geodesic distance is more sensitive to data near the simplex boundary, so it is suitable for sparse data [26].

MMD distance is an unbiased estimator in theory, so f samples of Equation (6) can be used for unbiased estimation of MMD in Equation (2):

$\begin{matrix} {\tilde{MMD}}_{Z} (Q_{L}, P_{L}) \\ = \begin{matrix} \frac{1}{f (f - 1)} \sum_{i \neq j} Z (L_{i}, L_{j}) \\ + \frac{1}{f (f - 1)} \sum_{i \neq j} Z (L_{i}', L_{j}') - \frac{2}{f^{2}} \sum_{i, j} Z (L_{i}, L_{j}') \end{matrix} \end{matrix}$ (6) where L_i, L_j is sampled from the conditional posterior distribution Q_L of latent variable L. L_i′, L_j′ is sampled from the prior distribution P_L of L. Input the sample data obtained by sampling into Equation (6) to obtain the MMD distance.

2.3 Bidirectional gating recurrent unit (BiGRU)

Gated Recurrent Unit (GRU) reduces a gated unit on the basis of LSTM [27], which not only retains the advantages of LSTM with long-term memory but also greatly reduces the number of parameters and effectively preventing over-fitting. Set the text input at GRU k time as x_k, and the input at the previous time and the accumulated historical information are respectively x_k-1, c_k-1. Then the output at the current moment is c_k, as shown in Equation (7):

$c_{k} = (1 - z_{k}) \times c_{k - 1} + z_{k} \times h_{k}$ (7) where z_k is the update gate, and h_k represents the state to be activated, they are shown in Equations (9):

$z_{k} = σ (W_{z} \times [c_{k - 1}, x_{k}])$ (8)

$h_{k} = tanh (W_{h} \times [γ_{k} \times c_{k - 1}, x_{k}])$ (9) where σ is the sigmoid activation function, W represents the weight matrix, and γ_k represents reset gate, γ_k as shown in Equation (10):

$γ_{k} = σ (W_{γ} \times [c_{k - 1}, x_{k}])$ (10)

BiGRU is equivalent to two GRUs training from opposite directions, and finally the training results of the two GRUs are connected. It obtains clearer semantics by predicting the text information inputted in the past and in the future. The structure of BiGRU is shown in Fig. 1.

Fig. 1

BiGRU neural network structure.

2.4 Sparsemax

Sparsemax is the sparse form of the softmax activation function [28], as shown in Equation (11):

$sparsemax (x) = \underset{p \in Δ^{d - 1}}{argmin} ∥ p - x ∥_{2}^{2}$ (11) where Δ^d-1 is the simplex defined by the probability distribution p with d - 1 degree of freedom, $Δ^{d - 1} ≜ {p \in ℝ^{d} | \sum_{j = 1}^{d} p_{j} = 1, p ⩾ 0}$ , the input x of the sparsemax function is projected onto the simplex Δ^d-1.

The projection method is to find the point closest to the simplex through the Euclidean distance, so the projection point probably falls on the boundary of the simplex, resulting in the effect of output sparsity.

2.5 Variational bayesian gaussian mixture model (VBGMM)

The Gaussian Mixture Model (GMM) can be considered as the probability distribution of the linear superposition of a finite number of Gaussian distributions, the Equation is as follows:

$\begin{matrix} p (x) & = \sum_{m = 1}^{M} p (α_{m}) p (x | α_{m}) \\ = \sum_{m = 1}^{M} π_{m} N (x | μ_{m}, \sum_{m}) \end{matrix}$ (12) where a_m is a latent variable indicating that x belongs to the m - th Gaussian distribution, N (x | μ_m, ∑_m) represents the probability density of the m - th Gaussian distribution, μ_m is the mean vector, ∑_m is the covariance matrix, mixing coefficient π_m satisfies $\sum_{m = 1}^{M} π_{m} = 1$ , and π_m ∈ [0, 1].

VBGMM is the GMM obtained by Variational Bayesian inference. The Variational Bayesian inference process of GMM is as follows: For each sample value x_n input in GMM corresponds to a latent variable a_n (a_nm represents the latent variable of the m - th Gaussian distribution in the mixed distribution). Set the input N sample data as X ={ x_1,⋯,x_N }, set the latent variable as A ={ α_1,⋯, α_N }, the joint probability distribution of sample data X and latent variable A is:

$\begin{matrix} p (X, A, π, μ, Λ) \\ = p (X | A, μ, Λ) p (A | π) p (π) p (μ | Λ) p (Λ) \end{matrix}$ (13) where μ is the mean vector, Λ is the precision matrix, and π is the mixing coefficient. p (X | A, μ, Λ) is the Gaussian distribution of the sample data X under the given the latent variables and the parameters of the model, p (A | π) is the conditional probability distribution of latent variable A under the given mixing coefficient π, p (π) is the Dirichlet distribution on the mixing coefficient π, and p (μ | Λ) p (Λ) is the Wishart prior distribution. Then introduce the variational probability distribution q (A, π, μ, Λ) to decompose the latent variable A and the parameters of the model. The decomposition form is shown in Equation (14):

$q (A, π, μ, Λ) = q (A) q (π) q (μ, Λ)$ (14)

The optimal estimation of each parameter in the above equation are obtained by Variational Bayesian inference, and the optimal estimation q^* (A) of q (A, π, μ, Λ) under parameter A is as follows:

$q^{*} (A) = \prod_{n = 1}^{N} \prod_{m = 1}^{M} r_{nm}^{α_{nm}}$ (15) where r_nm represents the response coefficient of sample n in the m - th Gaussian distribution, which is obtained by Equations (17):

$r_{nm} = p_{nm} / \sum_{j = 1}^{M} p_{nj}$ (16)

$ln p_{nm} = \begin{matrix} E [ln π_{m}] + \frac{1}{2} E [ln | Λ_{m} |] - \frac{d}{2} ln (2 π) \\ - \frac{1}{2} E_{μ_{m}, Λ_{m}} [{(x_{n} - μ_{m})}^{T} Λ_{m} (x_{n} - μ_{m})] \end{matrix}$ (17) where Λ_m represents the precision matrix of the m - th Gaussian distribution, μ_m represents the mean vector of the m - th Gaussian distribution, π_m represents the mixing coefficient of the m - th Gaussian distribution and d is the constant coefficient.

The optimal estimation q^* (π) and q^* (μ_m, Λ_m) of q (A, π, μ, Λ) under parameters π, μ and Λ are respectively shown in Equations (19):

$q^{*} (π) = T (Đ) \prod_{m = 1}^{M} π_{m}^{Đ_{m} - 1}$ (18)

$\begin{matrix} q^{*} (μ_{m}, Λ_{m}) = N (μ_{m} | U_{m}, {(φ_{m} Λ_{m})}^{- 1}) \\ Wishart (Λ_{m} | W_{m,} ω_{m}) \end{matrix}$ (19) where T (Đ) is the normalization coefficient of Dirichlet distribution and Đ ={ Đ₁, Đ₂, …, Đ_m }, U_m and (φ_m Λ_m) ^-1 are the mean and precision of the m - th Gaussian distribution respectively, Wishart (Λ_m | W_m, ω_m) represents the Wishart distribution with ω_m degrees of freedom, W_m is the metric matrix.

Then continue to updata the parameters of the q (π) and q (μ, Λ) distributions until the Evidence Lower Bound (ELBO) of the Variational Bayesian inference of GMM converges. The ELBO of the Variational Bayesian inference of GMM is shown in Equation (20):

$\begin{matrix} ELBO \\ = \begin{matrix} E [ln p (X | A, μ, Λ)] + E [ln p (A | π)] + \\ E [ln p (π)] + E [ln p (μ, Λ)] - E [ln q (A)] - \\ E [ln q (π)] - E [ln q (μ, Λ)] \end{matrix} \end{matrix}$ (20) where p (μ, Λ) is the Wishart prior distribution.

3 Our method

3.1 Short texts preprocessing

The process of preprocessing the short texts in this paper is as follows: Firstly, the Chinese short texts dataset are segmented with JIEBA [29], the stop words and low-frequency words in the dataset are filtered; Secondly, the low-frequency words and stop words in the English dataset are deleted through the natural language Toolkit (NLTK); Subsequently, the word frequency of the processed dataset is counted, and BOW is established to realize the vectorization of the text; Finally, the weight of word frequency in the BOW is modified by TF-IDF, highlight the weight of important feature words. Then it’s used as the input of the SW-VBGMM.

3.2 Sparse Wasserstein-Variational Bayesian Gaussian mixture model (SW-VBGMM)

W-LDA takes the Dirichlet distribution as the prior distribution of potential topics in the model. Although Dirichlet distribution well captures the rule that documents usually belong to sparse topic subsets [19], the weak relevance among the components of the random vector in the Dirichlet distribution makes the obtained potential topics almost irrelevant, which is not consistent with many practical problems. The Gaussian mixture distribution can fit any distribution through multiple Gaussian distributions. At the same time, due to the multi-apex of the Gaussian mixture distribution, it can better fit the data to obtain the more consistent topics [32]. VBGMM improves the generalization ability of the model and effectively prevents the occurrence of data over-fitting compared to GMM obtained by Expectation Maximization (EM) inference [30]. Therefore, we use VBGMM [21] as the prior distribution of W-LDA’s potential topics, and VBGMM continuously fits the potential topics during the model training process.

The encoder network of the SW-VBGMM is still a MLP. The input of the encoder network is b_T, which is the BOW pre-processed by TF-IDF. VBGMM is used as the prior distribution of potential topics, where the number of Gaussian distributions in VBGMM is consistent with the number of topics preset by the model. When the MLP maps b_T to the output layer with N neurons, VBGMM continuously fits the potential topics. Then add a sparsemax layer after the output layer of the MLP to obtain the sparse document-topic distribution $φ_{1} \in S^{N - 1}$ . The consistency and relevance of the topic features in φ₁ are better when applied to short texts. Similarly, the purpose of encoder network is to complete inference: Q (φ₁|b_T) ≈ P (φ₁|b_T).

Then the sparse document-topic distribution φ₁ is used as the input of the decoder network, where the decoder network is still composed of a single-layer neural network. The decoder network samples φ₁, and adds the sparsemax layer to the weight matrix generated by the decoder network to obtain the sparse topic-word distribution β₁. At the same time, the word probability distribution ${\tilde{b}}_{T} \in S^{V - 1}$ of the document is obtained in the output layer of the decoder network (with V neurons) according to Equation (3). The reconstruction loss of the SW-VBGMM is the negative cross-entropy loss between the input b_T to the encoder network and the document-word distribution ${\tilde{b}}_{T}$ generated by the decoder network.

In this paper, in order to match the VBGMM, we replace the information diffusion kernel function with the Radial Basis Function (RBF) kernel [31] suitable for Gaussian distribution. The mathematical form is as follows:

$Z (L, L^{'}) = exp (- \frac{L - L^{' 2}}{2 σ^{2}})$ (21) where L is sampled from the conditional posterior distribution Q_L of latent variable L. L′ is sampled from the prior distribution P_L of L. σ represents the scale parameter.

Similarly, use f samples of Equation (6) to estimate the MMD in Equation (2) unbiasedly to obtain the MMD distance.

3.3 Short texts classification based on SW-VBGMM and BiGRU

The process of short texts classification by the short texts classification model proposed in this paper is shown in Fig. 2. Firstly, the preprocessed corpus is converted to BOW, and TF-IDF is used to preprocess it to generate b_T. Subsequently, b_T is input into SW-VBGMM encoder network to generate the document-topic distribution φ₁, which is used as the input of BiGRU; Then, BiGRU neural network deeply extracts the features of φ₁; Finally, the softmax function layer classifies the final extracted features.

Fig. 2

Short texts classification flow chart.

4 Experiments and result

4.1 Experimental datasets

The experimental dataset includes Tan Songbo hotel review dataset, Takeout review dataset, THUCNews news headline dataset and AGNews dataset. The Tan Songbo hotel review dataset (3000 positive and 6000 negative) and the Takeout review dataset (4000 positive and 8000 negative) are both two-category Chinese datasets. The THUCNews news headline dataset includes ten categories of finance, real estate, stocks, education, technology, society, current affairs, sports, games, and entertainment. we extract 900 documents from each category in its training set to form a training set containing 9000 documents, and extract 300 documents from each category in its validation set and test set to form a validation set and a test set containing 3000 documents. AGNews is an English short news dataset, including four categories: world, sports, business and technology. It has 120,000 documents in the training set and 7,600 documents in the test set, we extracted 24,000 documents from the training set as the validation set. The data distribution is shown in Table 1.

Table 1
Data distribution

Dataset Total Data Training Data Validation Data Test Data Average length Class Number

Tan Songbo hotel review 1 9000 5400 1800 1800 75.5 2

Takeout review 2 12000 7200 2400 2400 27.7 2

THUCNews news headline 3 15000 9000 3000 3000 22.2 10

AGNews 4 127600 96000 24000 7600 17.9 4

Dataset	Total Data	Training Data	Validation Data	Test Data	Average length	Class Number
Tan Songbo hotel review 1	9000	5400	1800	1800	75.5	2
Takeout review 2	12000	7200	2400	2400	27.7	2
THUCNews news headline 3	15000	9000	3000	3000	22.2	10
AGNews 4	127600	96000	24000	7600	17.9	4

4.2 Experiment parameter settings

The experimental operating environment used in this paper was all carried out on a Windows Intel (R) Core (TM) i7-9750 CPU @ 2.60 GHz, 16 GB RAM. The experimental code was built on the basis of the Pytorch framework, and the experimental language was Python 3.6. The parameter settings in this paper are shown in Table 2, except for the number of Gaussian distributions in SW-VBGMM and the number of hidden layers in BiGRU, other parameter settings are shared.

Table 2
Parameter settings

Parameter Numerical value

Batch size 64

Number of training epochs 30

Dropout rate 0.5

Learning rate 0.01

Number of Gaussian distributions Number of topics

Optimizer Adam

Number of hidden layers in BiGRU 200

L2 Regularization coefficient 0.01

Parameter	Numerical value
Batch size	64
Number of training epochs	30
Dropout rate	0.5
Learning rate	0.01
Number of Gaussian distributions	Number of topics
Optimizer	Adam
Number of hidden layers in BiGRU	200
L2 Regularization coefficient	0.01

4.3 Experimental evaluation index

The quality of text classification is usually measured by precision (P), recall (R), F1 value (F1) and accuracy (ACC). The mathematical forms are as follows:

$P = \frac{A}{A + B}$ (22)

$R = \frac{A}{A + C}$ (23)

$F 1 = \frac{2 \times P \times R}{P + R}$ (24)

$Acc = \frac{A + D}{A + B + C + D}$ (25) where, A represents the number of texts that were correctly assigned to a certain category; B represents the number of texts that were incorrectly assigned to a certain category; C represents the number of texts that belong to a category but not be assigned to the category; D represents that the number of texts that do not belong to a category and not be assigned to the category.

4.4 Experimental results and analysis

4.4.1 Determine the optimal number of topics

In order to determine the optimal number of topics, we set the number of topics of the SW-VBGMM from 2 to 10 on the datasets of Tan Songbo hotel review and Takeout review, and use the F1 value to select the optimal number of topics. Since AGNews is a four-category dataset, we set the number of topics of the SW-VBGMM from 4 to 10 for experiments. As shown in Fig. 3, the optimal number of topics for the Takeout review dataset is 3, the F1 value is 96.62%; the optimal number of topics for the Tan Songbo hotel review dataset is 5, the F1 value is 95.78%; the optimal number of topics for the AGNews dataset is 7, the F1 value is 91.28%.

Fig. 3

F1 value of the two-category Chinese dataset and AGNews dataset under different topics.

For the THUCNews news headline dataset, we set the number of topics of SW-VBGMM from 10 to 50, and conduct experiments at intervals of 5. As shown in Fig. 4, the optimal number of topics for the dataset is 25, the F1 value is 93.61%.

Fig. 4

F1 value of the THUCNews news headline dataset under different topics.

4.4.2 Comparison of different classifiers

In order to verify that BiGRU can effectively extract the features of the document-topic distribution φ₁ generated by the SW-VBGMM and perform classification, we input φ₁ into different classifiers for comparison experiments. These classifiers contain SVM, Decision Tree (DT), K-Nearest Neighbor (KNN), Logistic Regression (LR), CNN and LSTM.

For classifiers based on machine learning (SVM, DT, KNN, LR), we combine the training set and the validation set of the dataset into a training set. In the SVM, we use the Gaussian kernel function RBF; In the DT, we set the feature selection method as information gain; The k value of KNN is selected as 5; The regularization parameter of LR is selected as L2 regularization. For the neural network (CNN, LSTM), we divide the training set, validation set and test set according to the data distribution ratio in Table 1. The validation set is used to adjust the hyperparameters to prevent the neural network from overfitting. The convolution window size of CNN is set as (3, 4, 5), the number of filters is set as 100, and the dropout value is set as 0.6; LSTM parameter settings same as the BiGRU. The experimental results of the Takeout review dataset and the Tan Songbo hotel review dataset are shown in Table 3, the experimental results of the THUCNews news headline dataset and the AGNews dataset are shown in Table 4.

Table 3
Comparison of different classifiers on Takeout review and Tan Songbo hotel review

Takeout review Tan Songbo hotel review

Classifier Acc (%) P (%) R (%) F1 (%) Acc (%) P (%) R (%) F1 (%)

φ₁+SVM 94.81 95.29 94.50 94.89 92.93 93.03 92.94 92.99

φ₁+DT 90.03 89.99 90.04 90.02 91.66 91.22 91.02 91.12

φ₁+KNN 94.05 94.54 93.95 94.24 91.55 91.61 90.34 90.97

φ₁+LR 93.60 93.93 93.40 93.66 75.26 75.24 67.89 71.38

φ₁+CNN 95.92 96.97 94.46 95.70 94.54 95.15 93.29 94.21

φ₁+LSTM 95.40 95.62 95.29 95.46 93.87 92.65 93.65 93.15

φ₁+BiGRU 96.85 97.66 95.60 96.62 95.72 95.85 95.71 95.78

	Takeout review	Tan Songbo hotel review
φ₁+SVM	94.81	95.29	94.50	94.89	92.93	93.03	92.94	92.99
φ₁+DT	90.03	89.99	90.04	90.02	91.66	91.22	91.02	91.12
φ₁+KNN	94.05	94.54	93.95	94.24	91.55	91.61	90.34	90.97
φ₁+LR	93.60	93.93	93.40	93.66	75.26	75.24	67.89	71.38
φ₁+CNN	95.92	96.97	94.46	95.70	94.54	95.15	93.29	94.21
φ₁+LSTM	95.40	95.62	95.29	95.46	93.87	92.65	93.65	93.15
φ₁+BiGRU	96.85	97.66	95.60	96.62	95.72	95.85	95.71	95.78

Table 4

Comparison of different classifiers on THUCNews news headline and AGNews

	THUCNews news headline				AGNews
Classifier	Acc (%)	P (%)	R (%)	F1 (%)	Acc (%)	P (%)	R (%)	F1 (%)
φ₁+SVM	88.47	89.27	88.57	88.42	88.62	88.47	88.50	88.48
φ₁+DT	86.94	86.99	86.93	86.96	85.10	85.11	85.08	85.10
φ₁+KNN	84.99	85.25	84.90	85.07	88.54	88.75	88.43	88.59
φ₁+LR	57.13	56.77	57.03	56.90	84.88	84.97	84.76	84.86
φ₁+CNN	91.83	91.80	92.20	92.00	90.75	90.77	90.88	90.83
φ₁+LSTM	89.50	90.01	89.30	89.65	90.45	90.65	90.59	90.62
φ₁+BiGRU	93.27	93.37	93.86	93.61	91.13	91.37	91.20	91.28

It can be seen from the experimental results in Tables 3 4 that this paper combines the sparse document-topic feature matrix φ₁ generated by the SW-VBGMM with BiGRU for classification and has achieved good results. Especially on the multi-categorized news headline dataset, the neural network as the classifier is significantly better than the traditional classifier. On the Takeout review dataset, the F1 value of BiGRU as the classifier for short texts classification is about 0.92% higher than CNN, and higher than SVM about 1.73%; on the Tan Songbo hotel review dataset, the F1 value of BiGRU as the classifier for short texts classification is about 1.57% higher than CNN, and higher than SVM about 2.79%; on the THUCNews news headline dataset, the F1 value of BiGRU as the classifier for short texts classification is about 1.61% higher than CNN, and higher than SVM about 5.19%; on the AGNews, the F1 value of BiGRU as the classifier for short texts classification is about 0.45% higher than CNN, and higher than SVM about 2.8%.

It can be seen from the experimental results that BiGRU is better than other classifiers (SVM, DT, KNN, LR, CNN, LSTM) in short texts classification. This is because that SVM and DT are very sensitive to missing features, so they are not suitable for sparse features; Both KNN and LR are difficult to deal with the problem of sample imbalance; Compared with CNN, BiGRU can extract the feature information of the context at the same time; Compared with LSTM, BiGRU reduces the number of parameters and effectively prevents over fitting. Therefore, BiGRU has got the best short texts classification effect among all comparison classifiers.

4.4.3 Comparison of different models

In order to verify the advantages of the proposed model compared with other topic models, we compared the SW-VBGMM with topic models such as LDA [5], W-LDA [19], NVDM [14], ETM [16]. The hyperparameter settings of the LDA topic model are consistent with the work of Blei [5], and the number of Gibbs sampling iterations is 2000; the parameter settings of the neural network topic model (W-LDA, NVDM, ETM) are consistent with the model proposed by this paper, the output of the encoder network is used as the feature of short texts classification, the topic vector dimension of ETM is set to 500, and the Dirichlet parameter of W-LDA is set to 0.1. The input of these comparative topic models is BOW, and the classifier is BiGRU. At the same time, we directly used the BOW processed by TF-IDF as the input of CNN [8] and BiGRU [23] for comparative experiments. The experimental results of the Takeout review dataset and the Tan Songbo hotel review dataset are shown in Table 5, the experimental results of the THUCNews news headline dataset and the AGNews dataset are shown in Table 6.

Table 5
Comparison of different models on Takeout review and Tan Songbo hotel review

Takeout review Tan Songbo hotel review

Model Acc (%) P (%) R (%) F1 (%) Acc (%) P (%) R (%) F1 (%)

LDA+BiGRU [5] 73.26 73.26 73.74 73.50 72.14 69.13 69.28 69.20

NVDM+BiGRU [14] 83.50 79.22 82.00 80.59 81.23 80.85 79.18 80.01

ETM+BiGRU [16] 91.10 93.34 91.00 92.16 91.20 90.77 89.95 90.36

W-LDA+BiGRU [19] 93.24 94.98 91.43 92.93 91.13 92.11 91.03 91.57

CNN [8] 88.71 87.13 91.08 89.06 86.61 86.75 86.11 86.43

BiGRU [23] 89.30 89.90 89.31 89.60 88.69 88.87 88.67 88.77

Our Model 96.85 97.66 95.60 96.62 95.72 95.85 95.71 95.78

	Takeout review	Tan Songbo hotel review
LDA+BiGRU [5]	73.26	73.26	73.74	73.50	72.14	69.13	69.28	69.20
NVDM+BiGRU [14]	83.50	79.22	82.00	80.59	81.23	80.85	79.18	80.01
ETM+BiGRU [16]	91.10	93.34	91.00	92.16	91.20	90.77	89.95	90.36
W-LDA+BiGRU [19]	93.24	94.98	91.43	92.93	91.13	92.11	91.03	91.57
CNN [8]	88.71	87.13	91.08	89.06	86.61	86.75	86.11	86.43
BiGRU [23]	89.30	89.90	89.31	89.60	88.69	88.87	88.67	88.77
Our Model	96.85	97.66	95.60	96.62	95.72	95.85	95.71	95.78

Table 6

Comparison of different models on THUCNews news headline and AGNews

	THUCNews news headline				AGNews
Model	Acc (%)	P (%)	R (%)	F1 (%)	Acc (%)	P (%)	R (%)	F1 (%)
LDA+BiGRU [5]	53.13	53.33	53.24	53.29	74.42	74.93	74.81	74.87
NVDM+BiGRU [14]	75.87	75.76	75.91	75.83	80.29	80.44	80.25	80.34
ETM+BiGRU [16]	84.87	84.92	87.61	86.24	85.50	85.71	85.42	85.56
W-LDA+BiGRU [19]	86.43	86.40	87.20	86.80	85.58	86.28	85.71	85.99
CNN [8]	87.69	87.79	87.67	87.73	86.29	86.44	86.65	86.54
BiGRU [23]	90.48	90.50	90.47	90.49	87.78	88.05	87.88	87.96
Our Model	93.27	93.37	93.86	93.61	91.13	91.37	91.20	91.28

It can be seen from the experimental results in Tables 5 6, that the model proposed in this paper has obvious advantages over the traditional LDA topic model in the short texts classification effect. At the same time, it is better than the neural network topic models (W-LDA, NVDM, ETM) and neural network (CNN, BiGRU) compared in this paper. BiGRU as the classifier for all topic models, the F1 value of the SW-VBGMM when used for short texts classification is about 3.69% higher than W-LDA in the Takeout review dataset, higher than W-LDA about 4.21% in the Tan Songbo hotel review dataset, higher than W-LDA about 6.81% in the THUCNews news headline dataset, and higher than W-LDA about 5.29% in the AGNews dataset. The proposed model in this paper achieves the best results compared with other comparison models on the three Chinese datasets and the AGNews dataset, which proves the effectiveness of the combination of SW-VBGMM and BiGRU for short texts classification.

Compared with other topic models (LDA, W-LDA, NVDM, ETM), SW-VBGMM is more suitable for short texts feature extraction. This is because that LDA and W-LDA use the Dirichlet distribution as the prior distribution of the model’s potential topic features. However, the Dirichlet prior distribution can not generate topic features with strong relevance. Although NVDM and ETM use Gaussian distribution and logistic-normal distribution respectively as the prior distribution of the model’s potential topic features, SW-VBGMM uses a more complex Variational Bayesian Gaussian mixture prior distribution, which can continuously fit potential topic features in the process of model training through multiple Gaussian distributions to generate more consistent topic features. At the same time, the input of SW-VBGMM is BOW processed by TF-IDF and sparsemax activation function is introduced, which makes the SW-VBGMM more suitable for extracting sparse features of short texts. Compared with CNN and BiGRU, the combination of SW-VBGMM and BiGRU can not only use SW-VBGMM to extract sparse topic features with better relevance, but also use BiGRU to further extract the generated topic features globally, thereby achieving a better short texts classification effect.

4.4.4 Ablation experiments

SW-VBGMM is improved on the basis of W-LDA. In order to verify the effectiveness of the various improved methods proposed in this paper for W-LDA, we conducted ablation experiments on four datasets. The experimental results of the Takeout review dataset and the Tan Songbo hotel review dataset are shown in Table 7, the experimental results of the THUCNews news headline dataset and the AGNews dataset are shown in Table 8.

Table 7
Comparison of different improved methods for W-LDA on Takeout review and Tan Songbo hotel review

Takeout review Tan Songbo hotel review

Method Acc (%) P (%) R (%) F1 (%) Acc (%) P (%) R (%) F1 (%)

W-LDA+BiGRU 93.24 94.98 91.43 92.93 91.13 92.11 91.03 91.57

TF-IDF+W-LDA+BiGRU 93.54 94.62 93.55 94.08 92.00 92.73 91.98 92.35

Sparsemax+W-LDA+BiGRU 93.70 94.75 93.73 94.24 92.36 92.77 92.26 92.51

VBGMM+W-LDA+BiGRU 95.27 95.36 95.25 95.31 94.08 94.08 94.09 94.08

TF-IDF+Sparsemax+W-LDA+BiGRU 93.98 94.84 93.94 94.38 92.91 92.81 92.93 92.87

TF-IDF+VBGMM+W-LDA+BiGRU 95.77 96.10 95.57 95.83 94.78 95.25 94.48 94.86

VBGMM+Sparsemax+W-LDA+BiGRU 96.36 96.73 95.36 96.04 94.75 94.74 94.76 94.75

SW-VBGMM+BiGRU 96.85 97.66 95.60 96.62 95.72 95.85 95.71 95.78

	Takeout review	Tan Songbo hotel review
W-LDA+BiGRU	93.24	94.98	91.43	92.93	91.13	92.11	91.03	91.57
TF-IDF+W-LDA+BiGRU	93.54	94.62	93.55	94.08	92.00	92.73	91.98	92.35
Sparsemax+W-LDA+BiGRU	93.70	94.75	93.73	94.24	92.36	92.77	92.26	92.51
VBGMM+W-LDA+BiGRU	95.27	95.36	95.25	95.31	94.08	94.08	94.09	94.08
TF-IDF+Sparsemax+W-LDA+BiGRU	93.98	94.84	93.94	94.38	92.91	92.81	92.93	92.87
TF-IDF+VBGMM+W-LDA+BiGRU	95.77	96.10	95.57	95.83	94.78	95.25	94.48	94.86
VBGMM+Sparsemax+W-LDA+BiGRU	96.36	96.73	95.36	96.04	94.75	94.74	94.76	94.75
SW-VBGMM+BiGRU	96.85	97.66	95.60	96.62	95.72	95.85	95.71	95.78

Table 8

Comparison of different improved methods for W-LDA on THUCNews news headline and AGNews

	THUCNews news headline				AGNews
Method	Acc (%)	P (%)	R (%)	F1 (%)	Acc (%)	P (%)	R (%)	F1 (%)
W-LDA+BiGRU	86.43	86.40	87.20	86.80	85.58	86.28	85.71	85.99
TF-IDF+W-LDA+BiGRU	86.68	87.50	86.67	87.08	86.53	86.80	86.77	86.79
Sparsemax+W-LDA+BiGRU	87.53	88.98	87.69	88.33	86.67	86.94	86.76	86.85
VBGMM+W-LDA+BiGRU	91.03	91.14	91.43	91.28	89.33	89.06	88.99	89.03
TF-IDF+Sparsemax+W-LDA+BiGRU	88.63	89.12	88.13	88.62	87.18	87.26	87.02	87.14
TF-IDF+VBGMM+W-LDA+BiGRU	91.93	92.36	91.82	92.10	89.78	90.05	89.91	89.98
VBGMM+Sparsemax+W-LDA+BiGRU	92.19	92.38	92.20	92.29	89.96	90.22	90.10	90.16
SW-VBGMM+BiGRU	93.27	93.37	93.86	93.61	91.13	91.37	91.20	91.28

It can be seen from Tables 7 8 that the three methods proposed in this paper to improve W-LDA are effective whether used alone or in combination with each other. In particular, using VBGMM as the prior distribution of potential topics in W-LDA plays the most significant role in improving the quality and relevance of topic features generated by the model. On the basis of keeping BiGRU as the topic features classifier, the results of the ablation experiment are analyzed as follows: using only VBGMM as the prior distribution of potential topics in W-LDA, the F1 value of the classification is higher than W-LDA about 2.38% in the Takeout review dataset, higher than W-LDA about 2.51% in the Tan Songbo review dataset, higher than W-LDA about 4.48% in the THUCNews news headline dataset, and higher than W-LDA about 3.04% in the AGNews dataset. It can be seen from the experimental results that when VBGMM replaces the Dirichlet distribution as the prior distribution of potential topics in W-LDA, the effect of short texts classification is significantly improved. This is because that the weak relevance between the components of the random vector in the Dirichlet distribution makes the relevance between the topic features generated by the model not strong. However, VBGMM continuously fits the potential topic features through multiple Gaussian distributions in the process of model training. In this way, the relevance of the topic features finally generated by the model is improved, and better classification results are obtained by using the topic features with higher relevance.

On the basis of VBGMM as the prior distribution of potential topics in W-LDA, using the sparsemax activation function instead of the softmax activation function on the output layer of the encoder network, the F1 value of the classification is higher than W-LDA about 3.11% in the Takeout review dataset, higher than W-LDA about 3.18% in the Tan Songbo review dataset, higher than W-LDA about 5.49% in the THUCNews news headline dataset, and higher than W-LDA about 4.17% in the AGNews dataset. It can be seen from the experimental results that the sparsemax activation function is more suitable for short texts features than the softmax activation function. This is because that the features of short texts are sparse, but the feature distribution generated by the softmax activation function is dense. Therefore, when softmax activation function is applied to short texts, many features in the generated dense feature distribution are ambiguous. The sparse feature distribution generated by the sparsemax activation function is more in line with the characteristics of short texts, and thus a better short texts classification effect is obtained.

On the basis of the previous two improved methods, using TF-IDF to preprocess the BOW input by the model, the F1 value of the classification is higher than W-LDA about 3.69% in the Takeout review dataset, higher than W-LDA about 4.21% in the Tan Songbo review dataset, higher than W-LDA about 6.81% in the THUCNews news headline dataset, and higher than W-LDA about 5.29% in the AGNews dataset. It can be seen from the experimental results that the classification effect is improved after preprocessing the BOW input by the model with TF-IDF. This is because that TF-IDF modifies the weight of the features in the BOW and improves the weight of the key features in the BOW, so that the model can generate more accurate topic features.

5 Conclusion

This paper proposes a neural network topic model SW-VBGMM, which can effectively extract the topic features of short texts. SW-VBGMM uses the BOW preprocessed by TF-IDF as the input of the model, and uses VBGMM as the prior distribution of potential topics in the model, so the quality and relevance of the topic features generated by the model are better. At the same time, SW-VBGMM adds the sparsemax activation function to the output layer of the encoder network to generate a document-topic feature distribution, which is more suitable for the sparse short texts features. Then the document-topic feature distribution is input to BiGRU for deep feature extraction. Finally the softmax layer is used to achieve effective short texts classification. Experiments show that the short texts classification effect of the proposed model is better than the topic models (LDA, ETM, NVDM, W-LDA) and neural network models (CNN, BiGRU) compared in this paper.

In the future, we consider inputting word embedding to SW-VBGMM instead of BOW to obtain a more consistent topic features. At the same time, we consider adding the sparsity constraint of topic-word in the SW-VBGMM training process to improve its topic extraction ability.

Footnotes

Acknowledgments

This work was supported by the Postdoctoral Science Foundation of China (2016M592894XB), the National Natural Science Foundation of China (61866020) and (11773012), the General Program of Yunnan Provincial Department of Science and Technology (2019FB082), and the General Program of basic research in Yunnan Province (202001AT070047).

References

, Huang

, Chen

, et al., Short Text Understanding Combining Text Conceptualization and Transformer Embedding[J], IEEE Access PP(99) (2019), 1–1.

Zhang

, Chen

, Li

, et al., Short Text Clustering Algorithms for Weibo Topic Detection[J], Advanced Materials Research 971–973 (2014), 1747–1751.

Chao

, Qu

and Tao

, Research of Collaborative Filtering Recommendation Algorithm for Short Text[J], Journal of Computer & Communications 2(14) (2014), 59–66.

Amensisa

A.D.

, Patil

and Agrawal

, A survey on text document categorization using enhanced sentence vector space model and bi-gram text representation model based on novel fusion techniques[J], (2018), 218–225.

Blei

D.M.

, et al., Latent dirichlet allocation, Journal of Machine Learning Research 3(3) (2003), 993–1022.

Zhou

, Wang

, Sun

, et al., A Method of Short Text Representation Based on the Feature Probability Embedded Vector[J], Sensors 19(17) (2019), 3728.

Zhang

and Zhang

, Topics extraction in incremental short texts based on LSTM[J], Social Network Analysis and Mining 10(1) (2020), 1–9.

Wang

, He

, Zhang

, et al., A Short Text Classification Method Based on N-Gram and CNN[J], Chinese Journal of Electronics 29(2) (2020), 248–254.

Cheng

, Yan

, Lan

, et al., BTM: Topic Modeling over Short Texts[J], IEEE Transactions on Knowledge & Data Engineering 26(12) (2014), 2928–2941.

10.

, Ji

and Wu

, Short text classification based on expanding feature of LDA[J], Computer Engineering and Applications 51(04) (2015), 123–127.

11.

Pang

, Wan

, Li

, et al., MR-LDA: An Efficient Topic Model for Classification of Short Text in Big Social Data[J], International Journal of Grid and High Performance Computing 8(4) (2016), 100–113.

12.

, Wang

and Li

, Online Biterm Topic Model based Short Text Stream Classification using Short Text Expansion and Concept Drifting Detection[J], Pattern Recognition Letters 116(DEC.1) (2018), 187–194.

13.

Kingma

D.P.

and Welling

, Auto-encoding variational bayes, arXiv preprint arXiv:1312.6114, (2014).

14.

Miao

, Lei

and Blunsom

, Neural variational inference for text processing, Computer Science (2016), 1791–1799.

15.

Ding

, Nallapati

and Xiang

, Coherence-Aware Neural Topic Modeling// Proceedings of the Conference on Empirical Methods in Natuaral Language Processing. Brusseis, Belgium, (2018), 830–836.

16.

Dieng

A.B.

, Ruiz

F.J.R.

and Blei

D.M.

, The Dynamic Embedded Topic Model, arXiv preprint arXiv:1907.05545, (2019).

17.

, Spokoyny

, Neubig

, et al., Lagging inference networks and posterior cpllapse in variational autoencoders, arXiv preprint arXiv:1901.05534, (2019).

18.

Tolstikhin

, Bousquet

, Gelly

, et al., Wasserstein Auto-Encoders[J], arXiv preprint arXiv: 1711.01558, (2017).

19.

Nan

, Ding

, et al., Topic modeling with wasserstein autoencoders, In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, (2019), 6345–6381.

20.

Choi

W.S.

and Kim

S.B.

, N-gram feature selection for text classification based on symmetrical conditional probability and tf-idf, Journal of Korean Institute of Industrial Engineers 41(4) (2015), 381–388.

21.

Blei

D.M.

, Kucukelbir

and McAuliffe

J.D.

, Variational inference: A review for statisticians[J], Journal of the American Statistical Association 112(518) (2017), 859–877.

22.

Lin

, Hu

and Guo

, Sparsemax and Relaxed Wasserstein for Topic Sparsity// proceedings of the ACM International conference on Web Search and Data Mining. Melbourne, Australia, (2019), 141–149.

23.

Han

, Liu

and Jing

, Aspect-level Drug Reviews Sentiment Analysis based on Double BiGRU and Knowledge Transfer[J], IEEE Access PP(99) (2020), 1–1.

24.

Zhang

, Li

and Zhou

, Text to image synthesis using multi-generator text conditioned generative adversarial networks, Multimedia Tools and Applications 80(3) (2021), 1–15.

25.

Gangeh

M.J.

, et al., Computer Aided Theragnosis Using Quantitative Ultrasound Spectroscopy and Maximum Mean Discrepancy in Locally Advanced Breast Cancer, IEEE Transactions on Medical Imaging 35(3) (2016), 778–790.

26.

Lafferty

and Lebanon

, Information Diffusion Kernels, Advances in Neural Information Processing Systems 15 (2002), 375–382.

27.

, et al., Attention-based LSTM, GRU and CNN for short text classification, Journal of Intelligent and Fuzzy Systems 39(1) (2020), 1–8.

28.

Martins

A.F.

and Astudillo

R.F.

, From softmax to sparsemax: a sparse model of attention and multi-label classification// Procedings of the International Conference on Machine Learning. New York, USA, (2016), 1614–1623.

29.

Zhang

, Li

, He

, et al., Improved feature size customized fast correlation-based filter for Naive Bayes text classification[J], Journal of Intelligent and Fuzzy Systems 38(11) (2020), 1–10.

30.

Wang

, et al., Localizing Multiple Objects Using Radio Tomographic Imaging Technology, IEEE Transactions on Vehicular Technology 65(5) (2016), 3641–3656.

31.

Roy

, Govil

and Miranda

, An Algorithm to Generate Radial Basis Function (RBF)-Like Nets for Classification Problems, Neural Networks 8(2) (1995), 179–201.

32.

Prabhudesai

, Mainsah

, Collins

and Throckmorton

C.S.

, Augmented Latent Dirichlet Allocation (LDA) Topic Model with Gaussian Mixture Topics// ICASSP 2018-2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, (2018), 2451–2455.

The short texts classification based on neural network topic model

Abstract

Keywords

1 Introduction

2 Related work

2.1 Wasserstein auto-encoder (WAE)

3.1 Short texts preprocessing

3.2 Sparse Wasserstein-Variational Bayesian Gaussian mixture model (SW-VBGMM)

4.1 Experimental datasets

Table 2 Parameter settings Parameter Numerical value Batch size 64 Number of training epochs 30 Dropout rate 0.5 Learning rate 0.01 Number of Gaussian distributions Number of topics Optimizer Adam Number of hidden layers in BiGRU 200 L2 Regularization coefficient 0.01

4.4.1 Determine the optimal number of topics

Footnotes

Acknowledgments

References

Table 2
Parameter settings

Parameter Numerical value

Batch size 64

Number of training epochs 30

Dropout rate 0.5

Learning rate 0.01

Number of Gaussian distributions Number of topics

Optimizer Adam

Number of hidden layers in BiGRU 200

L2 Regularization coefficient 0.01