A survey of topic models: From a whole-cycle perspective

Abstract

With the rapid development of information science and social networks, the Internet has accumulated various data containing valuable information and topics. The topic model has become one of the primary semantic modeling and classification methods. It has been widely studied in academia and industry. However, most topic models only focus on long texts and often suffer from semantic sparsity problems. The sparse, short text content and irregular data have brought major challenges to the application of topic models in semantic modeling and topic discovery. To overcome these challenges, researchers have explored topic models and achieved excellent results. However, most of the current topic models are applicable to a specific model task. The majority of current reviews ignore the whole-cycle perspective and framework. It brings great challenges for novices to learn topic models. To deal with the above challenges, we investigate more than a hundred papers on topic models and summarize the research progress on the entire topic model process, including theory, method, datasets, and evaluation indicator. In addition, we also analyzed the statistical data results of the topic model through experiments and introduced its applications in different fields. The paper provides a whole-cycle learning path for novices. It encourages researchers to give more attention to the topic model algorithm and the theory itself without paying extra attention to understanding the relevant datasets, evaluation methods and latest progress.

Keywords

Topic model text mining semantic understanding whole-cycle topic detection

1 Introduction

In the era of dramatic network technology and information science development, massive amounts of text data can be accelerated on the Internet. Since most of these data are unstructured, extracting important and expected information from them might be challenging. The topic model is a typical text mining and modeling technique widely applied in topic mining, text classification, emotion analysis, and other closely related fields [1–5]. With the probabilistic Latent Semantic Analysis (pLSA) [6] and the Latent Dirichlet Allocation (LDA) [7] making significant progress in text modeling, topic model research entered a rapid development stage. Shortly after that, researchers successively proposed the Dirichlet Multinomial Mixture (DMM) [8, 9], Biterm Topic Model (BTM) [10] and Correlated Topic Model (CTM) [11]. These models have higher text processing efficiency and field subdivision than LDA.

Based on previous research experience, we reviewed more than a hundred papers on topic models to analyze and summarize the common topic model modeling method [12–14]. The research result can provide the novices with preliminary guidance and research progress of topic models. Obviously, many researchers [15–23] have also reviewed topic models. In particular, Fan et al. [24] reviewed the short text topic model from the improvement, classification, application, etc., to provide a comprehensive reference for the study of the short text topic model. However, this review only focused on short texts and didn’t analyze on the general application of the topic model. Robertsus et al. [25] reviewed the topic model that aims at social media from methods to evaluation. They classified the topic model through the features (content, social interactions, and temporary aspects) used by the models and analyzed the datasets and indicators commonly used in model evaluation. However, this article lacks an introduction to the development and evolution of the topic model and only introduces the topic model that deals with social network text. It is difficult to provide good guidance for novices. Tan et al. [26] analyzed the latest research progress in emotion analysis based on data processing techniques, classification methods, datasets, etc., and made prospects for developing emotion analysis methods. However, this article did not conduct a performance evaluation of emotion analysis methods, making it difficult to demonstrate various methods’ performance advantages and disadvantages clearly. And the introduced methods only focus on emotion analysis and do not have good generalizations. The above articles often reviewed the topic model from a specific perspective. They lack a whole-cycle perspective and contain several inadequacies regarding the model types and evaluation indications. We systematically studied the topic model method from the research progress, datasets, evaluation indicators, performance experiment, and application fields to provide a whole-cycle perspective for the topic model research. Through the above research work, we summarized the topic model into the following three categories: traditional topic models, word vector-based topic models and neural network-based topic models.

In traditional topic models, Meng et al. [27] utilized a hybrid doctor recommendation model based on an online medical platform to recommend the most suitable doctor according to patients’ needs. Shi et al. [28] adopted a dynamic topic modeling method based on self-aggregation (SADTM) to capture topic distribution and aggregate short texts, which achieved topic feature mining for short text. Toubia et al. [29] proposed a topic model for innovative document research, providing auxiliary assistance for writing these documents. The traditional topic models rely on the information provided by the corpus for topic inference. Therefore, it is generally difficult to achieve good results when using these models for topic inference of social network short texts. Researchers have introduced word vectors technology to overcome the above weakness in topic model modeling. Das et al. [30] utilized multivariate Gaussian distribution to replace the topics parameterized representation of LDA in Gaussian LDA (GLDA) modeling. It improved the processing performance of out-of-vocabulary words in held-out documents. Li et al. [31] adopted the Generalized Polya Urn (GPU) model to enhance the semantic relevance under the same topic during DMM sampling. It directly extended DMM to the Generalized Polya Urn Dirichlet Multinomial Mixture (GPU-DMM). Yao et al. [32] proposed Knowledge Graph Embedding LDA (KGE-LDA) by embedding knowledge graphs in the topic modeling process, significantly improving semantic coherence. The word vector-based topic models basically don’t alter the assumptions and sampling methods of the benchmark model and cluster similar documents by using word vectors. This type of model solves the problem of sparse topic words in short texts, improving the document aggregation and classification accuracy of topic words. With the development of neural networks, some scholars found it could better utilize semantic similarity between words by using neural networks during topic modeling. Based on neural networks, Cao et al. [33] proposed the Neural Topic Model (NTM), which overcame the weaknesses of topic distribution singly and initialization sensitively. The NTM combined the words and documents efficiently under a unified framework. Wang et al. [34] utilized the Topic Attention Model (TAM) to solve the issue of representing topics only by words in the standard topic model, which reduced the perplexity of document modeling. Yin et al. [35] utilized the Spatial-Temporal LDA (ST-LDA) to enhance the inference ability for personal interest in different regions, thereby recommending higher-quality interest points for users. Miao et al. [36] proposed the Neural Variable Document Model (NVDM), which performed semantic analysis on discrete texts. The neural network-based topic models no longer use the distribution assumption and sampling method of the probabilistic topic model. These models use neural network nodes or weight matrices instead of distribution assumptions and use algorithms (i.e., backpropagation [37, 38], stochastic gradient [39–41], and adagrad, etc.) for model parameter training. In addition, the neural network structure can also be combined with word vectors for topic modeling, which improves the utilization of semantic similarity between words.

The main contributions of this paper are as follows:

We comprehensively discussed the topic model method, including its development process, evaluation indicators, performance experiment, application status, and other elements, to provide a whole-cycle perspective for studying the topic model method. The paper can help novices understand the topic model method quickly and accurately.

We assessed the research developments for the topic model in recent years, discussed the topic model’s current problems and potential solutions, and provided reference information for the correlational study. All the above work indicated the direction of future research for scholars.

After reviewing various topic models within the previous five years, we compared and summarized the application fields and advantages of frequently used topic models. This information can give reference for researchers to select the topic model in their research interests.

We collected and compared the widely used datasets for indicator evaluation to further study the topic models’ primary evaluation indicators. Meanwhile, we proposed the critical evaluation basis for the datasets and indicator selection criteria in the performance evaluation of the topic model.

This paper is organized as follows: The typical topic models’ relevant work and research progress are summarized in section 2. We compare and analyze the evaluation indicators for the topic model and the corresponding selection criteria in section 3. In section 4, we mainly analyze the statistical results of topic models through experiments and introduce their applications in different fields. The challenges and their development trend are analyzed in section 5. Finally, we conclude the paper with research progress, evaluation indicators, and application field of the topic model; In addition, we prospect the application of graph neural network (GNN) in topic modeling.

2 Related work and research progress on typical topic models

Typical topic models include traditional topic models and those integrated with deep learning [42–44]. Traditional topic models include LDA, DMM, CTM, BTM, and others, which provide essential methods for natural language processing and information retrieval. However, these models perform poorly in text mining, document classification, and processing of social short texts such as Twitter, microblogs, and Facebook. To overcome the above issues, scholars have proposed improved topic models [45–48], such as the supervised Latent Dirichlet Allocation (sLDA) [49] and Latent Feature-Biterm Topic Model (LF-BTM) [50]. In particular, to solve the short text sparseness, topic models based on deep learning have been proposed, such as the Correlated Gaussian Topic Model (CGTM) [51], Affinity Regularized NMF for LTM (NMF-LTM) [52], and Neural Sparse Topic Coding (NSTC) [53] model. We summarize the topic model parameters and their explanation in Table 1.

Table 1
Topic model parameter and its explanation

Variable Interpretation

D, K, B Document set, number of topics, biterm set in the document set

V, N, M The word set in the vocabulary set, number of words in the document set, number of words in the vocabulary set

α, β, θ, φ, z Dirichlet prior parameters of the document-topic distribution, Dirichlet prior parameters of the topic-lexical distribution, topic distribution vector, topic-lexical distribution matrix, latent topic

$N_{d}^{w}$ , $n_{k}^{w}$ , $\underset{k}{n}$ , $\underset{d, w}{z}$ The number of times the word w appears in document d, the number of times the word w is assigned to k, the number of times the document (or biterm) is assigned to the topic k, and the topic assignment sequence of the w-th word in the d-th document.

J, A Number of informal topics, number of short texts written by authors

$\underset{k}{φ}$ , $\underset{j}{ψ}$ , $\underset{a}{ξ}$ The word probability vector k (i < k< K) of the formal topic, the word probability vector i (1 < j< J) of the informal topic, and the author probability vector a (1 < a< A) of the informal topic.

$\underset{d}{θ}$ , $\underset{d}{N}$ , $\underset{d}{C}$ The formal topic probability vector in document d, the number of words in document d, and the number of short texts under document d.

$\underset{dn}{w}$ , $\underset{dn}{z}$ The nth observation word in document d, the formal topic represented by the nth word in document d

$\underset{dc}{a}$ , $\underset{dc}{x}$ , $\underset{dc}{y}$ , $\underset{dc}{P}$ The author of the c-th short text under the d-th common document, the formal topic represented by the c-th short text under the d-th common document, the informal topic represented by the c-th short text under the d-th common document, and the percentage of formal topics in the c-th short text under the d-th common document.

$\underset{dc}{M}$ , $\underset{dcm}{w}$ , $\underset{dcm}{b}$ The number of words in the c-th short text in the d-th official document, the m-th observation word in the c-th short text in the d-th normal document, and the topic of the m-th word in the c-th short text in the d-th ordinary document.

$\underset{sk}{N}$ , $\underset{dk}{N}$ , $\underset{s}{N}$ The number of word tags assigned to topic k by short text s, the number of word tags assigned to topic k by original document d, and the number of word tags assigned to short text s.

$\underset{dn}{w}$ , $\underset{k \underset{dn}{w}}{N}$ , $\underset{k}{N}$ The n-th word in the original document d, the number of type words assigned to topic k, and the number of word tags assigned to topic k

$\underset{k}{μ}$ , $\underset{c}{μ}$ , $\underset{k}{Σ}$ , $\underset{c}{Σ}$ , $\underset{d}{η}$ The mean of k-th Gaussian topic, the mean of η, the covariance of k-th Gaussian topic, the covariance of η, each dimension represents the k-dimensional vector of each topic weight in the document d.

$\underset{0}{μ}$ , $\underset{0}{Σ}$ , $\underset{0}{υ}$ , μ, Σ, υ Hyperparameters of Gaussian topic and logical normal prior

$\underset{d}{ϑ}$ , $\underset{d}{w}$ , $\underset{d, n}{s}$ The latent representation of document d in the topic space, the item vector of document d, the latent representation of word n in the topic space.

$\underset{u}{ϑ}$ , $\underset{u, r}{θ}$ , $\underset{z}{φ}$ , $\underset{z}{ψ}$ The polynomial distribution of user u’s spatial pattern, the polynomial distribution of user u’s interest in region r, the polynomial distribution of specific topic z, and the polynomial distribution of topic z in time.

$\underset{s, r}{φ}$ , $\underset{r}{μ}$ , $\underset{r}{Σ}$ , K, R The preference of the population s to the region r is distributed, the position difference of the region r, the position covariance of the region r, the number of topics, and the number of regions.

γ, α, β, η, τ, χ, δ Dirichlet prior of polynomial distribution

$\underset{\underset{i}{u}}{r}$ , $\overset{item}{D}$ , $\overset{Tag}{D}$ Item list, user-project document set, user-project pseudo-document set

U, E, S, K, V User set, platform set, post set, topic set, vocabulary set

π, $\underset{u}{θ}$ , $\underset{u, s}{P}$ , $\underset{u, s}{Z}$ The bias of the background topic distribution, the topic distribution of the user u, the platform of the s-th post of the user u, the topic of the s-th post of the user u.

$\underset{u, s, n}{y}$ , $\underset{u, s, n}{w}$ , $\underset{k}{φ}$ , $\underset{B}{φ}$ Controller of user u’s n-th word of s-th post, user u’s n-th word of s-th post, word distribution of topic k, word distribution of background topic

$\underset{d}{w}$ , $\underset{d}{h}$ , θ, φ, y Word sequence of document d, hashtag sequence of document d, topic distribution matrix of tag, word distribution matrix of the topic, the topic distribution of hashtag

D, $\underset{p}{N}$ , K, p, $\underset{c}{φ}$ , $\underset{k}{θ}$ Data set, number of word pairs, number of topics, set of word pairs, normal word distribution, burst topic distribution

$\underset{0}{γ}$ , $\underset{1}{γ}$ Hyper parameter

Variable	Interpretation
D, K, B	Document set, number of topics, biterm set in the document set
V, N, M	The word set in the vocabulary set, number of words in the document set, number of words in the vocabulary set
α, β, θ, φ, z	Dirichlet prior parameters of the document-topic distribution, Dirichlet prior parameters of the topic-lexical distribution, topic distribution vector, topic-lexical distribution matrix, latent topic
$N_{d}^{w}$ , $n_{k}^{w}$ , $\underset{k}{n}$ , $\underset{d, w}{z}$	The number of times the word w appears in document d, the number of times the word w is assigned to k, the number of times the document (or biterm) is assigned to the topic k, and the topic assignment sequence of the w-th word in the d-th document.
J, A	Number of informal topics, number of short texts written by authors
$\underset{k}{φ}$ , $\underset{j}{ψ}$ , $\underset{a}{ξ}$	The word probability vector k (i < k< K) of the formal topic, the word probability vector i (1 < j< J) of the informal topic, and the author probability vector a (1 < a< A) of the informal topic.
$\underset{d}{θ}$ , $\underset{d}{N}$ , $\underset{d}{C}$	The formal topic probability vector in document d, the number of words in document d, and the number of short texts under document d.
$\underset{dn}{w}$ , $\underset{dn}{z}$	The nth observation word in document d, the formal topic represented by the nth word in document d
$\underset{dc}{a}$ , $\underset{dc}{x}$ , $\underset{dc}{y}$ , $\underset{dc}{P}$	The author of the c-th short text under the d-th common document, the formal topic represented by the c-th short text under the d-th common document, the informal topic represented by the c-th short text under the d-th common document, and the percentage of formal topics in the c-th short text under the d-th common document.
$\underset{dc}{M}$ , $\underset{dcm}{w}$ , $\underset{dcm}{b}$	The number of words in the c-th short text in the d-th official document, the m-th observation word in the c-th short text in the d-th normal document, and the topic of the m-th word in the c-th short text in the d-th ordinary document.
$\underset{sk}{N}$ , $\underset{dk}{N}$ , $\underset{s}{N}$	The number of word tags assigned to topic k by short text s, the number of word tags assigned to topic k by original document d, and the number of word tags assigned to short text s.
$\underset{dn}{w}$ , $\underset{k \underset{dn}{w}}{N}$ , $\underset{k}{N}$	The n-th word in the original document d, the number of type words assigned to topic k, and the number of word tags assigned to topic k
$\underset{k}{μ}$ , $\underset{c}{μ}$ , $\underset{k}{Σ}$ , $\underset{c}{Σ}$ , $\underset{d}{η}$	The mean of k-th Gaussian topic, the mean of η, the covariance of k-th Gaussian topic, the covariance of η, each dimension represents the k-dimensional vector of each topic weight in the document d.
$\underset{0}{μ}$ , $\underset{0}{Σ}$ , $\underset{0}{υ}$ , μ, Σ, υ	Hyperparameters of Gaussian topic and logical normal prior
$\underset{d}{ϑ}$ , $\underset{d}{w}$ , $\underset{d, n}{s}$	The latent representation of document d in the topic space, the item vector of document d, the latent representation of word n in the topic space.
$\underset{u}{ϑ}$ , $\underset{u, r}{θ}$ , $\underset{z}{φ}$ , $\underset{z}{ψ}$	The polynomial distribution of user u’s spatial pattern, the polynomial distribution of user u’s interest in region r, the polynomial distribution of specific topic z, and the polynomial distribution of topic z in time.
$\underset{s, r}{φ}$ , $\underset{r}{μ}$ , $\underset{r}{Σ}$ , K, R	The preference of the population s to the region r is distributed, the position difference of the region r, the position covariance of the region r, the number of topics, and the number of regions.
γ, α, β, η, τ, χ, δ	Dirichlet prior of polynomial distribution
$\underset{\underset{i}{u}}{r}$ , $\overset{item}{D}$ , $\overset{Tag}{D}$	Item list, user-project document set, user-project pseudo-document set
U, E, S, K, V	User set, platform set, post set, topic set, vocabulary set
π, $\underset{u}{θ}$ , $\underset{u, s}{P}$ , $\underset{u, s}{Z}$	The bias of the background topic distribution, the topic distribution of the user u, the platform of the s-th post of the user u, the topic of the s-th post of the user u.
$\underset{u, s, n}{y}$ , $\underset{u, s, n}{w}$ , $\underset{k}{φ}$ , $\underset{B}{φ}$	Controller of user u’s n-th word of s-th post, user u’s n-th word of s-th post, word distribution of topic k, word distribution of background topic
$\underset{d}{w}$ , $\underset{d}{h}$ , θ, φ, y	Word sequence of document d, hashtag sequence of document d, topic distribution matrix of tag, word distribution matrix of the topic, the topic distribution of hashtag
D, $\underset{p}{N}$ , K, p, $\underset{c}{φ}$ , $\underset{k}{θ}$	Data set, number of word pairs, number of topics, set of word pairs, normal word distribution, burst topic distribution
$\underset{0}{γ}$ , $\underset{1}{γ}$	Hyper parameter

2.1 Method of the traditional topic model

2.1.1 Introduction and comparative analysis of four common benchmark models

In 2003, Blei et al. proposed the first complete probabilistic hierarchical topic model based on the research for the pLSA model. It consists of the following three layers: document, topic, and word. The LDA connects all the document parameters through the probability generation model. The basic assumptions of LDA are as follows: According to document set D and the prior parameters α and β, the topic assignment sequence Z_d,w of the vocabulary in each document is inferred. According to the sequence Z_d,w, the document-topic distribution probability matrix θ and the topic-vocabulary distribution probability matrix φ are obtained. In 2006, Blei et al. constructed an improved model based on the LDA by introducing a lognormal distribution and covariance matrix, which compensated for the shortcoming that the LDA cannot reflect the correlation between the extracted topics. Nigam et al. proposed the Dirichlet Polynomial Mixture Model (DMM), which was usually used to process short text. Yan et al. utilized the BTM to overcome feature sparsity issues in short texts. This method mines all word pairs from the text set and directly infers the topic on the biterm sets, compensating for the problem of short text sparsity.

The structures of the four models are shown in Fig. 1, and the probability formulas and application scenarios are shown in Table 2. According to Table 2, DMM and LDA are approximate in terms of time complexity, and both are lower than the BTM and CTM. The CTM has the highest time complexity due to requiring additional time to estimate hyperparameters from hyper distributions. The time complexity of the BTM is the second highest, and its complexity primarily depends on the biterm scale mined from the document set. When the BTM is applied to short text classification, the biterm set mined is small due to the characteristics of short text. So the time complexity of the BTM is close to LDA.

Fig. 1

Structure of the topic model.

Table 2

Topic model formula and its application scenarios

Model	Probability formula	Applicable scene	Complexity
			Space complexity	Time complexity
LDA	$p (z, w \| α, β) = (n_{d, k}^{- (d, w)} + \underset{k}{α}) \frac{(n_{k, w}^{- (d, w)} + \underset{w}{β})}{\sum_{v = 1}^{V} (n_{k, v}^{- (d, v)} + \underset{v}{β})}$	conventional text	\|D\|K + VK + \|D\|l	O (K\|D\|l)
DMM	$p (z \| \underset{\to d}{z}, d, α, β) = \frac{\underset{k, \to d}{n} + α}{\| D \| - 1 + K α} \times \frac{\prod_{w \in d} \prod_{j = 1}^{N_{d}^{w}} (n_{k, \to d}^{w} + β + j - 1)}{\prod_{j = 1}^{N_{d}^{w}} (n_{k, \to d}^{w} + V β + j - 1)}$	short text	\|D\| + VK + \|D\|l	O (K\|D\|l)
BTM	$p (\underset{b}{z} \| \underset{- b}{z}, B, α, β) = (\underset{k}{n} + α) \frac{(\underset{\underset{i}{w} \| k}{n} + β) (\underset{\underset{j}{w} \| k}{n} + β)}{{(\sum_{w} \underset{w \| z}{n} + V β)}^{2}}$	short text	K + VK + \|B\|	O (K\|B\|)
CTM	$p (z, w \| α, β) = \frac{\overset{η_{d}^{k}}{e}}{\sum_{j = 1}^{K} \overset{η_{d}^{k}}{e}} \frac{(n_{k, w}^{- (d, w)} + \underset{w}{β})}{\sum_{v = 1}^{V} (n_{k, v}^{- (d, v)} + \underset{v}{β})}$	conventional text	\|D\|K + VK + \|D\|l	$O (K \| D \| l + \overset{2}{K} + SK)$

2.1.2 Extension of the topic model

Topic models have contributed to researching information retrieval and natural language processing. With the continuous accumulation of data on the Internet, the current basic topic model has difficulty meeting scholars’ research needs in relevant fields. Therefore, scholars have explored the ways of improving and optimizing the basic topic model with a high matching degree to meet the actual needs of their professional research [54–57]. We summarize the advantages, disadvantages, parameter learning methods, and application fields for some extended models in Table 3.

Table 3
Basic model extension

Model Parameter learning Advantages Disadvantages Generalizability and application scenarios

Online LDA [58] Gibbs sampling It can capture the dynamic changes of topics with time. It performs well only in small document set processing. Well generalizability, mainly applicable to online documents and social media

SLDA Variational inference It can adapt to different types of variable responses. It can only be applied to tagged documents. Well generalizability, can be used for text classification and sensitive analysis

MG-LDA [59] EM algorithm It deals with online review documents better. The influence of emotional factors on online reviews is not considered. Well generalization, applicable to online user review

GLDA Collapsing Gibbs sampling It can flexibly capture the distribution of topics when ensuring the coherence preference of topics. The time complexity of the dataset is high, and additional algorithms are needed to speed up the processing process. Well generalizability, applicable to text classification and topic detection

cDTM [60] Variation KL Fast modeling can be performed through sparse variational inference. The increase in the topic quantities will reduce the model processing effect. Well generalizability, applicable to temporal text

GDTM [61] KG algorithm The sparsity and dynamics of short text can be considered synchronously. It is necessary to combine the random indicator method of incremental dimension reduction with the linguistic representation technology. Well generalizability, mainly applicable to temporal text

LF-DMM [62] Gibbs sampling It combines feature vectors to improve word-topic mapping in small corpus learning. The model corpus processing speed is slow. Well generalizability, mainly applicable to text classification and text clustering

TE-GSDMM [63] Gibbs sampling Service clustering performance is further improved. Web services with fewer categories cannot be clustered. General generalizability, only used for service cluster

GPU-DMM Gibbs sampling It solves the problem of sparseness and looseness of the service representation vector. The description of the Web service needs to be preprocessed. Well generalizability, can be used for topic detection and text classification

AL-STM [64] Gibbs sampling It makes up for the defects of manual and automatic marking topics in software engineering. It can only be applied to topic mining in software engineering. General generalizability, only used for topic detection

DESTM [65] Gibbs sampling Document embedding is used to aggregate short texts into long documents to reduce the impact of short text sparsity. Document embedding is used to cluster short texts into long documents to reduce the impact of short text sparsity. General generalizability, only used for short text

CSTM [66] Folding Gibbs sampling It can realize both cross-class shared topics and specific class topic processing. Additional LDA training is required to obtain the optimal topic quantities. Well generalizability, mainly applicable to text classification and text summary

CCTM [67] Gibbs sampling It can capture the main topic timeline and reveal the correlation between related subtopics. The relationship between words needs to be quantified in advance. General generalizability, only used for topic evolution mining

SBTM [68] Gibbs sampling It can characterize the topic content qualitatively. The data source comes from the course forum posts expressed in Chinese, which is limited to Chinese. General generalizability, only applicable to sentiment analysis

Promotion-BTM [69] Gibbs sampling The biterm is divided into topic words and general words, and only the semantic similarity of topic words is promoted. It depends on pre-trained word embedding and can be used directly from word2vec and phone, which can be used directly. General generalizability, only applicable to topic detection

Model	Parameter learning	Advantages	Disadvantages	Generalizability and application scenarios
Online LDA [58]	Gibbs sampling	It can capture the dynamic changes of topics with time.	It performs well only in small document set processing.	Well generalizability, mainly applicable to online documents and social media
SLDA	Variational inference	It can adapt to different types of variable responses.	It can only be applied to tagged documents.	Well generalizability, can be used for text classification and sensitive analysis
MG-LDA [59]	EM algorithm	It deals with online review documents better.	The influence of emotional factors on online reviews is not considered.	Well generalization, applicable to online user review
GLDA	Collapsing Gibbs sampling	It can flexibly capture the distribution of topics when ensuring the coherence preference of topics.	The time complexity of the dataset is high, and additional algorithms are needed to speed up the processing process.	Well generalizability, applicable to text classification and topic detection
cDTM [60]	Variation KL	Fast modeling can be performed through sparse variational inference.	The increase in the topic quantities will reduce the model processing effect.	Well generalizability, applicable to temporal text
GDTM [61]	KG algorithm	The sparsity and dynamics of short text can be considered synchronously.	It is necessary to combine the random indicator method of incremental dimension reduction with the linguistic representation technology.	Well generalizability, mainly applicable to temporal text
LF-DMM [62]	Gibbs sampling	It combines feature vectors to improve word-topic mapping in small corpus learning.	The model corpus processing speed is slow.	Well generalizability, mainly applicable to text classification and text clustering
TE-GSDMM [63]	Gibbs sampling	Service clustering performance is further improved.	Web services with fewer categories cannot be clustered.	General generalizability, only used for service cluster
GPU-DMM	Gibbs sampling	It solves the problem of sparseness and looseness of the service representation vector.	The description of the Web service needs to be preprocessed.	Well generalizability, can be used for topic detection and text classification
AL-STM [64]	Gibbs sampling	It makes up for the defects of manual and automatic marking topics in software engineering.	It can only be applied to topic mining in software engineering.	General generalizability, only used for topic detection
DESTM [65]	Gibbs sampling	Document embedding is used to aggregate short texts into long documents to reduce the impact of short text sparsity.	Document embedding is used to cluster short texts into long documents to reduce the impact of short text sparsity.	General generalizability, only used for short text
CSTM [66]	Folding Gibbs sampling	It can realize both cross-class shared topics and specific class topic processing.	Additional LDA training is required to obtain the optimal topic quantities.	Well generalizability, mainly applicable to text classification and text summary
CCTM [67]	Gibbs sampling	It can capture the main topic timeline and reveal the correlation between related subtopics.	The relationship between words needs to be quantified in advance.	General generalizability, only used for topic evolution mining
SBTM [68]	Gibbs sampling	It can characterize the topic content qualitatively.	The data source comes from the course forum posts expressed in Chinese, which is limited to Chinese.	General generalizability, only applicable to sentiment analysis
Promotion-BTM [69]	Gibbs sampling	The biterm is divided into topic words and general words, and only the semantic similarity of topic words is promoted.	It depends on pre-trained word embedding and can be used directly from word2vec and phone, which can be used directly.	General generalizability, only applicable to topic detection

(1) AOTM

Research on user preferences mining has received extensive attention with the rapid development of short video platforms (Bilibili, Kuaishou and YouTube) and social media (microblogs, Zhihu, and Twitter). Although the topic model is an effective tool for understanding text content, it cannot be directly used for user preference research due to two shortcomings: The first is that users’ comments are usually brief and severely lack data. The second is that users’ feedback is usually mixed with opinions expressed in the original comment. Therefore, Yang et al. [70] proposed the Author co-occurrence Topic Model (AOTM) for normal texts and users’ comments which were short texts (Fig. 2). This model allows each author of the short text to be in a group through considering the authorship, and it only represents a probability distribution on the topic of the short text.

Fig. 2

The structure of AOTM.

We describe the generation process of AOTM short text and regular text in Fig. 3.

Fig. 3

AOTM generation process.

According to the generation process of AOTM in Fig. 3, we can derive the complete posterior distribution with Equation (1). $\begin{matrix} f (z, b, x, y | w, A, α, β, γ, ω, ɛ) = \\ {\prod_{d = 1}^{D} \frac{\prod_{i = 1}^{K} Γ (l_{dk}^{(1)} + l_{dk}^{(1)} + α)}{Γ {\sum_{k = 1}^{K} (l_{dk}^{(1)} + l_{dk}^{(1)} + α)}}} \end{matrix}$ $\begin{matrix} {\prod_{a = 1}^{A} \frac{\prod_{j = 1}^{K} Γ (\underset{aj}{h} + ɛ)}{Γ {\sum_{j = 1}^{J} (\underset{aj}{h} + ɛ)}}} \\ \times {\prod_{d = 1}^{D} \prod_{c = 1}^{C} \frac{Γ (s_{dc}^{(1)} + s_{dc}^{(2)} + γ)}{Γ (s_{dc}^{(1)} + γ) Γ (s_{dc}^{(2)} + γ)}} \end{matrix}$ (1)

(2) LTM

Li et al. [71] adopted the Latent Topic Model (LTM) to solve the overfitting and time consumption problems of SATM [72] in the large-scale corpus, which is a generalized topic model for short text mining. The LTM didn’t include the extra short text generation process. Li et al. considered that the extended text members were unknown. And they thought the short text was a part of the regular long text, which produced by the standard topic model. The LTM is shown in Fig. 4.

Fig. 4

The structure of LTM.

In the LTM short text pair set, K represents the topics’ quantity, D represents the original documents’ quantity, and S represents the short text pairs’ quantity. The LTM generation process is shown in Fig. 5.

Fig. 5

LTM generation process.

The posterior conditional probability of $\hat{z}$ over original documents D is given in Equation (2). $p (\underset{s}{\hat{z}} = d | \overset{- s}{\hat{z}}, z, W, α) = \frac{\prod_{k = 1}^{K} \prod_{n = 1}^{\underset{sk}{N}} (N_{dk}^{- s} + n - 1 + α)}{\prod_{n = 1}^{\underset{s}{N}} (N_{d}^{- s} + n - 1 + K α)}$ (2)

The posterior conditional probabilities of z/K topics are given in Equation (3). $p (\underset{dn}{z} = k | \hat{z}, \overset{- dn}{z}, W, β, α) = \frac{N_{k \underset{dn}{w}}^{- dn} + β}{N_{k}^{- dn} + V β} (N_{dk}^{- dn} + α)$ (3)

2.2 Topic model based on word embeddings

Traditional topic model primarily rely on word co-occurrence models to generate document topics without considering the semantic structure in general. Short texts have the issue of data sparseness due to insufficient context information, which makes processing it become a limitation of the traditional topic model [73–76]. Therefore, researchers enhance the topic models’ generalization capacity by word embedding technology [77, 78]. The generated topic words are more consistent semantically during the word embedding topic model which are utilized for short text processing. Researchers have proposed various topic models for multiple objects. Each word in the sentiment analysis text often carries emotional and topic information. While some words tend to represent subjective feelings, others tend to express objective truths. Guo et al. [79] proposed a Bias-Sentiment-Topic (BST) model for microblog sentiment analysis based on word embedding, simultaneously combined with the relationship between bias, emotion, and topic. In the research on community Q&A, it is challenging to propose a semantic embedded joint learning framework due to the multiview and sparse data characteristics of community Q&A. Based on the Bayesian model, Sang L et al. [80] adopted Multimodal Multiview Semantic Embedding (MMSE) to break through the research bottleneck of community Q&A. Gao et al. quantified the relationship between words in mining the topic evolution by the Encoder-only Transformer Language Model (ETLM). They also proposed a conditional random field regularized correlated topic model (CCTM) based on ETLM. The CCTM can simultaneously focus on the topic evolution of normal documents and the evolution law of the topic with time. Shi et al. [81] proposed the Cbow Topic Model (CTM), which can solve the high-dimensional data problem in large-scale event texts. This paper organizes the document-topic distribution, topic-word distribution, and the role of distribution representation learning of each model in Table 4. It clearly shows the similarities and differences between the topic models based on word embedding.

Table 4
Comparison of auxiliary probabilistic topic models based on word vectors

Baseline model Model Document-Topic distribution Topic-Vocabulary distribution The role of distribution representation learning

LDA GLDA $Dir (α) \to Multi (\underset{d}{θ})$ $(I \overset{- 1}{W} (\underset{0}{ψ}, \underset{0}{v}), N (μ, \frac{1}{τ} Σ k)) \to$ $N (\underset{z}{μ}, Σ z)$ Word vector concatenation constitutes a document vector sampled from the word vector space.

WEI-FTM [82] $Dir (α) \to Multi (\underset{d}{θ})$ $N (0, \overset{2}{(\underset{0}{σ})} I) \to Dir (β \underset{k}{b}) \to$ $Multi (\underset{z}{φ})$ The similarity between the topic vector and word vector determines the topic-word distribution.

DGPU-LDA [83] $Dir (α) \to Multi (\underset{d}{θ})$ $Dir (β) \to Multi (\underset{\underset{d}{z}}{φ})$ Build a document vector, and enhance similar words and similar documents containing the words to be sampled into the same topic.

KGE-LDA $Dir (α) \to Multi (\underset{d}{θ})$ $Dir (β) \to Multi (\underset{\underset{d}{z}}{φ})$ Build event vectors using entity vectors in a knowledge graph and generate topic-word distributions for both the document vocabulary and entities.

LFTM $Dir (α) \to Multi (\underset{d}{θ})$ $Dir (β) \to (1 - \underset{d}{s}) Multi (\underset{z}{φ}) +$ $\underset{d}{s} CatE ((\underset{\underset{d}{z}}{u} \overset{T}{v}))$ The similarity between the topic vector and the word vector determines the topic-word distribution.

DMM GPU-DMM Dir (α) → Multi (θ) $Dir (β) \to Multi (\underset{\underset{d}{z}}{φ})$ During topic sampling, enhance words similar to the words to be sampled on the same topic.

BTM Even-BTM-GPU [84] Dir (α) → Multi (θ) $Dir (β) \to Multi (\underset{\underset{d}{z}}{φ})$ Build event vectors using lexical vectors and enhance similar events into the same topic.

GPU-PDMM [85] Dir (α) → Multi (θ) $Dir (β) \to Multi (\underset{\underset{d}{z}}{φ})$ During topic sampling, enhance words similar to the words to be sampled on the same topic.

CTM CGTM $(I \overset{- 1}{W} (ψ, ν), N (μ, \frac{1}{k} Σ k))$ $\to N (\underset{c}{μ}, \sum_{c}) \to Multi (\underset{d}{θ})$ $(I \overset{- 1}{W} (ψ, ν), N (μ, \frac{1}{k} Σ k))$ $\to N (\underset{z}{μ}, Σ z)$ Sampling from the word vector space

Baseline model	Model	Document-Topic distribution	Topic-Vocabulary distribution	The role of distribution representation learning
LDA	GLDA	$Dir (α) \to Multi (\underset{d}{θ})$	$(I \overset{- 1}{W} (\underset{0}{ψ}, \underset{0}{v}), N (μ, \frac{1}{τ} Σ k)) \to$ $N (\underset{z}{μ}, Σ z)$	Word vector concatenation constitutes a document vector sampled from the word vector space.
	WEI-FTM [82]	$Dir (α) \to Multi (\underset{d}{θ})$	$N (0, \overset{2}{(\underset{0}{σ})} I) \to Dir (β \underset{k}{b}) \to$ $Multi (\underset{z}{φ})$	The similarity between the topic vector and word vector determines the topic-word distribution.
	DGPU-LDA [83]	$Dir (α) \to Multi (\underset{d}{θ})$	$Dir (β) \to Multi (\underset{\underset{d}{z}}{φ})$	Build a document vector, and enhance similar words and similar documents containing the words to be sampled into the same topic.
	KGE-LDA	$Dir (α) \to Multi (\underset{d}{θ})$	$Dir (β) \to Multi (\underset{\underset{d}{z}}{φ})$	Build event vectors using entity vectors in a knowledge graph and generate topic-word distributions for both the document vocabulary and entities.
	LFTM	$Dir (α) \to Multi (\underset{d}{θ})$	$Dir (β) \to (1 - \underset{d}{s}) Multi (\underset{z}{φ}) +$ $\underset{d}{s} CatE ((\underset{\underset{d}{z}}{u} \overset{T}{v}))$	The similarity between the topic vector and the word vector determines the topic-word distribution.
DMM	GPU-DMM	Dir (α) → Multi (θ)	$Dir (β) \to Multi (\underset{\underset{d}{z}}{φ})$	During topic sampling, enhance words similar to the words to be sampled on the same topic.
BTM	Even-BTM-GPU [84]	Dir (α) → Multi (θ)	$Dir (β) \to Multi (\underset{\underset{d}{z}}{φ})$	Build event vectors using lexical vectors and enhance similar events into the same topic.
	GPU-PDMM [85]	Dir (α) → Multi (θ)	$Dir (β) \to Multi (\underset{\underset{d}{z}}{φ})$	During topic sampling, enhance words similar to the words to be sampled on the same topic.
CTM	CGTM	$(I \overset{- 1}{W} (ψ, ν), N (μ, \frac{1}{k} Σ k))$ $\to N (\underset{c}{μ}, \sum_{c}) \to Multi (\underset{d}{θ})$	$(I \overset{- 1}{W} (ψ, ν), N (μ, \frac{1}{k} Σ k))$ $\to N (\underset{z}{μ}, Σ z)$	Sampling from the word vector space

2.2.1 VAETM

Traditional topic models are often applied to the semantic mining of long text. Due to the lack of word co-occurrence patterns and the topic feature sparse in short texts, it is difficult for traditional topic models to mine high-quality topics from short texts. To solve the above issues, Zhao et al. [86] utilized word vector representation and entity vectors to construct a Variational Auto-Encoder Topic Model (VAETM). The model generation process is shown in Fig. 6. The Variational Auto-Encoder (VAE) is an encoding-decoding network proposed by Kingma et al. [87]. In VAE, the encoder compresses the input data d into a potential feature z, and the decoder reconstructs the signal $\hat{d}$ according to the data distribution in the potential space.

Fig. 6

Generation model in VAETM.

In Fig. 6, the quantity of documents is expressed as D, the words in document i are expressed as W, the number of words in document i is defined as N, the word vector in the document is expressed as $\overset{we}{w}$ , the entity vector is described as $\overset{ke}{w}$ , the label of the document is represented as y, the document-topic polynomial is defined as θ, the Dirichlet prior hyperparameter is expressed as α, the combination of potential performance with topic and background is described as η, and the logarithm of the overall word frequency is defined as d. The generation processes of VAETM are shown in Fig. 7.

Fig. 7

The generation processes of VAETM.

The objective function of the VAETM is expanded to Equation (4).

$\begin{matrix} L (\underset{i}{w}) = - \sum_{j = 1}^{\underset{i}{N}} log p (\underset{ij}{w} | h_{i}^{(s)}) + log p (\underset{i}{y} | h_{i}^{s}) \\ + c \cdot \underset{KL}{D} [\underset{φ}{q} (\underset{i}{h} | \underset{i}{w}) ∥ p (\underset{i}{h} | α)] \end{matrix}$ (4)

2.2.2 CGTM

The traditional topic models obtained the correlation structure between potential topics with a logical-normal distribution instead of the Dirichlet prior. Word embedding was proven to capture semantic rules. So the semantic relevance between words can be calculated directly in the word embedding space, such as using cosine values. Xun et al. proposed a Correlated Gaussian Topic Model (CGTM) based on word embedding (Fig. 8). The CGTM is directly modeled in a continuous word embedding space through additional word-level information. This model replaces words in the document with the meaningful words for embedding, and established a multivariate Gaussian distribution model. Then, this model learns the topic correlation between continuous Gaussian topics. The generation processes of CGTM are shown in Fig. 9.

Fig. 8

The structure of CGTM.

Fig. 9

The generation processes of CGTM.

τ and $\underset{c}{τ}$ are constant factors. Document D and the corresponding word embedding w were given, and the joint distribution of the topic assignment z and the logistic standard parameter η is shown in Equation (5). $\begin{matrix} p (z, {η_{d}}_{d = 1}^{D} | w) \propto \\ p (w | z) \prod_{d = 1}^{D} (\prod_{n = 1}^{\underset{d}{N}} \frac{exp (η_{d}^{\underset{dn}{z}})}{\sum_{i}^{K} exp (η_{d}^{i})}) N (\underset{d}{η} | \underset{c}{μ}, \underset{c}{Σ}) \end{matrix}$ (5)

2.3 Topic model based on a neural network

The neural network-based topic models characterized the text generation process that contains potential topic information with neural networks. In these models, they input data by document word bags, which added corresponding word vectors and other network layers to generate the document. Cao et al. combined feedforward neural networks to propose an NTM at the AAAI conference in 2015, which began the research on neural network-based topic models. Compared with the traditional topic model, the NTM has a simple structure and does not require prior assumptions. Meanwhile, it can obtain high-quality topic representation and accurate classification. With the occurrence of topic model construction from the neural network level, scholars proposed BERTopic [88], Context Reinforced Neural Topic Model (CRNTM) [89], Neural SparseMax Topic Models (NSMTM) [90], and other topic models [91–94]. This paper compares network structure, characteristics, and application fields of each model. The similarities and differences between the topic models based on conventional neural network structures are shown in Table 5.

Table 5
The comparison of topic models based on mainstream neural network structures

Model Network structure Input layer Model characteristic Generalizability and application scenarios

NTM Feedforward neural network n-gram vectors for documents Begin to build a topic model from the perspective of neural networks, and the distribution of words and documents has a reasonable probabilistic interpretation. Well generalizability, can be used for topic extraction and text classification

NVDM Variational autoencoder Word vector Follow with the VAE network structure to generate latent feature generation documents from the input document word vector space. General generalizability, only used for topic extraction

NSMTM Variational autoencoder Word vector Based on the VAE-based topic model, sparse constraints are imposed to generate topic and word distributions with sparse representation. General generalization, only applicable to text classification

TAM Recurrent neural network Bag of words In the attention mechanism, a novel method is designed to utilize the topic proportion of specific documents and the global topic vector learned from the neural topic model. A backpropagation inference method is also developed to allow joint model optimization. General generalizability, only applicable to text classification

SR-NSTM Variational autoencoder Word vector Add maximum interval posterior regularization to solve supervised tasks General generalizability, only used for text classification

SCIIOLAR [95] Variational autoencoder Word vector Various metadata can be used as label information to solve multilabel classification problems or to help infer and predict topics related to that label. Well generalizability, applicable to information retrieval and text classification

TopicRNN [96] Recurrent neural network Document word sequence Generate vocabulary based on topic and context words and determines whether the generated vocabulary is deactivated, capturing syntax and semantic relationships. Well generalizability, can be used for word prediction and sentiment analysis

CGTM Graph convolutional network Word vector Not only can external knowledge graphs be used, but external knowledge and old knowledge can be balanced to perform well on new data. Well generalizability, mainly applicable to topic extraction and document clustering

Model	Network structure	Input layer	Model characteristic	Generalizability and application scenarios
NTM	Feedforward neural network	n-gram vectors for documents	Begin to build a topic model from the perspective of neural networks, and the distribution of words and documents has a reasonable probabilistic interpretation.	Well generalizability, can be used for topic extraction and text classification
NVDM	Variational autoencoder	Word vector	Follow with the VAE network structure to generate latent feature generation documents from the input document word vector space.	General generalizability, only used for topic extraction
NSMTM	Variational autoencoder	Word vector	Based on the VAE-based topic model, sparse constraints are imposed to generate topic and word distributions with sparse representation.	General generalization, only applicable to text classification
TAM	Recurrent neural network	Bag of words	In the attention mechanism, a novel method is designed to utilize the topic proportion of specific documents and the global topic vector learned from the neural topic model. A backpropagation inference method is also developed to allow joint model optimization.	General generalizability, only applicable to text classification
SR-NSTM	Variational autoencoder	Word vector	Add maximum interval posterior regularization to solve supervised tasks	General generalizability, only used for text classification
SCIIOLAR [95]	Variational autoencoder	Word vector	Various metadata can be used as label information to solve multilabel classification problems or to help infer and predict topics related to that label.	Well generalizability, applicable to information retrieval and text classification
TopicRNN [96]	Recurrent neural network	Document word sequence	Generate vocabulary based on topic and context words and determines whether the generated vocabulary is deactivated, capturing syntax and semantic relationships.	Well generalizability, can be used for word prediction and sentiment analysis
CGTM	Graph convolutional network	Word vector	Not only can external knowledge graphs be used, but external knowledge and old knowledge can be balanced to perform well on new data.	Well generalizability, mainly applicable to topic extraction and document clustering

2.3.1 TAM

In recent years, topic model combined with neural variational inference has achieved good effects. Compared with the traditional topic models, the neural network-based topic models often approximate marginal distributions through deep neural networks to obtain strong generalization capabilities. Due to neural network-based methods having unsupervised properties, it is a challenge to directly utilize the topic proportion of a specific document for downstream prediction tasks for optimal performance. Therefore, Wang et al. proposed a TAM (Fig. 10) for supervised neural topics by recurrent neural networks (RNN). The model designs a new method in the attention mechanism, utilizing the percentage of topics in a specific document and the global topic vector learned by the neural network topic model. The TAM model adopted the word bag representation of Gaussian Softmax Construction (GSM) and the word label sequence of RNN for document input. The GSM fits the generation process of the document to estimate the specific topic distribution t. The sequence of word label $\underset{t}{x}$ is encoded as the hidden state $\underset{t}{h}$ by the GRU-base sequence encoder. Then, Wang et al. utilized the attention mechanism as a bridge to connect the above two parts, and realized the advantages complementary of the two models.

Fig. 10

The structure of TAM.

The left part of Fig. 10 is an unsupervised neural network-based topic model through variational autoencoder learning. The right part is a supervised RNN model that encodes the input words by Bi-GRU. The two parts are jointly learned by backpropagation inference, and the joint distribution is shown in Equation (6).

$\begin{matrix} p (l, d | \underset{0}{μ}, \underset{0}{σ}, β) = \\ \int_{t} p (t | \underset{0}{μ}, \underset{0}{σ}) p (d | t, β) p (l | t) dt \end{matrix}$ (6)

2.3.2 SR-NSTM

Although the NTM performs well in extracting interpretable potential topics and text representation, there are two significant limitations: 1) The feedforward neural network has a shallow structure. It is often difficult to consider the contextual information of the entire text, resulting in insufficient ability of feature representation. 2) The sparsity of feature representation in the topic’s semantics space is neglected. To overcome the above problems, some researchers proposed the NSTC, NSMTM, and Semantic Reinforcement Neural variational Sparse Topic Model (SR-NSTM) [97]. The SR-NSTM (Fig. 11) utilized the parameterized probability distribution of a neural network to construct the text generation process. Meanwhile, it combined a bidirectional long-short-term memory network to embed contextual information at the document-level. Because the LSTM can effectively capture long-term dependencies between words, it is often used in document-level semantic coding. The generation processes of each document in the SR-NSTM model are shown in Fig. 12.

Fig. 11

The structure of SR-NSTM.

Fig. 12

The generation processes of SR-NSTM.

Equation (7) is the calculation method for maximizing the probability of word counting w in the document during the generation process. $\begin{matrix} log \int \underset{φ}{p} (w) \underline{⩾} \\ - \underset{KL}{D} [\underset{θ}{q} (s | w) | | \underset{φ}{p} \\ (s | θ)] + \underset{q θ (s | w)}{E} (log \int \underset{φ}{p} (w | s, β)) \end{matrix}$ (7)

3 Evaluation indicator of the topic model

Recently, scholars paid attention to methods that can efficiently evaluate the effectiveness of topic models. Currently, the performance evaluation methods of the topic model mainly include perplexity, topical consistency, categorical clustering, topic significance, topic diversity, topic distance, and others. The generally used datasets for performance evaluating of topic models are shown in Table 6.

Table 6
Generally used datasets for performance evaluating of topic models

Dataset name Document size Application model

20Newsgroups 20000 CGTM, PAM, sLDA, Disc LDA, BTM

NIPS 1740 PAM, Online-LDA, IITM, IITMM

Amazon 10000+ SJASM, ASUM, JST

TripAdvisor 10000 SJASM, MAS

Google News 11109 DMM, LF-DMM

TweetSet 2472 DMM, SATM, GPU-DMM

Twitter 11109 LFTM, WEI-FTM

Wikil0+ 17000+ NTM, SLRTM

Sina microblog 200000 GSDMM, PTM, LTM, SADTM

Tweets 1500000 LDA, BTM, LTM, SATM

IMDB 25000 TopicRNN, TDLM, TCNLM

Web-Snippet 12000+ WEI-FTM, GPUDMM, GPUP-DMM, NSTC

Reuters-21578 11367 WEI-FTM, CGTM

Dataset name	Document size	Application model
20Newsgroups	20000	CGTM, PAM, sLDA, Disc LDA, BTM
NIPS	1740	PAM, Online-LDA, IITM, IITMM
Amazon	10000+	SJASM, ASUM, JST
TripAdvisor	10000	SJASM, MAS
Google News	11109	DMM, LF-DMM
TweetSet	2472	DMM, SATM, GPU-DMM
Twitter	11109	LFTM, WEI-FTM
Wikil0+	17000+	NTM, SLRTM
Sina microblog	200000	GSDMM, PTM, LTM, SADTM
Tweets	1500000	LDA, BTM, LTM, SATM
IMDB	25000	TopicRNN, TDLM, TCNLM
Web-Snippet	12000+	WEI-FTM, GPUDMM, GPUP-DMM, NSTC
Reuters-21578	11367	WEI-FTM, CGTM

3.1 Perplexity

The perplexity indicator is usually applied to evaluate the generalization ability of the topic model. Generalization ability refers to the adaptability of the topic model to unknown data. The lower perplexity, the stronger the generalization ability of the language topic mode. And the higher the modeling accuracy, the better the model performance. The perplexity formula of the language model is shown in Equation (8). $perplexity = - \frac{1}{N} \sum_{d - 1}^{N} \frac{1}{N_{d}} log (P (d))$ (8) where N is the documents’ quantity, N_d is the words’ quantity in document d, and P(d) is the appearing probability of each word in the document.

3.2 Topical consistency

Topic consistency is one of the important indicators for topic model performance evaluation. It evaluates the model’s performance through the consistency between the topic words. Furthermore, the ability to generate simple topic words is also one of the evaluation indicators of this method.

3.2.1 Topic cohesion

The cohesion score is essential for quantitatively evaluating the topic’s semantic consistency [98]. The keywords with semantic consistency often appear in the same document. And the topic cohesion can be utilized to evaluate the consistency between each discovered topic automatically. Particularly, for the specific topic z, the T words most relevant to the topic are denoted by $V^{z} = {v_{1}^{z}, v_{2}^{z}, \dots, v_{T}^{z}}$ , and the topic cohesion formula is shown as Equation (9). $C (z; V^{z}) = \sum_{L - 2}^{T} \sum_{l - 1}^{l} log \frac{D (v_{l}^{z}, v_{1}^{z}) + 1}{D (v_{l}^{z})}$ (9) where $D (v_{l}^{z})$ denotes the word frequency of the word $v_{l}^{z}$ in the document, and $D (v_{l}^{z}, v_{1}^{z})$ is the times of co-occurrences of words $v_{l}^{z}$ and $v_{1}^{z}$ in the document. Topic consistency is positively correlated with the effect of the topic model, and the robustness of the model is also positively correlated with topic consistency. However, it should be noted that the indicator is often applied to measure the high-frequency keywords in the document, and the measurement effect on the low-frequency keywords in the document is relatively poor.

3.2.2 Pointwise mutual information

The topics’ quantity is K, and the words’ quantity most related to topic k is T. Pointwise mutual information [99] (PMI) and normalized pointwise mutual information [100] (NPMI) are expressed as Equations (10) and (11). $PMI = \frac{1}{K} \sum_{k} \frac{2}{T (T - 1)} \sum_{1 < i < j < T} log \frac{- p (w_{i}, w_{j})}{p (w_{i}) p (w_{j})}$ (10) $NPMI = \frac{1}{K} \sum_{k} \frac{2}{T (T - 1)} \sum_{1 < i < j < T} \frac{log \frac{p (w_{i}, w_{j})}{p (w_{i}) p (w_{j})}}{- log p (w_{i}, w_{j})}$ (11) where p(w_i) denotes the probability of the occurrence of word i, and p(w_i, w_j) represents the joint occurrence probability of both words i and j. Most of the large-scale corpus is utilized in the experimental evaluation process, so the document content is long. There are some differences in the semantic information which contain in different positions of the document. Therefore, the sliding window is often used to calculate the joint probability. The values of PMI and NPMI are positively correlated with the semantic consistency of the document.

3.3 Categorical and clustering

Clustering purity and entropy are often used to evaluate the quality of topic detection. Many researchers have verified that the topic detection rate is positively correlated with the clustering purity, and is negatively correlated with the clustering entropy [101]. The calculation formula of cluster purity is shown as Equation (12). $Purity (W, C) = \frac{1}{N} \sum_{k} max_{j} | w_{i} ⋂ c_{j} |$ (12) where N is the members’ quantity involved in the whole cluster partition, k is the clusters’ quantity, w_i is a cluster in set W, and c_j is a cluster in set C of the entire cluster partition.

The calculation of the cluster entropy for cluster i is defined in Equation (13). $entropy = \sum_{i = 1}^{K} \frac{m_{i}}{m} e_{i}$ (13) where e_i is the cluster entropy for cluster i, as shown in Equation (14). $e_{i} = \sum_{j = 1}^{L} P_{ij} log 2 P ij$ (14) where P_ij is the probability that a member in cluster i belongs to class j, defined in Equation (15). $P_{ij} = \frac{m_{ij}}{m_{i}}$ (15) where all members’ quantity in cluster i is denoted by m_i, the members’ quantity belonging to class j is indicated by m_ij, L represents the classes’ quantity, the cluster quantity is denoted by K, and the members’ quantity of participating in the whole cluster partition is represented by M.

3.4 Other evaluation methods

3.4.1 Artificial evaluation method

Evaluation indicators such as perplexity are less effective for some document sources with complex and unlabeled text corpora. Therefore, in this case, a manual evaluation is often more reliable. The manual evaluation method is subjective in evaluating whether the topic is meaningful. Researchers usually utilize this method directly analyzing the semantic understanding of the most relevant 5 to 10 topic words, which were generated by the language model to obtain the subjective evaluation of the topic model.

3.4.2 Semantic retrieval ability

The semantic retrieval ability of the topic model is usually evaluated to assess the semantic expression and modeling ability further. This method uses the modeling results in semantic retrieval: The query sentences are utilized to retrieve the social network information, and then the query sentences are sorted according to the similarity with the information searched. Finally, the topic model retrieval ability is evaluated according to the retrieval results returned by the model. Some scholars utilize the topic-word distribution and topic-time distribution, which generated by an algorithm, to calculate the producing probability of query sentences for each piece of information. They also adopted this probability to evaluate the semantic retrieval ability of the algorithm. The probability is higher, and the semantic retrieval ability is stronger.

This section introduces several key indicators of the current topic model evaluation from different aspects, and provides the basis for the model performance evaluation. However, due to the different application conditions of each indicator, the correlation between each indicator impacts the evaluation results. Therefore, in practical application, researchers still need to select the optimal evaluation method to assess the model’s performance according to the reality. If necessary, a multi-indicator evaluation method should be integrated based on the correlation between each indicator, which can achieve the optimal effect for evaluating the topic model’s performance.

4 Statistical analysis results and applications of topic models

In this section, we mainly analyzed the statistical results of the topic model through experiments, and introduced its applications in different fields.

4.1 Statistics data analysis results

To analyze the statistical results of the topic model, we constructed multiple sets of experiments by using the Sina Microblog dataset. PMI is adopted as an evaluation indicator in the experiment.

4.1.1 Dataset

This performance evaluation uses the Sina microblog as the experimental dataset. Sina microblog contains a large amount of short text data, with a total of 200 000 pieces of data. This dataset is composed of data crawled from Sina microblog using users as random seeds. Each obtained information only contains textual information, retaining the original microblog content and excluding the corresponding microblog reporters. Simultaneously delete blog posts with no more than ten words, including word segmentation, removing inactive words, and deleting words that appear less than seven times.

4.1.2 Evaluating indicator

PMI is one of the critical indicators of topic model consistency evaluation, which can well reflect the model performance. Table 6 shows the generally used datasets for performance evaluation of the topic model. The calculation method of PMI is shown in Equation (10).

4.1.3 Experiment setting

We set the topic number K to 50 and 100, respectively, and calculate the PMI of the representative models selected from the three types of topic models. Among them, the representative models selected from the traditional topic model are LDA, LTM, and GSDMM; The word vector-based topic model selects UMHE, CTM, and HTMH as representative models; The representative models selected for the neural network-based topic model are DSSM, GRNN and UATM.

4.1.4 The result and analysis of the experiment

The results of PMI with K varying from 50 to 100 are shown in Table 7.

Table 7
Comparison of topic coherence

Dataset Model type Model K

50 100

Sina microblog Traditional LDA 0.80 0.75

LTM 1.25 1.35

GSDMM 1.43 1.41

Word vector-based UMHE 1.45 1.42

CTM 1.69 1.73

HTMH 1.74 1.71

Neural network-based DSSM 2.01 2.23

GRNN 2.12 2.51

UATM 2.61 2.69

Dataset	Model type	Model	K
Sina microblog	Traditional	LDA	0.80	0.75
		LTM	1.25	1.35
		GSDMM	1.43	1.41
	Word vector-based	UMHE	1.45	1.42
		CTM	1.69	1.73
		HTMH	1.74	1.71
	Neural network-based	DSSM	2.01	2.23
		GRNN	2.12	2.51
		UATM	2.61	2.69

From Table 7, we can see that the performance of the neural network-based topic model is better than the other types. The main reason is that the neural network-based topic model mainly uses deep learning technology to reconstruct the model text generation process and uses the sparse constraint of topic vocabulary to enhance the quality of topic words in the modeling process. The word vector-based topic model performs worse than the neural network-based topic model in the Sina microblog dataset. The word vector-based topic model usually uses word vector technology to enhance the model’s generalization ability and then realizes high-quality topic extraction in a short text. The traditional topic model performs the weakest in all categories. The primary cause is that the traditional topic models are difficult to effectively learn contextual semantic relationships in text, so their semantic modeling ability is poor.

4.2 Applications of topic models

With the continuous development of the topic model in recent decades, its application fields are also extending [102–106], including the pharmaceutical industry, tourism, finance and other fields. The topic model from the retrieval and recommendation field, the cross-media modeling field, and the topic summary field will be introduced in this section.

4.2.1 Retrieval and recommendation field

With the development of the information era, search engines provide excellent convenience for users, allowing them to obtain information quickly. At the same time, the matching accuracy of query input and search results has become an essential indicator for evaluating search engine performance. Therefore, developers have paid great attention to search engines performance. However, most of the query users’ input is relatively short and usually contains a variety of retrieval intentions. So personalized retrieval can well meet the actual needs of users. In addition to considering the semantic similarity between documents and query sentences, the personalized retrieval method also combines users’ interests to obtain optimal recommendations. Thus, their interests, browsing history, and so on can be matched. Personalized retrieval is essential for a customized recommendation for answering queries. Recently, the research on recommendation algorithms has been increasing and deepening. Shi et al. [107] utilized a User-based Aggregation Topic Model (UATM) to research user preferences and intention distribution. They analyzed the collected Sina microblog user data and content delivery. Yin et al. proposed a Spatial-Temporal LDA (ST-LDA), which was a probability generation model of latent class. Firstly, the ST-LDA divides the geographical space into multiple regions. Then it infers the interest distribution of single user on series topics based on the content (labels, categories) of accessed Point of Interest (POI) in each region. The structure of the ST-LDA is shown in Fig. 13. It is a generative model of the user’s location, time, role and text in the check-in record. The ST-LDA can mine potential topics and regions, then realize the unified learning of different users’ interests and location preferences.

Fig. 13

The structure of the ST-LDA.

Shao et al. [108] combined multimodal data with a comprehensive Sentiment-aware Multimodal Topic Model (SMTM) to achieve personalized and travel recommendation. The SMTM (Fig. 14) conducts topic mining from multiple fields, such as tourists’ subjectivity and attractiveness. In addition, the tourists’ emotional attitude toward classic topics is also included in the evaluation factors, and travel recommendations are made accordingly.

Fig. 14

SMTM travel recommendation framework.

Online shopping is an indispensable activity in daily life. Meanwhile, providing a means for customers to discover their favorite commodity quickly is particularly important. The collaborative filtering algorithm generates recommendations through interaction rates between the user and the product. Na et al. [109] believed that the number of user tags could reflect user preference to a certain extent. Therefore, the collaborative filtering algorithm of User-Item-Tag Latent Dirichlet Allocation (UITLDA) was proposed. Compared with the users’ current comments, the previous comments have a relatively weak impact. So the delay function was introduced into the prediction calculation of UITLDA. It will accurately recommend commodities for users’ according to their preferences. The user project and label topic modeling process of UITLDA are shown in Fig. 15. Figure 16 shows that UITLDA infers the topic distribution aimed at users, their items and tags. It then combined two constraints into another new distribution.

Fig. 15

UITLDA user project and label topic modeling process.

Fig. 16

The modeling process of UITLDA.

The topic model’s application in the recommendation field is not only tourism and commodity recommendations. There exists a Space-Time Periodic Task (STPT) model based on simulated user log records in the remote sensing image recommendation field [110]. Chen et al. [111] proposed the Spatial-Temporal Embedding Topic model (STET) to compensate for the defects of the STPT model. Regarding forum recommendation, the cognitive load of online participants has increased dramatically due to the increasing number of posts produced in the forum. Peng et al. proposed a Sentiment and Behavior Topic Model (SBTM) to obtain the text content relevant to the participants’ concerns rapidly and precisely.

In topic recommendation, Kowald et al. [112] developed a cognitive-inspired hashtag recommendation method (BLL_I,S) that involved the influence of time on personal and social topic tags in the prediction model. Ang Li et al. [113] proposed a Multiview Scholar Clustering Topic (MSCT) model that considered both scholar’s interest and double-view (internal and external) information with a clustering method for the accurate analysis of scholar clustering. It provided a reference for scholar recommendations. Furthermore, Zeng et al. [114] utilized a knowledge graph based on user perception to capture relevant news information and proposed a topic model applied to recommend accurate news for users. Meanwhile, Ji et al. [115] utilized a Social Period Aware Topic Model (SPATM) to distinguish user interests and social preferences. The SPATM achieved personalized venue recommendations for users. The topic model has a good recommendation application effect in the medical field. Yang et al. [116] proposed a doctor recommendation model based on the system decision model for considering patient preferences and opinions.

4.2.2 Cross-media modeling field

Modern information retrieval technology mainly includes single-media and cross-media information retrievals. Single-media information retrieval requires query words, and retrieval sets belonging to the same modality, such as retrieving text by text or retrieving image by image. Cross-media information retrieval integrates different modalities. It utilizes the associated information in other modalities to achieve retrieval of each other between additional modality information. Cross-media information retrieval has become popular academic research in the world. As an important development direction for information retrieval research in the future, it has received extensive attention from scholars [117–119].

A vital research field involves the cross-media data of mining and analyzing to solve the semantic gap. Adversarial learning methods [120, 121] are essential in image, text, and speech generation. New data distribution is promoted by establishing the adversary and game mechanism in the processes of generation and discriminative [122]. Liu et al. [123] proposed a Semantic Similarity-based Adversarial Cross-media Retrieval (SSACR) method in which two neural network models were trained with negative training. The SSACR method mainly includes input, mapping, distribution, similarity, and discriminative network. This method performs feature extraction, mapping, judgement and other operations on each image-text-semantic three tuple through the above network. At the same time, it minimizes errors. The overall processes of SSACR are shown in Fig. 17.

Fig. 17

SSACR overall processes.

In modern information society, with the increasing diversity and diversification of social platforms and the rapid development of social networking services (SNS) [124–126], the new trend is building a shared content model based on multiple social networking websites to describe the same user. Liu et al. [127] adopted a CrossSiteLDA (c-LDA), which is a reliable cross-site user-generated content model to conduct test research. The test result indicated the model was better than the existing model regarding performance indicators such as confusion and semantic consistency.

4.2.3 Topic summary areas

In recent years, academia and commercial circles have paid extensive attention to research on Twitter topic derivation [128, 129]. Moreover, information interconnection and social platforms (YouTube, microblogs, and TikTok) play essential roles in the real-time spread of news events. However, the texts on Twitter and TikTok are short and informal, and the vector identifier vocabulary in these texts is extremely sparse. Therefore, the methods of accurately and comprehensively obtaining summary statements have become a popular research topic for scholars. Wang et al. proposed a Hashtag Graph-based Topic Model (HGTM) [130]. The HGTM (Fig. 18) regards the Twitter text as semi-structured and uses the label relationship in the label graph to mine the semantic relationship between words. This model utilized the Dirichlet prior hyper-parameters to calculate the topic distribution matrix of each hash label, and the word distribution matrix of the topic. The test result showed that HGTM has a solid ability to deal with the text sparse and noise problems on Twitter. Compared with the baseline model, this model can mine more specific and coherent topics.

Fig. 18

The structure of HGTM.

In social networks such as Twitter, there are not only common topics but also occurrences of emergency topics (disaster news, rescue information, and leadership elections). Learning about emergency topics quickly and accurately can guide and control public opinion and online rumors. So scholars have proposed SATM, Burty Biterm Topic Model (BBTM) [131], Sparse RNN-Topic Model (SRTM) [132], and other topic models. However, it is often difficult to accurately locate emergency topics through the above models separately in the actual process. Therefore, Shi et al. [133] developed a modeling method based on RNN and social network emergency topic discovery (RTM-SBTD) to overcome this issue. The method consists of the following four parts: data preprocessing, RNN-based prior knowledge learning, sparse topic model construction based on the ‘spike and slab’ prior, and social network emergency topic discovery.

The RTM-SBTD method is compared with online LDA, BBTM, SARTM, and other methods regarding topic discovery accuracy, novelty, consistency, and quality. The results show that the RTM-SBTD method is better than other methods.

5 Challenge and trend analysis

Although the current topic models have made significant progress compared with the primitive topic models, the following challenges still exist in the use of the topic model:

Short text precise modeling. Existing topic models perform well when applied to typical corpora. Nevertheless, poor performance is unavoidable when it is used on short texts produced by social media platforms (Twitter, microblogs and so on), which tend to be colloquial, highly noisy, and nonstandard. In the future, external data from WordNet and Wikipedia can enhance short texts’ semantic analysis and processing. In contrast, internal data can reduce short text noise and improve the effectiveness of short text topic models.

Fine-grained context semantic awareness. Bert has demonstrated its significant benefits in text summarization and keyword extraction, becoming the optimal technology to replace word2vec as a word vector topic model. With the continuous exploration and optimization of the topic model, Bert will be applied to text mining innovatively.

Text generation applications for specific topics. Generative adversarial networks (GANs), as a joint training model, composes of generative and judgment models. It has been used in sentence-level text generation, such as single-text summarization and human-machine conversation. However, scholars should focus on improving the application of this technology in text generation for specific topics.

Model semantic understanding. Research on knowledge graphs has accumulated abundant achievements. Scholars should integrate high-quality prior knowledge and rich document semantic information in the research results of the knowledge graph into the modeling process of the topic model to enhance the semantic understanding ability of the model.

6 Conclusion

The relevant work and research progress, evaluation indicators, datasets, performance experiment, and application field of topic model were systematically studied to promote the topic model’s development in this paper. First of all, we evaluated various topic models’ research progress and characteristics. Then we discussed four widely used basic topic models and briefly introduced topic models based on word vectors and neural networks. After that, common evaluation indicators were illustrated to provide a valuable reference for comparing and selecting topic model performance indicators and datasets. Meanwhile, we also compared and summarized standard datasets used in topic model performance evaluation. Moreover, we conducted performance evaluation experiments on three types of models by using PMI. The application status of the topic model in recommendation and retrieval, cross-media modeling, and topic summarization were introduced. Finally, we also summarized the problems in applying the topic models in various fields.

With the development of GNN technology, its solid semantic relevance and modeling capabilities have been shown. In the future, we will combine GNN with topic models to improve topic modeling capabilities.

Footnotes

Author contributions

G.C. constructs the framework of this paper and puts forward the writing idea of this paper. L.S., Z.W. and Q.Y. complete the analysis and summary of the research progress of the topic model. J.L. and T.L. provided opinions on the structure of the paper in the process of writing the paper. G.C., Q.Y. and L.S. conducted the overall writing of the paper. All authors of this article read the full text and agree to the published version of the manuscript.

Data availability

This article contains no data or material other than the articles used for the review and referenced.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Grant No. 42377200), the State Key Laboratory of Geohazard Prevention and Geoenvironment Protection (No. SKLGP2021K005), the Natural Science Foundation of Hebei Province of China (No. D2022508002), Suzhou Science and Technology Plan Project (SYG202034), Guangxi Key Laboratory of Trusted Software (No. KX202315), and the Innovation Capability Enhancement Plan Project of Hebei Province, China (21567693H).

References

Murshed

B.A.H.

, Mallappa

, Abawajy

et al., Short text topic modelling methodes in the context of big data: taxonomy, survey, and analysis, J Artificial Intelligence Review 56 (2023), 5133–5260.

Albalawi

, Yeap

T.H.

and Benyoucef

, Using Topic Modeling Methods for Short-Text Data: A Comparative Analysis, J Frontiers in Artificial Intelligence 3 (2020), 42.

Klaifer

and Lilian

, Topic detection and sentiment analysis in Twitter content related to COVID-19 from Brazil and the USA, J Appl Applied Soft Computing 101 (2021), 107057.

and Yang

, Research on sentiment classification of micro-blog short text based on topic clustering, J Journal of Physics Conference Series 1827(1) (2021), 012160.

Hananto

V.R.

, Serdült

and Kryssanov

, A Text Segmentation Method for Automated Annotation of Online Customer Reviews, Based on Topic Modeling, J Applied Sciences 12(7) (2022), 3412.

Hofmann

, Unsupervised learning by probabilistic latent semantic analysis, J Machine Learning 42(1-2) (2001), 177–196.

Blei

D.M.

, Ng

and Jordan

M.I.

, Latent dirichlet allocation, J The Journal of Machine Learning Research 3 (2003), 993–1022.

Yin

and Wang

, A dirichlet multinomial mixture model-based method for short text clustering, C, Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, USA: ACM, 2014:233–242.

Nigam

, Mccallum

A.K.

and Thrun

, Text Classification from Labeled and Unlabeled Documents using EM, J Machine Learning 39(2/3) (2000), 103–134.

10.

Yan

, Guo

, Lan

and Cheng

, A biterm topic model for short texts, C,, Proceedings of the International Conference on Web. Rio de Janeiro, Brazil, 2013:1445–1456.

11.

Blei

D.M.

and Lafferty

J.D.

, Correlated Topic Models, C., //Proceedings of the International Conference on Neural Information Processing Systems, V ancouver, Canada, 2005:147–154.

12.

Abri

and Abri

, Providing a Personalization Model Based on Fuzzy Topic Modeling, J Arabian Journal for Science and Engineering 46 (2020), 3079–3086.

13.

Bianchi

, Terragni

, Hovy

et al., Cross-lingual Contextualized Topic Models with Zero-shot Learning, C., //In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. Online, 2020:1676–1683.

14.

Terragni

, Fersini

, Galuzzi

B.G.

, Tropeano

P.F.

and Candelieri

, OCTIS: Comparing and Optimizing Topic models is Simple! C. //Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, Online, 2021:263–270.

15.

Clarke

, Clark

, Birkin

et al., Understanding barriers to novel data linkages: topic modeling of the results of the LifeInfo survey, J Journal of Medical Internet Research 23(5) (2021), e24236.

16.

Qiang

, Qian

, Li

et al., Short text topic modeling techniques, applications, and performance: a survey, J IEEE Transactions on Knowledge and Data Engineering 34(3) (2020), 1427–1445.

17.

Wright

, Burton

, McKinlay

et al., Public opinion about the UK government during COVID-19 and implications for public health: A topic modeling analysis of open-ended survey response data, J Plos One 17(4) (2022), e0264134.

18.

Han

, Han

and Li

, Review: Topic Model Application for Social Network Public Opinion Analysis, C., //Proceedings of the 2nd International Conference on Information Technologies and Electrical Engineering, Zhuzhou Hunan China: ACM, 2019:1–9.

19.

Vayansky

and Kumar

S.A.

, A review of topic modeling methods, J Information Systems 94 (2020), 101582.

20.

Zhou

, Yu

and Hu

, Topic evolution based on the probabilistic topic model: a review, J Frontiers of Computer Science 11 (2017), 786–802.

21.

Abdelrazek

, Eid

, Gawish

E.K.

, Medhat

and Hassan

, Topic modeling algorithms and applications: A survey, J Information Systems 112 (2022), 102131.

22.

Fan

, Li

, Ma

, Lee

, Yu

and Hemphill

, A Bibliometric Review of Large Language Models Research from 2017 to 2023, J 2023: ArXiv, abs/2304.02020.

23.

Dong

, Tang

, Li

and Zhao

W.X.

, A Survey on Long Text Modeling with Transformers, J 2023: ArXiv, abs/2302.14502.

24.

Fan

, Shi

and Yuan

, Topic modeling methods for short texts: A survey, J Journal of Intelligent & Fuzzy Systems 45(2) (2023), 1971–1990.

25.

Nugroho

, Paris

, Nepal

et al., A survey of recent methods on deriving topics from Twitter: algorithm to evaluation, J Knowledge and Information Systems 62(7) (2020), 2485–2519.

26.

Tan

K.L.

, Lee

C.P.

and Lim

K.M.

, A Survey of Sentiment Analysis: Approaches, Datasets, and Future Research, J Applied Sciences, 13(7) (2023), 4550.

27.

Meng

and Xiong

, A Doctor Recommendation Based on Graph Computing and LDA Topic Model, J International Journal of Computational Intelligence Systems 14(1) (2021), 808.

28.

Shi

, Du

J.P.

, Liang

and Kou

F.F.

, Dynamic topic modeling via self-aggregation for short text streams, J Peer-to-Peer Networking and Applications 12(5) (2019), 1403–1417.

29.

Toubia

, A Poisson Factorization Topic Model for the Study of Creative Documents (and Their Summaries), J Journal of Marketing Research 58(6) (2021), 1142–1158.

30.

Das

, Zaheer

and Dyer

, Gaussian LDA for Topic Models with Word Embeddings, C. // Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Beijing, China: Association for Computational Linguistics, 2015: 795–804.

31.

, Wang

, Zhang

, Sun

and Ma

, Topic Modeling for Short Texts with Auxiliary Word Embeddings, C., //Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, Pisa Italy: ACM, 2016:165–174.

32.

Yao

, Zhang

, Wei

, Zhe

and Chen

, Incorporating knowledge graph embeddings into topic modeling, c., Proceedings of the National Conference on Artificial Intelligence, San F rancisco, USA, 2017:3119–3126.

33.

Cao

, Li

, Liu

, Li

and Ji

, A Novel Neural Topic Model and Its Supervised Extension, C., Twenty-ninth Aaai Conference on Artificial Intelligence, Austin, USA, 2015:2210–2216.

34.

Wang

and Yang

, Neural Topic Model with Attention for Supervised Learning, C., International Conference on Artificial Intelligence and Statistics 9 (2020), 1147–1156.

35.

Yin

, Zhou

, Cui

, Wang

, Zheng

and Nguyen

Q.V.H.

, Adapting to User Interest Drift for POI Recommendation, J IEEE Transactions on Knowledge and Data Engineering 28(10) (2016), 2566–2581.

36.

Miao

, Lei

and Blunsom

, Neural Variational Inference for Text Processing, J Computer Science (2016), 1791–1799.

37.

Dampfhoffer

, Mesquida

, Valentian

et al., Backpropagation-Based Learning Techniques for Deep Spiking Neural Networks: A Survey, J IEEE Transactions on Neural Networks and Learning Systems (2023), 1–16.

38.

Hecht-Nielsen

, Theory of the backpropagation neural network, M. Neural networks for perception, Academic Press, 1992:65–93.

39.

Ketkar

and Ketkar

, Stochastic gradient descent, J Deep learning with Python: A hands-on introduction (2017), 113–132.

40.

Yang

, Zhang

and Fan

, sdtm: A supervised bayesian deep topic model for text analytics, J Information Systems Research 34(1) (2023), 137–156.

41.

Joshi

, Fidalgo

, Alegre

et al., DeepSumm: Exploiting topic models and sequence to sequence networks for extractive text summarization, J Expert Systems with Applications 211 (2023), 118442.

42.

, Zhang

and Pan

, Bi-Directional Recurrent Attentional Topic Model, J ACM Transactions on Knowledge Discovery from Data (TKDD) 14 (2020), 1–30.

43.

Mishra

R.K.

, Urolagin

, Jothi

J.A.A.

, Neogi

A.S.

and Nawaz

, Deep Learning-based Sentiment Analysis and Topic Modeling on Tourism During Covid-19 Pandemic, J Frontiers in Computer Science 3 (2021), 775368.

44.

Hua

, Lu

C.T.

, Jaegul

and Chandan

K.R.

, Probabilistic Topic Modeling for Comparative Analysis of Document Collections, J ACM Transactions on Knowledge Discovery from Data (TKDD) 14(2) (2020), 1–27.

45.

Gupta

and Zhang

, Vector-quantization-based topic modeling, J ACM Transactions on Intelligent Systems and Technology (TIST) 12(3) (2021), 1–30.

46.

Tutubalina

E.V.

and Nikolenko

S.I.

, Topic Models with Sentiment Priors Based on Distributed Representations, J Journal of Mathematical Sciences 273 (2023), 639–652.

47.

Wang

J.Y.

and Zhang

X.L.

, Deep NMF topic modeling, J Neurocomputing 515 (2023), 157–173.

48.

Yang

, Wen

, Chen

N.S.

et al., A novel contextual topic model for multi-document summarization, J Expert Systems with Applications 42(3) (2015), 1340–1352.

49.

Blei

D.M.

and Mcauliffe

J.D.

, Supervised topic models, C. In Proceedings of the 20th International Conference on Neural Information Processing Systems (NIPS’07), Curran Associates Inc., Red Hook, NY, USA, 2007:121–128.

50.

Liu

and Huang

, Biterm topic model with word vector features, J Application Research of Computers 34(7) (2017), 2055–2058.

51.

Xun

, Li

, Zhao

W.X.

, Gao

and Zhang

, A Correlated Topic Model Using Word Embeddings, C., IJCAI 17 (2017), 4207–4213.

52.

Chen

, Wu

, Lin

, Liu

, Zhang

and Ye

, Affinity Regularized Non-Negative Matrix Factorization for Lifelong Topic Modeling, J IEEE Transactions on Knowledge and Data Engineering 32(7) (2020), 1249–1262.

53.

Peng

, Xie

, Zhang

, Wang

and Zhang

, Neural Sparse Topical Coding, C. //Proceedings of the Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 2018:2332–2340.

54.

Wang

, Guo

, Shen

et al., Robust supervised topic models under label noise, J Machine Learning 110 (2021), 907–931.

55.

Ozyurt

and Akcayol

M.A.

, A new topic modeling based method for aspect extraction in aspect based sentiment analysis: SS-LDA, J Expert Systems with Applications 168 (2020), 114231.

56.

Han

, Tian

, Huang

, Li

and Jia

, Topic representation model based on microblogging behavior analysis, World Wide Web 23(6) (2020), 1–15.

57.

Yao

and Wang

, Tracking urban geo-topics based on dynamic topic model, J Computers, Environment and Urban Systems 79 (2020), 101419.

58.

AlSumait

, Barbará

and Domeniconi

, Online LDA: Adaptive Topic Models for Mining Text Streams with Applications to Topic Detection and Tracking, C. // 2008 Eighth IEEE International Conference on Data Mining, Pisa, Italy: IEEE, 2008:3–12.

59.

Titov

and McDonald

, Modeling online reviews with multi-grain topic models, C. //Proceedings of the 17th international conference on World Wide Web. Beijing, China, 2008:111–120.

60.

Wang

, Blei

and Heckerman

, Continuous Time Dynamic Topic Models, C., //Proceedings of the 24th Conference on Uncer-tainty in Artificial Intelligence, IIelsinki, Finland, 2008:579–586.

61.

Ghoorchian

and Sahlgren

, GDTM: Graph-based Dynamic Topic Models, J Progress in Artificial Intelligence 9(3) (2020), 195–207.

62.

Nguyen

D.Q.

, Billingsley

, Du

and Johnson

, Improving Topic Models with Latent Feature Word Representations, J Transactions of the Association for Computational Linguistics 3 (2015), 299–313.

63.

, Shen

, Wang

, Du

and Du

, A Web service clustering method based on topic enhanced Gibbs sampling algorithm for the Dirichlet Multinomial Mixture model and service collaboration graph, J Information Sciences 586 (2022), 239–260.

64.

Bouziane

, Abdi

M.K.

and Sadou

, Automatically Labelled Software Topic Model, J International Journal of Open Source Software and Processes 11(1) (2020), 57–78.

65.

Niu

, Zhang

and Li

, A Nested Chinese Restaurant Topic Model for Short Texts with Document Embeddings, J Applied Sciences 11(18) (2021), 8708.

66.

Wang

, Zhang

J.L.

, Li

, Deng

and Liu

J.S.

, Bayesian Text Classification and Summarization via A Class-Specified Topic Model, J Journal of Machine Learning Research 22(89) (2021), 1–89.

67.

Gao

, Generation of topic evolution graphs from short text streams, J Neurocomputing 383 (2020), 282–294.

68.

Peng

, Xu

and Gan

, SBTM: A joint sentiment and behaviour topic model for online course discussion forums, J Journal of Information Science 47(4) (2021), 517–532.

69.

Kai

, Zhang

and Xu

, Topic Model over Short Texts Incorporating Word Embedding, C. //Proceedings of the 2018 2nd International Conference on Advances in Energy, Environment and Chemical Science (AEECS 2018), Zhuhai, China: Atlantis Press, 2018.

70.

Yang

and Wang

, Author topic model for co-occurring normal documents and short texts to explore individual user preferences, J Information Sciences 570 (2021), 185–199.

71.

, Li

, Chi

and Ouyang

, Short text topic modeling by exploring original documents, J Knowledge and Information Systems 56(2) (2018), 443–462.

72.

Quan

, Kit

, Yong

and Pan

, Short and Sparse Text Topic Modeling via Self-Aggregation, C., //International Conference on Artificial Intelligence 2015:2270–2276.

73.

Murakami

and Chakraborty

, Investigating the Efficient Use of Word Embedding with Neural-Topic Models for Interpretable Topics from Short Texts, J Sensors 22(3) (2022), 852.

74.

Huang

, Peng

, Li

et al., Improving biterm topic model with word embeddings, J World Wide Web 23(6) (2020), 3099–3124.

75.

Adji

B.D.

, Francisco

J.R.R.

and David

M.B.

, Topic Modeling in Embedding Spaces, J Transactions of the Association for Computational Linguistics 8 (2020), 439–453.

76.

, Jiang

and Wu

, Topic Modeling for Short Texts via Word Embedding and Document Correlation, J IEEE Access 8 (2020), 30692–30705.

77.

Wang

S.Y.

, Zhou

J.J.

and Lin

F.Q.

, Topic Mining Based on the Heat of Micro-blog, C., //Journal of Physics: Conference Series, IOP Publishing 1060(1) (2018), 012010.

78.

Peng

, Zhu

, Wang

, Li

, Zhang

and Tian

, Mining Event-Oriented Topics in Microblog Stream with Unsupervised Multi-View Hierarchical Embedding, J ACM Transactions on Knowledge Discovery from Data 20(3) (2018), 1–26.

79.

Guo

and Chen

, Bias-Sentiment-Topic model for microblog sentiment analysis, Concurrency and Computation, J Concurrency and Computation: Practice and Experience 30(13) (2018), e4417.

80.

Sang

, Xu

, Qian

S.S.

and Wu

, Multimodal multiview Bayesian semantic embedding for community question answering, J Neurocomputing 334 (2019), 44–58.

81.

Shi

, Cheng

, Xie

and Xie

, A word embedding topic model for topic detection and summary in social networks, J Measurement and Control 52(9-10) (2019), 1289–1298.

82.

Zhao

, Du

and Buntine

W.L.

, A Word Embeddings Informed Focused Topic Model, C., //Proceedings of the Asian Conference on Machine Learning Seou, Korea, 2017:423–438.

83.

Peng

, Yang

, Zhu

, Computer

S.O.

and University

, Semantic Enhanced Topic Modeling by Bi-directional LSTM, J Journal of Chinese Information Processing 32(4) (2018), 40–49.

84.

Sun

, Guo

and Ji

D.H.

, Topic Representation Integrated with Event Knowledge, J Chinese Journal of Computers 40(4) (2017), 791–804.

85.

, Duan

, Wang

, Zhang

, Sun

and Ma

, Enhancing Topic Modeling for Short Texts with Auxiliary Word Embeddings, J ACM Transactions on Information Systems 36(2) (2017), 1–30.

86.

Zhao

, Wang

, Zhao

, Liu

, Lu

and Zhuang

, A neural topic model with word vectors and entity vectors for short texts, J Information Processing & Management 58(2) (2021), 102455.

87.

Kingma

D.P.

and Welling

, Auto-Encoding Variational Bayes, C., //Proceedings of the International Conferencel on Learning Representations, Banff, Canada, 2014:1–14.

88.

Grootendorst

M.R.

, BERTopic: Neural topic modeling with a class-based TF-IDF procedure, J 2022, ArXiv, abs/2203.05794.

89.

Feng

, Zhang

, Ding

, Rao

and Xie

, Context Reinforced Neural Topic Modeling over Short Texts, J Information Sciences 607 (2020), 79–91.

90.

Lin

, Hu

and Guo

, Sparsemax and Relaxed Wasserstein for Topic Sparsity, C. //Proceedings of the twelfth ACM international conference on web search and data mining, 2019:141–149.

91.

Chen

, Ding

, Rao

et al., Hierarchical neural topic modeling with manifold regularization, J World Wide Web 24 (2021), 2139–2160.

92.

Mazzei

and Ramjattan

, Machine Learning for Industry 4.0 A Systematic Review Using Deep Learning-Based Topic Modelling, J Sensors 22(22) (2022), 8641.

93.

Huang

P.S.

, He

, Gao

et al., Learning deep structured semantic models for web search using clickthrough data, C. //Proceedings of the 22nd ACM international conference on Information & Knowledge Management 2013:2333–2338.

94.

, Gan

and Yin

, Feedback recurrent neural network-based embedded vector and its application in topic model, J EURASIP Journal on Embedded Systems, 2017(1):1–6.

95.

Card

, Tan

and Smith

N.A.

, Neural Models for Documents with Metadata, C., //Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne VIC Australia, 2018:2031–2040.

96.

Dieng

A.B.

, Wang

, Gao

and Paisley

, TopicRNN: A Recurrent Neural Network with Long-Range Semantic Dependency, M., //Proceedings of the International Conference on Learning Representations, Toulon, France, 2017:1–13.

97.

Xie

, Tiwari

, Gupta

, Huang

and Peng

, Neural variational sparse topic model for sparse explainable text representation, J Information Processing & Management 58(5) (2021), 102614.

98.

Korshunova

, Xiong

, Fedoryszak

and Theis

, Discriminative Topic Modeling with Logistic LDA, J Advances in Neural Information Processing Systems (2019), 32.

99.

Newman

, Lau

J.H.

, Grieser

and Baldwin

, Automatic evaluation of topic coherence, C. Human language technologies: The 2010 annual conference of the North American chapter of the association for computational linguistics, N. Eight Street, Stroudsburg, PA, 18360United States, 2010:100–108.

100.

Bouma

, Normalized (Pointwise) Mutual Information in Collocation Extraction, J Proceedings of GSCL 30 (2009), 31–40.

101.

Jiang

Y.Y.

, Li

and Wang

, An improved labeled latent Dirichlet allocation model for multilabel classification, J. Journal of Nanjing University (Natural Science) 49(4) (2013), 425–432.

102.

Aziz

, Dowling

, Hammami

et al., Machine learning in finance: A topic modeling method, J European Financial Management 28(3) (2022), 744–770.

103.

Churchill

and Singh

, The evolution of topic modeling, J ACM Computing Surveys 54(10) (2022), 1–35.

104.

Egger

and Yu

, A topic modeling comparison between lda, nmf, top2vec, and bertopic to demystify twitter posts, J Frontiers in Sociology 7 (2022), 886498.

105.

Mutanga

M.B.

and Abayomi

, Tweeting on COVID-19 pandemic in South Africa: LDA-based topic modelling method, J African Journal of Science, Technology, Innovation and Development 14(1) (2022), 163–172.

106.

Tang

, Chen

, Li

et al., Research on the Evolution of Journal Topic Mining Based on the BERT-LDA Model, C, SHS Web of Conferences, EDP Sciences 152 (2023), 03012.

107.

Shi

, Song

, Cheng

and Liu

, A user-based aggregation topic model for understanding user’s preference and intention in social network, J Neurocomputing 413 (2020), 1–13.

108.

Shao

, Tang

and Bao

B.K.

, Personalized Travel Recommendation Based on Sentiment-Aware Multimodal Topic Model, J IEEE Access 7 (2019), 113043–113052.

109.

, Ying

, Jun

T.X.

, Xia

L.M.

and Wang

, Improved user-based collaborative filtering algorithm with topic model and time tag, J International Journal of Computational Science and Engineering 22(2-3) (2020), 181–189.

110.

Zhang

, Chen

and Liu

, A Space-Time Periodic Task Model for Recommendation of Remote Sensing Images, J ISPRS International Journal of Geo-Information 7(2) (2018), 40.

111.

Chen

, Liu

, Li

and Jia

, Remote sensing image recommendation based on spatial– temporal embedding topic model, J Computers & Geosciences 157 (2021), 104935.

112.

Kowald

, Pujari

S.C.

and Lex

, Temporal Effects on Hashtag Reuse in Twitter: A Cognitive-Inspired Hashtag Recommendation Method, C., //Proceedings of the 26th International Conference on Web, Perth Australia: International Web Conferences Steering Committee, Perth Australia, 2017:1401–1410.

113.

, Li

, Shao

and Liu

, Multiview Scholar Clustering With Dynamic Interest Tracking, J IEEE Transactions on Knowledge and Data Engineering 2023(1), 1–14.

114.

Zeng

, Du

, Xue

and Li

, Scientific and Technological News Recommendation Based on Knowledge Graph with User Perception, C. //2022 IEEE 8th International Conference on Cloud Computing and Intelligent Systems (CCIS), 2020:491–495.

115.

, Meng

and Zhang

, SPATM: A Social Period-Aware Topic Model for Personalized Venue Recommendation, J IEEE Transactions on Knowledge and Data Engineering 34(8) (2022), 3997–4010.

116.

Yang

, Hu

, Liu

and Chen

, Doctor Recommendation Based on an Intuitionistic Normal Cloud Model Considering Patient Preferences, J Cognitive Computation 12 (2020), 460–478.

117.

Yang

, Yang

, Raymond

O.I.

, Zhu

, Huang

, Liao

and Long

, NSDH: A Nonlinear Supervised Discrete Hashing framework for large-scale cross-modal retrieval, J Knowledge-Based Systems 217 (2021), 106818.

118.

, Liu

, Zheng

et al., Bi-Labeled LDA: Inferring Interest Tags for Non-famous Users in Social Network, J Data Science and Engineering 5 (2020), 27–47.

119.

Jeong

, Yoon

and Lee

J.M.

, Social media mining for product planning: A product opportunity mining method based on topic modeling and sentiment analysis, J International Journal of Information Management 48 (2019), 280–290.

120.

Goodfellow

, Pouget-Abadie

, Mirza

, Xu

, Warde-Farley

, Ozair

, Courville

and Bengio

, Generative adversarial networks, J Communications of the ACM 63(11) (2020), 139–144.

121.

Radford

, Metz

and Chintala

, Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks, J Computer ence, (2015), 1–16.

122.

Shi

, Luo

, Zhu

, Kou

F.F.

, Cheng

and Liu

, A survey on cross-media search based on user intention understanding in social networks, J Information Fusion 91 (2023), 566–581.

123.

Liu

, Du

J.P.

and Zhou

, A cross media search method for social networks based on adversarial learning and semantic similarity, J Science China: Information Science 51(5) (2021), 779–794.

124.

Renteria-Vazquez

, Brown

W.S.

, Kang

et al., Social inferences in agenesis of the corpus callosum and autism: Semantic analysis and topic modeling, J Journal of Autism and Developmental Disorders (2021), 1–15.

125.

Curiskis

S.A.

, Drake

, Osborn

T.R.

et al., An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit, J Information Processing & Management 57(2) (2020), 102034.

126.

Blair

S.J.

, Bi

and Mulvenna

M.D.

, Aggregated topic models for increasing social media topic coherence, J Applied Intelligence 50 (2020), 138–156.

127.

Liu

, Zhang

, Lu

and Gu

, A reliable cross-site user generated content modeling method based on topic model, J Knowledge-Based Systems 209 (2020), 106435.

128.

Chakraborty

and Chandra

, Analyzing Interaction Dynamics in Social Networks through Social Yield, J Data & Knowledge Engineering 119 (2019), 139–149.

129.

Wang

, Meng

, Li

and Yang

, Multimodal Mention Topic Model for mentionee recommendation, J Neurocomputing 325 (2019), 190–199.

130.

Wang

, Liu

, Huang

and Feng

, Using Hashtag Graph-Based Topic Model to Connect Semantically-Related Words Without Co-Occurrence in Microblogs, J IEEE Transactions on Knowledge and Data Engineering 28(7) (2016), 1919–1933.

131.

Yan

, Guo

, Lan

, Xu

and Cheng

, A Probabilistic Model for Bursty Topic Discovery in Microblogs, C., //Twenty-ninth AAAI conference on artificial intelligence 2015:353–359.

132.

Shi

, Du

J.P.

, Liang

M.Y.

and Kou

F.F.

, SRTM: A Sparse RNN-Topic Model for Discovering Bursty Topics in Big Data of Social Networks, J Journal of Information Science & Engineering 35(4) (2019), 749–767.

133.

Shi

, Du

J.P.

and Liang

M.Y.

, Social network bursty topic discovery based on RNN and topic model, J Journal of Communications 39(4) (2018), 189–198.

A survey of topic models: From a whole-cycle perspective

Abstract

Keywords

1 Introduction

2 Related work and research progress on typical topic models

2.1.1 Introduction and comparative analysis of four common benchmark models

3.2.1 Topic cohesion

3.4.1 Artificial evaluation method

3.4.2 Semantic retrieval ability

4 Statistical analysis results and applications of topic models

4.1 Statistics data analysis results

4.1.1 Dataset

4.1.2 Evaluating indicator

4.1.3 Experiment setting

4.1.4 The result and analysis of the experiment

Table 7 Comparison of topic coherence Dataset Model type Model K 50 100 Sina microblog Traditional LDA 0.80 0.75 LTM 1.25 1.35 GSDMM 1.43 1.41 Word vector-based UMHE 1.45 1.42 CTM 1.69 1.73 HTMH 1.74 1.71 Neural network-based DSSM 2.01 2.23 GRNN 2.12 2.51 UATM 2.61 2.69

4.2.1 Retrieval and recommendation field

6 Conclusion

Footnotes

Author contributions

Data availability

Acknowledgments

References

Table 7
Comparison of topic coherence

Dataset Model type Model K

50 100

Sina microblog Traditional LDA 0.80 0.75

LTM 1.25 1.35

GSDMM 1.43 1.41

Word vector-based UMHE 1.45 1.42

CTM 1.69 1.73

HTMH 1.74 1.71

Neural network-based DSSM 2.01 2.23

GRNN 2.12 2.51

UATM 2.61 2.69