A more time-efficient gibbs sampling algorithm based on SparseLDA for latent dirichlet allocation

Abstract

As an efficient sampling algorithm for latent dirichlet allocation SparseLDA uses cache strategy to improve the time and space efficiency of its standard gibbs sampling algorithm (StdGibbs) by recycling previous computation. However, SparseLDA cannot further improve the time-efficiency of StdGibbs, since the amount of recycled computation is limited. This is because the word types of two adjacent tokens are usually different and the previous computation cannot be further recycled easily. To solve this problem, in this paper we propose a new algorithm named Efficient SparseLDA (ESparseLDA) based on SparseLDA. The main idea of ESparseLDA is to first rearrange the tokens within one text according to the word types so that the tokens of the same word type are aggregated together and then recycle more computation while making no approximation and ensuring the exactness. In this paper, we make detailed theoretical explanations and comparative experimental analyses on the correctness, exactness and time-efficiency of ESparseLDA. In detail, the statistical significance tests on perplexities strictly show that ESparseLDA is correct and exact. In addition, the running time results show that the time-efficiency of ESparseLDA is the higher than SparseLDA in varying degrees from 5.06% to 31.85% on the different datasets used in experiments.

Keywords

Latent dirichlet allocation topic model gibbs sampling topic inference

1. Introduction

Latent dirichlet allocation (LDA) has been studied extensively and deeply in recent years. As a type of graphical model it is well known as “topic model” [1, 26, 32, 37] nowadays. Initially LDA was proposed as an unsupervised method to deal with the text data which have no label information [3, 4]. Since then researchers from different fields have conducted extensive and in-depth research on LDA. In view of application LDA have been widely used in text mining [8, 11, 19, 23, 25], dimensionality reduction [7, 18], speech recognition [15, 16, 17], sentiment analysis [14, 21, 22, 33, 34], social network analysis [5, 6], image recognition [9, 28], scene recognition [27, 36] and other tasks [2]. Moreover, from a theoretical point of view both the generative process [30] and inference process [31] of LDA has been studied. In these studies researchers have made different expansions and improvements on LDA according to the diverse characteristics of the problems from various fields.

Concretely considering the problem that standard gibbs sampling algorithm [12] (shortened as “StdGibbs”) usually takes a long time (about hours on a medium size dataset) [38] to infer the topics of the tokens, the topic sampling process of StdGibbs for LDA has been studied intensively. Up to now there are two main inference methods for LDA, variational inference methods [4] and sampling inference methods [12]. Compared with variational inference methods, sampling inference methods have an advantage that they usually aim at the true posterior probability distribution in the topic inference process [24]. However, sampling inference methods have a problem that they usually take a long time to get closer to the true posterior probability. As a result of this problem, the application range of LDA is limited to a certain extent.

In order to solve this problem, some researchers have optimized the topic sampling process of StdGibbs for LDA to improve its time-efficiency and expand the adaptation of LDA. Accordingly some improved algorithms have been proposed successively. Typically these improved algorithms include FastLDA [29], SparseLDA [38], ECGS [35], AliasLDA [20] and LightLDA [39] (see Section 2 Related works). In these methods ECGS and AliasLDA are approximate algorithms which make some approximations in the optimization of StdGibbs, while FastLDA and SparseLDA are precise algorithms which do not make any approximation in the optimization of StdGibbs.

Among those precise optimization methods, SparseLDA is the most efficient. However, it still has a “recycle computation” problem (see Section 4.1 Problem statement). This problem is caused by a fact that usually the word types of two adjacent tokens are different. Consequently the word type changes frequently along with the token switching from one to another on the token sequence. Just because of this problem SparseLDA cannot recycle more computation by caching the results which have already been calculated before in the topic sampling process. Therefore the time complexity of SpareLDA cannot be further reduced (see Section 3.2 SparseLDA).

Aiming at this issue, this paper proposes a more efficient algorithm named “Efficient SparseLDA” (ESparseLDA) on the basis of SparseLDA. The main idea to solve the “recycle computation” problem in ESparseLDA includes two points. 1) The first point is to make a token rearrangement operation on the token sequence so that the tokens of the same word type within texts are aggregated together. 2) The second point is to recycle more computation by caching the results which have already been calculated so that the total amount of computation is reduced.

In particular, as the main contribution of this paper ESparseLDA has three distinguishing features as follows.

1)
Simplicity: The idea of ESparseLDA is simple and easy to understand and implement. The idea of ESparseLDA is to first make a token rearrangement operation on the token sequence and then use cache strategy to optimize the calculation in the topic sampling process (see Subsection 4.1 Intuitive ideas).
2)
Exactness: The sampling results of ESparseLDA are consistent with SparseLDA and StdGibbs. In other words, ESparseLDA is an optimization and improvement of SparseLDA while ensuring the exactness of the sampling results (see Subsection 4.5 Verification of ESparseLDA, Subsection 5.3.1 Correctness verification and Subsection 5.3.2 Exactness verification).
3)
Time-efficiency: ESparseLDA is more efficient compared with SparseLDA. The time complexity of SparseLDA is approximately O(DNK ${}_{w}$ ), while the time complexity of ESparseLDA is approximately O(RDNK ${}_{w}$ ) (0 $<R<$ 1) (see Subsection 3.2 SparseLDA and Subsection 4.4 Time complexity of ESparseLDA). Experimental results on different datasets show that the time-efficiency of ESparseLDA is higher than SparseLDA with different degrees from 5.06% to 31.85%.

For a brief presentation we summarize these three characteristics of the proposed algorithm ESparseLDA in Table 1.

Table 1
Three characteristics of the proposed algorithm ESparseLDA

Characteristics Summarization

Simplicity The idea of ESparseLDA is simple and easy to understand and implement.

Exactness The sampling results of ESparseLDA are as exact as SparseLDA and StdGibbs.

Time-efficiency ESparseLDA is more time-efficient compared with SparseLDA.

The rest of this paper is organized as follows. Section 2 discusses the related works. Section 3 describes StdGibbs and SparseLDA which are the important bases of this paper. Section 4 presents our proposed algorithm ESparseLDA in detail, including the framework description, time complexity analysis and theoretical illustration of its features. Section 5 conducts experimental validation to verify the features of ESparseLDA. Section 6 concludes the paper and points to future work.
2. Related works

Characteristics	Summarization
Simplicity	The idea of ESparseLDA is simple and easy to understand and implement.
Exactness	The sampling results of ESparseLDA are as exact as SparseLDA and StdGibbs.
Time-efficiency	ESparseLDA is more time-efficient compared with SparseLDA.

Aiming at the problem that the topic sampling process of StdGibbs for LDA usually takes a long time to infer the topics of the tokens, researchers have proposed a series of optimized methods based on StdGibbs including FastLDA [29], SparseLDA [38], ECGS [35], AliasLDA [20] and LightLDA [39]. These modified methods use different strategies to improve and optimize StdGibbs from different perspectives. In this section we discuss the relationship between them, since these studies are highly related to the method proposed in this paper.

StdGibbs StdGibbs [12] is the initial sampling method proposed to infer the topics of tokens for LDA. The time complexity of StdGibbs for LDA is O(DNK). $D$ represents the number of the texts used for training LDA. $N$ represents the average length of the texts. $K$ represents the number of topics. Table 2 summarizes the description of the important notations in this paper. The time complexity of StdGibbs O(DNK) can be seen as composed of two main factors. The first factor “DN” is the total number of the tokens to be sampled in the texts used for training LDA and the second factor “ $K$ ” is the time complexity of the topic sampling process for one token.

Table 2
Notation description

Notation	Description
$\alpha$	The hyperparameter on the document-topic mixture
$\beta$	The hyperparameter on the topic-word mixture
$V$	The number of word types in vocabulary, a const scalar
$K$	The number of topics, $\|T\|$ , a const scalar
$D$	The number of documents in dataset D, a const scalar
$N$	The average number of tokens in one text, a const scalar
$N_{\textit{iter}}$	The number of iterations
$z_{w}$	The topic of token $w$ , a variable
$\bm{z}_{D}$	The topic sequence of dataset D, a vector type variable
$d$	A document in a dataset, an index
$t$	A topic in the topic set, an index
$w$	A word type in the word type set, an index
N ${}_{T}$	A 1* $K$ count matrix, the count of each topic within the dataset
$n_{t}$	An element in N ${}_{T}$ , the count of topic $t$ within the dataset
N ${}^{T}_{D}$	A $D$ * $K$ document-topic count matrix, the count of each topic within each text
$n^{t}_{d}$	An element in N ${}^{T}_{D}$ , the count of topic $t$ within text $d$
$K_{d}$	The average number of the non-zero elements per text in N ${}^{T}_{D}$
N ${}^{T}_{W}$	A $K$ * $V$ topic-word count matrix, the count of each word type within each topic
$n^{t}_{w}$	An element in N ${}^{T}_{W}$ , the count of word type $w$ within topic $t$
$K_{w}$	The average number of the non-zero elements per word type in N ${}^{T}_{W}$
$R$	0 $<R<$ 1, the macro average type token ratio of the texts in the dataset, a const scalar
$\textbf{sseg}(t)$	The probablitiy segment of topic $t$ in S item, a variable
$\textbf{rseg}(t)$	The probablitiy segment of topic $t$ in R item, a variable
$\textbf{qseg}(t)$	The probablitiy segment of topic $t$ in Q item, a variable
t_old_topic	The old topic of the current token before the sampling of the current token, an index
t_new_topic	The new topic of the current token after the sampling of the current token, an index
$C$	The amount of computation needed by the calculation of Q item for one token under cache strategy,
	a const scalar

The main problem of StdGibbs is that in general it takes a long time for inferring the topics. For instance, on a medium sized dataset (1.0 $\times$ 10 ${}^{6}<\textit{DN}<$ 1.0 $\times$ 10 ${}^{7}$ ) and a medium topic number setting (100 $<K<$ 1000) one iteration of StdGibbs usually takes hours for inferring the topics. In order to improve the time-efficiency of StdGibbs, researchers have proposed a series of modified algorithms based on StdGibbs, including FastLDA, SparseLDA, ECGS, AliasLDA and LightLDA. These methods are 10 to 20 times faster than StdGibbs, and reduce the sampling time of StdGibbs from hours to minutes. According to the two main components (“DN” and “ $K$ ”) in the time complexity O(DNK) of StdGibbs, these modified algorithms can be divided into two categories. 1) The first category contains FastLDA, SparseLDA and AliasLDA. These methods optimize the topic sampling process for one token. For example, SparseLDA reduces the time complexity of the topic sampling process for one token from O( $K$ ) to O( $K_{w}$ ), and AliasLDA reduces that from O( $K$ ) to O( $K_{d}$ ). The symbols “ $K_{w}$ ” and “ $K_{d}$ ” are explained in Table 2. 2) The second category only contains ECGS. This method optimizes the number of tokens to be sampled. For instance, ECGS reduces the number of tokens to be sampled from “DN” to “RDN” (0 $<R<$ 1). $R$ is the macro average ratio of the number of word types to the number of word tokens. (For more details of $R$ see Section 5.1.1 Datasets). In the following, these methods are discussed in detail.

FastLDA In this study [29] scholars noticed a fact that after several iterations of StdGibbs the values of the probability segments of one token always have a large difference and usually only a few topic probability segments can take over the most of the total topic probability in the topic sampling process of one token. Based on this fact, the idea of FastLDA is to optimize the topic sampling process of one token by calculating and checking only a few probability segments instead of all the $K$ segments. Consequently, in FastLDA the amount of calculations for the topic sampling process of one token is reduced from $K$ to much less than $K$ on average. Experiments in this study have observed that the time-efficiency of FastLDA is 3 to 8 times of StdGibbs. Furthermore, the time-efficiency of FastLDA increases as the value of $K$ increases.

SparseLDA In this work [38], researchers first divided the total topic probability distribution (see Eq. (2)) used in StdGibbs into three buckets. Then they found a reality that two of these three buckets hardly changes during the whole topic sampling process in StdGibbs. Based on this point, the main idea of SparseLDA is to obtain these three buckets with less computation by caching and recycling most of the results which already have been calculated before. Therefore, SparseLDA reduces the time complexity of the topic sampling process for one token from O( $K$ ) to O( $K_{w}$ ). Experimental results in this research have shown that SparseLDA can be approximately 20 times faster than StdGibbs and 2 times faster than FastLDA. Moreover, SparseLDA uses significantly less memory compared with StdGibbs and FastLDA. Compared with FastLDA which needs to iteratively refine an approximation to the total topic probability, the idea of SparseLDA is simpler. As an important related work SparseLDA is further explained and analyzed in Section 3.2.

ECGS Compared with FastLDA and SparseLDA, the idea of ECGS is different. In this research [35] scholars tried to reduce the number of tokens to be sampled in each iteration of StdGibbs. For this purpose they proposed two strategies, shortcut strategy and dynamic strategy. The idea of shortcut strategy is that for the tokens of the same word type in one document ECGS only samples one topic for them instead of sampling a topic for each of them. Compared with shortcut strategy, the idea of dynamic strategy is that for the tokens of the same word type in one document ECGS dynamically decides the number of topics sampled for them instead of only sampling one topic for them. In particular, using shortcut strategy ECGS can reduce the number of tokens to be sampled from “DN” to “RDN” (0 $<R<$ 1). Experiments in this work have indicated that ECGS is also approximately 2 times faster than FastLDA and is comparative to SparseLDA.

AliasLDAThe idea of AliasLDA [20] is to combine gibbs sampling algorithm with metropolis-hastings algorithm to approximate the total topic probability distribution by utilizing the sparsity of the topics within one text rather than the sparsity of the topics assigned to one word type which is utilized by SparseLDA. Concretely, AliasLDA also adopts the gibbs sampling algorithm framework for each iteration just as StdGibbs, but for each token AliasLDA uses metropolis-hastings algorithm to reduce the sample complexity by sampling from a proposal distribution which is much easier to sample than the original distribution. Accordingly, AliasLDA reduces the time complexity of the topic sampling process for one token from O( $K$ ) to O( $K_{d}$ ). Experiments have shown that on a relatively small dataset ( $\textit{DN}<$ 1.0 $\times$ 10 ${}^{7}$ ) SparseLDA performs better than AliasLDA (about 30–40%), but on a relatively large dataset ( $\textit{DN}>$ 1.0 $\times$ 10 ${}^{7}$ ) AliasLDA performs better than SparseLDA (about 30–40%). Theoretically, SparseLDA is suitable for the dataset in which $K_{w}$ is relatively small than $K_{d}$ , while AliasLDA is fit for the dataset in which $K_{d}$ is relatively small than $K_{w}$ .

LightLDA The idea of LightLDA [39] is also to combine gibbs sampling with metropolis-hastings algorithm to approximate the posterior probability distribution. But it differs from AliasLDA in that for each token it proposes two proposal distributions and alternates between them to sample a topic. These two proposal distributions are called “doc proposal” and “word proposal”. For “doc proposal”, the time complexity of sampling a topic is O(1). For “word proposal”, the time complexity of sampling a topic is amortized O(1). Totally, LigthtLDA reduces the time complexity of the topic sampling process for one token from O( $K$ ) to amortized O(1). Experiments have shown that LightLDA is around 3 $\sim$ 5 times as fast as AliasLDA.

Finally, these optimizations can be divided into two categories according to the exactness of the sampling results. 1) The first category includes ECGS, AliasLDA and LightLDA. These three methods make some approximations in the optimization of StdGibbs in order to improve much more time-efficiency. 2) The second category includes FastLDA and SparseLDA. These two methods are exact and make no approximation. In other words, these methods correctly and exactly samples from the same true total topic distribution as StdGibbs. As a precise optimization of StdGibbs which is relatively more efficient than or comparative with other refined methods, SparseLDA still has a “recycle problem” which controls its time complexity. That means as the most efficient algorithm among those precise optimization methods SparseLDA cannot further improve the time-efficiency of StdGibbs (see Section 4.1 Problem statement), since the amount of recycled computation is limited. Concretely, this is because the word types of two adjacent tokens are usually different and the previous computation cannot be further recycled easily (see Section 3.2 SparseLDA). Hence, in this paper we analyze the “recycle computation” problem of SparseLDA and propose a new algorithm ESparseLDA under the condition of ensuring its exactness. Comparatively, ESparseLDA is more time-efficient than SparseLDA and makes no approximation compared with ECGS, AliasLDA and LightLDA (see Section 4 Proposed method).

3. Preliminaries

In this section we first give a brief introduction of StdGibbs, including its algorithm framework and time complexity. Then based on this, we make a detailed analysis of SparseLDA, because it is an important related work of this paper and the original paper [38] does not present too much specific information about it. Concretely, the detailed analysis of SparseLDA includes its core ideas and key points in addition to its algorithm framework and time complexity. These contents are the key foundation of the method proposed in this paper.

3.1 StdGibbs

LDA contains two main processes, the generative process and the inference process. The generative process of LDA assumes that each text is a mixture over latent topics, and each topic is a distribution over word types. After the generation process, in the inference process of LDA the main problem is to infer the topic of each token. Using StdGibbs the probability that one token $w$ belongs to topic $t$ is calculated as below:

$\displaystyle p(z_{w}=t)\propto(\alpha+n_{d}^{t})\frac{\beta+n_{w}^{t}}{\beta V% +n_{t}}.$ (1)

The notations are summarized in Table 2. In this paper, this probability is also called “topic probability segment” shortened as “segment”. For each token, there are $K$ segments, since the topic number is set to $K$ in LDA.

Framework of StdGibbs The pseudo codes of StdGibbs and its sub algorithms, including StdGibbsText, StdGibbsToken and StdGibbsSearch are shown in Algorithms 1–4 in the following. In short, StdGibbs, StdGibbsText and StdGibbsToken respectively correspond to the sampling procedure of one dataset, one text and one token.

Algorithm 1 StdGibbs The whole procedure of StdGibbs on the entire dataset is shown in Algorithm 1. StdGibbs mainly consists of two parts: Line 1 and 2 is the first part. This part randomly assigns topics to all tokens in the entire dataset (line 1) and prepares the count matrices N ${}^{T}_{D}$ , N ${}^{T}_{W}$ and N ${}_{T}$ (line 2). Line 3–7 is the second part. It is the sampling procedure consisting of a loop body. For each text within the dataset, StdGibbs calls the sub algorithm StdGibbsText (line 5). In general, StdGibbs corresponds to the sampling procedure of all tokens in the dataset, while StdGibbsText corresponds to the sampling procedure of all tokens in one text as shown in Algorithm 2.

Algorithm 1 StdGibbs (Dataset D)
1: random assignment of z ${}_{D}$ // initialization
2: count of N ${}_{T}$ , N ${}^{T}_{D}$ , N ${}^{T}_{W}$
3: for iter $=$ 1 to $N_{\textit{iter}}$
4: for each text $d$ inD do
5: StdGibbsText ( $d$ )
6: end for
7: end for

Algorithm 2 StdGibbsText StdGibbsText consists of a loop body. Each execution of the loop body corresponds to the sampling procedure of one token. For each token within the text, StdGibbsText calls the sub algorithm StdGibbsToken.

Algorithm 2 StdGibbsText ( $d$ )
1: for each token $w$ in text $d$ do
2: StdGibbsToken ( $d$ , $w$ )
3: end for

Algorithm 3 StdGibbsToken ( $d$ , $w$ )
1: t_old_topic $=$ $z_{w}$ , $t=$ t_old_topic
2: $n^{t}_{d}$ –, $n^{t}_{w}$ –, $n_{t}$ –
3: sum $=$ 0
4: for $i=$ 1 to $K$ do
5: $\textbf{seg}(i)=p(z_{w}=i)$ (see Eq. (1)) // compute segments
6: sum $=$ sum $+$ $\textbf{seg}(i)$
7: end for
8: sample a random variable, $u\sim$ Uniform (0, sum)
9: t_new_topic $=$ StdGibbsSearch ( $u$ , seg) // search the segment
10: $z_{w}=$ t_new_topic, $t=$ t_new_topic
11: $n^{t}_{d}$ ++, $n^{t}_{w}$ ++, $n_{t}$ ++

Algorithm 3 StdGibbsToken StdGibbsToken is the core part of StdGibbs corresponding to the sampling procedure of one token. The whole procedure of StdGibbsToken for one token is shown in Algorithm 3. It mainly consists of two parts: Line 1–7 is the first part named “calculation process” in this paper. This part calculates the $K$ segments for the current token. Line 8–11 is the second part called “search process” in this paper. This part searches the segment in which the sampling point falls for the current token. Specifically, line 1 and 2 is prepares for the formal calculation process, including the removal of the old topic (line 2) and the update of related elements ( $n^{t}_{d}$ , $n^{t}_{w}$ , $n_{t}$ ) in the count matrices (N ${}^{T}_{D}$ , N ${}^{T}_{W}$ , N ${}_{T}$ ) caused by this removal. Line 3–7 is the formal calculation process for the segments of the current token consisting of a loop body. Each execution of this loop body calculates one segment for the current token, so the time complexity of this calculation process is O( $K$ ). Line 8 and 9 is the search process completed by calling the sub algorithm StdGibbsSearch. The time complexity of this search process is also O( $K$ ). The index of the segment found in this search process is the new topic of the current token. The whole process of StdGibbsSearch is shown in Algorithm 4. Line 10 and 11 deals with the aftermath of the search process, including the assignment of the new topic (line 10) and the update of related elements in the count matrices caused by this assignment (line 11). The total time complexity of one iteration of StdGibbs for the entire dataset is O(DNK), since both the time complexities of one iteration of these two processes are O(DNK).

Algorithm 4 StdGibbsSearch ( $u$ , seg)
1: judge $=$ 0
2: for $i=$ 1 to $K$ do
3: judge $=$ judge $+$ $\textbf{seg}(i)$
4: if $u<$ judgethen
5: return $i$ // return the segment
6: end if
7: end for

3.2 SparseLDA

According to the granularity of relevant data, SparseLDA divided Eq. (1) used in StdGibbs into three buckets as shown in Eq. (2), and each of these three buckets is a sum of $K$ segments as shown in Eq. (3.2).

$\displaystyle P(z=t|w)\propto\frac{\alpha\beta}{\beta V+n_{t}}+\frac{n_{d}^{t}% \beta}{\beta V+n_{t}}+\frac{(\alpha+n_{d}^{t})n_{w}^{t}}{\beta V+n_{t}}$ (2) $\displaystyle P_{\textit{total}}=\sum\limits_{t=1}^{K}P(z=t|w)=S+R+Q$ $\displaystyle S=\sum\limits_{t=1}^{K}{\rm{\bf sseg}}(t)=\sum\limits_{t}\frac{% \alpha\beta}{\beta V+n_{t}},$ $\displaystyle R=\sum\limits_{t=1}^{K}{\rm{\bf rseg}}(t)=\sum\limits_{t=1}^{K}% \frac{n_{d}^{t}\beta}{\beta V+n_{t}}=\sum\limits_{\{t|n_{d}^{t}>0\}}\frac{n_{d% }^{t}\beta}{\beta V+n_{t}},$ $\displaystyle Q=\sum\limits_{t=1}^{K}{\rm{\bf qseg}}(t)=\sum\limits_{t=1}^{K}% \frac{n_{w}^{t}(\alpha+n_{d}^{t})}{\beta V+n_{t}}=\sum\limits_{\{t|n_{w}^{t}>0% \}}\frac{n_{w}^{t}(\alpha+n_{d}^{t})}{\beta V+n_{t}}.$ (3)

In original paper, these three blocks respectively are called the “smoothing only” bucket, the “document topic” bucket and the “topic word” bucket, but in this paper these three buckets are called “S item”, “R item” and “Q item”. To improve the time-efficiency of StdGibbs, SparseLDA uses “cache strategy” to compute S item and R item and uses “sparse strategy” to compute Q item according to the different changes of these three items in the topic sampling process.

Framework of SparseLDAThe pseudo codes of SparseLDA and its sub algorithms, including SparseLDAText, SparseLDAToken and SparseLDASearch are shown in Algorithms 5–8 in the following. In short, SparseLDA, SparseLDAText and SparseLDAToken respectively correspond to the sampling procedure of one dataset, one text and one token.

Algorithm 5 SparseLDA The whole procedure of algorithm SparseLDA on the entire dataset is shown in Algorithm 5. Similar with StdGibbs, SparseLDA mainly consists of two parts: Line 1 and 2 is the first part. This part randomly assigns topics to all tokens in the entire dataset (line 1) and prepares the count matrices N ${}^{T}_{D}$ , N ${}^{T}_{W}$ and N ${}_{T}$ (line 2). Line 3–13 is the second part. This part is the sampling procedure consisting of two loop bodies. Specifically, line 4–9 is the first loop body corresponding to the preparation work for the calculation process of S item for all tokens in the dataset. Line 10–12 is the second loop body corresponding to the formal process of sampling procedure. Each execution of the second loop body corresponds to the sampling procedure for one text.

Algorithm 5 SparseLDA (Dataset D)
1: random assignment of z ${}_{D}$
2: count of N ${}_{T}$ , N ${}^{T}_{D}$ , N ${}^{T}_{W}$
3: for iter $=$ 1 to $N_{\textit{iter}}$
4: variable declaration of sseg, ssum, rseg, rsum, qseg, qsum
5: ssum $=$ 0
6: for $i$ $=$ 1 to $K$ do
7: calculate $\textbf{sseg}(i)$ (see Eq. (3.2)) // prepare S item
8: ssum $=$ ssum $+$ $\textbf{sseg}(i)$
9: end for
10: for each text $d$ in D do
11: SparseLDAText ( $d$ , ssum, sseg)
12: end for
13: end for

Algorithm 6 SparseLDAText Similar with the relationship between StdGibbs and StdGibbsText, for each text within the dataset SparseLDA calls the sub algorithm SparseLDAText (line 11). The whole procedure of SparseLDAText is shown in Algorithm 6 consisting of two loop bodies: Line 1–5 is the first loop body corresponding to the preparation work for the calculation process of R item for all tokens in one text. Line 6–8 is the second loop body. Each execution of the second loop body corresponds to the sampling procedure of one token. For each token within the text SparseLDAText calls the sub algorithm SparseLDAToken (line 7).

Algorithm 6 SparseLDAText ( $d$ , ssum, sseg)
1: rsum $=$ 0
2: for each $t$ in { $t\|n^{t}_{d}!=$ 0} do
3: calculate $\textbf{rseg}(t)$ (see Eq. (3.2)) // prepare R item
4: rsum $=$ rsum $+$ $\textbf{rseg}(t)$
5: end for
6: for each token $w$ in document $d$ do
7: SparseLDAToken ( $d$ , $w$ , ssum, sseg rsum, rseg)
8: end for

Algorithm 7 SparseLDAToken SparseLDAToken is the core algorithm of SparseLDA corresponding to the sampling procedure for one token as shown in Algorithm 7. It also mainly consists of two parts: Line 1–8 is the first part. This part calculates S item, R item and Q item for the current token. Line 9–14 is the second part searching for the segment in which the sampling point falls. Specifically line 1 and 2 prepares for the formal calculation process of S item, R item and Q item for the current token. Line 3 calculates S item and R item for the current token using cache strategy. Line 4–8 is a loop body corresponding to calculate Q item using sparse strategy for the current token. Line 9–11 is the search process completed by calling the sub algorithm SparseLDASearch. The process of SparseLDASearch is shown in Algorithm 8. Line 12–14 deals with the aftermath of the search process for the current token.

Algorithm 7 SparseLDAToken ( $d$ , $w$ , ssum, sseg, rsum, rseg)
1: t_old_topic $=z_{w}$ , $t=$ t_old_topic
2: $n^{t}_{d}$ –, $n^{t}_{w}$ –, $n_{t}$ –
3: update $\textbf{sseg}(t)$ , ssum, $\textbf{rseg}(t)$ , rsum for excluding $z_{w}$ // compute S item and R item
4: qsum $=$ 0
5: for each $t$ in { $t\|n^{t}_{w}!=$ 0} do
6: calculate $\textbf{qseg}(t)$ (see Eq. (3.2)) // compute Q item
7: qsum $=$ qsum $+$ $\textbf{qseg}(t)$ ;
8: end for
9: sum $=$ ssum $+$ rsum $+$ qsum
10: sample a random variable, $u\sim$ Uniform (0, sum)
11: t_new_topic $=$ SparseLDASearch ( $u$ , ssum, rsum, qsum, sseg, rseg, qseg)
12: $z_{w}=$ t_new_topic, $t=$ t_new_topic
13: $n^{t}_{d}$ ++, $n^{t}_{w}$ ++, $n_{t}$ ++
14: update $\textbf{sseg}(t)$ , ssum, $\textbf{rseg}(t)$ , rsum for including $z_{w}$ // update S item and R item

Algorithm 8 SparseLDASearch ( $u$ , ssum, rsum, qsum, sseg, rseg, qseg)
1: if $u<$ qsum then // search Q item
2: return StdGibbsSearch ( $u$ , qseg)
3: else if $u<$ qsum $+$ rsum then // search R item
4: return StdGibbsSearch (u – qsum, rseg)
5: else // search S item
6: return StdGibbsSearch (u – qsum – rsum, sseg)
7: end if

Time complexity of SparseLDA Like StdGibbs, SparseLDA mainly comprises two processes, the calculation process and the search process. The following are the analysis of the time complexities of these two processes.

Time complexity of the calculation process The calculation process calculates S item, R item and Q item. Below we analyze the time complexities of the calculations of these three items separately as shown in Tables 3–5.

Table 3

Time complexity for calculating S item in SparseLDA

S item	The first token in the dataset	Not the first token in the dataset
Single time complexity	O( $K$ )	O(1)
Number of tokens	1	DN-1
Subtotal time complexity	O( $K$ )	O(DN)
Total time complexity	O(DN)

Table 4

Time complexity for calculating R item in SparseLDA

R item	The first tokens in all the texts	Not the first tokens in all the texts
Single time complexity	O( $K_{d}$ )	O(1)
Number of tokens	$D$	DN- $D$
Subtotal time complexity	O(DK ${}_{d}$ )	O(DN)
Total time complexity	O(DN)

Table 5

Time complexity for calculating Q item in SparseLDA

Q item	All tokens
Single time complexity	O( $K_{w}$ )
Number of tokens	DN
Total time complexity	O(DNK ${}_{w}$ )

1) Time complexity of calculating S item For the first token in dataset, the subtotal time complexity for calculating its S item (Algorithm 5, line 5–9) is O( $K$ ). For the other DN-1 tokens in dataset, the subtotal time complexity for calculating their S items (Algorithm 7, line 3) is O(DN). Thus, for all the tokens in the dataset, the time complexity for calculating their S items is O(DN) as shown in Table 3, since DN is far greater than $K$ .

2) Time complexity of calculating R item For all the first tokens in all the texts in a dataset, the subtotal time complexity for calculating their R items (Algorithm 6, line 1–5) is O(DK ${}_{d}$ ). For the other DN- $D$ tokens in the dataset, the subtotal time complexity for calculating their R items (Algorithm 7, line 3) is O(DN). Thus, for all the tokens in the dataset the time complexity of the calculation process for R item is also O(DN) as shown in Table 4, since $N$ is greater than $K_{d}$ .

3) Time complexity of calculating Q item For all the tokens in the dataset, the time complexity for calculating their Q items (Algorithm 7, line 4–8) is O(DNK ${}_{w}$ ) as shown in Table 5, since for one token the time complexity for calculating its Q item is O( $K_{w}$ ).

In summary, for all tokens of the entire dataset the time complexity of the calculation process is O(DNK ${}_{w}$ ) which is dominated by the calculation process of Q item, since the respective time complexities for calculating S item, R item and Q item are O(DN), O(DN) and O(DNK ${}_{w}$ ).

Time complexity of the search process The time complexity of the search process is shown in Table 6. According to the item into which the sampling point falls, the time complexity of the search process for one token can be divided into three cases. The corresponding complexities of these three case are O( $K$ ), O( $K_{d}$ ) and O( $K_{w}$ ) respectively, since the S item, R item and Q item respectively have $K$ , $K_{d}$ and $K_{w}$ nonzero segments. In addition, the value of S item and R item is relatively small compared with Q item (about 1:19) as shown by the experimental results. This causes that about more than 95% of the sampling points fall into the Q item. Thus the time complexity of the search process for one token can be approximated as O( $K_{w}$ ) and the total time complexity of the search process for all tokens in the entire dataset is O(DNK ${}_{w}$ ).

Table 6

Time complexity of the search process in SparseLDA for one token

The item into which the sample point falls	S	R	Q
Time complexity	O( $K$ )	O( $K_{d}$ )	O( $K_{w}$ )

Based on this there is another point should be noted about the segments of the Q item, which further influences the time complexity of the search process. When the sampling point falls into the Q item, the time complexity of the search process for one token can even be approximated as O(1), since in SparseLDA the segments of Q item is ordered from large to small and the proportion of the first largest segment is usually more 90%. In summary, for one token the time complexity of the search process is approximately O(1) and for all tokens in the entire dataset the total time complexity of the search process is approximately O(DN).

Time complexity of SparseLDA The total time complexity of SparseLDA approximates to O(DNK ${}_{w}$ ) dominated by the calculation processs, since the time complexities of the calculation process and the search process approximates to O(DNK ${}_{w}$ ) and O(DN) respectively.

4. Proposed method

In this section, we first describe the “recycle computation” problem of SparseLDA and introduce our intuitive ideas to solve this problem. Then, we propose our method ESparseLDA modified on SparseLDA and present the details of ESparseLDA, mainly including its algorithm framework and time complexity. Finally, we explain the features (correctness, exactness and time-efficiency) of ESparseLDA from the theoretical point of view.

4.1 Problem statement

The “recycle computation” problem of SparseLDA As mentioned before (Section 2), SparseLDA is an efficient and exact gibbs sampling method for LDA. However, there is a so called “recycle computation” problem in calculating the segments of Q item in SparseLDA, and this problem dominates the total time complexity of SparseLDA. Specifically, this is because the computation of Q item is different from S item and R item. In detail, the segments of Q item cannot be calculated through the cache strategy by making use of the computational results calculated before as S item and R item, since along with the token switch the segments of S item and R item almost have not changed, but the segments of Q item almost all have changed. Normally, in the same text when the topic sampling process switches from one token to another token with a different word type, all the segments in Q item (qseg(1), qseg(2), …, qseg( $K$ )) will have changed, but only two segments in S item (sseg(t_old_topic) and sseg(t_new_topic)) and only two segments in R item (rseg(t_old_topic) and rseg(t_new_topic)) will have changed. This is the “recycle computation” problem of SparseLDA as the original paper [38] has said, “The topic word constant q changes with the value of w, so we cannot as easily recycle earlier computation”. It is precisely because of this problem, the time complexity of SparseLDA cannot be reduced further. So this paper tries to solve this problem and further reduce the time complexity of SparseLDA.

4.2 Intuitive ideas

In order to improve the time-efficiency of SparseLDA, it is necessary to solve the “recycle computation” problem in calculating the segments of Q item in SparseLDA. To solve this problem the general idea of this paper is to firstly rearrange the token sequence within each document according to word types and secondly compute Q item by caching and recycling the previous results. Obviously, the general idea includes two parts. The first part aims to create conditions by rearranging the token sequence for the second part to utilize cache strategy. Following are the description and illustration of the contents and details of these two parts.

Intuitive idea of token rearrangement First of all, there is a problem associated with the cache strategy before it is used for calculating Q item, that is, the word type changes frequently along with the topic sampling process switches from one token to another. As mentioned before (in Subsection 4.1 Problem statement), the word types of adjacent tokens are often different from each other under natural conditions, which makes all the $K$ segments of Q item changed after the token switch and prevents the adoption of cache strategy for computing Q item. Thus, SparseLDA cannot compute Q item through cache strategy and needs to recalculate all the nonzero segments of Q item using sparse strategy for each token in the dataset.

In order to solve this associated problem, this paper considers such a reality that usually in one text there will be more than one token belonging to the same word type. Based on this reality, this paper has produced an intuitive idea, the idea of token rearrangement, to solve this associated problem. The main motive of this intuitive idea is that for the same text if the tokens of the same word type are grouped together, then the word type will not change so frequently along with the token switch in the topic sampling process, and Q item can be calculated through cache strategy by recycling the previous computational results.

Below the operation of this idea is described through an example. Suppose there are two sentences from an article [4] as follows:

The William Randolph Hearst Foundation will give $1.25 million to Lincoln Center. The Hearst Foundation, a leading supporter of the Lincoln Center, will make its usual annual donation, too.

Accordingly, assume that a token sequence is obtained from this sentence as follows after stop word removal and other pretreatment on this sentence.

foundation give million center foundation leading supporter center make annual

Based on this token sequence, another sequence is obtained as below after the token rearrangement operation.

foundation foundation give million center center leading supporter make annual

Comparing these two token sequences, it can be seen that the tokens with the same word type in the latter token sequence are placed together. For the two tokens of the word type “foundation”, they are not adjacent before, but after the token rearrangement operation they are adjacent. Similarly, the situation is the same for the two tokens of the word type “center”.

Intuitive idea of cache strategy After explaining the first intuitive idea (the idea of token rearrangement) here we illustrate the second intuitive idea, the idea of cache strategy, of this paper moving towards the solution of the “recycle computation” problem of SparseLDA. The main motive of this intuitive idea is that for SparseLDA if Q item can be computed through cache strategy instead of sparse strategy, then the total amount of computation in SparseLDA can be further reduced, and the time-efficiency of SparseLDA can be further improved.

Briefly, it is because cache strategy needs less computation and can reduce the time complexity to a greater extent compared with the sparse strategy. In detail, cache strategy utilizes the previous intermediate results to accomplish most of the later computation, while sparse strategy utilizes the sparsity of count matrices and needs to compute all the nonzero elements. Concretely, in the same text when the topic sampling process switches from one token to another token with the same word type, only two segments (O(1)) in Q item (qseg(t_old_topic) and qseg(t_new_topic)) will have changed and needs to be recalculated utilizing cache strategy, but utilizing sparse strategy all the nonzero segments (O( $K_{w}$ )) in Q item needs to be recalculated. Theoretically, in SparseLDA for each token the cache strategy used to compute S item and R item reduces the time complexity from O( $K$ ) down to O(1), but the sparse strategy used to compute Q item only reduces the time complexity from O( $K$ ) down to O( $K_{w}$ ).

Below the operation of this idea is explained by carrying on the previous example. Take the previous token sequence obtained after the token rearrangement operation as an example, the corresponding strategy, sparse strategy or cache strategy, used to compute Q item for each token is shown as follows.

foundation (Sparse) foundation (Cache) give (Sparse) million (Sparse) center (Sparse) center (Cache) leading (Sparse) supporter (Sparse) make (Sparse) annual (Sparse)

It can be seen that for the first token of the word type “center” its Q item is computed by the sparse strategy, while for the second token of the word type “center” its Q item is computed by the cache strategy. The situation is the same for the two tokens of the word type “foundation”.

In summary, before the token rearrangement operation all the Q items are calculated using sparse strategy only, but after the token rearrangement operation, some changes have happened; that is, some Q items are still calculated using sparse strategy, but the other Q items are calculated using cache strategy. Specifically, it is divided into two kinds of situations: 1) If the current token has the same word type as the previous token, then the Q item of the current token is calculated using cache strategy. 2) Otherwise, the Q item of the current token is calculated using sparse strategy as before in SparseLDA.

In practice, there will much more Q items computed by cache strategy. For example, in view of the datasets used in this paper on average about 25–60% tokens are repetitive within one text. Thus their Q items are all computed by cache strategy. Especially, the repetition ratio will be higher for the documents which in general repeat similar content again and again, such as fictions, patents and dissertations.

4.3 ESparseLDA

Based on these two intuitive ideas introduced in the previous section, we propose a new algorithm named “Efficient SparseLDA” (ESparseLDA) in this section. ESparseLDA accomplishes the same task as SparseLDA but is more time-efficiency than SparseLDA, since the “recycle computation” problem of SparseLDA has been solved in ESparseLDA. The main idea of ESparseLDA to solve the “recycle computation” problem in ESparseLDA is as follows: 1) First it rearranges the tokens in the same text according to the word type. This step makes the tokens of the same word type within texts aggregated together. 2) Second it calculates some of the Q items using the cache strategy by recycling the computational results which have already been calculated. This step reduces the amount of computation required by Q item.

Framework of ESparseLDA Based on the main idea described above, we completed the detailed design of the topic sampling process of ESparseLDA. The pseudo codes of ESparseLDA and its sub algorithms, including ESparseLDAText and ESparseLDAToken are shown in Algorithms 9–11 respectively in the following. ESparseLDA, ESparseLDAText and ESparseLDAToken respectively correspond to the sampling procedure of one dataset, one text and one token. With these pseudo codes, we give a detailed description and theoretical analysis of ESparseLDA.

Algorithm 9 ESparseLDA The whole procedure of ESparseLDA on the entire dataset is shown in Algorithm 9. Similar with SparseLDA, ESparseLDA mainly consists of two parts: Line 1 and 2 is the first part. This part randomly assigns topics to all tokens in the entire dataset (line 1) and prepares the count matrices N ${}^{T}_{D}$ , N ${}^{T}_{W}$ and N ${}_{T}$ (line 2). Line 3–13 is the second part. This part is the sampling procedure consisting of two loop bodies. Specifically, line 4–9 is the first loop body corresponding to the preparation work for the calculation process of S item for all tokens in the dataset. Line 10–12 is the second loop body corresponding to the formal process of sampling procedure. Each execution of the second loop body corresponds to the sampling procedure for one text. From the pseudo code it can be seen that compared with SparseLDA, ESparseLDA is roughly the same. The only difference between them is that for the sampling procedure of all the tokens in one text ESparseLDA (Algorithm 9, line 11) calls the sub algorithm ESparseLDAText while SparseLDA (Algorithm 5, line 11) calls the sub algorithm SparseLDAText.

Algorithm 9 ESparseLDA (Dataset D)
1: random assignment of z ${}_{D}$
2: count of N ${}_{T}$ , N ${}^{T}_{D}$ , N ${}^{T}_{W}$
3: for iter $=$ 1 to $N_{\textit{iter}}$
4: variable declaration of sseg, ssum, rseg, rsum, qseg, qsum
5: ssum $=$ 0
6: for $i=$ 1 to $K$ do
7: calculate $\textbf{sseg}(i)$ (see Eq. (3.2)) // prepare S item
8: ssum $=$ ssum $+$ $\textbf{sseg}(i)$
9: end for
10: for each text $d$ in D do
11: ESparseLDAText ( $d$ , ssum, sseg)
12: end for
13: end for

Algorithm 10 ESparseLDAText The whole procedure of ESparseLDAText for all tokens of one text is shown in Algorithm 10 consisting of two loop bodies: Line 1–5 is the first loop body corresponding to the preparation work for the calculation process of R item for all tokens in one text. Line 6–12 is the second loop body. Each execution of the second loop body corresponds to the sampling procedure of one token. For each token within the text, ESparseLDAText calls the sub algorithm SparseLDAToken (line 8) or ESparseLDAToken (line 10) according to two possible relationships between the word type of the current token and the word type of the previous token. Specifically 1) (token switch between different word types) if the word type of the current token is different from that of the previous token, SparseLDAToken (Algorithm 7) is called. 2) (token switch between the same word type) Otherwise, if the word type of the current token is the same as that of the previous token ESparseLDAToken (Algorithm 11) is called. Compared with SparseLDAText, ESparseLDAText is different to some extent. The main difference between ESparseLDAText and SparseLDAText is that for the sampling procedure of one token SparseLDAText (Algorithm 6, line 7) only calls the sub algorithm SparseLDAToken while ESparseLDAText calls the sub algorithm SparseLDAToken or ESparseLDAToken in different situations. SparseLDAToken is described before in Subsection 3.2 and ESparseLDAToken is described below.

Algorithm 10 ESparseLDAText ( $d$ , ssum, sseg)
1: rsum $=$ 0
2: for each $t$ in { $t\|n^{t}_{d}!=$ 0} do
3: calculate $\textbf{rseg}(t)$ (see Eq. (3.2)) // prepare R item
4: rsum $=$ rsum $+$ $\textbf{rseg}(t)$
5: end for
6: for each token $w$ in document $d$ do
7: if $w$ is the first token of $d$ or
$w$ is different from the previous token
8: SparseLDAToken ( $d$ , $w$ , ssum, sseg, rsum, rseg)
9: else
10: ESparseLDAToken ( $d$ , $w$ , ssum, sseg, rsum, rseg, qsum, qseg)
11: end if
12: end for

Algorithm 11 ESparseLDAToken The whole procedure of ESparseLDAToken for one token is shown in Algorithm 11. As the core algorithm of ESparseLDA, ESparseLDAToken mainly consists of two parts. Line 1–3 is the first part. This part calculates S item, R item and Q item for the current token. Line 4–9 is the second part searching for the segment in which the sampling point falls. Specifically, line 1 and 2 prepares for calculating S item, R item and Q item of the current token including the acquisition of old topic (line 1) and the update of related elements in the count matrices (line 2) for excluding the old topic of the current token. Line 3 calculates S item, R item and Q item through cache strategy for the current token. Line 4–6 is the search process completed by calling the sub algorithm SparseLDASearch (Algorithm 8 in Subsection 3.2). Line 7–9 deals with the aftermath of the search process for the current token including the assignment of new topic (line 7), the update of related elements in the count matrices (line 8) and the update of the segments of S item, R item and Q item for including the new topic of the current token (line 9). Compared with SparseLDAToken (Algorithm 7), ESparseLDAtToken is different mainly in that it calculates the segments of Q item through cache strategy (Algorithm 11, line 3 and 9) instead of sparse strategy (Algorithm 7, line 4–8).

Algorithm 11 ESparseLDAToken ( $d$ , $w$ , ssum, sseg, rsum, rseg, qsum, qseg)
1: t_old_topic $=z_{w}$ , $t=$ t_old_topic
2: $n^{t}_{d}$ –, $n^{t}_{w}$ –, $n_{t}$ –
3: update $\textbf{sseg}(t)$ , ssum, $\textbf{rseg}(t)$ , rsum, $\textbf{qseg}(t)$ , qsum for excluding $z_{w}$ // compute S, R and Q item
4: sum $=$ ssum $+$ rsum $+$ qsum
5: sample a random variable, $u\sim$ Uniform (0, sum)
6: t_new_topic $=$ SparseLDASearch ( $u$ , ssum, rsum, qsum, sseg, rseg, qseg)
7: $z_{w}=$ t_new_topic, $t=$ t_new_topic
8: $n^{t}_{d}$ ++, $n^{t}_{w}$ ++, $n_{t}$ ++
9: update $\textbf{sseg}(t)$ , ssum, $\textbf{rseg}(t)$ , rsum, $\textbf{qseg}(t)$ , qsum for including $z_{w}$ // update S, R and Q item

Calculations for Q item in ESparseLDA Compared with SparseLDA it can be seen that ESparseLDA has some similarities and differences on the calculation processes of S item, R item and Q item. In particular, for S item and R item ESparseLDA calculates them in the same way as SparseLDA, but for Q item ESparseLDA calculates it in a different manner. Before calculating Q item ESparseLDA first needs to determine whether the word types of the current token and the previous token are the same.

1) Token switch between different word types If they are different, it is a switch between two tokens of two different word types. In this case, ESparseLDA cannot use the cache strategy to compute Q item, since all the $K$ segments in Q item have changed, so sparse strategy is still adopted to compute Q item. In this situation, the calculations for Q item is the same as SparseLDA and the time complexity is still O( $K_{w}$ ) as in SparseLDA. It should be noted that for the first token in each text, it belongs to this situation, because the Q item cached before cannot be used again when the token switch happened between two adjacent texts, no matter whether the word types of these two tokens are the same or not.

2) Token switch between the same word type If they are the same, it is a switch between two tokens of the same word type. In this situation, ESparseLDA uses cache strategy to compute Q item. Specifically in this case ESparseLDA gets most segments of Q item by recycling the previous results and gets the remaining segments of Q item by recalculation, since in Q item only two segments ( $Q_{\textit{t\_old\_topic}}$ and $Q_{\textit{t\_new\_topic}}$ ) have changed. The subscript t_old_topic represents the old topic of the current token before the sampling of the current token, while t_new_topic represents the new topic of the current token after the sampling of the current token. In other words, in this situation the calculations for the segments of Q item can be divided into two steps. i) For the two segments ( $Q_{\textit{t\_old\_topic}}$ and $Q_{\textit{t\_new\_topic}}$ ) which have changed, ESparseLDA recalculates them (Algorithm 11, line 3) by first updating the related elements in the count matrices (Algorithm 11, line 2). ii) For the other $K$ -2 segments in Q item which has not changed, ESparseLDA gets them (Algorithm 11, line 3) by caching and recycling the results which have been already calculated. From the perspective of pseudo code, the calculations for Q item contain two main parts. 1) One part of the calculations (Algorithm 11, line 2 and 3) before the topic sampling of the current token is as follows:

$\displaystyle\textit{qsum}=\textit{qsum}-\textbf{qseg}(\textit{t\_old\_topic}),$ $\displaystyle--n_{\textit{t\_old\_topic}},--n^{\textit{t\_old\_topic}}_{d},--n% ^{\textit{t\_old\_topic}}_{w},$ $\displaystyle\textit{recalculate }\textbf{qseg}(\textit{t\_old\_topic}),$ $\displaystyle\textbf{qseg}(\textit{t\_old\_topic})=\frac{n_{w}^{\textit{t\_old% \_topic}}(\alpha+n_{d}^{\textit{t\_old\_topic}})}{\beta V+n_{\textit{t\_old\_% topic}}},$ $\displaystyle\textit{qsum}=\textit{qsum}+\textbf{qseg}(\textit{t\_old\_topic}),$

2) The other part of the calculations (Algorithm 11, line 8 and 9) after the topic sampling of the current token is as follows:

$\displaystyle\textit{qsum}=\textit{qsum}-\textbf{qseg}(\textit{t\_new\_topic}),$ $\displaystyle--n_{\textit{t\_new\_topic}},--n^{\textit{t\_new\_topic}}_{d},--n% ^{\textit{t\_new\_topic}}_{w},$ $\displaystyle\textit{recalculate }\textbf{qseg}(\textit{t\_new\_topic}),$ $\displaystyle\textbf{qseg}(\textit{t\_new\_topic})=\frac{n_{w}^{\textit{t\_new% \_topic}}(\alpha+n_{d}^{\textit{t\_new\_topic}})}{\beta V+n_{\textit{t\_new\_% topic}}},$ $\displaystyle\textit{qsum}=\textit{qsum}+\textbf{qseg}(\textit{t\_new\_topic}),$

In this way, the time complexity of the calculations for Q item in ESparseLDA is reduced from O( $K_{w}$ ) in SparseLDA to O(1). Overall, it can be seen from the above details that ESparseLDA deals with Q item in a way different from SparseLDA. SparseLDA uses only sparse strategy to compute Q item, but ESparseLDA uses both sparse strategy and cache strategy to compute Q item.

4.4 Time complexity of ESparseLDA

Like SparseLDA, ESparseLDA mainly comprises two processes, the calculation process and the search process. In the following, we separately analyze the time complexities of these two processes.

Time complexity of the calculation process Similar with SparseLDA, the calculation process of ESparseLDA consists of three calculation processes respectively corresponding to the calculations of S item, R item and Q item. As can be seen from the previous analysis, ESparseLDA only optimizes the calculation process of Q item and does not change the calculation processes of S item and R item. Thus, compared ESparseLDA with SparseLDA, the time complexities of the calculation processes of S item and R item has no change which is still O(DN) as shown in Tables 3 and 4, while the time complexity of the calculation process of Q item has a change.

The calculation process of Q item can be divided into two cases according to the two strategy used (cache strategy or sparse strategy) for computing. In short, the single time complexity for compute one Q item is O(1) (Algorithm 11, line 3) in cache case and is O( $K_{w}$ ) (Algorithm 7, line 4–8) in sparse case. Since there are RDN tokens using sparse strategy and $(1-R)\textit{DN}$ tokens using cache strategy, the time complexity of the calculation process of all Q items approximates O(RDNK ${}_{w}$ ) as shown in Table 7. $R$ represents the macro average type token ratio of all the texts in the dataset. The value of $R$ is between 0 and 1 (For more description of $R$ see Section 5.1.1 Datasets).

Table 7
Time complexity for calculating Q item in ESparseLDA

Q item	The first token in the word type	Not the first token in the word type
Single time complexity	O( $K_{w}$ )	O(1)
Number of tokens	RDN	$(1-R)\textit{DN}$
Subtotal time complexity	O(RDNK ${}_{w}$ )	O( $(1-R)\textit{DN}$ )
Total time complexity	O(RDNK ${}_{w}$ )

In summary, the time complexities of the three calculation processes for all S item, R item and Q item are respectively O(DN), O(DN) and O(RDNK ${}_{w}$ ) in ESparseLDA. Thus, the total time complexity of the calculation process for all tokens of the entire dataset is O(RDNK ${}_{w}$ ) for ESparseLDA which is also dominated by the calculation process of Q item.

Time complexity of the search process Since ESparseLDA only optimizes the calculation process and does not change the search process, the time complexity of the search process in ESparseLDA is the same as SparseLDA which can be approximated as O(DN) (see Section 3.2.4 Time complexity of SparseLDA).

Time complexity of ESparseLDA The total time complexity of ESparseLDA approximated to O(RDNK ${}_{w}$ ) which is dominated by the calculation process, since the time complexity of its calculation process approximates O(RDNK ${}_{w}$ ) and its search process approximates O(DN). Thus, in theory the time complexity of ESparseLDA is less than SparseLDA, since the time complexity of SparseLDA is O(DNK ${}_{w}$ ) and the value of $R$ is between 0 and 1. Intuitively, it can be inferred that the amount of calculations in ESparseLDA is less than that of SparseLDA, since ESparseLDA only optimizes the calculation process and does not change the search process.

In addition, it needs to be explained that the token rearrangement operation is just a preprocessing step. This step is completed in conjunction with other preprocessing steps. It does not affect the time complexity of ESparseLDA.

4.5 Verification of ESparseLDA

ESparseLDA has three important points needed to be explained and verified. In short, these three points are the 1) correctness, 2) exactness and 3) time-efficiency of ESparseLDA. In detail, 1) correctness-refers to the rationality of the token rearrangement operation which is performed before the topic sampling process, 2) the second point, exactness, refers to the consistency between the training results of ESparseLDA and SparseLDA, 3) the third point, time-efficiency, refers to the time-efficiency of ESparseLDA compared with SparseLDA. In this subsection, these three points are explained from the theoretical point of view, and in the next section (Section 5 Experiments) they are verified from the experimental point of view.

Correctness of the token rearrangement operation In detail, the correctness of the token rearrangement operation means that the token rearrangement operation which is performed before the topic sampling process of StdGibbs has no influence on the quality of LDA obtained. This point is explained from two viewpoints in the following. The first viewpoint is the sufficient statistics of statistical models and the second viewpoint is the property of gibbs sampling method.

From the viewpoint of sufficient statistics, the token rearrangement operation has no impact on the quality of the LDA, since it does not affect the main sufficient statistics of LDA. In detail, LDA mainly considers the statistical information about the number of each topic within each text and the number of each word type within each topic. It does not concern about the statistical information about the order of tokens and topics. The main sufficient statistics of LDA are the two counting matrices, N ${}^{T}_{W}$ and N ${}^{T}_{D}$ . N ${}^{T}_{D}$ includes the statistical information about the number of each topic within each text. N ${}^{T}_{W}$ includes the statistical information about the number of each word type within each topic. Since the token rearrangement operation does not affect these two counting matrices, it has no impact on the quality of the LDA.

Considering the property of gibbs sampling, the token rearrangement operation is equivalent to change changing the sampling order of gibbs sampling algorithm. Since gibbs sampling algorithm has a property that random permutations of the sampling order does not affect the sampling results [10], from this viewpoint the token rearrangement operation also has no influence on the quality of LDA. Moreover, it is worth to note that gibbs sampling algorithm contains a number of random factors because it is a kind of stochastic algorithm. Although these random factors will affect the specific sampling results, this will not affect the quality of the overall results of gibbs sampling algorithm.

Exactness of ESparseLDA To explain the consistency between the training results of ESparseLDA and SparseLDA is to show that ESparseLDA is an optimization of SparseLDA in the case of ensuring the exactness. Since ESparseLDA only optimizes the topic sampling process by reducing the repeated calculation through the cache strategy, ESparseLDA and SparseLDA should have the same sampling results for each token on the assumption that the random factors (e.g. random initialization, random sample) have no influence on the sampling results. Therefore, ESparseLDA is a kind of precise optimization of SparseLDA. Furthermore, ESparseLDA can also be regarded as a kind of precise optimization of StdGibbs, since SparseLDA is also a kind of precise optimization algorithm of StdGibbs.

Exactness of the proposed method From the exactness of ESparseLDA it can be known that using ESparseLDA instead of SparseLDA for training has no change on the ability of LDA. In addition, from the correctness of the token rearrangement operation it can be seen that this operation also has no influence on the quality of LDA trained by SparseLDA. Thus, from the combination of these two points it can be inferred that the token rearrangement operation will still have no influence on the quality of LDA trained by ESparseLDA. So the method proposed in this paper which first makes token rearrangement operation and then uses ESparseLDA for training will not change the ability of the LDA.

Time-efficiency of ESparseLDA The time-efficiency of ESparseLDA is illustrated from two viewpoints, intuitive analysis and theoretical analysis. Intuitively, it can be deduced that ESparseLDA is more efficient than SparseLDA, since ESparseLDA only optimizes the calculation process of Q item in SparseLDA and no other parts of SparseLDA is changed. Theoretically, according to the time complexity analysis, the time complexity of SparseLDA is O(DNK ${}_{w}$ ) and the time complexity of ESparseLDA is approximatively O(RDNK ${}_{w}$ ), 0 $<R<$ 1. This also shows that the ESparseLDA is more efficient than the SparseLDA, since the time complexity of ESparseLDA is less than SparseLDA. Compared with SparseLDA the total time complexity reduced by ESparseLDA is approximatively O( $(1-R)\textit{DNK}_{w}$ ), since there are $(1-R)\textit{DN}$ tokens using cache strategy and the single time complexity reduced for each token by ESparseLDA is approximatively O( $K_{w}$ ). Thus the percentage of the time-efficiency improved by ESparseLDA is $(1-R)\times$ 100%.

However, in reality compared with SparseLDA the amount of computation reduced by ESparseLDA is $(1-R)\textit{DN}(K_{w}-C)$ and therefore the percentage of the time-efficiency improved by ESparseLDA is $(1-R)(K_{w}-C)/K_{w}\times$ 100%. $C$ is a constant which represents the amount of computation needed by the calculation of Q item for one token under cache strategy in ESparseLDA. It should be noted that when the number of texts approaches infinite or is extremely large, $K_{w}$ will approach $K$ [20]. In this case the time-efficiency improved by ESparseLDA will approach $(1-R)\times$ 100%, since usually $K$ ( $K>$ 100) is much larger than $C$ ( $C<$ 10) which makes $(K-C)/K$ approach 1.0.

5. Experiments

In this section, we verify the correctness, exactness and time-efficiency of ESparseLDA from the experimental point of view. The experimental design, results and analysis is the core content of this section. In addition, this section also describes the datasets, evaluation criteria, parameter setting and implementation details of the experiments.

Table 8
Statistical information of the original datasets

Datasets	$D$	$V$	$W$	$R$	$W/D$	$W/V$
KOS	3,430	6,906	467,714	0.755	136	68
NIPS	1,500	12,375	1,932,365	0.386	1288	156
Enron	39,861	28,099	6,412,172	0.579	161	228

5.1 Datasets and evaluation criteria

5.1.1 Datasets

Original datasets In this paper, three text collections, KOS blog entries, NIPS full papers and Enron Emails in UCI Machine Learning Repository1

¹
http://archive.ics.uci.edu/ml/datasets/Bag+of+Words.

are used as the experimental datasets. In the following “Kos”, “Nips” and “Enron” is used as the abbreviations to represent these three datasets respectively. The concrete information of these three datasets is shown in Table 8. In Table 8,

D

is the total number of texts in one dataset.

V

is the total number of word types in one dataset (the size of dictionary or the size of vocabulary). It should be noted that the three datasets used in this paper have already been preprocessed and in the process of preprocessing only the word types that occur more than ten times are kept after tokenization and removal of stopwords.

W

is the total number of tokens in one dataset.

W/D

is the average length of all texts in one dataset.

W/V

is the average occurrence number of all the word types in one dataset which is related to

K_{w}

. Usually the value of

K_{w}

increases as the value of

W/V

increases. Extremely when

W/V

is very large or approaches infinite, the value of

K_{w}

will approach

K

[20].

R

is the macro average type token ratio of all texts in one dataset defined as below (see Eq. (4)).

$\displaystyle R=\textit{Macro-TTR}=\frac{\sum\limits_{d\in D}{\textit{TTR}_{d}% }}{|D|}$ (4)

To be explained, “TTR” is the abbreviation of “Type Token Ratio”2

http://www.sltinfo.com/type-token-ratio/.

https://en.wikipedia.org/wiki/Lexical_density.

defined as the ratio of the number of word types to the number of tokens (see Eq. (5)).

$\displaystyle\textit{TTR}=\frac{\textit{number}-\textit{of}-\textit{word}-% \textit{types}}{\textit{number}-\textit{of}-\textit{tokens}}$ (5)

The variety of the word types increases as the value of TTR increases. In addition, TTR ${}_{d}$ represents the TTR of text $d$ . Thus $R$ represents the macro average of the TTR of all texts in one dataset. It is worth noting that $R$ is the macro average based on using the text as a unit. This “macro average” method is also adopted in Macro- $F_{1}$ 4

⁴

https://en.wikipedia.org/wiki/F1_score.

which is commonly used in classifier evaluation.

Extended datasets In reality, the percentage of the time-efficiency improved by ESparseLDA is $(1-R)(K_{w}-C)/K_{w}\times$ 100%. Thus, in order to observe the effects of $R$ and $W/V$ on the time-efficiency improved by ESparseLDA on SparseLDA, we extended the dataset Nips and get the extended datasets Nips-1 and Nips-2. Simply both dataset Nips-1 and dataset Nips-2 can be seen as extended from two Nips datasets. 1) Outer-text extension. Dataset Nips-1 was obtained by directly putting two Nips datasets together, so it can be seen as consisting of two Nips datasets. Suppose that dataset Nips contains only one text “hello world”, then in this extension dataset Nips-1 contains two texts, and each of these two texts is “hello world”. In this way, the content of each text is unchanged, but the number of texts is doubled. In addition, it can be seen from Table 9 that in this extension the $R$ value of dataset Nips-1 is the same as dataset Nips and its $W/V$ value is two times of dataset Nips. 2) Inner-text extension. Compared with dataset Nips-1, dataset Nips-2 was obtained by putting two Nips datasets together first and then merging the same two texts from two Nips dataset into one text. Suppose that dataset Nips contains only one text “hello world”, then in this extension dataset Nips-2 contains only one text and its content is “hello world hello world”. In this extension, the content of each text is doubled, but the number of the text remains unchanged. Moreover, it can be seen from Table 9 that in this extension the $R$ value of the dataset Nips-2 is half of dataset Nips and its $W/V$ value is two times of dataset Nips. Thus by comparing the experimental results on Nips and Nips-1 we can observe the effect of $W/V$ on the time-efficiency improved by ESparseLDA. And by comparing the experimental results on Nips-1 and Nips-2 we can observe the effect of $R$ on the time-efficiency improved by ESparseLDA.

Table 9

Statistical information of the extended datasets

Dataset	$D$	$V$	$W$	$R$	$W/D$	$W/V$
NIPS	1,500	12,375	1,932,365	0.386	1288	156
NIPS-1	1,500*2	12,375	1,932,365*2	0.386	1288	156*2
NIPS-2	1,500	12,375	1,932,365*2	0.386/2	1288*2	156*2

5.1.2 Evaluation criteria

In this paper, we need to verify the correctness, exactness and time-efficiency of ESparseLDA. The validation of the correctness and exactness needs to evaluate the ability of LDA. And the validation of the time-efficiency needs to evaluate the time used for training LDA. For this purpose, the typical evaluation criteria, perplexity [4, 20],5

⁵
https://en.wikipedia.org/wiki/Perplexity.

was chosen to evaluate the ability of LDA and the running time (number of seconds) was chosen to evaluate the time used for training LDA. The “running time” only includes the time of each iteration and it does not include the random initial part at the beginning and the part of saving results during each iteration.

Perplexity is a criterion commonly used in natural language processing to measure the generalization ability of language models [4]. Algebraically it is defined as the reciprocal geometric mean of the likelihood of a held-out test dataset given the results of models. The results of LDA are the count matrices, including the document-topic count matrix N ${}^{T}_{D}$ and the topic-word count matrix N ${}^{T}_{W}$ . Formally, for a held-out test dataset of $M$ documents it is defined as below

$\displaystyle\textit{perplexity}(\textbf{D}_{\text{held-out}})=\left[{\prod% \limits_{d=1}^{M}p(d|\textbf{N}_{D}^{T},\textbf{N}_{W}^{T})}\right]^{-\frac{1}% {\sum\limits_{d=1}^{M}{N_{d}}}}=\exp\left[{-\frac{\sum\limits_{d=1}^{M}{\log p% (d|\textbf{N}_{D}^{T},\textbf{N}_{W}^{T})}}{\sum\limits_{d=1}^{M}{N_{d}}}}\right]$ (6)

where

$\displaystyle p(d|\textbf{N}_{D}^{T},\textbf{N}_{W}^{T})=\prod\limits_{i=1}^{N% _{d}}p(w_{i}|\textbf{N}_{D}^{T},\textbf{N}_{W}^{T})=\prod\limits_{i=1}^{N_{d}}% \sum\limits_{t=1}^{K}\left[{\frac{n_{d}^{t}+\alpha}{\sum\limits_{t=1}^{K}{(n_{% d}^{t}}+\alpha)}\frac{n_{w}^{t}+\beta}{\sum\limits_{w=1}^{V}{(n_{w}^{t}}+\beta% )}}\right].$ (7)

Perplexity is a monotonically decreasing function and a lower perplexity value indicates better generalization ability. In our experiments, the test set ratio is set to 10% which means that 90% of the dataset is used for training LDA and 10% of the dataset is used to compute the perplexity for LDA. It should be noted that SparseLDA and ESparseLDA are compared on the same dataset partition in order to eliminate the effect of dataset partition on perplexity values and running times and the experiment results in this paper is averaged over five runs for each algorithm on each dataset. For the reason that the values of error bars are relatively too small compared with the values of perplexities and running times, the error bars which show the standard deviation of the experiment results over five runs are not plotted.

5.2 Parameter setting and implementation details

In this experiment the number of topics $K$ is set to 100. We chose a relatively small value for $K$ , since the three datasets used in this paper are not large datasets. $\alpha$ is set to 0.1 and $\beta$ is set to 0.01, which is the usual setting in LDA. According to the experimental results the numbers of iterations for sampling algorithms are respectively set to 1000, 1800 and 1600 for dataset Kos, Nips, and Enron. The numbers of iterations are set to different values, because the sampling algorithms need different numbers of iterations to converge on these three datasets.

In this paper, SparseLDA is used as the contrast algorithm to analyze the performance of ESparseLDA. Implementation of SparseLDA is achieved from an open source code on GitHub6

⁶
https://github.com/xunzheng/topic.

implemented with C++11. This version is selected based on two considerations. 1) In this version the details of algorithm and data structure in SparseLDA are implemented completely. 2) This version is a minimal implementation version of SparseLDA. The minimum version refers to that it only contains the implementation of SparseLDA. In this version no other sampling algorithm is implemented. And it does not contain complex data preprocessing, results displaying and other functions beyond SparseLDA. Therefore, this paper chooses this version as the implementation of SparseLDA in order to facilitate the change and optimization in the implementation of the ESparseLDA.

All the experiments in this paper were conducted on a laptop with 4 GB memory and an Intel i5-4210M processor with 2.6 GHz clock rate. In addition, the program was carried out on Microsoft Visual Studio 2013 development platform (VS2013) under Windows7 64 bit operating system. Since VS2013 is not fully support for C++11, we made a proper modification on the open source code, and this modification does not affect the core algorithms and data structures in SparseLDA and ESparseLDA.

5.3 Experimental design, results and analysis

According to the different contents to be verified the whole experiment is divided into three parts as follows. 1) “Correctness verification” verifies the correctness of the token rearrangement operation. 2) “Exactness verification” verifies the consistency between the training results of ESparseLDA and SparseLDA. 3) “Time-efficiency verification” verifies the time-efficiency of ESparseLDA compared with SparseLDA. Since the token arrangement operation is the prerequisite of ESparseLDA, this experiment firstly verifies the correctness and then verifies the exactness and time-efficiency of ESparseLDA.

5.3.1 Correctness verification

The correctness of token arrangement operation has been explained from the perspective of theoretical analysis in the previous section (see Section 4.5 Verification of ESparseLDA). From the experimental point of view, verifying the correctness of token arrangement operation is to verify that changing the sampling order of token sequence within texts before sampling does not affect the quality of LDA trained by the same sampling algorithm.

Experimental design of correctness verification On each dataset the experimental procedure of correctness verification was designed to be as follows: 1) Firstly, SparseLDA was used to train a LDA model on natural token sequence of the dataset. 2) Then the token rearrangement operation was made on the natural token sequence and SparseLDA was again used to train another LDA model on the rearranged token sequence. 3) Finally the abilities of these two LDA models were compared. If the abilities of these two LDA models are consistent, the correctness of the token rearrangement operation is verified. Otherwise, the correctness is not verified. In this experiment, SparseLDA was selected as the inference algorithm and perplexity was chosen as the evaluation criterion for the ability of LDA model.

Experimental results of correctness verification According to the experimental design, the experimental results of correctness verification obtained on dataset Kos, Nips and Enron are shown in Fig. 1. In each subfigure of Fig. 1, the horizontal axis represents the number of iterations, the vertical axis represents the perplexity of LDA model, and the curves show the variation of the perplexity of LDA model along with the change of the number of iterations. “Unordered” in the legend corresponds to the LDA model trained by SparseLDA on the natural token sequence, while “Ordered” in the legend corresponds to the LDA model trained also by SparseLDA but on the rearranged token sequence.

From Fig. 1 it can be seen that on each dataset the perplexities of these two LDA models (“Unordered”, “Ordered”) roughly converge to the same value. Moreover, the convergence processes of these two LDA models are also generally consistent. In addition to the visual inspection we make a statistical significance test on each pair of the perplexities of these two LDA models (“Unordered”, “Ordered”). Table 10 shows the means and standard deviations of the perplexities of these two LDA models (“Unordered”, “Ordered”) after convergence in five running for the three datasets. Intuitively, it can be observed that the means of the perplexities of these two LDA models (“Unordered”, “Ordered”) on each dataset are very close and they are the same until the first or the second decimal place. Strictly, we make a statistical significance test on each pair of perplexities in the same dataset to check whether the difference between them is significant or not. The p-value shows that from a statistical point of view the exactness of the results of these two LDA models (“Unordered”, “Ordered”) have no differences and only changing the sampling order of the token sequence within texts does not affect the ability of LDA model. Thus the correctness of the token rearrangement operation is verified.

Table 10
Means and standard deviations of the perplexities of these two LDA models (“Unordered”, “Ordered”) after convergence in five running for the three datasets

Dataset	Kos	Nips	Enron
Unordered	1269.996 $\pm$ 1.010744	1414.612 $\pm$ 0.454236	1730.619 $\pm$ 0.607049
Ordered	1269.605 $\pm$ 0.211064	1414.571 $\pm$ 0.242529	1730.627 $\pm$ 0.329561
P-value	0.3026	0.8844	0.9804

Figure 1.

Perplexity as a function of the number of iterations. Unordered vs Ordered: (a) dataset Kos, (b) dataset Nips, (c) dataset Enron.

5.3.2 Exactness verification

Theoretically the exactness of ESparseLDA refers to the consistency between the training results of ESparseLDA and SparseLDA. In experiment verifying the exactness of ESparseLDA is to verify that the abilities of two LDA models respectively trained by ESparseLDA and SparseLDA on the same token sequence are consistent.

Experimental design of exactness verification On each dataset the experimental procedure of exactness verification was designed to be as follows: 1) Firstly, SparseLDA was used to train a LDA model on a token sequence of the dataset. 2) Then ESparseLDA was used to train another LDA model on the same token sequence used in step 1). 3) Finally the abilities of these two LDA models were compared. If the abilities of these two LDA models are consistent, the exactness of ESparseLDA is verified. Otherwise, the exactness is not verified. In this experiment, SparseLDA and ESparseLDA were selected respectively as the sampling algorithm and perplexity was still chosen as the evaluation criterion for the ability of LDA model.

Experimental results of exactness verification According to the experimental design, the experimental results of exactness verification obtained on the three datasets are shown in Fig. 2. In each subfigure of Fig. 2, the horizontal axis, the vertical axis and the curves have the same meaning as those in Fig. 1. “SparseLDA” in the legend corresponds to the LDA model trained by SparseLDA, while “ESparseLDA” in the legend corresponds to the LDA model trained by ESparseLDA on the same token sequence.

From Fig. 2 it can be seen that the perplexities of these two LDA models (“SparseLDA”, “ESparseLDA”) obtained in each dataset roughly converge to the same value. Moreover, the convergence processes of these two LDA models are also generally consistent. In addition to the visual inspection we also make a statistical significance test on each pair of the perplexities of these two LDA models (“SparseLDA”, “ESparseLDA”). Table 11 shows the means and standard deviations of the perplexities of these two LDA models (“SparseLDA”, “ESparseLDA”) after convergence in five running for the three datasets. The p-value shows that from a statistical point of view the exactness of the results of these two LDA models (“SparseLDA”, “ESparseLDA”) have no differences and using ESparseLDA instead of SparseLDA for training has no change on the ability of LDA model. Thus the exactness of ESparseLDA is verified.

Table 11
Means and standard deviations of the perplexities of these two LDA models (“SparseLDA”, “ESparseLDA”) after convergence in five running for the three datasets

Dataset	Kos	Nips	Enron
SparseLDA	1269.605 $\pm$ 0.211064	1414.571 $\pm$ 0.242529	1730.627 $\pm$ 0.329561
ESparseLDA	1269.681 $\pm$ 0.919829	1414.982 $\pm$ 0.376996	1730.943 $\pm$ 0.520961
P-value	0.8326	0.1346	0.3175

Figure 2.

Perplexity as a function of the number of iterations. SparseLDA vs ESparseLDA: (a) dataset Kos, (b) dataset Nips, (c) dataset Enron.

Meanwhile, combining the correctness verification and exactness verification it can be seen that the method proposed in this paper which first rearranges the token sequence and then uses ESparseLDA for training will not change the ability of LDA.

5.3.3 Time-efficiency verification

Verifying the time-efficiency of ESparseLDA is to verify that compared with SparseLDA ESparseLDA needs less running time to train LDA model on the rearranged token sequence. It is important to note that in verifying the time-efficiency of ESparseLDA it is necessary to first verify that the token rearrangement operation will not change the running time of SparseLDA, since ESparseLDA requires this operation in its preprocessing.

Experimental design of time-efficiency verification On each dataset the experimental procedure of time-efficiency verification is designed to be as follows: 1) SparseLDA was used to train a LDA model (“LDA-1”) on the natural token sequence of the dataset. 2) Token rearrangement operation was made on the natural token sequence and SparseLDA was used to train another LDA model (“LDA-2”) on the rearranged token sequence. 3) ESparseLDA was used to train another LDA model (“LDA-3”) on the same rearranged token sequence used in step 2). 4) LDA-1 and LDA-2 were compared. If the running times of these two LDA models are the same, then it shows that the token rearrangement operation does not affect the time-efficiency of SparseLDA; otherwise, it shows that the token rearrangement operation will affect the time-efficiency of SparseLDA. 5) LDA-1 and LDA-3 were compared. If the running time of LDA-3 is less than LDA-1 then the time-efficiency of ESparseLDA is verified; otherwise, it shows that the time-efficiency of ESparseLDA algorithm is invalid.

Experimental results and analysis In order to analyze the time-efficiency of ESparseLDA in detail, the experimental results contains three parts as follows: 1) Seconds of one iteration as a function of the number of iterations, 2) Seconds of one iteration as a function of $K_{w}$ , 3) Time-efficiency improvement as a function of $R$ and $W/V$ . According to the experimental design, the experimental results of these three parts are analyzed one by one in the following.

1) Seconds of one iteration as a function of the number of iterations Figure 3 is the experimental results of the first part in time-efficiency verification. In each subfigure of Fig. 3, the horizontal axis represents the number of iterations, the vertical axis represents the running time of one iteration and the curves show the variations of the running time of one iteration along with the change of the number of iterations.

a) Comparison between LDA-1 and LDA-2 From the first three subfigures (subfigure a) $\sim$ c)) in Fig. 3 it can be seen that the two curves of LDA-1 and LDA-2 are coincident. On each of the three datasets the running time of one iteration in LDA-1 varies in the same situation as LDA-2. This verifies that the token rearrangement operation will not change the running time of SparseLDA.

b) Comparison between LDA-1 and LDA-3 From the last three subfigures (subfigure d) $\sim$ f)) in Fig. 3 it can be seen that different from the first three subfigures, the two curves of LDA-1 and LDA-3 are not coincident. On each of the three datasets, the running time of one iteration in ESparseLDA is less than that of SparseLDA on the same iteration number. This shows that the time-efficiency of ESparseLDA is higher than SparseLDA. Thus the time-efficiency of ESparseLDA is verified. Moreover, on the same iteration number the running time of one iteration in ESparseLDA is less than that of SparseLDA in varying degrees on the three datasets. Concretely, after convergence on dataset Kos, Nips and Enron ESparseLDA respectively reduces the running time of one iteration from 1.083 s, 3.645 s and 16.372 s to 1.033 s, 2.961 s and 13.462 s as shown in Table 12. Table 12 shows the comparison of running time for one iteration after convergence between SparseLDA (LDA-1) and ESparseLDA (LDA-3) on dataset Kos, Nips and Enron. On dataset Nips the reduction degree is the largest and on dataset Kos the reduction degree is the smallest.

Table 12
Comparison of running time for one iteration after convergence between SparseLDA (LDA-1) and ESparseLDA (LDA-3) on dataset Kos, Nips and Enron

Dataset	Kos	Nips	Enron
SparseLDA	1.083 s	3.645 s	16.372 s
ESparseLDA	1.033 s	2.961 s	13.462 s

Figure 3.

Seconds of one iteration as a function of the number of iterations. LDA-1 vs LDA-2: (a) dataset Kos, (b) dataset Nips, (c) dataset Enron. LDA-1 vs LDA-3: (d) dataset Kos, (e) dataset Nips, (f) dataset Enron.

2) Seconds of one iteration as a function of K ${}_{w}$ Figure 4 is the experimental results of the second part in time-efficiency verification. Since the $K_{w}$ of ESparseLDA may not be the same as SparseLDA on the same iteration number, just comparing the running time of ESparseLDA with SparseLDA on the same iteration number cannot fully verify the time-efficiency of ESparseLDA. Therefore, this part compares the running time of ESparseLDA with SparseLDA on the same $K_{w}$ . Different from Fig. 3, in each subfigure of Fig. 4 the horizontal axis represents the value of $K_{w}$ , the vertical axis represents the running time of one iteration, and the curves show the variations of the running time of one iteration along with the change of $K_{w}$ . One point on the curve corresponds to one iteration of the sampling algorithm. It should be noted that the $K_{w}$ is gradually becoming smaller along with the iterative running of the sampling algorithm, so the leftmost of the horizontal axis corresponds to the start of sampling algorithm which has the largest value of $K_{w}$ , while the rightmost of the horizontal axis corresponds to the end of sampling algorithm which has the smallest value of $K_{w}$ .

Figure 4.

Seconds of one iteration as a function of $K_{w}$ . LDA-1 vs LDA-2: (a) dataset Kos, (b) dataset Nips, (c) dataset Enron. LDA-1 vs LDA-3: (d) dataset Kos, (e) dataset Nips, (f) dataset Enron.

a) Comparison between LDA-1 and LDA-2 From the first three subfigures (subfigure a) $\sim$ c)) in Fig. 4 it can be seen that the two curves of LDA-1 and LDA-2 are coincident. This is consistent with the situation in the first three subfigures of Fig. 3. On each of the three datasets the running time of one iteration in LDA-1 varies in the same as LDA-2 along with the change of the $K_{w}$ . This fully verifies that the token rearrangement operation will not change the running time of SparseLDA.

b) Comparison between LDA-1 and LDA-3 From the last three subfigures (subfigure d) $\sim$ f)) in Fig. 4 it can be seen that the two curves of LDA-1 and LDA-3 are not coincident. On each of the three datasets, the running time of one iteration of ESparseLDA is less than that of SparseLDA on the same $K_{w}$ . This is also consistent with the situation in the last three subfigures of Fig. 3. In addition, this fully shows that the time-efficiency of ESparseLDA is higher than SparseLDA. Thus the time-efficiency of ESparseLDA is also fully verified.

Furthermore, on the same $K_{w}$ the running time of one iteration in ESparseLDA is less than that of SparseLDA in varying degrees on the three datasets. Concretely, at initial stage on dataset Kos, Nips and Enron ESparseLDA respectively reduces the running time of one iteration from 4.547 s, 17.889 s and 71.108 s to 4.294 s, 14.963 s and 61.602 s as shown in Table 13. Table 13 shows the comparison of running time for one iteration at initial stage between SparseLDA (LDA-1) and ESparseLDA (LDA-3) on dataset Kos, Nips and Enron. On dataset Nips the reduction degree is the largest and on dataset Kos the reduction degree is the smallest. This is also consistent with the situation after convergence in the last three subfigures of Fig. 3 as said before. In addition, it can be seen that the initial values of $K_{w}$ for dataset Kos, Nips and Enron are respectively 44.53, 68.34 and 74.79, and the end values of $K_{w}$ are respectively 7.63, 12.19 and 17.83. Since the values of $W/V$ for these three datasets are 68, 156 and 228, these experimental results are consistent with the previous analysis that in general the value of $K_{w}$ increases as the value of $W/V$ increases.

Table 13

Comparison of running time for one iteration at initial stage between SparseLDA (LDA-1) and ESparseLDA (LDA-3) on dataset Kos, Nips and Enron

Dataset	Kos	Nips	Enron
SparseLDA	4.547 s	17.889 s	71.108 s
ESparseLDA	4.294 s	14.963 s	61.602 s
$K_{w}$ (initial value)	44.53	68.34	74.79
$K_{w}$ (final value)	7.63	12.19	17.83
$W/V$	68	156	228

3) Time-efficiency improvement as a function of $R$ and $W/V$ In reality the time-efficiency improved by ESparseLDA is $(1-R)(K_{w}-C)/K_{w}\times$ 100%. It is not only related to $R$ but also related to $W/V$ . In view of $R$ , the time-efficiency improvement increases as the value of $R$ decreases. And in view of $K_{w}$ , the time-efficiency improvement also increases as the value of $W/V$ increases. Thus, for the further analysis, this part analyzes the impact of $R$ and $W/V$ on the time-efficiency improved by ESparseLDA compared with SparseLDA.

Figure 5.

Time-efficiency improvement as a function of $R$ and $W/V$ . (a) dataset Kos, Nips and Enron, (b) dataset Nips, Nips-1 and Nips-2.

Figure 5 is the experimental results of the third part in time-efficiency verification. In each subfigure of Fig. 5, the horizontal axis represents the value of $R$ of the datasets, the vertical axis represents the value of $W/V$ of the datasets, one circle corresponds to one dataset, and the size of one circle represents the time-efficiency improved by ESparseLDA compared with SparseLDA. In detail, the time-efficiency improvement is defined as below.

$\displaystyle E=\frac{T_{\textit{SparseLDA}}-T_{\textit{ESparseLDA}}}{T_{% \textit{SparseLDA}}}$ (8)

$E$ represents the time-efficiency improved by ESparseLDA compared with SparseLDA. $T_{\textit{SparseLDA}}$ represents the total running time of all iterations of SparseLDA. $T_{\textit{ESparseLDA}}$ represents total the running time of all iterations of ESparseLDA. Table 14 shows the time-efficiency improved by ESparseLDA compared with SparseLDA on dataset Kos, Nips, Enron, Nips-1 and Nips-2.

Table 14

Time-efficiency improved by ESparseLDA compared with SparseLDA on dataset Kos, Nips and Enron

	Kos	Nips	Enron	Nips-1	Nips-2
Speed up ratio ( $E$ )	5.06%	19.16%	17.21%	26.70%	31.85%
$W/V$	68	156	228	156*2	156*2
$R$	0.755	0.386	0.579	0.386	0.386/2

a) Speed up ratio on Kos, Nips and Enron From Fig. 5a and Table 14 it can be seen that the time-efficiency improvements on dataset Kos, Nips and Enron are respectively 5.06%, 19.16% and 17.21%. This result is consistent with our analysis. The improvement on dataset Kos is the smallest, since the $R$ value of Kos is the biggest and the $W/V$ value of Kos is the smallest among these three datasets.

b) Speed up ratio on Nips, Nips-1 and Nips-2 From Fig. 5b and Table 14 it can be seen that the time-efficiency improvements on dataset Nips, Nips-1 and Nips-2 are respectively 19.16%, 26.70% and 31.85%. This result fully verifies our previous analysis. Firstly, by comparing the improvement on Nips and Nips-1 (the same $R$ value, different $W/V$ values) it can be observed that the time-efficiency improvement actually increases from 19.16% to 26.70% as the value of $W/V$ increases from 156 to 312. Secondly, by comparing the improvement on Nips-1 and Nips-2 (the same $W/V$ value, different $R$ values) it can be observed that the time-efficiency improvement increases from 26.70% to 31.85% as the value of $R$ decreases from 0.382 to 0.191.

Theoretically it can be inferred that when $W/V$ increases to an extremely large value, the time-efficiency improvement $(1-R)(K_{w}-C)/K_{w}\times$ 100% will approach $(1-R)\times$ 100%. That means for dataset Nips when $W/V$ increases to an extremely large value, the time-efficiency improvement will approach about 60%. In reality this situation does occur, since the vocabulary is always limited to a certain quantity and the total tokens can increase to an extremely large number.

In addition, in this case it can be seen that the time-efficiency improved by ESparseLDA is comparative with AliasLDA. Theoretically ESparseLDA performs better on the dataset with small value of $R$ and large value of $W/V$ . In reality, compared with AliasLDA which is more suitable for the dataset with a large amount of short documents, such as messages, tweets and online comments, ESparseLDA is more suitable for the dataset with long documents which not only have a large number of words but also have a lot of repetitive words, such as fictions, patents and dissertations. Moreover, if the short documents (the Facebook messages of one person, the tweets of one user, the questions for one answer and the online reviews of one book) can be aggregated together into a long document [13], the dataset with a large number of short documents can also be suitable for ESparseLDA.

In summary, in this subsection we verify the time-efficiency of ESparseLDA by conducting comparative experiments, and actually the experimental results show that ESparseLDA is more time-efficient than SparseLDA. Concretely, the time-efficiencies improved by ESparseLDA on SparseLDA varies from 5.06% to 31.85% on the different datasets used in experiments. Moreover, the experimental results also validate our previous theoretical analysis of the impact of $R$ and $W/V$ on the time-efficiency improved by ESparseLDA compared with SparseLDA.

6. Conclusion

In this paper, we proposed a more efficient gibbs sampling algorithm, ESparseLDA, based on SparseLDA. ESparseLDA is also used in LDA for inferring the topics and accomplishes the same task as SparseLDA. It improves the time-efficiency of SparseLDA by recycling more computation while ensuring the exactness of SparseLDA. In other words, ESparseLDA makes no approximation in the optimization of SparseLDA. The core idea of ESparseLDA is combining token rearrangement and cache strategy for recycling more computation so as to improve the time-efficiency of SparseLDA. In this paper we verified the correctness, exactness and time-efficiency of ESparseLDA both from theoretical analysis and experimental validation. Theoretically the percentage of the time-efficiency improved by ESparseLDA is $(1-R)(K_{w}-C)/K_{w}\times$ 100% and it will approach $(1-R)\times$ 100% as $W/V$ increases to an extremely large value. This means that ESparseLDA performs much better than SparseLDA on the dataset with either a relatively small value of $R$ or a relatively large value of $W/V$ . Experimental results have shown that ESparseLDA is more efficient than SparseLDA in different degrees varying from 5.06% to 31.85%. In practice, ESparseLDA is more suitable for the dataset with long documents which not only have a large number of words but also have a lot of repetitive words.

Future work will try to optimize the sampling inference algorithms of other LDA-based topic models using the core idea of ESparseLDA, since the core idea is general and is not limited to LDA. Theoretically, this idea can be applied to AliasLDA which can reduce the time complexity from O( $K_{d}$ ) to O(RK ${}_{d}$ ).

Footnotes

Acknowledgments

This work was supported by National Nature Science Foundation of China (NSFC) under the Grant No. 61602204, 61472157, 61170092, 61133011 and 61103091.

References

Blei

D.M.

, Probabilistic topic models, Communications of the ACM 55(4) (2012), 77–84.

Blei

D.M.

Carin

and Dunson

, Probabilistic topic models, IEEE Signal Processing Magazine 27(6) (2010), 55–65.

Blei

D.M.

A.Y.

and Jordan

M.I.

, Latent dirichlet allocation, in: Advances in Neural Information Processing Systems, MIT Press, 2001, pp. 601–608.

Blei

D.M.

A.Y.

and Jordan

M.I.

, Latent dirichlet allocation, Journal of Machine Learning Research 3 (2003), 993–1022.

Cha

Hsieh

C.-C.

and Cho

, Incorporating popularity in topic models for social network analysis, in: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, 2013, pp. 223–232.

Cha

and Cho

, Social-network analysis using topic models, in: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, 2012, pp. 565–574.

Crain

S.P.

Zhou

Yang

S.-H.

and Zha

, Dimensionality reduction and topic modeling: From latent semantic indexing to latent dirichlet allocation and beyond, in: Mining Text Data, Springer, Boston, MA, 2012, pp. 129–161.

Das

Zaheer

and Dyer

, Gaussian lda for topic models with word embeddings, in: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, ACL, 2015, pp. 795–804.

Deepak

Hariharan

and Sinha

, Cluster based human action recognition using latent dirichlet allocation, in: 2013 International Conference on Circuits, Controls and Communications (Ccube), IEEE, 2013, pp. 1–4.

10.

Gilks

W.R.

Richardson

and Spiegelhalter

, Markov Chain Monte Carlo in Practice, CRC Press, 1995.

11.

Godin

Slavkovikj

De Neve

Schrauwen

and Van de Walle

, Using topic models for twitter hashtag recommendation, in: Proceedings of the 22nd International Conference on World Wide Web, ACM, 2013, pp. 593-596.

12.

Griffiths

T.L.

and Steyvers

, Finding scientific topics, in: Proceedings of the National Academy of Sciences, NAS, 2004, pp. 5228-5235.

13.

Hong

and Davison

B.D.

, Empirical study of topic modeling in twitter, in: Proceedings of the First Workshop on Social Media Analytics, ACM, 2010, pp. 80-88.

14.

and Oh

A.H.

, Aspect and sentiment unification model for online review analysis, in: Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, ACM, 2011, pp. 815-824.

15.

Kim

Georgiou

and Narayanan

, Supervised acoustic topic model with a consequent classifier for unstructured audio classification, in: Proceedings of the 10th International Workshop on Content-Based Multimedia Indexing (CBMI), IEEE, 2012, pp. 1-6.

16.

Kim

Georgiou

P.G.

Narayanan

and Sundaram

, Supervised acoustic topic model for unstructured audio information retrieval, in: Proceedings of Asia Pacific Signal and Information Processing Association (APSIPA) Annual Summit and Conference, IEEE, 2010, p. 3.

17.

Kim

Narayanan

and Sundaram

, Acoustic topic model for audio information retrieval, in: 2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, IEEE, 2009, pp. 37-40.

18.

Lacoste-Julien

Sha

and Jordan

M.I.

, Disclda: Discriminative learning for dimensionality reduction and classification, in: Advances in Neural Information Processing Systems, MIT Press, 2009, pp. 897-904.

19.

Lau

J.H.

Cook

McCarthy

Gella

and Baldwin

, Learning word sense distributions, detecting unattested senses and identifying novel senses using topic models, in: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL, 2014, pp. 259-270.

20.

A.Q.

Ahmed

Ravi

and Smola

A.J.

, Reducing the sampling complexity of topic models, in: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2014, pp. 891-900.

21.

Lin

and He

, Joint sentiment/topic model for sentiment analysis, in: Proceedings of the 18th ACM Conference on Information and Knowledge Management, ACM, 2009, pp. 375-384.

22.

Lin

Everson

and Ruger

, Weakly supervised joint sentiment-topic detection from text, IEEE Transactions on Knowledge and Data Engineering 24(6) (2012), 1134–1145.

23.

Mehrotra

Sanner

Buntine

and Xie

, Improving lda topic models for microblogs via tweet pooling and automatic labeling, in: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, 2013, pp. 889–892.

24.

Mimno

Hoffman

and Blei

D.M.

, Sparse stochastic inference for latent dirichlet allocation, in: Proceedings of the 29th International Coference on International Conference on Machine Learning, ACM, 2012, pp. 1515–1522.

25.

Moens

M.-F.

and Vulié

, Multilingual probabilistic topic modeling and its applications in web mining and search, in: Proceedings of the 7th ACM International Conference on Web Search and Data Mining, ACM, 2014, pp. 681–682.

26.

Nguyen

V.-A.

Boyd-Graber

Resnik

and Miler

, Tea party in the house: A hierarchical ideal point topic model and its application to republican legislators in the 112th congress, in: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, ACL, 2015, pp. 1438–1448.

27.

Niu

Hua

Gao

and Tian

, Context aware topic model for scene recognition, in: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2012, pp. 2743–2750.

28.

Niu

Hua

Gao

and Tian

, Semi-supervised relational topic model for weakly annotated image recognition in social media, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2014, pp. 4233–4240.

29.

Porteous

Newman

Ihler

Asuncion

Smyth

and Welling

, Fast collapsed gibbs sampling for latent dirichlet allocation, in: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2008, pp. 569–577.

30.

Rabinovich

and Blei

, The inverse regression topic model, in: Proceedings of the 31st International Conference on Machine Learning, ACM, 2014, pp. 199–207.

31.

Ranganath

Wang

David

and Xing

, An adaptive learning rate for stochastic variational inference, in: Proceedings of the 30th International Conference on Machine Learning, ACM, 2013, pp. 298–306.

32.

Steyvers

and Griffiths

, Probabilistic topic models, in: Handbook of Latent Semantic Analysis, 2007, pp. 424–440.

33.

Titov

and McDonald

, Modeling online reviews with multi-grain topic models, in: Proceedings of the 17th International Conference on World Wide Web, ACM, 2008, pp. 111–120.

34.

Titov

and McDonald

R.T.

, A joint model of text and aspect ratings for sentiment summarization, in: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, ACL, 2008, pp. 308–316.

35.

Xiao

and Stibor

, Efficient collapsed gibbs sampling for latent dirichlet allocation, in: Proceedings of 2nd Asian Conference on Machine Learning, JMLR, 2010, pp. 63–78.

36.

Yang

Zhang

Wang

and Li

, Scene and place recognition using a hierarchical latent topic model, Neurocomputing 148 (2015), 578–586.

37.

Yang

Boyd-Graber

and Resnik

, A discriminative topic model using document network structure, in: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL, 2016, pp. 686–696.

38.

Yao

Mimno

and McCallum

, Efficient methods for topic model inference on streaming document collections, in: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2009, pp. 937–946.

39.

Yuan

Gao

Dai

Wei

Zheng

et al., Lightlda: Big topic models on modest computer clusters, in: Proceedings of the 24th International Conference on World Wide Web, ACM, 2015, pp. 1351–1361.

A more time-efficient gibbs sampling algorithm based on SparseLDA for latent dirichlet allocation

Abstract

Keywords

1. Introduction

Table 2 Notation description

3.1 StdGibbs

4.1 Problem statement

4.2 Intuitive ideas

4.3 ESparseLDA

4.4 Time complexity of ESparseLDA

Table 7 Time complexity for calculating Q item in ESparseLDA

5. Experiments

Table 8 Statistical information of the original datasets

5.1.1 Datasets

1 http://archive.ics.uci.edu/ml/datasets/Bag+of+Words.

5 https://en.wikipedia.org/wiki/Perplexity.

6 https://github.com/xunzheng/topic.

5.3.1 Correctness verification

Table 10 Means and standard deviations of the perplexities of these two LDA models (“Unordered”, “Ordered”) after convergence in five running for the three datasets

Table 11 Means and standard deviations of the perplexities of these two LDA models (“SparseLDA”, “ESparseLDA”) after convergence in five running for the three datasets

Table 12 Comparison of running time for one iteration after convergence between SparseLDA (LDA-1) and ESparseLDA (LDA-3) on dataset Kos, Nips and Enron

Footnotes

Acknowledgments

References

Table 2
Notation description

Table 7
Time complexity for calculating Q item in ESparseLDA

Table 8
Statistical information of the original datasets

¹
http://archive.ics.uci.edu/ml/datasets/Bag+of+Words.

⁵
https://en.wikipedia.org/wiki/Perplexity.

⁶
https://github.com/xunzheng/topic.

Table 10
Means and standard deviations of the perplexities of these two LDA models (“Unordered”, “Ordered”) after convergence in five running for the three datasets

Table 11
Means and standard deviations of the perplexities of these two LDA models (“SparseLDA”, “ESparseLDA”) after convergence in five running for the three datasets

Table 12
Comparison of running time for one iteration after convergence between SparseLDA (LDA-1) and ESparseLDA (LDA-3) on dataset Kos, Nips and Enron