iLDA: An interactive latent Dirichlet allocation model to improve topic quality

Abstract

User-generated content has been an increasingly important data source for analysing user interests in both industries and academic research. Since the proposal of the basic latent Dirichlet allocation (LDA) model, plenty of LDA variants have been developed to learn knowledge from unstructured user-generated contents. An intractable limitation for LDA and its variants is that low-quality topics whose meanings are confusing may be generated. To handle this problem, this article proposes an interactive strategy to generate high-quality topics with clear meanings by integrating subjective knowledge derived from human experts and objective knowledge learned by LDA. The proposed interactive latent Dirichlet allocation (iLDA) model develops deterministic and stochastic approaches to obtain subjective topic-word distribution from human experts, combines the subjective and objective topic-word distributions by a linear weighted-sum method, and provides the inference process to draw topics and words from a comprehensive topic-word distribution. The proposed model is a significant effort to integrate human knowledge with LDA-based models by interactive strategy. The experiments on two real-world corpora show that the proposed iLDA model can draw high-quality topics with the assistance of subjective knowledge from human experts. It is robust under various conditions and offers fundamental supports for the applications of LDA-based topic modelling.

Keywords

Expert knowledge interactive strategy latent Dirichlet allocation subjective topic-word distribution

1. Introduction

User-generated content is reshaping the way of online user behaviour. By posting contents on social media platforms, users communicate with others and state their views about political, social and economic events. In the fourth quarter of 2016, 319 million and 1.8 billion monthly active users generate unstructured contents on Twitter and Facebook, respectively. Since abundant contents imply rich information about user interests, user-generated content offers a great opportunity for sociologists, managers and marketing people to understand user behaviours deeply. Enterprises such as Dell.com and Amazon.com have introduced various social media services to help organisations across a variety of industries develop user insights, engage with customers and understand the markets. Analysing user-generated content has become the new must-have ability for organisations in different fields.

Extracting insights from user-generated content is a challenging job because user-generated content is usually presented in the style of unstructured text. To extract hidden semantic information from the unstructured text, topic modelling technique has been developed to summarise user views into topics. Topic modelling is a kind of method that discovers hidden semantic structure from text corpus. It assumes that a document is a mixture distribution of topics, where the topic is a multinomial distribution about vocabulary. Latent Dirichlet allocation (LDA) model [1] is one of the most popular strategies for its explicit representation of a document and flexible exchangeability assumption. LDA is a complete generative probabilistic topic model for collections of discrete data. It is a powerful tool for discovering topics from documents without any keywords or labels [2–4]. Although LDA and its variants are unsupervised and can automatically discover topics, a significant limitation is that topics discovered by these methods do not always make sense to people and some words are not useful for the recognition of topic meanings. For example, the word ‘fruit’ co-occurs with the words about basketball games. Therefore, some topics discovered by these methods are occasionally confusing and not coherent with human judgements [5,6].

The intuition behind LDA is that a document exhibits multiple topics and a topic is composed of words with different probabilities. The problem of topic coherence exists in two granularities. In the topic granularity, the LDA models may generate some meaningless topics because the words with highest probabilities in these topics are very puzzling. In the word granularity, although we can understand the meaning of a topic by the words with highest probabilities, confusing words like ‘fruit’ in topics about basketball games still exist.

Figure 1 illustrates two topics discovered from Reuters corpus [7] with the LDA model. As shown in Figure 1, the meaning of topic (a) is not sensible because the topic is mixed by words about disease, life, economic and so on. Although people may classify topic (b) to be bio-medical news, the words like ‘right’ and ‘float’ are still puzzling.

Figure 1.

(a) Meaningless topic and (b) meaningful topic with puzzling words.

In the environment of online social media, topic coherence is an intractable problem for LDA and its variants because free speech is the basic feature of online social media and user expressions are always irregular. To mine coherent topics, Newman et al. [8] introduced a novel method to evaluate topic coherence whereby words in topics are rated for coherence or interpretability. Mimno et al. [6] analysed the reasons incoherent topics generate, designed an automated evaluation metric for incoherent topics and proposed a statistical topic model to find low-quality topics. Although these methods can help us filter low-quality topics, they cannot tell us what to do when flaw topics generate.

Human knowledge is widely documented as a useful factor to improve the performance of theoretical models. This article integrates knowledge from human experts into topic models and proposes new interactive strategies to mine high-quality topics. Framework of the proposed interactive latent Dirichlet allocation (iLDA) model is shown in Figure 2. Figure 2 shows that, after topics are discovered and the objective topic-word distribution is generated by LDA, our model allows human experts to integrate their knowledge to generate a subjective topic-word distribution. The objective and subjective topic-word distributions are then merged to generate a comprehensive topic-word distribution, which is used to explore the next generation of topics. The interactive process is repeated until coherent and high-quality topics are obtained.

Figure 2.

Framework of iLDA.

To improve topic quality with the iLDA model, the following three issues should be addressed:

In LDA-based generative models, many topics are usually generated together with a large number of probable words. It is impossible for human experts to provide probabilities for each word in each topic completely by themselves. This article proposes a strategy which allows human experts to offer their subjective knowledge by adjusting a partial of the probabilities generated by LDA. Therefore, the first issue of the iLDA model is from a large number of topics and words, which of those should be selected and presented to human experts for adjustment.

When human experts decrease the probabilities of the puzzling words in a topic, we obtain some surplus probabilities which are the differences between the original probabilities and the adjusted probabilities by human experts. Because a fundamental assumption of LDA is that the probability sum of all words in a topic equals 1, we should allocate the surplus probabilities to the rest of words in the topic. Therefore, the second question of the iLDA model is how to allocate the surplus probabilities generated by expert’s adjustment.

The basic idea of topic models is to draw words for topics automatically based on the joint distribution and the conditional distribution. The iLDA model aims to combine subjective knowledge from human experts with the objective knowledge mined by LDA to improve topic quality. Therefore, the third question of the iLDA model is how to draw words for topics automatically based on subjective and objective topic-word distributions.

The model in the existing literature that is most similar to iLDA is the interactive topic model proposed by Hu et al. [5]. In the interactive topic model, a mechanism was proposed to encode users’ feedback to topic models. The interactive process is simple enough because users just need to merge or split topics. Because their purpose is to make topic models easy to use even for political scientists who have little data mining knowledge, users are not required to adjust probabilities the words have in topics. The contributions of our proposed iLDA are two folded:

To the best of our knowledge, this is the first effort to integrate human knowledge with topic modelling by allowing human experts to adjust topic-word probabilities. We propose a new interactive framework to integrate subjective and objective knowledge. The framework theoretically contributes to the topic modelling field and has more opportunity to generate fine-grained and explainable topics.

This article proposes a novel indicator to determine which topics should be adjusted by human experts, designs two strategies to allocate the surplus probabilities and proposes new method to draw words for topics by deriving new joint and conditional distributions. These methods guarantee that the proposed iLDA model can integrate the knowledge of human experts and implied information mined by general LDA model to obtain coherent and meaningful topics.

The rest of the article is organised as follows. Section 2 reviews the related literatures on topic models. Section 3 proposes the iLDA model and its inference algorithm. A computational study is conducted in section 4 to test the proposed iLDA model. Conclusions and future researches are given in section 5.

2. Literature review

In the big data era, knowledge discovery from massive data is always a challenging task and need innovation of methodology and technology [9,10]. The topic model is a generative statistical model which provides a simple way to analyse unstructured text data. In topic modelling, a document is often considered as a bag-of-words which is a vector of words without orders. A topic model seeks to map high-dimension of documents to a lower latent semantic space [11,12]. In this section, we will review related work from LDA which has served as a springboard for many other topic models.

2.1. LDA and topic modelling

The general LDA model, proposed by Blei et al. [1], is a useful method to explore topics in documents in the fields such as business and public policy [13,14]. There are two generation processes in the LDA model: one is the generation of document-topic distribution and the other one is the generation of topic-word distribution. Suppose $α$ and $β$ are hyperparameters of Dirichlet distribution. Parameters $θ$ and $ϕ$ represent topic distribution and word distribution, respectively. If $d_{i}$ and $z_{i}$ denote the document and topic of the ith word, the generation process of LDA is as follows

θ ~ Dirichlet (α)

(1)

z_{i} | θ^{(d_{i})} ~ Multinomial (θ^{(d_{i})})

(2)

ϕ ~ Dirichlet (β)

(3)

w_{i} | z_{i}, ϕ ~ Multinomial (ϕ_{z_{i}})

(4)

Since LDA is an unsupervised model which does not require information about topics within documents and documents are also not labelled with topics or keywords, it is widely used to solve various problems. For example, Hu et al. [15] employed LDA-based method to mine customer opinions and identify top-k most informative sentences from online hotel reviews. Hyung et al. [13] developed music descriptors using LDA model to extract keywords from a large collection of user-generated documents. Ling et al. [16] developed an interpretable LDA model to mine customer preferences from ratings and online reviews so as to obtain accurate recommendation for customers. Büschken and Allenby [17] designed a sentence-based LDA model to improve the inference and prediction of consumer preference. Jacobs et al. [18] took customer heterogeneity into account and developed a novel LDA model which considers a product as a word and the products a customer purchased as a document the customer created. LDA and topic modelling are also widely used to analyse other unstructured data (e.g. picture and image) and obtained satisfactory results [19–25].

Although the general LDA model provides a powerful tool for discovering and exploiting the hidden thematic structure [26], the most probable words for each topic are not always sensible to users. Lu et al. studied the method to improve topic quality [27]. They designed various vocabulary reduction strategies and tested the influence of each strategy on topic modelling.

2.2. Evaluation of topic models

A natural evaluation metric for statistical topic models is the probability of held-out documents given a trained model [28]. Since topic models are generally used to understand documents, it is typically evaluated by either measuring performance on some secondary tasks, such as document classification, or by estimating the log-likelihood probability of unseen held-out documents. Thus, the evaluation probability for held-out document needs to estimate normalisation constants of posterior distribution over topics [28]. A common situation for the probabilities of held-out documents is that it is often inconsistent with human judgement. To measure semantic meaning of topic models, Chang et al. [29] designed two human evaluation tasks to explicitly evaluate the topics inferred by topic models: one is word intrusion and the other one is topic intrusion. Based on the observation that the size of topics usually has a strong relationship with the probability of topics judged by domain experts, Mimno et al. [6] proposed a coherence indicator to identify flawed topics. Because the indicator does not rely on human annotators and is a good predictor of human judgements, it has been widely used to measure topic quality in topic models [30–32]. This article will design a new method based on the coherence indicator to evaluate the quality of topics generated by iLDA.

2.3. Interpretability of topic models

Topics inferred by LDA are not always sensible to users. Many enhanced topic models are proposed to improve the coherence and interpretability of LDA through incorporating external knowledge. Newman et al. [31] proposed QUAD-REG and CONV-REG strategies for regularising topic models to generate more coherent and interpretable topics. They characterised external data by the ‘covariance’ matrix and treated the matrix as the word dependencies for prior. Meanwhile, they considered coefficients to leverage external information. Another principled approach to incorporate domain knowledge into LDA is inspired by constrained clustering methods [33]. Andrzejewski et al. [34] proposed a method which uses the Must-Links and Cannot-Links in topic modelling. The authors consider all domain knowledge as Must-Links and Cannot-Links which represent words that may have large probability occur within a topic and not. To incorporate these two links into the LDA model, they used Dirichlet Tree distribution [35,36] as the conjugate prior distribution of topic-word distribution. They treated Must-Links as the transitive closure and transformed Cannot-Links as an alternative form that amenable to Dirichlet Tree. Inspired by this work, Chen et al. [37] proposed the MDK-LDA method to simulate human judgement. The authors argued that human gain new knowledge gradually based on old knowledge. Therefore, they add a new latent variable in LDA which is the result from multi-domain knowledge. To improve the interpretability of topics, researchers also present models such as biterm topic model (BTM) [30] to utilise the co-occurrence patterns.

Domain knowledge and other co-occurrence patterns are actually the external data derived from other works and may not be proper for specific corpus. To discover topics with high quality, researchers also seek help from human feedback. Since topics are finally used to help people to understand documents, it may be useful to incorporate human judgement directly. Hu et al. [5] proposed a new mechanism which considers user voices by encoding their feedback to topic models as correlations between words into a topic model. The authors focused on the correlations among words within a topic and used tree-priors to encode these correlations straightforwardly. The mechanism allows users to evaluate whether words within a topic correlated with each other or not. The authors finally build a system that can learn and adapt from users’ input. Topic interpretability is a fundamental problem even if in a well-designed model. Although many enhanced topic models have been proposed to solve the problem, controlling words correlation precisely within a topic based on user feedback is still an unsolved issue.

The current literature shows that, although the LDA model proposes a useful framework for topic analysis, many issues still exist to obtain high-quality topics. The significant limitations of the LDA model are that topics discovered are inexplicable and some words are not useful to recognise topic meanings. Also, new methods are required to measure topic quality and provide guidelines for quality improvement of topics. To deal with these problems, this article allows human experts to integrate their knowledge to generate a new topic-word distribution after topics are discovered by LDA. We design a new formula to measure topic quality and experts can adjust the topic-word distribution of the low-quality topics to generate subjective distributions. The topic-word distributions generated by LDA and human experts are then merged to explore topics from documents. Figure 3(a) and (b) illustrates the difference between the LDA model and the proposed iLDA model. We will provide the details of iLDA in next section.

Figure 3.

Comparison between LDA and iLDA: (a) latent Dirichlet allocation and (b) interactive latent Dirichlet allocation.

3. iLDA model

3.1. Model framework

Suppose we have a collection of $D$ documents denoted by $D = {1, \dots, d \dots, D}$ where d is the $d th$ document, which consists of $V$ vocabulary denoted by $V = {1, 2, \dots, V}$ . Each document is a sequence of words denoted by $w_{d} = {w_{1}, w_{2}, \dots, w_{d_{m}}}$ and let $n_{d}$ denote the $n th$ word in document d. As illustrated in Figure 3(a), the LDA model assumes that documents are the mixture of topics where a topic is a distribution over a fixed vocabulary. Nodes in Figure 3 represent random variables or distributions where the shaded node w is the word in documents we observed. Parameters $α$ and $β$ are the fixed hyperparameters of $θ$ and $ϕ_{l}$ which represent the multinomial document-topic distribution and topic-word distribution, respectively. Variable z is the topic sampled from $θ$ . Edges encode the conditional densities underlying the generative process. As discussed in literature review, although LDA can automatically discover the hidden structures of the documents, some words in topics do not always make sense.

To make topics more understandable, the proposed iLDA model focuses on the integration of knowledge from human experts. The graphical model of iLDA is represented in Figure 3(b). Based on the general LDA model, the iLDA model introduces a new variable $ϕ_{u}$ to denote the subjective topic-word distribution generated by human experts. The iLDA model integrates the objective distribution $ϕ_{l}$ and the subjective distribution $ϕ_{u}$ to generate the comprehensive topic-word distribution $ϕ$ which is used to guide the generation of word w. Table 1 presents the variables used in the iLDA model.

Table 1.

Notations of iLDA model.

Notation	Description
$α$	Hyperparameters of multinomial distribution $θ$
$β$	Hyperparameters of multinomial distribution $ϕ$
$θ$	Multinomial distribution over document topics
$ϕ_{l}$	Multinomial distribution over topic words generated by general LDA
$ϕ_{u}$	Multinomial distribution over topic words generated by expert knowledge
$ϕ$	Multinomial distribution over topic words in interactive LDA
k	The index of topic
$λ_{1}$	The trust of human beings
$λ_{2}$	The trust of LDA
z	The topic assigned to the word
w	The words of documents
V	Vocabulary occurs in all documents
d	The index of document

LDA: latent Dirichlet allocation.

Suppose $λ_{1}$ and $λ_{2}$ are the weights of $ϕ_{l}$ and $ϕ_{u}$ to generate $ϕ$ , the iLDA model employs a linear weighted-sum strategy to generate $ϕ$

ϕ = λ_{1} ϕ_{l} + λ_{2} ϕ_{u}

(5)

where $λ_{1} + λ_{2} = 1$ . Parameters $λ_{1}$ and $λ_{2}$ measure the reliability degree of the LDA model and expert knowledge. When employing the iLDA model to extract topics from documents, a common wisdom is to set a bigger $λ_{2}$ if the human expert is an authority in the field and a smaller one if expert knowledge in the field is insufficient. By integrating expert knowledge, iLDA model generates words in documents by the steps in Figure 4.

Figure 4.

The generation process of iLDA model.

3.2. Inference

To draw words according to the distribution of $θ$ , $ϕ_{l}$ and $ϕ_{u}$ , the key inferential problem that we need to solve is to compute the joint distribution

p (w, z, θ, ϕ_{l} | α, β, ϕ_{u}) = p (w | z, ϕ_{l}, ϕ_{u}) p (z | θ) p (θ | α) p (ϕ_{l} | β)

(6)

Suppose we sample topic $z$ given distribution $θ$ , we can obtain the probability of all topics in the corpus

p (z | θ) = Π_{m = 1}^{D} \sum_{k = 1}^{K} θ_{k}^{n_{mk}}

(7)

Here, $θ_{k}$ represents the probability of topic k and $n_{mk}$ is the times that topic k occurs in the document. Given the topic, topic-word distribution and the constraint distribution, the probability of word generation is

p (w | z, ϕ_{l}, ϕ_{u}) = Π_{k = 1}^{K} \underset{{i : z_{i} = k}}{Π} p (w_{i} = t | z_{i} = k)

(8)

p (w | z, ϕ_{l}, ϕ_{u}) = Π_{k = 1}^{K} Π_{t = 1}^{V} {[λ_{1} ϕ_{lkt} + λ_{2} ϕ_{ukt}]}_{kt}^{n_{kt}}

(9)

where ${i : z_{i} = k}$ represents all words that are assigned to topic k. $ϕ_{ukt}$ and $ϕ_{lkt}$ are the probability of word t in topic k which are generated by LDA and human experts, respectively. $n_{kt}$ is the times of word t assigned to topic k.

After we obtain the joint distribution, we can use Markov Chain Monte Carlo (MCMC) algorithm to infer the unknown parameters. Since topic $z$ is the only hidden variable, we can sample it from $p (z | w)$ . Suppose we observe word $w_{i} = t$ , the topic of the ith word is written as $z_{i}$ and $- i$ represents all of the words except i, we have

p (z_{i} = k | z_{- i}, w) = \frac{p (z, w)}{p (z_{- i}, w_{i} = t, w_{- i})}

(10)

= \frac{p (z_{i} = k, w_{i} = t | z_{- i}, w_{- i})}{p (w_{i} = t | z_{- i}, w_{- i})}

(11)

\propto p (z_{i} = k, w_{i} = t | z_{- i}, w_{- i})

(12)

\begin{matrix} \int p (z_{i} = k | θ_{m}) Dir (θ_{m} | n_{m, - i} + α) d θ_{m} \\ \cdot \int p (w_{i} = t | ϕ_{k}) Dir (ϕ_{lk} | n_{k, - i} + β) d ϕ_{lk} \end{matrix}

(13)

\begin{matrix} = \int θ_{mk} Dir (θ_{m} | n_{m, ⇁ i} + α) d θ_{m} \\ \cdot \int (λ_{1} ϕ_{lkt} + λ_{2} ϕ_{ukt}) Dir (ϕ_{lk} | n_{k, ⇁ i} +) d θ_{lk} \end{matrix}

(14)

= E (θ_{mk}) \cdot [λ_{1} E (ϕ_{lkt}) + λ_{2} ϕ_{ukt}]

(15)

Therefore, we have the conditional probability

p (z_{i} = k | z_{- i}, w) \propto \frac{n_{m, - i}^{(k)} + α_{k}}{\sum_{k = 1}^{K} (n_{m, - i}^{(t)} + α_{k})} \cdot [λ_{1} \cdot \frac{n_{k, - i}^{(k)} + β_{k}}{\sum_{k = 1}^{K} (n_{k, - i}^{(t)} + β_{k})} + λ_{2} ϕ_{u, k, t}]

(16)

With the conditional probability $p (z_{i} = k | z_{- i}, w)$ , we can sample words for each topic with MCMC algorithm.

In the estimation step of LDA, we need sample topic index of each word in documents. In an iteration, we need to compute each word with a topic and update the topic distribution and document-topic distribution. Thus, the complexity of LDA for each iteration is O(D*N*K) where D is the number of documents, K is the topic size and N is the vocabulary size. For iLDA, we compute linear combination of objective distribution and subjective distribution before sampling topic index. Thus, the complexity of iLDA is O(D*N* (K + 1)).

3.3. Derivation strategy of human knowledge

As shown in formula (16), the subjective topic-word distribution is required to derive conditional probabilities and draw topics from documents. Because massive topics and words exist in unstructured texts, it is an impossible task for human experts to provide subjective probabilities for each word in each topic. This article employs a compromising strategy to derive expert knowledge. We run the general LDA model for preset iterations, show human experts the results and invite them to adjust the probabilities based on their knowledge. The topic-word distribution generated by experts is considered to be the subjective topic-word distribution. In this process, the key issues to obtain knowledge from human experts are two folded. The first issue is which topics and words should be presented to experts for adjustment. The second issue is how to calculate the subjective topic-word distribution $ϕ_{u}$ after human experts provide their adjustment.

3.3.1. Selection of topics and words need to be adjusted

To determine which topics should be adjusted, we propose an indicator inspired by topic coherences which are useful indicators for topic quality and word consistency in topic models [6]. We employ Figure 5 to demonstrate our motivation. Figure 5 illustrates three topics with good, inter and bad quality, respectively. The horizontal axis is the coherence value of the topic. We can see that the higher the coherence value, the more likely it is a good topic. If we add a vertical line in Figure 5, we can see that there are almost all good topics when the line is near to highest value of coherence. Therefore, to decide whether a topic should be adjusted, we calculate the difference between its coherence value and the first moment of the coherence values of all topics to obtain the adjustment indicator

A V_{k} = C_{k} - \frac{\sum C_{⇁ k}}{K - 1}

(17)

Here, $C_{k}$ is the coherence value of topic k and $C_{⇁ k}$ is coherence value of topics except topic k. The coherence is defined as

C_{k} = \sum_{m = 2}^{M} \sum_{l = 1}^{m - 1} \log \frac{D (v_{m}^{(k)}, v_{l}^{(k)})}{D (v_{l}^{(k)})}

(18)

where $v^{(k)}$ is one of the most M probable words in topic k, $D (v_{m}^{(k)}, v_{l}^{(k)})$ is the co-document frequency of word l and m, and K is the number of topic size. The intuition of the proposed indicator is simple while effective. It indicates that a topic is good if the co-occurrence probabilities of the words in the topic are more frequent than others [6]. Otherwise, the topic should be presented to human experts for adjustment. Since the coherence value is negative, adjust values of good topics are positive in general.

Figure 5.

Topic coherence.

After the topics with low qualities are recognised and presented to human experts, they are asked to evaluate which words in the topics are not correlated with the topic meaning and then make adjustment decisions for the probabilities of the words. Because the probability sum of all words within a topic equals 1, we propose two strategies to allocate the surplus probabilities generated by expert adjustment next.

3.2.2. Calculation of subjective topic-word distribution

When the topics and words are presented to human experts, the iLDA model allows human experts to decrease the probabilities for the flawed words according to their knowledge. Suppose topic k with topic-word distribution $ϕ_{l}^{(k)} = {p_{lv}^{(k)} | v = 1, 2, \dots, V}$ is presented to human experts for adjustment where $p_{lv}^{(k)}$ is the probability of word v in topic k. From the T most probable words W_T in topic k which correspond to probabilities $ϕ_{lT}^{(k)} = {p_{lt}^{(k)} | t = 1, \dots, T}$ , human experts select $T^{'}$ words W_T_′ with probabilities $P_{{lT}^{'}}^{(k)} = {p_{{lt}^{'}}^{(k)} | t^{'} = 1, 2, \dots, T^{'}, T^{'} \leq T}$ and decrease their probabilities by $P_{{uT}^{'}}^{(k)} = {p_{{ut}^{'}}^{(k)} \in [0, 1) | t^{'} = 1, 2, \dots, T^{'}}$ . Here, $p_{{ut}^{'}}^{(k)}$ is the degenerative probability which measures the distrust degree human experts put on word t′ in topic k according to their knowledge. With the distrust degree from human experts, we proposed two strategies, that is, the deterministic strategy and stochastic strategy, to get the subjective topic-word distribution.

Deterministic strategy

The deterministic strategy adjusts the objective topic-word distribution according to the degenerative probabilities completely. The topic-word probabilities for the words adjusted by human experts are calculated as

ϕ_{{uT}^{'}}^{(k)} = {p_{{lt}^{'}}^{(k)} \times p_{{ut}^{'}}^{(k)} | t \in w_{T}, t^{'} \in w_{T^{'}}}

(19)

To make sure the sum of the probabilities of all words in a topic is 1, the surplus probabilities from the adjusted words are distributed to other most probable words, that is, W_T – W_T_′, rather than all of the remained words, that is, W_V – W_T_′, in the topic. The weighted redistribution method of the degenerative probabilities to the words in W_T – W_T_′ is

ϕ_{u, - T^{'}}^{(k)} = {p_{- t^{'}} \times (1 + p_{r} \frac{p_{- t^{'}}}{\sum_{- T^{'}} p_{- t^{'}}}) | - t^{'} \in T and - t^{'} \notin T^{'}}

(20)

where $p_{r} = \sum_{t^{'} = 1}^{T^{'}} p_{t^{'}} \times (1 - p_{u t^{'}}^{(k)})$ measures the surplus probability which is the sum of probabilities the adjusted words reduced. Probability $p_{r}$ is prorated to the probable words in W_T – W_T_′ according to $p_{- t^{'}} / \sum_{- T^{'}} p_{- t^{'}}$ .

Suppose the topic-word distribution of the words in W_V – W_T_′ is $ϕ_{V - T}^{(k)}$ , the subjective topic-word distribution obtained by the deterministic strategy is $ϕ_{u}^{(k)} = {ϕ_{{uT}^{'}}^{(k)}, ϕ_{u, - T^{'}}^{(k)}, ϕ_{V - T}^{(k)}}$ .

Stochastic strategy

The deterministic strategy considers human experts’ knowledge is always accurate and reliable. Therefore, it adjusts the objective topic-word distribution based on the degenerative probabilities given by experts to obtain the subjective topic-word distribution. The stochastic strategy, however, takes expert confidence into account to generate the subjective topic-word distribution. In the stochastic strategy, the degenerative probabilities are considered to reflect the belief human experts have to adjust the objective distribution. A higher degenerative probability means human experts largely accept the objective probabilities while a smaller one means the objective probabilities are likely to be adjusted according to human experts’ knowledge. We introduce a stochastic variable u to determine whether the probability of a word should be adjusted. With the objective topic-word distribution for topic k, $ϕ_{l}^{(k)} = {p_{v}^{(k)} | v = 1, 2, \dots, V}$ , the probabilities for words in W_T, $ϕ_{lT}^{(k)} = {p_{t}^{(k)} | t = 1, \dots, T}$ and words in W_T_′, $P_{{lT}^{'}}^{(k)} = {p_{t^{'}}^{(k)} | t^{'} = 1, 2, \dots, T^{'}, T^{'} \leq T}$ , and the degenerative probabilities given by human experts $P_{{uT}^{'}}^{(k)} = {p_{u t^{'}}^{(k)} \in [0, 1) | t^{'} = 1, 2, \dots, T^{'}}$ , the stochastic strategy calculates the topic-word probabilities for the words adjusted by human experts as

ϕ_{{uT}^{'}}^{(k)} = {p_{t}^{(k)} \times [p_{t^{'}}^{(k)}]^{I (u > p_{t^{'}}^{(k)})} | t \in T, t^{'} \in T^{'}}

(21)

where $I (u > p_{t^{'}}^{(k)})$ is a signal function which equals to 1 if $u > p_{t^{'}}^{(k)}$ and 0 otherwise. Then, we can employ the weighted redistribution method (20) to calculate the topic-word probabilities for words in W_V – W_T_′ and obtain the subjective topic-word distribution.

4. Experiment

4.1. Data

In this section, two real-world text corpora are used to test the effect of the proposed iLDA model. The characteristics of the corpora are summarised in Table 2 and we give brief descriptions of the two corpora as follows:

Table 2.

Dataset statistics.

Corpus name	Documents size	Vocabulary size	Average word size in each document
Reuters	11,413	24,741	95
Weibo	51,315	15,397	17.89

Reuters [7]. Reuters corpus was released by Reuters Ltd. in 2000 and was used widely in the field of natural language process, machine leaning and information retrieval. The original corpus, known as ‘RCV1’, has been used and relabelled by many researchers. The dataset used in this article was labelled by Moschitti and Basili [38].

Weibo. This dataset is collected from Weibo.com which is a famous social media website in China. On the website, users post, repost and make comments to messages. We collect Weibo messages by a web crawler for our experiment.

Reuters and Weibo are two different datasets in terms of corpus generation. Reuters corpus is official news with regular expressions. Weibo corpus, however, is more open and irregular because Weibo.com is an open forum where users with different backgrounds post a large number of contents without specific forms.

4.2. Experiment on model performance

4.2.1. Topic quality

To study the quality of topics discovered by the proposed iLDA, we compare it with the following baseline methods:

Mixture of unigrams (MU). The MU [39] assumes that each document is generated by only one topic and words are drawn from the topic independently. In this article, we use jLDADMM (https://github.com/datquocnguyen/jLDADMM) to implement MU algorithm.

Probabilistic latent semantic analysis (pLSA). pLSA [40] is also known as a probabilistic latent semantic indexing (pLSI) strategy. It models each word in a document as a sample from a mixture model, where the mixture components are multinomial random variables that can be viewed as representations of ‘topics’. We use mltool4j (https://code.google.com/archive/p/mltool4j) to implement pLSA.

LDA. As one of the most classical topic models, the general LDA model can induce topic-word distributions from a large number of documents without labelling. Because iLDA is constructed based on LDA by integrating expert knowledge, we employ LDA as a baseline method to test the influence of expert knowledge on topic modelling. In this article, we use jGibbLDA model (http://jgibblda.sourceforge.net) [41,42] as a baseline method which uses Gibbs Sampling technique for parameter estimation and inference.

BTM. BTM [30] learns the topics by directly modelling the generation of word co-occurrence patterns in the whole corpus. The model learns topics over short texts by modelling biterms where a biterm is an unordered word-pair co-occurred in a short context. We use the code released by the authors to implement BTM (https://github.com/xiaohuiyan/BTM).

We run five experiments for each model on each corpus by fixing the number of topics as 5, 10, 15, 20 and 25, respectively. The average coherence values of the five models on the two corpora are shown in Tables 3 and 4.

Table 3.

Average coherence values for different methods on Weibo.

	Weibo
	MU	pLSA	LDA	BTM	iLDA
Topic 5	−1918.11	−2002.50	−1928.67	−1921.08	−1908.71
Topic 10	−1967.66	−1946.78	−1856.43	−1852.72	−1844.63
Topic 15	−1969.90	−1918.67	−1817.71	−1820.00	−1796.12
Topic 20	−1941.06	−1896.00	−1774.42	−1803.38	−1747.20
Topic 25	−1949.67	−1896.93	−1761.44	−1909.07	−1742.87

MU: mixture of unigrams; LDA: latent Dirichlet allocation; pLSA: probabilistic latent semantic analysis; BTM: biterm topic model; iLDA: interactive latent Dirichlet allocation.

Table 4.

Average coherence values for different methods on Reuters.

	Reuters
	MU	pLSA	LDA	BTM	iLDA
Topic 5	−1665.25	−1649.76	−1531.07	−1635.78	−1540.89
Topic 10	−1605.84	−1610.66	−1350.24	−1598.48	−1447.96
Topic 15	−1601.39	−1587.35	−1350.24	−1586.57	−1333.43
Topic 20	−1609.14	−1576.64	−1255.33	−1569.94	−1250.68
Topic 25	−1600.55	−1560.09	−1260.70	−1561.06	−1250.40

MU: mixture of unigrams; LDA: latent Dirichlet allocation; pLSA: probabilistic latent semantic analysis; BTM: biterm topic model; iLDA: interactive latent Dirichlet allocation.

As shown in Tables 3 and 4, the proposed iLDA outperforms the baselines in all the experiments except for the experiment on Reuters with five topics. In the five models, MU and pLSA have worst performances and LDA dominates them. Compared with the coherence on Weibo, the LDA model obtains much better coherence on Reuters. It indicates that data sparsity has significantly negative influence on LDA’s performance. BTM performs slightly better than MU and pLSA on Reuters, while it achieves reasonable results on Weibo. It also obtains better performance than LDA in some cases (e.g. experiments with 5 and 10 topics on Weibo). The results suggest that BTM can alleviate the influence of data sparsity.

To study whether the performance of iLDA is statistically robust, we conduct pairwise t-test to compare the topic coherence values obtained by iLDA and other baseline methods. We compare the coherence values of 75 (= 5 + 10 + 15 + 20 + 25) topics for each method and present the p values in Figure 6. From Figure 6, we can see that all the coherence improvements from the baseline methods to iLDA are statistically significant at the 0.05 level. The performance of the proposed iLDA model is statistically better than the baseline methods.

Figure 6.

Results of t-test between iLDA and baselines.

4.2.2. Detailed comparisons between iLDA and LDA

Because the proposed iLDA model is inspired on LDA by integrating subjective knowledge, this section compares the topic quality derived by iLDA and LDA. Based on topic coherence formula (18), this article uses the defeat ratio from iLDA to LDA in formula (22) to compare their performances

defeat ratio = \frac{N_{i L D A}}{N_{i L D A} + N_{L D A}} \times 100 %

(22)

where $N_{iLDA}$ denotes the number of topics drawn by iLDA and have higher coherence than the topics drawn by LDA. $N_{LDA}$ denotes reversely.

To calculate the defeat ratio, we do not consider the topics which are not modified by experts to make sure the defeat ratio is larger than 50% if iLDA performs better. A 100% defeat ratio means from iLDA to LDA that all of the topics drawn by iLDA are better than those drawn by LDA. Since LDA is not sensitive to prior information, we set $α = 1 / K$ and $β = 50 / K$ for both models. We will release the constraint and test the performance of our model when we have various reliability degrees for the LDA model and expert knowledge.

We conduct 5×4×4 experiments in which the corresponding factor levels are as follows: number of topics = [5, 10, 15, 20, 25], number of probable words in topic = [20, 30, 40, 50] and reliability degree of human expert = [0.2, 0.4, 0.6, 0.8]. Topic-K in Tables 5–7 represents the experiments in which the size of topics is set to be K. The average defeat ratios of all experiments are shown in Table 5. Table 5 shows that the proposed iLDA model can significantly improve topic quality compared with the LDA model. The defeat ratio is 75.6% for Reuters and 89.1% for Weibo corpus, respectively, if five topics are explored, which are much higher than 50% when iLDA and LDA have equivalent performance. We also provide the detailed results when the number of probable words is fixed to be 20 (Table 6) and the reliability degrees of human experts is fixed to be 0.8 and 0.2 for Reuters and Weibo corpus, respectively (Table 7). Both tables show that we can obtain positive improvement if we integrate expert knowledge. Tables 6 and 7 also indicate that the factors such as topic size and reliability degree impact the performance of iLDA. We will examine the influence of these factors on the performance of iLDA in the sensitivity analysis.

Table 5.

The average defeat ratios.

	Topic 5	Topic 10	Topic 15	Topic 20	Topic 25
Reuters	0.756	0.619	0.630	0.503	0.574
Weibo	0.891	0.625	0.545	0.525	0.537

Table 6.

The defeat ratios when number of probable words is 20.

	Topic 5	Topic 10	Topic 15	Topic 20	Topic 25
Reuters	0.75	0.6	0.714	0.7	0.708
Weibo	0.75	0.8	0.667	0.65	0.6

Table 7.

The defeat ratios when fixing reliability degree.

	Reliability degree	Topic 5	Topic 10	Topic 15	Topic 20	Topic 25
Reuters	0.8	0.75	0.6	0.6	0.7	0.708
Weibo	0.2	0.75	0.8	0.533	0.65	0.56

4.2.3. Document clustering

Besides topic-word distribution, topic models also generate document-topic distributions for the corpus and each document can be denoted by a vector of topics. Therefore, decent topic models should have ability to draw topics to approximate document content. This section employs document clustering as an indirect strategy to test the performance of the proposed iLDA model and the baselines. The motivation behind the experiment is that, if a model can draw high-quality topics, the document clustering based on the topics should obtain better results because the topics are the accurate approximate of document content.

The k-means method is employed to cluster documents. With the k-means method, each document is considered as a data point which is represented by the corresponding document-topic distribution. The method randomly chooses k documents from corpus and uses these documents as the initial means. Other documents are then assigned to a nearest cluster. After the initial clustering step, the centroid of each cluster is calculated and the update step is iterated. Because the purpose of k-means is to maximise the between-group dispersion and minimise the within group dispersion of the samples, we use between_SS/total_SS to measure the clustering effectiveness. Here, between_SS is the weighted sum of squares between two cluster centres and the total_SS is the sum of squares assignment. The results are shown in Table 8.

Table 8.

The between_SS/total_SS values for document clustering.

Corpus	Topic size	LDA	iLDA	MU	pLSA	BTM
Weibo	Topic 5	78.00%	87.00%	30.50%	82.80%	70.60%
	Topic 10	88.10%	88.70%	42.00%	73.50%	79.60%
	Topic 15	82.00%	88.30%	75.50%	72.50%	71.60%
	Topic 20	88.80%	89.70%	89.00%	75.80%	78.20%
	Topic 25	79.40%	87.00%	94.40%	74.00%	79.30%
Reuters	Topic 5	97.60%	97.70%	19.30%	88.70%	80.70%
	Topic 10	96.80%	97.00%	70.00%	87.40%	83.70%
	Topic 15	97.20%	97.50%	76.70%	75.90%	84.40%
	Topic 20	95.10%	95.20%	88.20%	81.30%	82.60%
	Topic 25	95.80%	95.90%	90.90%	74.00%	84.10%

MU: mixture of unigrams; LDA: latent Dirichlet allocation; BTM: biterm topic model; pLSA: probabilistic latent semantic analysis; iLDA: interactive latent Dirichlet allocation.

Table 8 indicates that the proposed iLDA model obtains better clustering results than the baselines on both of the two corpora. The results prove, from an indirect perspective, that iLDA can discover high-quality topics from documents. By integrating the subjective and objective knowledge, iLDA can draw topics which can summarise document contents accurately and thus result in better results of document clustering.

4.3. Sensitivity analysis

4.3.1. Influence of adjustment indicator

In section 3.3, we introduce the adjustment indicator to determine which topics should be adjusted by human experts. This section designs two experiments to test the influence of the indicator on the performance of iLDA. We classify the discovered topics into two categories: the first category consists of high-quality topics with top 20% indicator values and the second category includes the rest 80% topics. In the first experiment, we show human experts the topics in the first category and invite them to change the probabilities of the words while the second experiment invites human experts to adjust the probabilities of the words in the second category. Table 9 provides the coherence improvement from LDA to iLDA when we use these two adjustment strategies in the iLDA model. Table 9 shows that probability adjustments of the topics with low indicator values lead to coherence increase (experiment 1), but the coherence value may decrease if we change the probabilities of the topics with top 20% of the indicator values (experiment 2). For the topics with high indicator values, their qualities are mostly reasonable based on the generative mechanism of LDA. The adjustment by human experts brings conflicting information and thus results in the decrease of coherence values. For the topics with low indicator values, however, human experts can identify the puzzling words and improve topic qualities by decreasing their probabilities. Table 9 proves that, while reducing workload and saving time for human experts, the proposed adjustment indicator is also useful to improve topic quality.

Table 9.

The validity of adjustment indicator.

	Topic 5	Topic 10	Topic 15	Topic 20	Topic 25
Experiment 1	0.179021	3.274715	−0.28223	−8.77758	−7.82111
Experiment 2	27.23644	12.76005	19.71918	6.28	22.47152

4.3.2 Influence of degenerative probabilities

In the iLDA model, human experts are invited to provide degenerative probabilities for the unrelated words according to their knowledge. Although more efforts are required from human experts, our experiments show that it is worthy to draw high-quality topics. Figure 7 illustrates the comparison results of iLDA_f and LDA. With the iLDA_f model, human experts are not required to provide degenerative probabilities for words. They are only invited to indicate that a word is related or not to a topic. Once human experts select an unrelated word from a topic, we set a fixed degenerative probability from 0.0 to 0.8 to decrease its probability in the topic. Experimental results in Figure 7 show that, even with the fixed degenerative probabilities, the iLDA_f model still outperforms the LDA model. The defeat ratio is 54.92% and 54.76% on Reuters and Weibo, respectively, which prove the value of expert knowledge to discover high-quality topics.

Figure 7.

Results with different degenerate-probability: (a) Reuters and (b) Weibo.

Figure 7 also proves that there are no perfect fixed probabilities for all datasets and topics. When the fixed probability is set to a specific value, iLDA_f is able to outperform LDA but cannot achieve the best performances in all topic sizes. For example, on Reuters corpus, the iLDA_f is better than LDA in all topics when the fixed probability is set to 0.0. However, the highest defeat ratio on topic 10 and topic 15 is obtained when the fixed probability is 0.6. On Weibo corpus, the iLDA_f outperforms LDA in all size of topics when the fixed probability is 0.6, but topic 15 and topic 20 achieve the best result when the fixed probability is 02 and 0.8, respectively. Thus, in reality, the trade-off between best performances and robust performances is an important thing to be considered when the strategy of fixed probability is employed.

We now evaluate the value of the flexible degenerative probabilities employed in the iLDA model. We also utilise the defeat ratio from iLDA to iLDA_f to illustrate the comparison. We set the reliability of human judgement to 0.4 and 0.8 on Reuters and Weibo, respectively, of the iLDA model and compute the average defeat ratios from iLDA to iLDA_f. As shown in Figure 8, compared with iLDA which uses flexible degenerative probabilities, the iLDA_f model results in decrease of topic qualities on both corpora. In most of the cases, the iLDA generates better topics compared with the iLDA_f model. The defeat ratio from iLDA to iLDA_f is 64.2% and 63.35% on Reuters and Weibo, respectively. From the figures we can see that all of the defeat ratios from iLDA to iLDA_f are no less than 0.5 on Reuters corpus. It is also shown that iLDA_f outperforms iLDA on Weibo in only two cases.

Figure 8.

Defeat ratio from iLDA to iLDA_f: (a) Reuters and (b) Weibo.

4.3.3. Influences of reliability degree

In the proposed iLDA model, human experts have significant roles for model performance. Because the expertise of human experts varies in different fields and the professional levels usually differ among experts, the reliability degree human experts have should be considered to conduct the interactive process.

To examine whether the iLDA model remains valid under different reliability degrees, we vary the reliability degrees of human experts from 0.2 to 1. The model will reflect human’s judgement totally when the reliability degree equals 1. The comparison of experimental results between iLDA_r and LDA in Figure 9 shows that the reliability degree has obvious influence on the model performance. Taking the model performance of Topic 5 on Weibo, for example, the defeat ratio is 100% when the reliability degree is 0.8% and 75% when the reliability degree is 0.6. The average defeat ratios of the two corpora from LDA_r to iLDA are 72.50%, 65.00%, 61.26%, 56.88% and 59.42%, respectively, when the topic number varies from 5 to 25. Figure 9 indicates that an optimal reliability degree exists to mine specific number of topics on different corpora. The optimal reliability degree is 0.8 on Reuters corpus except for Topic 15.

Figure 9.

Defeat ratio from iLDA_r to LDA: (a) Reuters and (b) Weibo.

Our experiment also shows that complete dependence on human experts results in negative influence on model performance. Figure 10 illustrates the results when the reliability degree of human experts is set to be 1. The average defeat ratio from iLDA to LDA on Reuters and Weibo decreases to 39.72% and 48.31%, respectively. Figures 9 and 10 indicate that the subjective and objective knowledge are both essential for the iLDA model.

Figure 10.

Experiments when reliability degree is 1.

4.3.4. Influences of deterministic strategy and stochastic strategy

The iLDA model proposes two strategies, that is, the deterministic strategy and stochastic strategy, to get the subjective topic-word distribution. This section conducts experiments to test the influence of the two strategies. As shown in Figure 11, the defeat ratios from stochastic strategy to deterministic strategy are no less than 0.5 in most cases on the two corpora. The stochastic strategy performs worse than the deterministic strategy only in 7 and 8 cases in the 25 experiments on the two corpora. Figure 12 illustrates a detailed comparison of the two strategies when reliability degree is 0.8.

Figure 11.

Influences of deterministic and stochastic strategy: (a) Reuters and (b) Weibo.

Figure 12.

Defeat ratio when reliability degree is 0.8.

From Figure 12, we can see that the stochastic strategy outperforms the deterministic strategy except for Topic 5 on Weibo corpus. The defeat ratio from the stochastic strategy to the deterministic strategy is 70.42% and 56.88% on Reuters and Weibo, respectively. With the deterministic strategy, although the integration of human knowledge helps us find high-quality topics, a potential defect is that the topics are apt to converge towards local best themes. The stochastic strategy, however, can be treated as a selector variable for the words to be adjusted. It will encode the uncertainty of human judgement into the model which captures the power of the iLDA model better. The experiment indicates that integrating human knowledge and overstepping local best themes are both critical for the iLDA model.

The experimental results show that the proposed iLDA model outperforms other methods in terms of topic qualities and document clustering. In the proposed model, low-quality topics are selected and experts are invited to adjust the topic-word distributions of these topics. Because the LDA model can extract topics from unstructured texts and expert knowledge can eliminate the negative influence of low-quality topics, the integration of subjective knowledge from experts and objective knowledge from LDA model leads to the accurate results in the experiments.

5. Conclusion and future work

Knowledge discovery from user-generated content has been a commodity much sought after by industry and academic research. To discover high-quality knowledge, this article proposes an interactive strategy of the LDA topic model which aims to integrate subjective knowledge from human experts and objective knowledge mined by the LDA model. The proposed iLDA model employs a new indicator to measure topic quality, provides two (deterministic and stochastic) methods to derive subjective topic-word distribution and proposes guidelines to draw words for topics automatically based on both the subjective and the objective topic-word distributions. Our experiment shows that the proposed iLDA model can mine high-quality topics and is robust under various conditions.

Topic modelling based on LDA has been a frequently used tool to detect instructive knowledge in data such as genetic information, images and networks. While useful to various fields and industries, LDA-based topic modelling still suffers from inherent limitations such as topic quality, coherence and stability [43,44]. Different from the applied researches of LDA-based topic modelling, this article focuses on the inherent limitations of the LDA methodology and proposes strategies to improve topic quality. The proposed iLDA model offers fundamental supports for the applications of LDA-based topic modelling.

This article employs deterministic and stochastic strategy to obtain expert knowledge and uses linear sum-weighted method to integrate objective and subjective knowledge. In terms of future research, one possibility is to study new strategies to obtain subjective knowledge from human experts and new methods to integrate the subjective and objective knowledge. In real application, massive corpora from multiple sources determine that it is an impossible task to deal with the unstructured text by one expert. Another extension of the proposed model is to design interactive strategies for multiple experts to achieve better analysis performance.

Footnotes

Acknowledgements

We appreciate the constructive comments from the anonymous reviewers. We also thank Prof. Chunhua Sun for his contribution.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

Funding

This work was supported by the Major Program of the National Natural Science Foundation of China (71490725), the Foundation for Innovative Research Groups of the National Natural Science Foundation of China (71521001), the National Natural Science Foundation of China (71722010, 91546114, 91746302, 71501057) and The National Key Research and Development Program of China (2017YFB0803303).

References

Blei

Jordan

MI.

Latent Dirichlet allocation. J Mach Learn Res 2003; 3: 993–1022.

Wang

Jiang

Automatically building templates for entity summary construction. Inform Process Manag 2013; 49: 330–340.

Kar

Nunes

Ribeiro

Summarization of changes in dynamic text collections using latent Dirichlet allocation model. Inform Process Manag 2015; 51: 809–833.

Ehsan

Shakery

Candidate document retrieval for cross-lingual plagiarism detection using two-level proximity information. Inform Process Manag 2016; 52: 1004–1017.

Boyd-Graber

Satinoff

, et al. Interactive topic modeling. Mach Learn 2014; 95: 423–469.

Mimno

Wallach

Talley

, et al. Optimizing semantic coherence in topic models. In: Proceedings of the conference on empirical methods in natural language processing, Edinburgh, 27–31 July 2011, pp. 262–272. Stroudsburg, PA: ACM.

Lewis

Yang

Rose

, et al. RCV1: a new benchmark collection for text categorization research. J Mach Learn Res 2004; 5: 361–397.

Newman

Lau

Grieser

, et al. Automatic evaluation of topic coherence. In: Human Language Technologies: the annual conference of the North American chapter of the Association for Computational Linguistics, Los Angeles, CA, 2–4 June 2010, pp. 100–108. Stroudsburg, PA: ACM.

Zhang

Yang

Chen

, et al. PPHOPCM: privacy-preserving high-order possibilistic c-means algorithm for big data clustering with cloud computing. IEEE T Big Data. Epub ahead of print 5 May 2017. DOI: 10.1109/TBDATA.2017.2701816.

10.

Zhang

Yang

Chen

, et al. An improved deep computation model based on canonical polyadic decomposition. IEEE T Syst Man Cy A 2018; 48: 1657–1666.

11.

Landauer

Foltz

Laham

An introduction to latent semantic analysis. Dis Process 1998; 25: 259–284.

12.

Hofmann

. Probabilistic latent semantic analysis. In: Proceedings of the fifteenth conference on uncertainty in artificial intelligence, Stockholm, 30 July–1 August 1999, pp. 289–296.

13.

Hyung

Park

J-S

Lee

Utilizing context-relevant keywords extracted from a large collection of user-generated documents for music discovery. Inform Process Manag 2017; 53: 1185–1200.

14.

De Mauro

Greco

Grimaldi

, et al. Human resources for Big Data professions: a systematic classification of job roles and required skill sets. Inform Process Manag 2018; 54: 807–817.

15.

Y-H

Chen

Y-L

Chou

H-L.

Opinion mining from online hotel reviews – a text summarization approach. Inform Process Manag 2017; 53: 436–449.

16.

Ling

Lyu

King

Ratings meet reviews, a combined approach to recommend. In: Proceedings of the 8th ACM conference on recommender systems, Silicon Valley, CA, 6–10 October 2014, pp.105–112. New York: ACM.

17.

Büschken

Allenby

GM.

Sentence-based text analysis for customer reviews. Market Sci 2016; 35: 953–975.

18.

Jacobs

Donkers

Fok

Model-based purchase predictions for large assortments. Market Sci 2016; 35: 389–404.

19.

Mehmood

Anwar

Ali

, et al. A novel image retrieval based on a combination of local and global histograms of visual words. Math Probl Eng 2016; 2016: 8217250.

20.

Ali

Bajwa

Sablatnig

, et al. Image retrieval by addition of spatial information based on histograms of triangular regions. Comput Electr Eng 2016; 54: 539–550.

21.

Yuan

2D-LDA: a statistical linear discriminant analysis for image matrix. Pattern Recogn Lett 2005; 26: 527–532.

22.

Zheng

Caiming

Caixian

MMDF-LDA: an improved multi-modal latent Dirichlet allocation model for social image annotation. Expert Syst Appl 2018; 104: 168–184.

23.

Mehmood

Mahmood

Javid

MA.

Content-based image retrieval and semantic automatic image annotation based on the weighted average of triangular histograms using support vector machine. Appl Intell 2018; 48: 166–181.

24.

Mehmood

Anwar

Altaf

A novel image retrieval based on rectangular spatial histograms of visual words. Kuwait J Sci 2018; 45: 54–69.

25.

Mehmood

Abbas

Mahmood

, et al. Content-based image retrieval based on visual words fusion versus features fusion of local and global features. Arabian J Sci Eng 2018; 43: 7265–7284.

26.

Blei

DM.

Probabilistic topic models. Commun Acm 2012; 55: 77–84.

27.

Cai

Ajiferuke

, et al. Vocabulary size and its effect on topic representation. Inform Process Manag 2017; 53: 653–665.

28.

Wallach

Murray

Salakhutdinov

, et al. Evaluation methods for topic models. In: Proceedings of the 26th annual international conference on machine learning, Montreal, QC, Canada, 14–18 June 2009, pp. 1105–1112. New York: ACM.

29.

Chang

Boyd-Graber

Gerrish

, et al. Reading tea leaves: how humans interpret topic models. In: Advances in neural information processing systems, Whistler, BC, Canada, 7–11 December 2009, pp. 1–9. New York: Curran Associates Inc.

30.

Yan

Guo

Lan

, et al. A biterm topic model for short texts. In: Proceedings of the 22nd international conference on world wide web, Rio De Janeiro, Brazil, May 13–17 2013, pp. 1445–1456. New York: ACM.

31.

Newman

Bonilla

Buntine

Improving topic coherence with regularized topic models. In: Advances in neural information processing systems, Granada, 12–14 December 2011, pp. 496–504. New York: Curran Associates Inc.

32.

Roberts

Stewart

Airoldi

EM.

A model of text for experimentation in the social sciences. J Am Stat Assoc 2016; 111: 988–1003.

33.

Basu

Davidson

Wagstaff

Constrained clustering: advances in algorithms, theory, and applications. New York: CRC Press, 2008, p. 129.

34.

Andrzejewski

Zhu

Craven

Incorporating domain knowledge into topic modeling via Dirichlet forest priors. In: Proceedings of the 26th annual international conference on machine learning, Montreal, QC, Canada, 14–18 June 2009, pp. 25–32. New York: ACM.

35.

Minka

The Dirichlet-tree distribution, 1999, https://tminka.github.io/papers/dirichlet/minka-dirtree.pdf (accessed 2 January 2019).

36.

Dennis

III . On the hyper-Dirichlet type 1 and hyper-Liouville distributions. Commun Stat Theor Meth 1991; 20: 4069–4081.

37.

Chen

Mukherjee

Liu

, et al. Leveraging multi-domain prior knowledge in topic models. In: 23rd international joint conference on artificial intelligence (IJCAI), Beijing, China, 3–9 August 2013, pp. 1–7. Palo Alto, CA: AAAI Press.

38.

Moschitti

Basili

. Complex linguistic features for text classification: a comprehensive study. In: European conference on information retrieval, Sunderland, 5–7 April 2004, pp.181–196. Berlin: Springer.

39.

Nigam

McCallum

Thrun

, et al. Text classification from labeled and unlabeled documents using EM. Mach Learn 2000; 39: 103–134.

40.

Hofmann

Unsupervised learning by probabilistic latent semantic analysis. Mach Learn 2001; 42: 177–196.

41.

Bíró

Szabó

Benczúr

AA.

Latent Dirichlet allocation in web spam filtering. In: Proceedings of the 4th international workshop on adversarial information retrieval on the web, Beijing, China, 22 April 2008, pp. 29–32. New York: ACM.

42.

Phan

X-H

Nguyen

L-M

Horiguchi

Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings of the 17th international conference on world wide web, Beijing, China, 21–25 April 2008, pp. 91–100. New York: ACM.

43.

Mehrotra

Sanner

Buntine

, et al. Improving LDA topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the 36th international ACM SIGIR conference on research and development in information retrieval, Dublin, 28 July–1 August 2013, pp. 889–892. New York: ACM.

44.

Koltcov

Nikolenko

Koltsova

, et al. Stable topic modeling for web science: granulated LDA. In: Proceedings of the 8th ACM conference on web science, Hannover, 22–25 May 2016, pp. 342–343. New York: ACM.