Combining semantic graph and probabilistic topic models for discovering coherent topics

Abstract

Probabilistic topic models, which frequently represent topics as multinomial distributions over words, have been extensively used for discovering latent topics in text corpora. However, because topic models are entirely unsupervised, they may lead to topics that are not understandable in applications. Recently, several knowledge-based topic models have been proposed which primarily use word-level domain knowledge in the model to enhance the topic coherence and ignore the rich information carried by entities (e.g, persons, locations, organizations, etc.) associated with the documents. Additionally, there exists a vast amount of prior knowledge (background knowledge) represented as Linked Open Data (LOD) datasets and other ontologies, which can be incorporated into the topic models to produce coherent topics. In this paper, we introduce a novel regularization entity-based topic model (RETM), which integrates an ontology with an entity-based topic model (EntLDA) to increase the coherence of the identified topics through the topic modeling process. Our experimental results demonstrate the effectiveness of the proposed model in improving the coherence of topics.

Keywords

Statistical learning topic modeling topic coherence Semantic Web ontologies

1. Introduction

In recent years, a considerable effort has been dedicated to revealing a hidden thematic structure of high dimensional data vectors through the application of the statistical probabilistic techniques. In this context, the approaches such as probabilistic topic models have a long and successful history in statistical data analysis. Probabilistic topic models, such as Latent Dirichlet Allocation (LDA) [7] have been shown to be powerful techniques to analyze the content of documents and extract the underlying topics represented in the collection. In other words, at a very high level, in topic models, the patterns of co-occurrence of words in a text document will be discovered and the topics will be generated based on those words to describe the document. Topic models usually assume that individual documents are mixtures of one or more topics, while topics are probability distributions over the words. These models have been extensively used in a variety of text processing tasks, such as word sense disambiguation [8,19], relation extraction [39], text classification [17,21], and information retrieval [37]. Thus, topic models provide an effective framework for extracting the latent semantics from the unstructured text collection.

However, due to the fact that topic models are entirely unsupervised, purely statistical, data driven and that they cannot capture the correlations, they may produce topics that are not meaningful and understandable to humans or interpretable by applications. To cope with this problem, several knowledge-based topic models haven been proposed which are integrating prior domain knowledge into topic models. [3,4,10,20]. These models incorporate domain knowledge to improve the topic identification process and facilitate producing coherent topics. Chemudugunta et al. [10], for example, combined human-defined concepts with unsupervised topic modeling and came with Concept-Topic model (CTM). In fact, CTM extends the number of topics by adding human-generated concepts as special topics.

Andrzejewski et al. [3] proposed a model to incorporate the domain knowledge in the form of must-links and cannot-links into LDA. A must-link indicated that two words should be in the same topic, whereas a cannot-link stated that two words should not be in the same topic.

[20 ,22 ,30] introduced models that utilize prior knowledge in the form of seed words to direct the topic coherency. In [13], the author has applied Multi-Domain Knowledge that exploits multiple domains knowledge to enhance topic coherency in a new domain. [12] describes a model, called lexical semantic relations, that employs specific types of lexical knowledge.

Although the aforementioned topic model approaches use some kinds of prior knowledge, they essentially treat documents as bags of words and integrate solely word-level prior knowledge into the topic models. Furthermore, some of them make an assumption that the user knows the domain very well and can provide a proper knowledge for the domain, which not always can be applicable. Some improved version of those models have been proposed [16,28,35] to address the mentioned issues. In this context, documents are associated with richer aspects. For instance, news articles convey information about people, locations or events, research articles are linked with authors and venues, and social posts are associated with geo-locations and timestamps. In [35], the authors integrate the authorship information into the topic model and discover a topic mixture over the documents and authors. [28] proposed a model to learn the relationship between the topics and entities mentioned in the articles. In [16], authors introduced a topic model in order to link entity mentions in documents to their corresponding entities in the knowledge base.

Moreover, existing topic models do not utilize the vast amount of existing external knowledge bases which are available in the form of ontologies, such as DBpedia [6] and numerous other datasets in Linked Open Data (LOD).1

¹
http://linkeddata.org/.

Another line of work combines topic modeling with graph structure of the data. [24] proposed a method to integrate a topic model with a harmonic regularizer [40] based on the network structure of the data. In [14], the authors introduced a topic model that incorporates heterogeneous information network. Our work differs from previous works in a way that we utilize the semantic graph of the entities in the ontology in order to regularize the topic model and discover coherent topics. The underlying intuition is that the entities classified into the same or similar domains in the ontology are semantically closely related to each other and should have similar topics. Accordingly, entities (i.e., ontology concepts and instances) occurring in a document along with the relationships between them create a semantic graph where can be combined with the entity topic model and a regularization framework for improving topic coherence.

In this paper, we propose a topic model which utilizes the DBpedia ontology to enhance the topic modeling process. Our aim is to leverage the semantic graph of concepts in DBpedia and combine their various properties with unsupervised topic models, such as LDA, in a well-founded manner. Although there are existing knowledge-based topic models [10] that use human-defined concepts hierarchies along with topic models, they basically focus on simple aspects of ontologies, i.e. associated vocabulary of concepts and hierarchical relations between concepts, and do not consider the rich aspects of ontology concepts such as non-hierarchical relations.

This general unified framework has many advantages, such as linking text documents to knowledge bases and LOD and discovering more coherent topics. We first presented our ontology-based topic model, EntLDA model, in [2] where we illustrated that incorporating ontological concepts with topic models improves topic coherence. In this paper, we elaborate on and extend these results. We also extensively explore the theoretical foundation of our ontology-based framework, demonstrating the effectiveness of our proposed model over two datasets.

2. Background

In this section, we focus on some of the related concepts and definitions that we will discuss through the paper.

2.1. Semantic Web

The Semantic Web, which is defined as an extension to the current Web through the developed standards by the World Wide Web Consortium (W3C), was initiated by Tim Berners-Lee. These standards and techniques allow machines to understand and use the knowledge (semantics) behind the Web. In fact, the Web of data, (i.e., the Web with resources with relations), enables the information to be shared and reused across applications. In order to bring structure to the Web and allow information exchange across applications, the Semantic Web requires a few key technologies. In the following section, we outline a few fundamental Semantic Web technologies that are necessary for achieving the functionality previously mentioned.

2.1.1. Ontology

Ontologies have been designed as a way to express knowledge about a domain in the Semantic Web. Most ontologies are domain specific and are used to describe and represent an area of knowledge by integrating terms (classes) and the relations between those terms (properties). The knowledge can be understood by applications if the terms and relationships among these terms are defined clearly. It is done by applying ontology description languages such as the Resource Description Framework (RDF) with its associated Resource Description Framework Schema (RDFS) and the Web Ontology Language (OWL) to encode knowledge and semantics in the way that can be understandable by computer applications. RDFS provides a standard vocabulary that can be used to describe classes and properties in a specific domain while RDF has been designed to make statements about the domain’s resources in the form of <subject, predicate, object> triples. RDF and RDFS are the building blocks of the Semantic Web. For example, we can express the fact that “Washington is the Capital of the United States” as an RDF triple: <Washington, isCapitalOf, United States>, where its graph structure is represented in Fig. 1.

Fig. 1.

Graph structure of the example RDF statement.

2.1.2. Linked Open Data

Linked Open Data (LOD) is about creating typed links between data from various sources [5]. In other words, LOD is a method of publishing structured data in such a way that is interlinked with other data sources. Linked Open Data is based on the standard Web technologies such as HTTP, RDF, and URI. Tim Berners-Lee illustrated a set of rules for publishing linked data on the Web as follows:

Use URIs as the identifiers for things.

Use HTTP so that the things can be looked up.

Provide useful information when people look up a URI, using standards such as RDF, SPARQL, etc.

Include links to other URIs, so that they can find more things.

Since Linked Open Data has been introduced, many organizations have published their datasets in the Linked Open Data format. One of the primary datasets in LOD is DBpedia[6]. DBpedia is an ontology (encoded in RDF) containing information extracted from Wikipedia and is publicly available on the Web. The DBpedia ontology is very useful and provides many advantages: it covers many domains; because it is automatically extracted from Wikipedia at regular intervals, it automatically evolves as Wikipedia changes; it is multilingual and provides localized versions in 125 languages. In all, it contains 3 billion triples out of which 580 million were extracted from the English edition of Wikipedia; because DBpedia is structured, it allows us to ask quite complex queries against its data. Hence, it should be feasible to leverage this invaluable knowledge in many data/text mining tasks. In fact, the rich knowledge sources such as ontologies in the Semantic Web have been extensively utilized in a variety of data mining and knowledge discovery tasks [9].

2.2. Probabilistic topic models

Probabilistic topic models are a set of algorithms that are used to uncover the hidden thematic structure from a collection of documents. The main idea of topic modeling is to create a probabilistic generative model for the corpus of text documents. In topic models, documents are mixtures of topics, where a topic is a probability distribution over words. The two main topic models are Probabilistic Latent Semantic Analysis (pLSA) and Latent Dirichlet Allocation (LDA). Hofmann (1999) [18] introduced pLSA for document modeling. pLSA model does not provide any probabilistic model at the document level which makes it difficult to generalize it to model new unseen documents. Blei et al. [7] extended this model by introducing a Dirichlet prior on mixture weights of topics per documents, and called the model Latent Dirichlet Allocation (LDA). In the next section, we briefly describe the LDA method.

Fig. 2.

LDA graphical model.

2.2.1. Latent Dirichlet Allocation (LDA)

The Latent Dirichlet Allocation (LDA) is a generative probabilistic model for extracting thematic information (topics) from a collection of documents. LDA assumes that each document is made up of various topics, where each topic is a probability distribution over words. The graphical model of LDA is shown in Fig. 2 and the generative process is as follows:

For each topic $k \in {1, 2, \dots, K}$ , sample a word distribution $ϕ_{k} \sim Dir (β)$ .

For each document $d \in {1, 2, \dots, D}$ ,

sample a topic distribution $θ_{d} \sim$ Dir(α);

for each word $w_{n}$ , where $n \in {1, 2, \dots, N}$ , in document d,

sample a topic $z_{i} \sim Mult (θ_{d})$ ;

sample a word $w_{n} \sim Mult (ϕ_{z_{i}})$ .

Where α and β are parameters of the symmetric Dirichlet prior. In LDA model, as the words are generated from the topics and topics are generated from documents, the probability of a word w given a document d is defined as:

\begin{matrix} (1) & p (w_{i} | d) = \sum_{j = 1}^{K} p (w_{i} | z_{j}) p (z_{j} | d) . \end{matrix}

3. Related work

Probabilistic topic models, and in particular the Latent Dirichlet Allocation (LDA) [7] have been proved to be effective and widely used techniques in various text processing tasks. As they are naturally unsupervised and entirely statistical, they do not exploit any prior knowledge in the models.

Recently, several approaches have been proposed that integrate prior knowledge to direct the topic modeling process. For example, [3,19,20,30] apply word-level knowledge into topic models. [3] introduced DF-LDA that uses word-level domain knowledge in the form of must-links and cannot-links in LDA. A must-link indicated that two words should be in the same topic, whereas a cannot-link stated that two words should not be in the same topic. DF-LDA encodes the set of must-links and cannot-links associated with the domain knowledge using a Dirichlet Forest prior, replacing the Dirichlet prior over the topic-word multinomial distributions. In this case the user is allowed to control the strength of the domain knowledge.

Patterson et al. [30] leverages word features as side information to boost topic cohesion. The intuition is to treat word information as features instead of explicit restriction and to modify the smoothing prior over the topic distributions for words in such a way that correlation is stressed. In this way, we can learn the prior probability of how words are distributed over different topics based on how similar they are.

[20] described a topic model which uses word-level prior knowledge as the form of sets of seed words in order to find coherent topics. Seed words are user-provided words that represent the topics underlying the corpus. Seed topic information can be utilized to improve the topic-word probability distributions or it can be first transfered to the document level based on the document words and then be used for enhancing document-topic distributions, or it can be combined at topic and then document level.

Interactive Topic Modeling (ITM) [19] proposed a model that allows the user to incorporate knowledge interactively during the topic modeling process.

Some other related works include [13] which introduces the MDK-LDA model to use multiple domains knowledge to guide the topic modeling process. The knowledge is called s-set (semantic-set) and refers to a set of words sharing the same semantic meaning in the domain. In this context, each document is represented as mixture of topics whereas each topic is a distribution over semantic-sets.

In [12], the authors proposed GK-LDA, general knowledge-based model, which exploits general knowledge of lexical semantic relations in the topic model. Some of the lexical semantic relations are synonymy, antonymy, hyponymy, adjective-attribute, etc. They used synonym, antonym and adjective-attribute relations and show the advantages of utilizing these relations for discovering coherent topics.

The key point about the current approach, which is noticeable and makes it different from others, is applying ontologies as background knowledge in topic models. Although, the following works have tried to take the advantage of using ontologies in topics models but our work differs from those in the way that they use word-level prior knowledge whereas we exploit ontology concepts (i.e. concept-level knowledge) and their relationships directly in the topic model.

Boyd-Graber et al. [8] introduces LDAWN topic model which leverages WordNet knowledge for the word-sense disambiguation task. The basic intuition is that the words in a topics have similar meanings and therefore share paths within WordNet.

Chemudugunta et al. [10] describe CTM, the Concept-Topic model, which combines human-defined concepts with LDA. The key idea in their framework is topics from the statistical topic models and concepts of the ontology are both represented by a set of “focused” words and they use this similarity in their model. In [11], the authors extended the work and proposed HTCM, the Hierarchical Concept-Topic model, in order to leverage the known hierarchical structure among concepts.

In addition to the mentioned related works [14,24] combined statistical topic modeling with network analysis by regularizing the topic model with a regularization framework based on the network structure.

Our approach is somewhat similar to a few previous works, particularly [8,10,11] in terms of exploiting ontologies in the topic models and with [14,24] in terms of regularizing the topic model, yet it differs from all of them. In [8], the task is word-sense disambiguation whereas ours is to find coherent topics. CTM [10], uses tree-structured Open Directory Project2

²
http://www.dmoz.org.

as ontology and also relies on simple aspect of this concept hierarchy, which is the set of words associated to concepts, and HCTM [11] additionally utilizes the hierarchical structures of ontology concepts to direct the topic model. In [14,24], models do not consider the entities mentioned in the documents in the topic models.

In this paper we propose a novel entity-based topic model, EntLDA, which incorporates DBpedia ontology into the topic model in a systematic manner. In our model we exploit various properties of concepts and not only hierarchical relations but also lateral (other than hierarchical) relations between ontology concepts. EntLDA also accounts for entity mentions in documents and their corresponding DBpedia entities as labeled information in the generative process to constrain Dirichlet prior of document-entity and entity-topic in order to effectively improve topic coherency.

Fig. 3.

A fragment of the semantic graph from the example text.

4. Problem statement

In this section, the Entity Topic Model based on LDA (EntLDA) and its learning process will be described. We then define the entity network regularization and investigate how to integrate the entity topic model with the regularization framework.

Many topic models like LDA usually ignore the entities associated with the documents in the modeling process because the pure topic models are based on the idea that documents are made up of topics, while each topic is a probability distribution over the vocabulary. Unlike in LDA, in the proposed EntLDA, each document is a distribution over the entities (of the ontology) where each entity is a multinomial distribution over the topics and each topic is a probability distribution over the words. For example, in DBpedia ontology, each entity has a number of topics (categories) assigned to it. Hence, each entity is a mixture of different topics with various probabilities.

We highlight the fact that entities occurring in the document together with the relationships among them can determine the document’s topics. In other words, the underlying intuition behind our model is that documents are associated with entities carrying rich information about the topics of the document and utilizing this significant information is of great interest and can potentially improve the topic modeling and topic coherence.

For instance, the following is a fragment of a recent news article:

U.S. government reports decline in Obamacare individual enrollment. Enrollment in the individual insurance plans created under Obamacare declined to 12.2 million Americans , the U.S. government said on Wednesday as Republican lawmakers and the Trump administration sought to repeal the healthcare law .

As of the end of January, enrollment was down by about 500,000 people from 2016. It is about 1.6 million people short of former President Barack Obama ’s goal for 2017 sign-ups, the government said.

The U.S. House of Representatives is working on passing a bill that would gut the 2010 Affordable Care Act , often called Obamacare , in part by replacing the income-based tax credits that helped reduce the monthly premiums for the majority of participants with age-based credits.

The White House and congressional leaders said on Tuesday they were weighing changes to their plan to dismantle Obamacare as Republicans’ questions mounted following an estimate that it would cause 14 million Americans to lose insurance next year.

The finding made it tougher for President Donald Trump to sell his first major piece of legislation , even to fellow Republicans in Congress .

The data released on Wednesday by part of the U.S. Department of Health and Human Services includes people who selected or were automatically enrolled in a plan between Nov. 1 last year and Jan. 31 either through the federal HealthCare.gov website or one of the state-based exchanges.

Of those enrolled, 10.1 million people or 83 percent received the advance premium tax credits , one of the lynchpins of the law alongside the expansion of Medicaid for the poor. About one-third of the enrollees were new to the market .

As Fig. 3 shows, we can recognize the entity mentions (underlined) in the document and induce relationships among them through applying the knowledge from the DBpedia ontology. This leads to the creation of a semantic graph of connected entities that were identified in the document. We combine the structure of such a semantic graph of entities with the probabilistic topic models in order to enhance the coherence of the discovered topics.

4.1. The EntLDA topic model

The novelty of the EntLDA model is in leveraging the potential role of the ontological knowledge associated with entities occurring in a document. This ontological knowledge about the entities (ontology concepts and relationships among them) is fully integrated into the EntLDA topic model. Effectively, this exploits the prior (ontological) knowledge in order to produce coherent topics automatically.

Fig. 4.

Graphical representation of EntLDA; symbols explained in Algorithm 1.

Algorithm 1

EntLDA topic model

Table 1

Notation used in this paper

Symbol	Description
D	number of documents
E	number of entities
K	number of topics
V	number of words
$N_{d}$	number of words in document d
$α_{t}$	asymmetric Dirichlet prior for entity e
β	symmetric Dirichlet prior for topic t
τ	symmetric Dirichlet prior for document d
$z_{i}$	topic assigned to the word at position i in the document d
$e_{i}$	entity assigned to the word at position i in the document d
$w_{i}$	word at position i in the document d
$θ_{e}$	multinomial distribution of topics for entity e
$ϕ_{k}$	multinomial distribution of words for topic k
$ζ_{d}$	multinomial distribution of entities for document d

The graphical representation of EntLDA is shown in Fig. 4 and the generative process is defined in Algorithm 1. The notation used in this paper is summarized in Table 1.

It should be noted that in the generative process for each document d, instead of selecting an entity uniformly from $E_{d}$ as in the author-topic model [35], we draw an entity from a document-specific multinomial distribution $ζ_{d}$ over $E_{d}$ . The reason is based on the assumption that each entity in $E_{d}$ contributes differently in generating the document d. $E_{d}$ is a vector containing all the entities of the document d.

According to the model, we can write the joint distribution of all observed and hidden variables as follows: $\begin{array}{l} P (w, z, e, ζ, θ, β, E, α, β, τ) \\ = \prod_{t = 1}^{E} P (θ_{t} | α) \prod_{i = 1}^{K} P (ϕ_{i} | β) (\prod_{d = 1}^{D} P (ζ_{d} | τ, E_{d}) \\ \times \prod_{j = 1}^{N} P (e_{d, j} | ζ_{d}) P (z_{d, j} | θ_{e_{d, j}}) P (w_{d, j} | ϕ_{z_{d, j}})), \end{array}$ where the bold-font variables indicate the vector version of the variables. Using the entity topic model, the joint distribution of all observed and hidden variables can be factored into three terms: $\begin{array}{l} P (w, z, e | α, β, τ, E) \\ = P (w | z) P (z | e) P (e) \\ \times \int_{ϕ} P (w | z, ϕ) P (ϕ | β) d ϕ \\ \times \int_{θ} P (z | e, θ) P (θ | α) d θ \\ (2) & \times \int_{ζ} P (e | ζ, E_{d}) P (ζ | τ) d ζ . \end{array}$

4.2. Inference using Gibbs sampling

In our EntLDA model, two sets of unknown parameters need to be estimated: (1) the E entity-topic distributions θ, and K topic-word distributions; and (2) the assigned topic $z_{i}$ and assigned entity $e_{i}$ for each word $w_{i}$ . There are a variety of techniques to estimate the parameters of the topic models such as variational EM [7] and Gibbs sampling [15]. In this paper, we utilize the collapsed Gibbs sampling algorithm for EntLDA. Collapsed Gibbs sampling is a Markov Chain Monte Carlo (MCMC) [34] algorithm to sample from posterior distribution over the latent variables. Instead of estimating the model parameters directly, we evaluate the posterior distribution on just e and z and then use the results to infer θ and ϕ.

The posterior inference is defined as follows: $\begin{array}{l} P (e, z | w, E_{d}, α, β, τ) \\ = \frac{P (e, z, w | E_{d}, α, β, τ)}{\sum_{e} \sum_{z} P (e, z, w | E_{d}, α, β, τ)} \\ (3) & \propto P (e) P (z | e) P (w | z), \end{array}$ where $\begin{array}{l} P (e) = {(\frac{Γ (E τ)}{Γ {(τ)}^{E}})}^{D} \\ (4) & \times \prod_{d = 1}^{D} \frac{\prod_{j = 1}^{E} Γ (C_{j d}^{ED} + τ_{j})}{Γ (\sum_{j^{'}} (C_{j^{'} d}^{ED} + τ_{j^{'}}))}, \\ P (z | e) = {(\frac{Γ (K α)}{Γ {(α)}^{K}})}^{E} \\ (5) & \times \prod_{j = 1}^{E} \frac{\prod_{k = 1}^{K} Γ (C_{k j}^{TE} + α_{k})}{Γ (\sum_{k^{'}} (C_{k^{'} j}^{TE} + α_{k^{'}}))}, \\ P (w | z) = {(\frac{Γ (V β)}{Γ {(β)}^{V}})}^{K} \\ (6) & \times \prod_{k = 1}^{K} \frac{\prod_{w = 1}^{V} Γ (C_{w k}^{WT} + β_{w})}{Γ (\sum_{w^{'}} (C_{w^{'} k}^{WT} + β_{w^{'}}))} . \end{array}$

For a word token w at position i, its full conditional distribution can be written as: $\begin{array}{l} P (e_{i} = j, z_{i} = k | \\ w_{i} = w, z_{- i}, e_{- i}, w_{- i}, E_{d}, α, β, τ) \\ \propto \frac{C_{j d, - i}^{ED} + τ_{j}}{\sum_{j^{'}} (C_{j^{'} d, - i}^{ED} + τ_{j^{'}})} \times \frac{C_{k j, - i}^{TE} + α_{k}}{\sum_{k^{'}} (C_{k^{'} j, - i}^{TE} + α_{k^{'}})} \\ (7) & \times \frac{C_{w k, - i}^{WT} + β_{w}}{\sum_{w^{'}} (C_{w^{'} k, - i}^{WT} + β_{w^{'}})}, \end{array}$ where $C_{w k}^{WT}$ is the number of times word w is assigned to topic k. $C_{k j}^{TE}$ is the number of times topic k is assigned to entity e and $C_{j d}^{ED}$ is the number of word tokens assigned to entity e. The subscript $- i$ , which denotes the contribution of the current word $w_{i}$ being sampled, is removed from the counts.

After Gibbs sampling, we can easily estimate the topic-word distributions ϕ, entity-topic distributions θ and document-entity distributions ζ by: $\begin{array}{l} (8) & ζ_{d j} = \frac{C_{j d}^{ED} + τ_{j}}{\sum_{j^{'}} (C_{j^{'} d}^{ED} + τ_{j^{'}})}, \\ (9) & θ_{j k} = \frac{C_{k j}^{TE} + α_{k}}{\sum_{k^{'}} (C_{k^{'} j}^{TE} + α_{k^{'}})}, \\ (10) & ϕ_{k w} = \frac{C_{w k}^{WT} + β_{w}}{\sum_{w^{'}} (C_{w^{'} k}^{WT} + β_{w^{'}})}, \end{array}$ where $ζ_{d j}$ is the probability of an entity given a document, $θ_{j k}$ is the probability of a topic given an entity and $ϕ_{k w}$ is the probability of a word given a topic.

4.3. Regularization framework for topic models

Resources on the World Wide Web (WWW) and in particular text documents are not only getting richer in content, but they also become extensively interconnected with users and other types of objects. This evolution of the Web leads to a network of data where, in addition to textual information available in documents, we often gain access to the associated network-like structure in the data. Bibliographic data and social networks are such examples where we have both textual documents and a network of multi-typed objects.

Although topic models have proven their utility in document analysis, they usually consider only the textual information and ignore the network structure present in the data. On the other hand, the interactions between objects in the network play an important role in revealing the rich semantics of the network and thus the document. Topic modeling based on network structure (regularized topic modeling) has been shown to be effective in extracting topics and discovering topical communities [14,23,24]. The basic idea is to combine topic modeling and social network analysis, and leverage the power of both the topic models and the discrete regularization, which increases the likelihood of discovering quality topics.

In the following section, we propose an ontology-based regularization framework that integrates the network structure of the entities in the documents with the topic models.

4.4. Regularization-Based Entity Topic Model

Our approach to regularization, which we call Regularization-Based Entity Topic Model (RETM), combines our EntLDA model with the semantic graph structure of the entities occurring in the documents. The key idea behind this method is that the entities in the ontology that are semantically closely related to each other, are categorized under the same or similar topics. Thus, we leverage the information in the individual documents, including entities mentioned in the document text, and integrate it with the graph structure of the ontology by regularizing the topic model based on the entity network. Particularly, entities appearing in a document that are semantically related to each other in the ontology should belong similar topics.

Algorithm 2:

Parameter estimation

Entity network: An entity network associated with a collection of documents D is a graph $G = ⟨ V, E ⟩$ , where V is a set of entities occurring in the corpus and E is a set of ontology relations (properties). Each entity $e_{u}$ is considered as a node in the graph. Given two entities $e_{u}$ and $e_{v}$ , an edge $⟨ e_{u}, e_{v} ⟩$ between $e_{u}$ and $e_{v}$ exists if both co-occur in the same document and there is a relation p in the ontology that connects them. Even though the edges in the ontology are directed, in this paper we only consider the undirected case. Thus, we define the regularized data likelihood of the EntLDA as follows: $\begin{matrix} (11) & O_{ξ} (D, G) = - (1 - ξ) L (D) + ξ R (D, G), \end{matrix}$ where $L (D)$ is the log likelihood of the collection D to be generated by EntLDA topic model, $R (D, G)$ is a harmonic regularizer defined on the entity network G and ξ is the controlling factor of the two terms. The harmonic regularizer is defined as: $\begin{matrix} (12) & \begin{matrix} R (D, G) = & \frac{λ}{2} \sum_{⟨ e_{u}, e_{v} ⟩ \in E} w (e_{u}, e_{v}) \\ \times \sum_{j = 1}^{k} {(p (z_{k} | e_{u}) - p (z_{k} | e_{v}))}^{2}, \end{matrix} \end{matrix}$ where $p (z_{k} | e_{i})$ denotes the probability that an entity $e_{i}$ belongs to a topic $z_{k}$ . $w (e_{u}, e_{v})$ is the weight of the edge $⟨ e_{u}, e_{v} ⟩$ . We define $w (e_{u}, e_{v})$ as the semantic relatedness between $e_{u}$ and $e_{v}$ , for which we adopted the Wikipedia Link-based Measure (WLM), introduced in [38]. Given two DBpedia entities $e_{u}$ and $e_{v}$ , the semantic relatedness between them is defined as: $\begin{matrix} (13) & \begin{matrix} w (e_{u}, e_{v}) \\ = 1 - \frac{log (max (| E_{u} |, | E_{v} |)) - log (| E_{u} \cap E_{v} |)}{log (| Y |) - log (min (| E_{u} |, | E_{v} |))}, \end{matrix} \end{matrix}$ where $E_{u}$ and $E_{v}$ are sets of DBpedia entities that link to $e_{u}$ and $e_{v}$ , respectively, and Y is the set of all entities in DBpedia.

In order to minimize Eq. (11), we have to find a probabilistic topic model that fits the text collection D and also smoothes the topic distributions between the entities in the entity network. In the special case that $ξ = 0$ , the objective function boils down to log-likelihood function of EntLDA with no regularization term. But for the general case of $ξ > 0$ , there is no closed-form solution for the complete likelihood function [24]. Thus, we use a two-step algorithm to learn all the parameters in Eq. (11). In the first step, we train the model parameters (ζ, θ, ϕ) using the objective function $O_{1} (D, G) = - L (D)$ by the Gibbs sampling algorithm. We set the Dirichlet prior for each entity $e_{i}$ as: $\begin{matrix} α_{e_{i} k} = (1 - ξ) α + ξ \frac{K}{| G_{e_{i}} |} \sum_{⟨ e_{i}, e_{j} ⟩ \in G} θ_{e_{j} k}, \end{matrix}$ where $| G_{e} |$ is the number of neighbors of entity e in the entity network G. In the second step, we fix ϕ and ζ, and re-estimate parameters θ to minimize $O_{ξ}$ by running an iterative process to obtain the new θ for each entity e as: $\begin{array}{l} p_{t + 1}^{(n + 1)} (z_{k} | e_{u}) \\ = (1 - γ) p_{t + 1}^{(n)} (z_{k} | e_{u}) \\ (14) & + γ \frac{\sum_{⟨ e_{u}, e_{v} ⟩ \in G} w (e_{u}, e_{v}) p_{t + 1}^{(n)} (z_{k} | e_{v})}{\sum_{⟨ e_{u}, e_{v} ⟩ \in G} w (e_{u}, e_{v})} \end{array}$ where $θ_{u k}^{(n + 1)} = p_{t + 1}^{(n + 1)} (z_{k} | e_{u})$ and γ is a coefficient to smooth the topic distribution. The learning algorithm has also been used previously in [14,24,36]. The fitting approach is shown in Algorithm 2.

5. Experiments and evaluations

In this section, we describe the evaluation of the proposed Regularization Based Entity Topic Model (RETM) in two experiments using two different datasets. In Experiment 1, we compare the final result with three baseline models: LDA [7], EntLDA without regularization framework and GK-LDA [12]. For Experiment 2, we select the most coherent method from Experiment 1 and perform our experiment on a larger dataset.

From the selected method, LDA is the basic topic model to learn the topics from the corpus. EntLDA without regularization is just the proposed model excluding the regularization term (i.e., $ξ = 0$ in Eq. (11)). GK-LDA is a model that uses word-level lexical knowledge (i.e., synonyms and antonyms) from dictionaries to improve topic coherence. Therefore, it aims to constrain the words to appear under the topics according to the lexical relations between the words. GK-LDA is the most recent work and the closest to our method in terms of leveraging prior knowledge in the model and discovering topic coherence, which is why we selected it for our experiments.

5.1. Data sets

We evaluated the proposed approach on two datasets of Reuters3

³
http://www.reuters.com/.

news articles. The articles were randomly selected for both datasets. The fist dataset is a collection of

D = 1243

news articles spanning six primary categories of Business, Health, Politics, Science, Sports and Technology. This dataset consists of 239,009 words with the initial vocabulary size of 24,695. We used DBpedia ontology as our background knowledge and extracted 5887 entities (named entities) mentioned in the corpus and used these entities in the experiments.

The second dataset is a collection of $D = 7105$ news articles. In addition to the categories included in the first dataset, the second dataset also includes categories of Art, Economy, Entertainment, and Global Marketing (ten categories in total). Furthermore, this dataset has $1, 539, 950$ words and the initial vocabulary of size $53, 846$ . The background knowledge extracted from DBpedia ontology for this dataset consists of $17, 355$ entities (named entities).

5.2. Experimental setup

Experiment 1. We removed punctuation, stopwords, numbers, and words occurring fewer than 10 times in the corpus through the pre-processing of the first dataset. Then, we created a $W = 5226$ vocabulary. For GK-LDA, we followed the data preparation method explained in [12] by running the Stanford POS Tagger4

⁴
http://nlp.stanford.edu.

and identifying nouns and adjectives. We then utilized WordNet [25] to produce LR-sets (lexical relation sets). The GK-LDA model used was the one from the author’s website.

The Mallet toolkit5

⁵

http://mallet.cs.umass.edu/.

was applied in order to implement the LDA model. The number of topics was empirically set as

K = {15, 20, 25, 30, 35}

and the hyperparameters α, β and τ were set with

α = 0.1

β = 0.1

and

τ = 50 / K

respectively. Additionally, we empirically set both coefficients ξ and γ for topic smoothing and the effectiveness of background knowledge as 0.9 because they produced better results. Parameter analysis is further described in the Section 5.3.2. For all the models, we ran the Gibbs sampling algorithm for 500 iterations and computed the posterior inference after the last sampling iteration.

Experiment 2. For the second experiment we went through the same procedure for data pre-processing of the second dataset as used in Experiment 1, except we selected the LDA model (best baseline in Experiment 1) for comparison of the final result on the larger dataset.

5.3. Experimental results

We have evaluated our RETM model (EntLDA with regularization) by comparing it to the selected baseline models. We used the topic coherence metric to evaluate the quality of the topics. Topic models are often evaluated using the perplexity measure [7] on held-out test data. In [9], Chang et al. reported that human judgments can sometimes be contrary to the perplexity measure. Also, Newman et al. [29], indicated that perplexity has limitations and may not reflect the semantic coherence of topics learned by the model. Additionally, perplexity only provides a measure of how well a model fits the data, which is different from our goal of finding coherent topics.

5.3.1. Quantitative analysis

For the quantitative evaluation, we applied a metric, namely topic coherence score, proposed by [26] for measuring the quality of topics. Arguably, this has become the most commonly used topic coherence evaluation method. Given a topic z and its top T words $V^{(z)} = (v_{1}^{(z)}, \dots, v_{T}^{(z)})$ ordered by $P (w | z)$ , the coherence score is defined as: $\begin{matrix} (15) & C (z; V^{(z)}) = \sum_{t = 2}^{T} \sum_{l = 1}^{t - 1} log \frac{D (v_{t}^{(z)}, v_{l}^{(z)}) + 1}{D (v_{l}^{(z)})}, \end{matrix}$ where $D (v)$ is the document frequency of word v and $D (v, v^{'})$ is the number of documents in which words v and $v^{'}$ co-occurred. It has been demonstrated that the coherence score is highly consistent with human-judged topic coherence [26]. Higher coherence scores indicate higher quality of topics.

Experiment 1. We performed several experiments with various values of ξ ( $0 < ξ ⩽ 1$ ) for different numbers of topics $K = {15, 20, 25, 30, 35}$ , and for the majority of experiments, $ξ ⩾ 0.8$ consistently produced better results. Thus, we empirically set $ξ = 0.9$ for all of the experiments.

Table 2 shows the results for two topics $K = {20, 25}$ where the number of top words ranges from 5 to 20. RETM receives the highest coherence score, which suggests that it outperforms the other models significantly.

Table 2
Topic coherence on top T words (Experiment 1). A higher coherence score means more coherent topics

T Top words ξ

5 10 15 20

$K = 20$ $LDA$ $- 236.966$ $- 1187.90$ $- 3039.50$ $- 6018.90$ –

GK-LDA $- 266.103$ $- 1304.60$ $- 3181.30$ $- 6102.20$ –

$EntLDA$ $- 264.204$ $- 1239.10$ $- 3072.70$ $- 5999.10$ 0

$RETM$ −220.659 −1162.70 −2919.70 −5663.10 0.9

$K = 25$ $LDA$ $- 385.643$ $- 1930.50$ $- 4795.30$ $- 9361.30$ –

GK-LDA $- 311.517$ $- 1697.50$ $- 3916.10$ $- 7785.70$ –

$EntLDA$ $- 392.535$ $- 1947.90$ $- 4894.0$ $- 9123.20$ 0

$RETM$ −318.669 −1593.50 −3859.0 −7444.90 0.9

	T	Top words	ξ
$K = 20$	$LDA$	$- 236.966$	$- 1187.90$	$- 3039.50$	$- 6018.90$	–
GK-LDA	$- 266.103$	$- 1304.60$	$- 3181.30$	$- 6102.20$	–
$EntLDA$	$- 264.204$	$- 1239.10$	$- 3072.70$	$- 5999.10$	0
$RETM$	−220.659	−1162.70	−2919.70	−5663.10	0.9
$K = 25$	$LDA$	$- 385.643$	$- 1930.50$	$- 4795.30$	$- 9361.30$	–
GK-LDA	$- 311.517$	$- 1697.50$	$- 3916.10$	$- 7785.70$	–
$EntLDA$	$- 392.535$	$- 1947.90$	$- 4894.0$	$- 9123.20$	0
$RETM$	−318.669	−1593.50	−3859.0	−7444.90	0.9

Table 3 illustrates the average topic coherence for the top words (ranges from 5 to 20) among all the models with different number of topics. RETM shows a significant improvement (p-value < 0.01 by t-test) over the LDA, GK-LDA, and EntLDA (without regularization) models.

Table 3

Average topic coherence on top T words for various number of topics (Experiment 1)

	$K = 15$	$K = 20$	$K = 25$	$K = 30$	$K = 35$
$LDA$	$- 2019.096$	$- 2620.816$	$- 3353.964$	$- 4118.186$	$- 4884.294$
GK-LDA	$- 2091.456$	$- 2713.551$	$- 3427.704$	$- 4301.466$	$- 5009.519$
$EntLDA$	$- 2002.758$	$- 2643.776$	$- 3331.678$	$- 4121.085$	$- 4839.843$
$RETM$	−1870.453	−2491.540	−3304.017	−4040.859	−4782.272

Figure 5 shows that our RETM model consistently achieves higher topic coherence score over the baselines. Among the baseline models, LDA works better than GK-LDA and EntLDA without regularization (i.e. $ξ = 0$ ) which strengthens the impact of background knowledge, particularly at the concept level in the topic model. The reason that non-regularized EntLDA did not outperform LDA is that we added all the entities corresponding to entity mentions in document d to $E_{d}$ without doing any explicit entity disambiguation on it. Therefore, $E_{d}$ might have multiple entities for a single entity mention (i.e., ambiguous entities), which add noise to the topic modeling process. More interestingly, GK-LDA did not work well in our experiments which might be because of the nature of our corpus which is much different than the corpus used in [12].

Fig. 5.

Experiment 1: average topic coherence for all models (a higher coherence score means more coherent topics).

Experiment 2. According to the results of Experiment 1, LDA is the best baseline model among all the baseline models, therefore we evaluated our model by comparing it to LDA using a larger dataset (i.e., the second dataset). As Table 4 depicts, RETM shows better performance in comparison to LDA for different topics and top words. The overall evaluation between two models in Table 5 and Fig. 6 also confirm that the proposed model has better efficiency in a larger dataset.

Table 4

Topic coherence on top T words (Experiment 2)

	T	Top words		ξ

		5	15
$K = 10$	$LDA$	$- 130.996$	$- 1475.867$	–
$K = 10$	$RETM$	−109.370	−1407.189	0.9
$K = 20$	$LDA$	$- 268.448$	$- 2906.036$	–
$K = 20$	$RETM$	−255.200	−2823.233	0.9
$K = 30$	$LDA$	$- 383.676$	$- 4802.958$	–
$K = 30$	$RETM$	−358.886	−4368.152	0.9
$K = 40$	$LDA$	$- 542.455$	$- 6529.937$	–
$K = 40$	$RETM$	−531.716	−6425.509	0.9
$K = 50$	$LDA$	$- 699.591$	$- 8689.024$	–
$K = 50$	$RETM$	−697.091	−8272.356	0.9

Table 5

Average topic coherence on top T words for various number of topics

	$K = 10$	$K = 20$	$K = 30$	$K = 40$	$K = 50$
$LDA$	$- 1222.341$	$- 2515.419$	$- 4062.686$	$- 5540.235$	$- 7188.757$
$RETM$	−1160.160	−2488.948	−3918.133	−5422.519	−6958.338

Fig. 6.

Experiment 2: average topic coherence between RETM and LDA (a higher coherence score means more coherent topics).

Fig. 7.

The effect of varying parameter ξ in the regularization framework for $K = 20$ .

5.3.2. Parameter analysis

In our method, we use the underlying regularization parameter ξ which effectively impacts the RETM model.

Figure 7 shows how the performance of RETM varies with changing the regularization parameter ξ. As we mentioned in Section 4.4, parameter ξ controls the balance between the data likelihood and the smoothness of the topic distribution over the entity network. When $ξ = 0$ , no background knowledge is integrated with the topic model. When $ξ > 0$ , the regularization framework considers the topic consistency and the semantic relatedness between the entities in the documents which, accordingly, enhances the coherence of topics. As we increase the value of ξ, we rely more on the integration of the background knowledge into the topic model and receive better topic coherence scores. We also set $ξ = 1$ to see whether we achieve better performance if we solely rely on the entity network. But as Fig. 7 illustrates, the topic coherence decreases significantly. Thus, we empirically set $ξ = 0.9$ in all of the experiments.

5.3.3. Qualitative analysis

In this section, we describe some qualitative results to give an intuitive perspective on the performance of different models. Many of the topics are improved significantly, however, we show some sample examples. We further focus on sample topics of LDA and our RETM models, since the results of the other baselines were inferior. We selected a sample of topics with the top-10 words for each topic from an experiment with number of topics $K = 20$ . Table 6 presents the top words of each topic for LDA and our proposed models. We tried to find the best possible topic matches for the models. Although both LDA and RETM represent the top words for each topic, the topic coherence under RETM is qualitatively better than LDA. We can see that RETM model produces much better topics than LDA does. For instance, as Table 6 shows the words “reuters”, “reporting” and “editing” that are recognized by LDA as part of the top words of the topic labeled as “Healthcare Study” are more likely unrelated to the topic in comparison with the same topic learned from the RETM model. For each topic, we italicized and marked in red the wrong topical words. We think that topic labeling and topical words are a subjective task, therefore, we do not expect every one to agree with us, nonetheless, we tried our utmost to have the consensus of two human assessors.

Table 6
Example topics from two domains, along with top-10 words under LDA and RETM models. The first row presents the manually generated labels. Italicized words indicate the words that are not likely to be relevant to the topics

Healthcare study Internet companies Sports National security U.S. politics

LDA RETM LDA RETM LDA RETM LDA RETM LDA RETM

drug drug company company season season china government obama obama

patients cancer million technology game team data security house house

drugs patients billion mobile win win security officials state president

treatment drugs mobile google play game government agency president state

fda treatment google apple league play information data washington washington

reuters fda market market home league agency information law republican

percent study business internet back editing states national court government

trial virus apple based team cup national companies republican administration

reporting cases reuters online club final defense intelligence healthcare law

editing people corp business match won chinese nsa administration healthcare

Healthcare study	Internet companies	Sports	National security	U.S. politics
drug	drug	company	company	season	season	china	government	obama	obama
patients	cancer	million	technology	game	team	data	security	house	house
drugs	patients	billion	mobile	win	win	security	officials	state	president
treatment	drugs	mobile	google	play	game	government	agency	president	state
fda	treatment	google	apple	league	play	information	data	washington	washington
reuters	fda	market	market	home	league	agency	information	law	republican
percent	study	business	internet	back	editing	states	national	court	government
trial	virus	apple	based	team	cup	national	companies	republican	administration
reporting	cases	reuters	online	club	final	defense	intelligence	healthcare	law
editing	people	corp	business	match	won	chinese	nsa	administration	healthcare

6. Conclusion and future work

In this paper, we proposed an entity topic model EntLDA and also RETM, which is EntLDA with a regularization network. Both models integrate the probabilistic topic models with the knowledge graph of an ontology in order to investigate and improve the task of discovering coherent topics. The proposed models effectively utilize the semantic graph of the ontology including entities and the relations among them and integrate this knowledge with the topic model to produce more coherent topics. We demonstrated the effectiveness of this model by conducting two different experiments. In Experiment 1, we evaluated our model against the baseline models (LDA, GK-LDA, and EntLDA) using a dataset of text collection contains 1243 news articles categorized into six primary topics from Reuters6

⁶

http://www.reuters.com/.

news articles. Our model reached the highest coherence score among the baseline models. For Experiment 2, we selected the LDA model (the best baseline in Experiment 1) for comparison of the final result on the larger dataset (dataset of text collection contains 7105 news articles categorized into ten primary topics). Experiment 2’s results also show that our model significantly outperforms the baseline model on the larger dataset.

There are many interesting future directions of this work. We are particularly interested in investigating the application of the proposed model in “automatic topic labeling” [1] and text categorization tasks. Additionally, the models can be potentially used for entity classification [27], entity summarization [31,32], and profiling RDF datasets [33] tasks.

References

Allahyari and

Kochut, Automatic topic labeling using ontology-based topic models, in: 14th International Conference on Machine Learning and Applications (ICMLA), IEEE, 2015.

Allahyari and

Kochut, Discovering coherent topics with entity topic models, in: 2016 IEEE/WIC/ACM International Conference on Web Intelligence (WI), IEEE, 2016, pp. 26–33. doi:10.1109/WI.2016.0015.

Andrzejewski,

Zhu and

Craven, Incorporating domain knowledge into topic modeling via Dirichlet forest priors, in: Proceedings of the 26th Annual International Conference on Machine Learning, ACM, 2009, pp. 25–32.

Andrzejewski,

Zhu,

Craven and

Recht, A framework for incorporating general domain knowledge into latent Dirichlet allocation using first-order logic, in: IJCAI Proceedings – International Joint Conference on Artificial Intelligence, Vol. 2, 2011, pp. 1171–1177.

Bizer,

Heath and

Berners-Lee, Linked data-the story so far, in: Semantic Services, Interoperability and Web Applications: Emerging Concepts, 2009, pp. 205–227. doi:10.4018/978-1-60960-593-3.ch008.

Bizer,

Lehmann,

Kobilarov,

Auer,

Becker,

Cyganiak and

Hellmann, DBpedia – A crystallization point for the Web of Data, Web Semantics: Science, Services and Agents on the World Wide Web 7(3) (2009), 154–165. doi:10.1016/j.websem.2009.07.002.

D.M.

Blei,

A.Y.

Ng and

M.I.

Jordan, Latent Dirichlet allocation, Journal of Machine Learning Research 3 (2003), 993–1022.

J.L.

Boyd-Graber,

D.M.

Blei and

Zhu, A topic model for word sense disambiguation, in: EMNLP-CoNLL, 2007, pp. 1024–1033.

Chang,

Gerrish,

Wang,

J.L.

Boyd-Graber and

D.M.

Blei, Reading tea leaves: How humans interpret topic models, in: Advances in Neural Information Processing Systems, 2009, pp. 288–296.

10.

Chemudugunta,

Holloway,

Smyth and

Steyvers, Modeling documents by combining semantic concepts with unsupervised statistical learning, in: The Semantic Web – ISWC 2008, Springer, 2008, pp. 229–244. doi:10.1007/978-3-540-88564-1_15.

11.

Chemudugunta,

Smyth and

Steyvers, Combining concept hierarchies and statistical topic models, in: Proceedings of the 17th ACM Conference on Information and Knowledge Management, ACM, 2008, pp. 1469–1470.

12.

Chen,

Mukherjee,

Liu,

Hsu,

Castellanos and

Ghosh, Discovering coherent topics using general knowledge, in: Proceedings of the 22nd ACM International Conference on Conference on Information & Knowledge Management, ACM, 2013, pp. 209–218.

13.

Chen,

Mukherjee,

Liu,

Hsu,

Castellanos and

Ghosh, Leveraging multi-domain prior knowledge in topic models, in: Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, AAAI Press, 2013, pp. 2071–2077.

14.

Deng,

Han,

Zhao,

Yu and

C.X.

Lin, Probabilistic topic models with biased propagation on heterogeneous information networks, in: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2011, pp. 1271–1279. doi:10.1145/2020408.2020600.

15.

T.L.

Griffiths and

Steyvers, Finding scientific topics, Proceedings of the National academy of Sciences of the United States of America 101(Suppl. 1) (2004), 5228–5235. doi:10.1073/pnas.0307752101.

16.

Han and

Sun, An entity–topic model for entity linking, in: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Association for Computational Linguistics, 2012, pp. 105–115.

17.

Hingmire and

Chakraborti, Topic labeled text classification: A weakly supervised approach, in: Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval, ACM, 2014, pp. 385–394.

18.

Hofmann

et al., Probabilistic latent semantic analysis, in: Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, 1999, pp. 289–296.

19.

Hu,

Boyd-Graber,

Satinoff and

Smith, Interactive topic modeling, Machine Learning 95(3) (2014), 423–469. doi:10.1007/s10994-013-5413-0.

20.

Jagarlamudi,

DauméIII and

Udupa, Incorporating lexical priors into topic models, in: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, 2012, pp. 204–213.

21.

Li,

Cardie and

Li, TopicSpam: A topic-model based approach for spam detection, in: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2013, pp. 217–221.

22.

Lu,

Ott,

Cardie and

B.K.

Tsou, Multi-aspect sentiment analysis with topic models, in: 2011 IEEE 11th International Conference on Data Mining Workshops (ICDMW), IEEE, 2011, pp. 81–88. doi:10.1109/ICDMW.2011.125.

23.

Ma,

Zhou,

Liu,

M.R.

Lyu and

King, Recommender systems with social regularization, in: Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, ACM, 2011, pp. 287–296. doi:10.1145/1935826.1935877.

24.

Mei,

Cai,

Zhang and

Zhai, Topic modeling with network regularization, in: Proceedings of the 17th International Conference on World Wide Web, ACM, 2008, pp. 101–110. doi:10.1145/1367497.1367512.

25.

G.A.

Miller, WordNet: A lexical database for English, Communications of the ACM 38(11) (1995), 39–41. doi:10.1145/219717.219748.

26.

Mimno,

H.M.

Wallach,

Talley,

Leenders and

McCallum, Optimizing semantic coherence in topic models, in: Proceedings of the Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2011, pp. 262–272.

27.

Nadeau and

Sekine, A survey of named entity recognition and classification, Lingvisticae Investigationes 30(1) (2007), 3–26. doi:10.1075/li.30.1.03nad.

28.

Newman,

Chemudugunta and

Smyth, Statistical entity–topic models, in: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2006, pp. 680–686. doi:10.1145/1150402.1150487.

29.

Newman,

Noh,

Talley,

Karimi and

Baldwin, Evaluating topic models for digital libraries, in: Proceedings of the 10th Annual Joint Conference on Digital Libraries, ACM, 2010, pp. 215–224. doi:10.1145/1816123.1816156.

30.

Petterson,

Buntine,

S.M.

Narayanamurthy,

T.S.

Caetano and

A.J.

Smola, Word features for latent Dirichlet allocation, in: Advances in Neural Information Processing Systems, 2010, pp. 1921–1929.

31.

Pouriyeh,

Allahyari,

Kochut,

Cheng and

H.R.

Arabnia, ES-LDA: Entity summarization using knowledge-based topic modeling, in: Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2017, pp. 316–325.

32.

Pouriyeh,

Allahyari,

Kochut,

Cheng and

H.R.

Arabnia, Combining word embedding and knowledge-based topic modeling for entity summarization, in: 2018 IEEE 12th International Conference on Semantic Computing (ICSC), IEEE, 2018, pp. 252–255. doi:10.1109/ICSC.2018.00044.

33.

Pouriyeh,

Allahyaril,

Cheng,

H.R.

Arabnia,

Kochut and

Atzori, R-LDA: Profiling RDF datasets using knowledge-based topic modeling, in: 2019 IEEE 13th International Conference on Semantic Computing (ICSC), IEEE, 2019, pp. 146–149. doi:10.1109/ICOSC.2019.8665510.

34.

C.P.

Robert and

Casella, Monte Carlo Statistical Methods, Vol. 319, Citeseer, 2004. doi:10.1007/978-1-4757-4145-2.

35.

Rosen-Zvi,

Griffiths,

Steyvers and

Smyth, The author–topic model for authors and documents, in: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, AUAI Press, 2004, pp. 487–494.

36.

Tang,

Leung,

Luo,

Chen and

Gong, Towards ontology learning from folksonomies, in: IJCAI, Vol. 9, 2009, pp. 2089–2094.

37.

Wei and

W.B.

Croft, LDA-based document models for ad-hoc retrieval, in: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, 2006, pp. 178–185.

38.

Witten and

Milne, An effective, low-cost measure of semantic relatedness obtained from Wikipedia links, in: Proceeding of AAAI Workshop on Wikipedia and Artificial Intelligence: An Evolving Synergy, AAAI Press, 2008, pp. 25–30.

39.

Yao,

Haghighi,

Riedel and

McCallum, Structured relation discovery using generative models, in: Proceedings of the Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2011, pp. 1456–1466.

40.

Zhu,

Ghahramani,

Lafferty et al., Semi-supervised learning using Gaussian fields and harmonic functions, in: ICML, Vol. 3, 2003, pp. 912–919.

Combining semantic graph and probabilistic topic models for discovering coherent topics

Abstract

Keywords

1. Introduction

1 http://linkeddata.org/.

2.1. Semantic Web

2.1.1. Ontology

2.2. Probabilistic topic models

3. Related work

2 http://www.dmoz.org.

4.1. The EntLDA topic model

4.3. Regularization framework for topic models

4.4. Regularization-Based Entity Topic Model

5.1. Data sets

3 http://www.reuters.com/.

4 http://nlp.stanford.edu.

5.3.1. Quantitative analysis

5.3.3. Qualitative analysis

References

¹
http://linkeddata.org/.

²
http://www.dmoz.org.

³
http://www.reuters.com/.

⁴
http://nlp.stanford.edu.