Abstract
Maintaining large collection of documents is an important problem in many areas of science and industry. Different analysis can be performed on large document collection with ease only if a short or reduced description can be obtained. Topic modeling offers a promising solution for this. Topic modeling is a method that learns about hidden themes from a large set of unorganized documents. Different approaches and alternatives are available for finding topics, such as Latent Dirichlet Allocation (LDA), neural networks, Latent Semantic Analysis (LSA), probabilistic LSA (pLSA), probabilistic LDA (pLDA). In topic models the topics inferred are based only on observing the term occurrence. However, the terms may not be semantically related in a manner that is relevant to the topic. Understanding the semantics can yield improved topics for representing the documents. The objective of this paper is to develop a semantically oriented probabilistic model based approach for generating topic representation from the document collection. From the modified topic model, we generate 2 matrices- a document-topic and a term-topic matrix. The reduced document-term matrix derived from these two matrices has 85% similarity with the original document-term matrix i.e. we get 85% similarity between the original document collection and the documents reconstructed from the above two matrices. Also, a classifier when applied to the document-topic matrix appended with the class label, shows an 80% improvement in F-measure score. The paper also uses the perplexity metric to find out the number of topics for a test set.
Introduction
As electronic information which is unorganized and heterogeneous, is growing exponentially on a daily basis one major challenge is to provide efficient access and retrieval mechanisms for this data. Modeling large text corpora unlike small statements like tweets, blog entries, messages and e-mails is necessary. Finding a short description for the collection, can make classification, organizing, summarization, and similarity analysis easy. Many IR techniques like Boolean Model [16], Vector Space Model (VSM) [8] and probabilistic models [4] have gained attention for dealing with these problems [8]. The basic input to all these models is a term- document matrix. The automatic classification system uses pure Bag of Words (BoW) approach [11] for finding relevant features. The basic BoW model does not consider position or order of text and the semantics which helps to uncover more about inter- or intra- document statistical structure. Semantics (meaning) of the terms in the document are not properly identified and represented. As an Example 1, when the BoW model is applied to a statement A large can can hold water, the first term can is a noun and the second term can is a modal in English grammar, but it takes both can as a single term and term frequency is taken as two which is not correct. If tagging is done, the two terms can, can be distinguished and its frequency changes. Also since the English language is ambiguous, words have different meaning based on the context in which it is used (polysemy). For example 2, the word strike has different meaning based on context. Similarly there is synonymy where different words share same sense.
Classification [9] of large document corpus is becoming difficult because of the increase in vocabulary size which may in turn lead to increase in sparsity as all words may not be present in all the documents. BoW has the assumption of ex- changeability feature, i.e. the words in the set can be placed in any order [2]. To address the shortcomings of BoW, dimensionality reduction techniques are also introduced.
Topic modeling is a machine learning tool that can provide a solution to the problem where it automatically group together related terms to a topic [3]. It discover patterns of terms used and connects the documents that exhibit similar patterns. Topic models have emerged as a powerful new technique for finding useful structure in an unstructured collection. If we consider an example data set of various question paper collection, the underlying topics indicate the areas from where all the questions are collected.
Traditional topic models treat terms as independent entities without having predefined information about their word meaning. Topics are inferred by observing only the word occurrence. These terms may not be semantically related. When the semantics are given priority, the topics generated using topic modeling are more meaningful according to the document context than those topics generated by looking at only term occurrences. This can be further utilized for improving the clustering and classification of document corpora. Although topic modeling is an existing approach, there is scope to improve its result by providing semantic information associated with it and between the terms [2].
This paper focuses on improving the results of topic modeling, by introducing various Natural Language Processing (NLP) techniques in the early phase of the LDA model. Given a large set of n documents D ∈ having m terms, reduce m and replace it with more meaningful term- interdependent set of k topics such that k ⪡ m. While this transformation is done, semantics of the words that constitute a sentence and semantics between sentences is also given priority. LSA model is used to verify the results of LDA as well as to find if any interrelationship exist between topics [5]. The document-topic matrix from LDA is also used in place of document-term matrix to perform document classification using SVM [2].
The rest of the paper is divided as follows: Section 2 briefly discusses the related approaches in this field, Section 3 gives an in-depth description about our proposed methodology, Section 4 discuss our experimentation setup with results and Section 5 gives the concluding remarks.
Related work
To distinguish our work we will briefly discuss various related approaches like LSA, LDA in more details. In VSM, the documents are represented as a vectors where vectors that are closer in the vector space are semantically related and vectors that are far apart are semantically distant [8]. In the term frequency- inverse document frequency (tf-idf) scheme [13], documents of arbitrary length is reduced to fixed length list of frequencies. In this method, a vocabulary of terms from the entire corpus is taken and for each document in the corpus. The count of terms is taken according to the occurrence of the terms in each document. While tf-idf method has prominent features like identification of most descriptive term in the document and easy computation, the size of the vocabulary generated by the VSM is large as it takes all the terms from the collection and the matrix generated is sparse, because not all set of vocabulary terms are there in a document. LSA adds a decisive step to document processing [11]. In addition to recording which keywords are present in a document, the technique looks at the document collection as a whole, to see which other documents contain some of those same words.
LSA is a corpus- based method that does not use dictionaries, semantic networks, grammars, syntactic or morphological parsers, and its input is represented only by raw text divided in chunks [15]. Depending on the corpus size a chunk may be a sentence, an utterance in a chat, a paragraph or even a whole document. The method starts from the term-document matrix computed on the corpus segmented into chunks and then SVD applied in order to compute the most important singular values. It produces a representation to a latent semantic space, which uses only the most important (large) k singular values. The value for k depends on the corpus and task, and is usually between 100 and 600, a common choice being 300. This new space is used to compute similarities between different words and even whole documents, practically taking into account that terms that are co-occurring in similar contexts may be considered to be semantically related. LSA is also the most popular approach for dimensionality reduction. It maps documents represented in high dimension to a latent (hidden) lower dimension representation.
Probabilistic Latent Semantic Analysis (pLSA) is another widely used technique to model documents [13]. The main goal is to model co-occurrence data under a probabilistic framework to discover the underlying semantic structure of the data. A document d and a word w
n
are conditionally independent given an unobserved topic z:
The latent/hidden variables (represented by topics/concepts) are associated with the observed variables (represented by documents and terms, from the text domain). Similar to LSA, pLSA aims to factorize the sparse co- occurrence matrix in order to reduce its dimensionality. However, pLSA is usually viewed as a more sound method as it provides a probabilistic interpretation, whereas LSA achieves the factorization by only using mathematical basis (more precisely, LSA uses the singular value decomposition method). The collection is represented in the form of a term- document matrices. The goal of pLSA is to extract the topics and represent the documents as a mixture of topics. The documents are represented in terms of all topics, as it assumes documents as mixture model. The pLSA model is not a generative model and also, the number of parameters to be estimated grows linearly with number of training documents. This growth leads to overfitting problem. For a compact representation of the words that come under a topic, a sparse LSA [14] method was used. In this technique a projection matrix is found out from the inverse singular value matrix and the transpose of the term- dimension matrix. This approach selects only less number of terms to represent the hidden topics.
Latent Dirichlet Allocation (LDA), overcomes this problem by taking the topic mixture weights as k- parameter hidden random variables rather than individual set of parameters. LDA is a generative statistical model of a corpus. It is under the assumption that, each document is represented as a mixture of hidden topics, where each topic is stated over the terms in vocabulary list. LDA assumes the following generative process for each document w in a corpus D [2]. Given the parameters α and β, the joint distribution of a topic mixture θ, a set of N topics z, and a set of M terms w is given by:
Most topic models, such as Latent Dirichlet Allocation are unsupervised. Only the terms in the documents are modelled. The topics are inferred such that it maximizes the likelihood of the collection in supervised LDA [3]. Topic proportions for a document are drawn from a Dirichlet distribution [10]. Terms in the document are obtained by repeatedly choosing a topic assignment from topic proportions and in which a term corresponding to the topic is drawn. Two steps are done for the topic estimation and prediction, variational Expectation (E) step and Maximization (M) step. In variational E step, given the document response, the posterior distributions of latent variables (z) are calculated. In the M-step, the probability of a term to be under a topic is proportional to expected number of the times the term was assigned to that topic. Supervised LDA is described as full generative, replacing the normal linear model [7]. Given a collection, the posterior distribution of the latent variables given the observed documents determines a latent topical decomposition of the collection. There are two factors which determine the topic of the term: the topic distribution of the document and probability of a topic to emit this term. There limitation of it is if a term is not observed recurring enough in a corpus, then it is likely to be as- signed to the dominant topic within the document.
Again in a multinomial model [10], individual documents are represented as vector of counts, i.e., BoW representation which loses its semantics. By assuming a multinomial distribution, documents are assumed to be generated from repeatedly choosing terms from a distribution where an individual term is a binomial distribution, i.e., a fixed distribution.
The above mentioned models LSA, PLSA, sparse LSA and LDA use BoW model that do not consider the order of occurrence of terms in the document as well the semantics. The semantics of terms are needed for meaningful representation of topics. We need to consider the semantics as same terms may have different meanings in different context (term ‘bank’ may differ n different context), there are different terms with same meanings (‘Brave’ and ‘fearless’), some terms should come together to understand the context than individually (‘Romeo and Juliet’) and sometimes removing the pronoun (‘he’, ‘she’, ‘it’, e.t.c) as a stop words can reduce the importance of the noun corresponding to that pronoun in the topic (in sentence ‘Nazir was happy because he saw the show’, term ‘he’ will be removed as stopword but replacing ‘he’ with ‘Nazir’ can give importance to the term to come in the topic).
This paper adds depth to the exiting LDA method by adding semantic extraction of terms and then identifying topics from document collection and also later use this for classification of the corpus.
Proposed methodology
Classification of large document corpus is becoming difficult these days because of the increase, variation and variety of terms in the vocabulary list of these corpora. One major problem that accompanies this is providing an efficient access and retrieval mechanism. Having a semantics-based classification system with a high F-score is always the need of the hour. The traditional topic model (LDA) find topics from the document collection based on the occurrence of the terms, without considering the semantic aspects like synonym usage, word sense, word collocation, pronoun resolution, Named Entity Recognition e.t.c. When semantics is included, the topics generated using LDA can be more meaningful. Our proposed method does a pre-processing step so as to understand the semantics present in the corpus. We divide our work into three phases, phase 1 where various NLP techniques are used as a pre-processing step to topic models, phase 2 and phase 3 which deals with LDA and LSA respectively.
In the existing method of LDA number of topics are given as input. We set the number of topics as the number of eigen values obtained after eigen value decomposition. Our work consider two approaches for generating topics, one which use LDA to extract the topic and other method is to cluster the terms using LSA. We compare the topics generated with these two techniques, where a topic is represented by terms that are semantically related. Given the input corpus, each document in the corpus is taken for tokenization, synonym linking, tagging, Named Entity Recognition, lemmatization, Word Sense Disambiguation, word collocation and pronoun resolution. These steps are mentioned below. These terms are then given as input to LDA/ LSA for identifying the topics. Figure 1 gives an overview of the work presented in our paper.
Phase 1
The input corpus consist of n documents. For each document d in the corpus, the following steps of phase 1 are applied- tokenization, synonym linking, tagging, Named Entity Recognition, lemmatization, Word sense Disambiguation, word collocation and pronoun resolution. These steps are explained below. Tokenization: Each document in the corpus is initially tokenized to generate single units or tokens where each token is extracted when a white space is encountered. This form the initial vocabulary list for the corpus. Pronoun Resolution: Pronoun resolution is the process in which the pronouns are found and replaced with the noun term. An algorithm for anaphora resolution called Hobb’s algorithm is used and is elaborated in [11]. Sentence is parsed into a tree format and an optimal search process is done to find the antecedent. During the search process, the NP node upon which the search is terminated is marked as the antecedent. The pronoun node is replaced with this antecedent term. We use pronoun resolution as step 2 because it becomes important for Named Entity Recognition, tagging and stop word removal.
Word Collocation: A collocation is two or more terms that often go together. Collocation of terms is calculated by Pointwise Mutual Information (PMI). PMI is a measure of how often two terms x and y occur together compared with what we would expect if they were independent. n-grams from terms are obtained which is used to generate a new vocabulary list.
Synonym Link: The terms and their corresponding synonyms (if any exist within the vocabulary list) are linked using WordNet. This helps to identify if a term(s) means exactly or nearly the same term(s) in the vocabulary list. Example: ’bus stop’ has the synonym ’bus station’. As the terms and their synonym can be n- grams, we do this step aftercollocation. Named Entity Recognition (NER): It helps to further strengthen the tagging process by correctly identifying nouns like persons, location, date, time and organization from vocabulary list. Tagging: The process of categorizing terms according to their Parts Of Speech (POS) and assigning them duly is known as POS tagging. There are different taggers available; default tagger gives a default tag initialized to every term. Regular Expression tagger tags terms according to a user-defined regular expression. Unigram Tagger tags a single term with respect to that term. Bigram tagger tag terms according to the previous term. Trigram tagger tags term according to the previous two terms. Stanford PosTagger considers the sentence and tag according to the part of speech. Stanford PosTagger considers the sentence and tag according to the part of speech. Stanford PosTagger has high accuracy and this have been used for our experiment.
Stop Word Removal: From the above output the stop words are identified and then removed. It is always better to remove stop words after tagging, to correctly identify the significance of each term. Lemmatization: Lemmatization normally aims in eliminating inflectional endings, and to determine the lemma (base form) of a given term. Lemmas are defined in a dictionary for each term and retrieved according to the terms and its tag in the term list. These lemmas are then updated to the new vocabulary list.
In the above example, if we use stemming instead of lemmatization, the term enticing will be changed to entic, which is not a meaningful term. So we prefer using lemmas instead of stemming. Word Sense Disambiguation (WSD): This step is included to find if similar terms used in different sentences of the document collection are having the same meaning or are different as in example 2. If they have the same sense, the frequency of the corresponding term is incremented otherwise they are considered as two different terms. For this, the Lesk algorithm is used
1
.
In Table 2 each term has an information part which tells about the synonym id (same for terms that are synonyms), category of the term obtained from NER, tag of the word and word sense of the term obtained by WSD respectively. These information can distinguish the terms in the vocabulary list when taking the frequency of the term in each document for the document-term matrix obtained after phase 1.
Go through each term t in a document d of corpus. Pronoun resolved using Hobb’s algorithm. Find n-grams using PMI for collocation. Link the related terms based on synonym using WordNet. Find the named entity from the terms. Tag each term in document Lemmatize the tagged term according to the tag Find the word sense of term. Index all the terms in the vocabulary list.
Repeat 1 for all documents in Document set D
Create document-term matrix by calculating the count of each terms in document.
V, e= eig, number of topics k= |e|, Number of (Eigen Value> threshold)
Set α = 0.1, β = 0.001; LDA parameters.
Assign each term (w), in the matrix to a topic:
Go through each term w
j
in document vector(d ∈ D) For all topics(t
i
, i ∈ [0,k]) compute P(t
i
| d) and P(w
j
|t
i
) compute P(wj-1w
j
|t
i
)=P(wj-1|t
i
)*P(w
j
|t
i
) and P(wj-2wj-1w
j
|t
i
)=P(wj-2|t
i
) *P(wj-1|t
i
)*P(w
j
|t
i
) Reassign term w
j
with new topic with maximum(P(t
i
| d)* P(w
j
|t
i
)) Reassign bigrams and trigrams with new topic with maximum(P(t
i
| d)*P(wj-1w
j
|t
i
)) and max(P(t
i
| d)* P(wj-2wj-1w
j
|t
i
)) respectively
Repeat steps 6-7 till terms in topic do not change
In our approach, given the vocabulary list as output from phase 1, not only single term but also n-grams are taken from the documents. So, the topics inferred, which is described in terms of terms contains n-grams also, according to the topic likelihood. Latent Dirichlet Allocation (LDA) is a statistical model of a corpus, which is the simplest topic modeling technique. The basic concept is that each document is represented as a mixture of hidden topics, where each topic is stated over the terms in vocabulary list. In reality, only the documents are observable. The goal is to infer the underlying topic structure with these observed data. In LDA, it is assumed that there are k underlying topics according to which documents are created, and that each topic is represented as a multinomial distribution over |C| terms in the vocabulary. A document is generated as mixture of these themes and sampling terms from them. More precisely a document of m terms w = (w1, . . w m ) is generated by following process. First samples are taken from a Dirichlet (α1, . . α k ) distribution. Then, for each of the m terms, a topic t i ∈ (1, . . , k) is sampled from a mulitinomial distribution. Finally, each term w m is sampled, conditioned on the t i th topic, from the multinomial distribution p(w|t i ) [3]. The document-term matrix formed from the two matrices –document-topic and term-topic matrix(both derived from modified LDA) we find the similarity with the original document-term matrix.
Phase 3
In phase 3, LSA is applied on the document-term matrix obtained from phase 1. The document-term matrix is large in size and sparse. So, the document similarity cannot be understood properly. In Table 1, when the document similarity between the document D2 and D3 is taken by using cosine similarity it becomes 0. Actually, the two documents have some similarity because of the terms like flower and spring in them, but this similarity is not captured in the document- term matrix generation which uses frequency of the term to describe it. LSA reduce the matrix using SVD. Matrix is decomposed into three matrices term- dimension matrix, singular value matrix and document- dimension matrix. Every singular value shows the importance of dimensions, where singular values are in the decreasing order. The rank reduction of matrix is expected to combine the dimensions of terms with similar meanings.
After LSA, the sparseness is reduced by a numerical representation and also there occurs a latent relationship between the terms. Here, as D1 contains both spring and flower, we can also indirectly say that D2 and D3 are related. This is done through SVD which replaces zero with a small numerical value which is directly proportional to the number of times they have occurred together in other documents. Also, by applying LSA to the document-topic matrix obtained from LDA, we can cluster the related topics of the document collection.
Go through each term t in a document d of corpus. Pronoun resolved using Hobb’s algorithm. Find n-grams using PMI for collocation. Link the related terms based on synonym using WordNet. Find the named entity of the terms. Tag each term in the document. Lemmatize the tagged term according to the tag Find the word sense of terms. Index all the terms in the vocabulary list.
Repeat 1 for all documents in Document set D
Create document-term matrix (M) by calculating the count of each terms in document.
Decompose M = USV T using SVD.
Group terms in V matrix column wise based on the sign method [5].
Experimental result and analysis
In this section, we conduct four types of experiments on real world data sets like TREC- AP dataset [2], that contains 22057 documents and The Reuters 2 with 8000 documents. The experiments produce result in the form of topics with the probability of each topic to be latently present in a document. Each topic is also described with a set of terms associated with it.
Effect of phase 1 on LDA and LSA
Modified LSA and LDA (with phase 1 introduced) is tested on sample documents from the TREC AP data set with and without phase 1. The k (number of topics to be extracted) was taken as 3 based on the three highest eigen values, and number of terms used to represent each topic was set as 5. The output is represented as the terms which occurs in that topic. The result obtained is shown in Table 3 (without phase 1) and Table 4 (with lemmatization, one of the steps of phase 1). In Table 3 we can see the terms campaigning and campaign repeatedly occurring as a result of LDA whereas in Table 4 this has been replaced by a single term campaign. The vocabulary list is of size 14561 for our approach with phase 1 and 7584 for experiment without phase 1. The output obtained by LSA and LDA applying phase 1 is shown in Table 5 and Table 6 respectively. Comparing Table 3 and 6, we find that the terms generated after introducing phase 1 is more meaningful than without phase 1. We could get matching documents from the TREC AP data set using the sentences reconstructed with the terms belonging to corresponding derived topics. Only the terms with high probability in the topic are obtained. In our experiment, terms corresponding to each topic obtained are represented as- (term and its corresponding tag).
Figure 2 shows the terms representing each topic from the data set obtained using modified LDA. The terms are sized in a descending order based on their probability to occur in the topic. The result shows n-gram terms are also present in each corresponding topic. From the modified topic model, we generate 2 matrices- a document-topic and a term-topic matrix. The reduced document-term matrix derived from these two matrices has 85% similarity with the original document-term matrix i.e. we get 85% similarity between the original document collection and the documents reconstructed from the above two matrices. The similarity was checked by cosine similarity.
Perplexity
Topic modeling being unsupervised make evaluation difficult. It is also difficult to decide the number of topics that may be present in the collection. Perplexity is one measure for topic count identification. LDA is first done on training sample and then perplexity is computed on the test data. It is given by the formula:
Where M is the number of document, w d is word in document d and N d is the number of words in the document. A left- right method [6] can be used to find the perplexity over the words. It is a decreasing function of log likelihood of unseen documents D.
Lower the perplexity better the model. It also helps in fixing the number of topics. Per word perplexity is computed using the Markov chain algorithm.
Perplexity was computed for a test sample of 80 documents from TREC AP corpus and is given in Table 7. Reduced perplexity for k = 5, indicates the number of topics for the sample set.
The document-topic matrix obtained after modified LDA can be given to SVD to find inter-relationship between topics. The document-topic matrix is decomposed into document-dimension matrix (U), singular matrix (S) and topic-dimension matrix (V). Table 8 shows the sample V matrix obtained from sample dataset of TREC AP between 3 topics. The positive values in each row indicate positive relationship between topics and negative values indicate negative relationship betweentopics.
Classification using classifier
Given a document-topic matrix of training document collection of BBCSports dataset which has 2 categories (either BBCSport or not a BBCSport), we represent each document in terms of the topics as its features and associate with this the corresponding label of the document (Table 9).
Using this data classification [2] is done. Based on this result a test sample was used to predict class labels. Only a binary classification is done. The recall and F- measure is computed as shown in Table 9. There is an 80% improvement in prediction result with LDA + phase1 as compared to directly applying the text data or document topic matrix without phase 1 to a classifier (Table 10). Testing is done on 4560 documents with 200 topics.
Compared to the result of LDA without phase 1, modified LDA with phase 1 shows that introduction of semantic analysis prior to topic modeling improves its various aspects like identification of correct number of topics present in the collection, topics and its semantically related terms with respect to the collection, topic inter-relationship and improved classification of documents with respect to topics.
We have presented a probabilistic model to represent a large collection of documents by a short description called topics. During this process, semantics of the sentences are given importance replacing the Bag Of Word concept. Topics are represented explicitly via Dirichlet distribution variable that is repeatedly selected once for a document. In this sense model generates an allocation of the terms in the document to topics. In our proposed method, the semantics of the sentences are taken care by tokenization, pronoun resolution, word collocation, synonym linking, Named Entity Recognition, tagging, lemmatizing and Word Sense Disambiguation. These make the generated term list to be more meaningful. By these methods, generated topics become more meaningful to the documents. In our experiments, LDA has proved to be more accurate in terms of topics. The document-term matrix formed from the two matrices – document-topic matrix and term-topic matrix derived from modified LDA we get 85% similarity with the original document-term matrix. The document-topic matrix generated is evaluated using a classification approach which gives an 80% improvement. Perplexity metric is also used to find out the number of topics for a test set.
Footnotes
Acknowledgments
We express our deep gratitude to our University Chancellor Shri. Mata Amritanandamayi Devi, who is our guiding force and inspiration. We would like to thank the faculty of Computer Science and Engineering Department, Amrita Vishwa Vidhyapeetham, Amritapuri Campus, for their support and encouragement in successfully completing our paper. We also thank our friends for giving us valuable information and our family for their moral support.
