Abstract
This paper analyses topic segmentation based on the LDA (Latent Dirichlet Allocation) model, and performs the topic segmentation and topic evolution of stem cell research literatures in PubMed from 2001 to 2012 by combining the HMM (Hidden Markov Model) and co-occurrence theory. Stem cell research topics were obtained with LDA and expert judgements made on these topics to test the feasibility of the model classification. Further, the correlation between topics was analysed. HMM was used to predict the trend evolution of topics over various years, and a time series map was used to visualize the evolutional relationships among the stem cell topics.
1. Introduction
Stem cell technology has become one of the highlights among new technologies with its great potential in various areas of science. Stem cells are biological cells capable of self-replication and multidirectional differentiation, from the most primitive totipotent zygote cells and their various differentiations to organ pedigree-specific stem cells. Owing to their self-replication and multidirectional differentiation potential, and their ability to maintain the homeostasis of tissues and organs as well as to finish injury-repairing processes, stem cells have great importance attached to them on account of their ability to be used as seeding cells to start the growth of new tissues and organs.
Considering that stem cells have broad application prospects and market potential in a field where they can also be used in tissue and organ repair, development of biological modelling, new drug research and development, drug potency assessment and toxicity evaluation, etc., we can define stem cell research as the most important area for incrementally pushing forward the advancement of medical and health services. Fully understanding the description of the topic structure of stem cell biology can help policy-makers identify necessary areas for funding more accurately. It assists researchers not only in their efforts to follow academic frontiers, but also to keep track of international developments, thus providing stem cell research with reference to the relevant policy decisions.
The Latent Dirichlet Allocation (LDA) model can be used to analyse the topics in the stem cell research area. Aimed at analysing the hidden or potential topic structure in large datasets (including very large text sets or web documents), LDA uses latent Dirichlet distribution to complete tasks such as date sampling and estimation of the parameters of a research object together with reasoning. Through obtaining potential topic structure, we can focus on semantic retrieval [1], document clustering [2, 3], filtering of meaningless data [4] and even processing of bioinformatics data [5].
The Hidden Markov Model (HMM) is a statistical model based on the Markov model put forward by a Russian chemist in 1870, which has a very rigorous mathematical structure and highly reliable computing performance. It is a stochastic process by nature, which means that, during the development of a substance, each state change will occur depending on the current state rather than the past condition. The above change chain is called the Markov Chain. The Markov prediction method is a method that bases itself on the Markov Chain to calculate probability. HMM is also a kind of Markov Chain of which the condition cannot be observed directly, but can be caught by the vector sequence, which is shown by some sort of probability density distribution and is produced by the sequences of states, with the ability to respond to the probability density distribution. As a result, HMM can be widely applied to automatic probability construction for sequence recognition systems.
This paper tries to combine LDA and HMM and uses a co-occurrence analysis method to predict the evolution trend of a stem cell area. In this mixed model, we can obtain the topic distribution of a stem cell’s data concentration, as well as the connection evolution process and the dynamic relationship of different topics through exploiting HDA’s ability to obtain topics and HMM’s capability to transform. In addition, we can also draw the evolution map of stem cells by utilizing interactive visualization software, and finding open source tools to perform an analysis of the relationships between these topics, which offers us a method for visualization to make it easier to understand the contents of cells’ datasets and their relationship to development.
This paper consists of four sections. Section 1 introduces the background of stem cell research and the combination of the LDA and HMM models. Section 2 is a detailed description of the process of combining the LDA model and the HMM model, then a demonstration of the function of this mixed model. Section 3 is an experiment in which data analysis is performed of stem cells using the LDA model to explore the research topic of stem cell datasets, then the co-occurrence analysis method is comboned for a correlation analysis of adjacent years, and finally the HMM model is utilized to obtain the time trend of the topic. Section 4 reaches a conclusion, suggesting where we will focus our efforts in future.
2. Methodology
2.1. Topic analysis via co-occurrence theory
In 1955, Eugene Garfield was the first person to propose the use of citation relations to analyse the ideas of scientific literature [6]. In 2001, Garfield and his colleagues introduced a software package called HistCite for bibliometric analysis and information visualization, and with its help, they performed topic evolution analysis [7]. In 1972, inspired by Kuhn’s paradigm theory, Henry Small tried to describe the scientific structure of nuclear physics and its evolution over time. In 1973, Small proposed the concept of co-citation [8]. Robert R. Braam combined index terms and a classification code in ‘Chemical Abstracts’ (CA) and ‘Biological Abstracts’ (BIOSIS) with citation data in the ‘ISI Science Citation Index’, so as to analyse the topic structure of chemical and biological science [9]. In 1986, Serge Bauin used co-occurrence analysis to map the trends in fishery between 1979 and 1981, which was considered the first application of co-occurrence analysis. In the same year, Rip A et al. published Mapping the dynamics of science and technology, a book that acted as the foundation for the application of co-occurrence analysis in mapping knowledge structures With literature content analysis deepening [10], researchers started to connect statistics to semantic analysis, which made it possible for co-word analysis to reveal more content information. In 1997, Loet Leydesdoff merged latent semantic analysis into co-word analysis [11]. An et al. combined co-occurrence analysis with MeSH, developed by the American National Library of Medicine. Owing to improved information entropy, a co-word analysis of trends in stem cell field based on MeSH was developed by NLM, and drew a strategic diagram [12].
Owing to the flexibility and efficiency of co-word analysis in revealing content, many researchers have conducted more in-depth studies, For instance, some combined co-word analysis with visualization and strategic diagrams, with the intention of making the results more intuitive [13]. An analysis system calling itself Citespace, developed by Chaomei Chen et al., has the function of helping users to identify the literatures triggering the concept changes, and can help to identify important nodes, thus leading to further discoveries regarding the transformation between networks [14].
2.2. LDA topic analysis
David Blei et al. first proposed LDA in 2003 [15]. LDA is also a kind of probability topic model, which was developed following LSI and PLSI, and in some degrees promoted the later development of the topic model. Owing to its introducing a prior distribution named Dirichlet, LDA possesses more overall hypothesis generation text, utilizing a probability distribution rather than a concrete polynomial distribution function to ensure that all the topics obey the Dirichlet polynomial prior distribution, in order for it to correspond to the relative article, thus obtaining the proper topic set. Afterwards, the CTM (correlated topic model), which was associated with the correlation between topics, and the dynamic topic model, which focused on timestamp information, came into being [16–19]. The author–topic model, the author–role–topic model and the online LDA (OLDA) model were included in the extended LDA models [20, 21].
There have been many studies of topic segmentation with LDA. The IBM Almaden Research Center dug out many composition characteristics of the topics from Blog data. They also made it that they could distinguish topics of casual chats from topics of sudden events, and presented structure feature descriptions of these two kinds of topic and a topic-judging algorithm based on statistical difference [22]. In 2006, Ding Zhou et al. regarded topic evolution as a Markov process and modelled topic evolution via a Markov transition matrix, drawing the conclusion that topic evolution was greatly influenced by the implicit interactions of ‘hidden’ users [22]. In the same year, Qiaozhu Mei et al. defined two topic evolution modes, content evolution and strength evolution [23, 24]. Xuerui Wang et al. incorporated timestamps into LDA and then invented the Topics Over Time (TOT) model [19]. In 2008, Loul wah AlSumait et al. proposed OLDA and applied it in text topic mining and tracking [21]. In 2009, there emerged a novel topic model-labelled LDA that showed a huge upgrade over the regular LDA in that label corpora and the topics were perfectly integrated with proper user monitoring [25]. In 2011, Zhongwu Zhai et al. proposed a constrained LDA for automatic clustering of synonymous features of products in the field of public opinion analysis [26].
2.3. LDA algorithm
According to the process of LDA in text segmentation and the characteristics of scientific literature datasets [27], the algorithm for scientific literature datasets in LDA is described as follows. Suppose the whole scientific literature dataset contains D items of scientific literature and T topics. After pre-processing, the number of keywords for analysis is V, while w and z refer to the numbers of keywords and topics in the whole literature dataset, respectively, d represents the index of literature, expressed as
For the whole literature dataset, there are D*
In LDA, every item of literature has a distribution of overall topics, and each topic has a distribution of overall keywords. Assume
The process of topic segmentation in LDA is shown in Figure 1; the shaded circle represents observable variables. Unshaded circles represent latent variables. Arrows represent condition dependence between two variables. Boxes represent repeated sampling, with the number of repetitions in the lower-right corner of the boxes. In LDA, there are two parameters that need to be inferred; one is

Process of topic segmentation in LDA.
2.4. HMM
A process can be regarded as Markov property if its ‘future’ depends on ‘now’ rather than ‘past’. Markov processes whose time and state are discrete are called the Markov Chain. HMM is developed based on the Markov Chain. However, the practical problems are always more complicated than described by the Markov Chain and it is not always possible for the practicalities to correspond to the relative conditions while the practical things that we observe are often associated with a group of probability distributions and conditions. In order to solve practical problems, HMM was generated, based on the Markov Chain[28], which is a double random process, and they respectively are:
Markov Chain – describes the shift of change through shift probability;
General Random Process – describes the relationship between the state and the observed sequences through observed value probability.
The state of HMM is uncertain and invisible; only the conversion of an observed number can be seen while the conversion of a state can hardly be seen directly at all. Only via the random process of observed sequences can we indirectly perceive its true state and its hidden sequences. Therefore, the ‘hidden’ Markov model gets its name from its unobservable traits.
There are five basic elements in HMM. It can be described using a model five tuple:
N– The number of states in HMM and their observation are invalid but the state can be transferred among any of them. N states can be labelled as
M – The number of observations in HMM, representing the number of characteristics that can be observed. The observation set can be labelled as
π– Initiated probability in HMM, representing how high the probability of being under a certain status is and labelled as
A – The probability distribution of the state transition, representing the probability of changing from one state to another, can be labelled as
B – The probability distribution of every state observation, representing the probability of a particular observed value in a particular state, can be labelled as
The composition of HMM, which is shown in Figure 2, is divided into two parts. One is the Markov Chain described by parameters

The composition of HMM.
2.5. Mixed model
This paper combines the LDA model and HMM model to construct a topic evolution model, which inherits the ability to catch topics in LDA and the ability to obtain a shifting topic in HMM, therefore making it easier to observe the trend according to which topics evolve over time. The whole mixed model is shown in Figure 3.

Process of topic evolution.
As shown in Figure 3, the whole process is made up of five stages. The first stage is responsible for collecting data. During the first stage, the collection of data from expected literature is localized, and stored in txt format via the database. The second stage is responsible for the pre-processing of the data, during which the local txt data is pre-processed through some phases such as removing stop words, subheadings and unrelated noise words. The third stage is responsible for slicing data. This stage divides the published years from 2001 to 2012 into 12 separate series. The fourth stage is responsible for obtaining topics. By confirming the value of parameters
3. Experimental analysis
In the experiment, the LDA and HMM models were used to process scientific literature data on stem cell research from 2001 to 2012 in PubMed [29]. The details are shown in Figure 4 and are separated into the following five steps:
Step 1: the PubMed database was treated as the source of the experimental data from which literature data in the stem cell field was retrieved and localized. The data were pre-processed, removing irrelevant words and dividing all the data into 12 groups.
Step 2: the data were expressed with a data format that was identifiable by LDA. According to an expert suggestion, the values of
Step 3: the results of step 2 were normalized via the 0–1 method and each dataset gathered as a class using LDA, then co-occurrence theory was used to analyse the inner and outer relationships between the topics to work out the shift probability.
Step 4: regarding the index of the topics as the hidden state and the class label as the visible state, the HMM model was constructed in topics, and consequently the evolution sequence obtained in different years.
Step 5: the topic analysis included analysis of the inner topic cluster, definition of labels for research content by experts, calculation of correlations and the drawing of the correlation map.

The experimental process of topic evolution.
3.1. Acquiring experiment data
PubMed, the largest biomedical bibliographic database with over 17 million articles and more than 10,000 newly submitted research abstracts every week, provides a strong database for monitoring breakthroughs and trends in biomedical research. The journal literature in PubMed represents the current status of scientific research on medical science including stem cell research; therefore it was used as the data source for the literature analysis. Our experiment used the English database of PubMed to analyse the literature.
The database includes 101,167 items of literature on stem cell research during the time interval between 2001 and 2012. It comprises 12 data series divided by year of publication. The distribution of research papers in the stem cell field used for the experiment between 2001 and 2012 is shown in Table 1.
Distribution of papers.
3.2. Procedure of experiment
Based on expert advice, we set the topic number to seven after removing prepositions and useless words from the database, and make clusters by utilizing the LDA model for annual data. Therefore, annual data are clustered into seven classes; these 12 data series are clustered into 84 classes.
First we should clarify all clusters as 0–1. If a topic appears in many different clusters, then we must keep the topic in the cluster that has the maximum probability of appearing, and delete the word in the other clusters. The similarity between the same words is 1. A word appears in many clusters, which will account for the fact that similarity between topics is high, and the division of topics is not clear, so we must normalize all described situations as 0–1. After comparing and analysing all results that in stem cell research literature are classified as 0–1, we attach labels to clusters and add corresponding topic sequence numbers for the seven clusters from each year. The topic sequence numbers and corresponding cluster labels for 2001 and 2002 are shown in Table 2.
Topic labels.
Establishing the HMM model requires the solution of three parameter problems.
The first problem is the probability of occurrence of hidden state π. We can understand it as the observation probability of every topic.
The second problem is the probability of observation of distribution B. We can understand it as the observation probability after normalizing the similarity between topics, and the similarity of topics in adjacent years can be obtained by utilizing the co-occurrence theory.
The third problem is the transition possibility between topics A. The initial value of A can be set equal to B.
In order to acquire the training dataset evolved from topics of the HMM model from 2001 to 2006, we must give a threshold value
3.3. Result and analysis of experiment
Comparing seven topics by the present data series with seven topics by the following data series, we find that some of the data series might be using the same topics, and it might be that two topics of the present topics combined to form a new topic the following year. In addition, new topics might appear the following year, so we predicted a path based on the present topic and the highest transition probability topic in the next year.
In order to verify the topic process lifecycle, every topic will experience the production, development, climax, recession and death as time passes. Topics in 2001 and 2006 are set as the original topics, then every topic evolution path predicted. The predicted results are shown in Tables 3 and 4. In order to trace the evolution of the topics, we set the topics of 2001 and 2006 as starting topics and drew the evolution mapping (shown in Figures 5 and 6) with a time series map based on the results (shown in Tables 3 and 4) predicted by the HMM.
Predicting results from 2001 to 2006.
Predicting results from 2007 to 2012.

2001–2006 Topic evolution.

2007–2012 Topic evolution.
We can see from Figure 5 that the topic of 2001_2 and topic of 2001_3 merged into new topic 2002_5 in 2002. This indicates that the breast cancer and its clinical studies were merged into a new topic about carcinoma expression. Many topics will be merged into a new topic the following year, so seven topics in 2001 merged to form only three topics in 2006. As we can see from Figure 6, topics in 2007 did not merge in 2008, but the topic of 2008_1 and the topic of 2008_5 merged into topic 2009_0 in 2009, and seven topics in 2007 merged to form four topics in 2012.
4. Conclusions
This paper introduces the stem cell research background, on the basis of the analysis of stem cell theme. It discusses the co-occurrence theory topic analysis and LDA topic partition model in the application of stem cell data. According to the shortcomings of parameter estimation in the LDA model, integrated co-occurrence theory and index clustering index, an ATNLDA (Auto Topic-number LDA) topic segmentation model was constructed. At the same time, the dynamic evolution of stem cell research literature was realized by combinging the HMM and LDA models. The corresponding experimental data in stem cell research from PubMed 2001–2012 publications was collected and the evolutionary process was analysed by training the HMM model and output result to draw the topic evolution map.
A dynamic evolution of the stem cell research literatures topic model was achieved from this research. Topic evolution is an important research direction, and can be applied widely to explore systems in social networking services, recommendation systems in online stores, etc. The experiment only analyses combinations of different topics, and does not consider a situation in which topics are divided into further topics the following year. In addition, it does not track the evolution of new topics. All of these factors will be addressed in future research.
Footnotes
Funding
The project was supported by the National Natural Science Foundation of China (no. 71173211), the Soft Sciences Foundation of Fujian Province of China (no. 2014R0091), the National Social Sciences Funding Program of China (no. 13CTQ011), the Social Sciences Foundation of Fujian Province of China (no. 2013C067) and the MOE Project of Humanities and Social Sciences of China (no. 11YJC870027) .
