Abstract
With the rise of personalized travel recommendation in recent years, automatic analysis and summary of the tourist attraction is of great importance in decision making for both tourists and tour operators. To this end, many probabilistic topic models have been proposed for feature extraction of tourist attraction. However, existing state-of-the-art probabilistic topic models overlook the fact that tourist attractions tend to have distinct characteristics with respect to specific seasonal context. In this article, we contribute the innovative idea of using seasonal contextual information to refine the characteristics of tourist attractions. Along this line, we first propose STLDA, a season topic model based on latent Dirichlet allocation which can capture meaningful topics corresponding to various seasonal contexts for each attraction. Then, an inference algorithm using Gibbs sampling is put forward to learn the model parameters of our proposed model. In order to verify the effectiveness of STLDA model, we present a detailed experimental study using collected real-world textual data of tourist attractions. The experimental analysis results show that the superiority of STLDA over the basic LDA model in providing a representative and comprehensive summarization related to each tourist attraction. More importantly, it has great significance for improving the level of personalized attraction recommendation.
Introduction
With the rapid development of tourism market, the demand for intelligent travel services has been expected to increase remarkably. The prevalence of the Internet enables everyone to easily access travel related information from various websites. However, the sustained growth of travel data on the web may be overwhelming for tourists when selecting tourist attractions that specific to their personalized requirements. Meanwhile, tour operators need to present customized tourist attractions for potential tourists so as to survive in competitive market and make more profit. Therefore, it is highly desirable to produce a precise analysis and summary of online attraction information, with the objective of providing decision support for both tourists and tour operators.
As an effective tool to achieve precision marketing for tour operators and assist decision making for tourists, the personalized recommendation technique has attracted a great deal of attention and widely applied in the tourism domain over the past few years [7, 15, 28]. Personalized attraction recommendation focuses on identifying the most relevant attractions to recommend to tourists, where the content-based method is popularly used in this case since this method cater well to tourists’ needs. The content-based attraction recommendation approach aims to maximize the relevance between the tourists’ preferences and attractions’ features. A critical challenge along this line is to get a comprehensive understanding of the characteristics of tourist attractions. In recent years, thematic analysis has been actively investigated in feature extraction of tourist attraction and gradually become an important attraction profiling technique. Topic detection and extraction is a well-studied research [11, 34, 41, 46] that aims at identifying a group of words that form topics from a collection of documents. In the case of topic extraction of tourist attractions, most existing researches develop various methods to extract typical topic features for each tourist attraction [17, 38]. Nevertheless, the common sense that tourist attractions tend to show distinct features corresponding to different seasons has been ignored by previous studies and this ought to be referred as valuable contextual information for improving topic representation of tourist attractions. According to the review of Champiri et al. [9], changing scenarios can cause the variations of topic occurrence. Lately, topic mining considering contextual information has been increasingly brought to the attention of scholars.
Two snapshots that illustrate seasonal characteristics in the description documents of attractions.
For topic mining of tourist attractions, the attraction textual data is significantly different from other common documents since the content of an attraction description text often reveal a strong seasonal pattern, which is an intrinsic feature of the attraction and should be considered as important contextual information with respect to this attraction. In order to clearly illustrate the seasonal characteristics existing in attractions description documents, Fig. 1 shows snapshots of two famous tourist attractions in China. Figure 1a is the description text of East Lake Scenic Area from its official website (
To fill this gap, we present a novel probabilistic topic model to detect meaningful topics corresponding to various seasonal contexts for each attraction from a collection of attraction description documents. The proposed Season Topic model based on LDA (STLDA) is a generative probabilistic model, which can capture the potential season-dependent topic clusters that naturally occurring in attractions documents. As a generative model, our learned topic model is substantially the joint probability distribution of seasonal contextual information as well as textual data, which specifies a probabilistic process to describe how words in attractions documents might be generated in particular when the seasonal feature in each attraction document is taken into account. By including seasonal contextual information, STLDA can model the variations of topic occurrence that reveal the changing seasonal contexts, which is unable to capture using other probabilistic topic models. As a result, our proposed model can detect the representative and comprehensive attributes corresponding to various seasonal contexts for each attraction and well represent the content of each attraction description document.
The rest of this paper is organized as follows. In Section 2, we review prior works related to our study. Section 3 is devoted to the methods including the basic LDA model and the proposed STLDA model. In Section 4, an inference algorithm using Gibbs sampling for the parameter estimation of our proposed model is discussed in detail. Section 5 illustrates the experimental results and analysis. Finally, Section 6 includes our conclusions.
In the tourism field, the dramatic increase in the number of available travel related information makes it difficult for tourists and tour operators to make travel decisions. As a critical step toward travel recommendation for enhancing the competitiveness of tour operators as well as ameliorating the information overload problem of the tourists, thematic analysis of tourist attractions has drawn focused research recently. Topic-based feature analysis for a given attraction facilitate users and tour operators to capture the high-level concepts that reveal representative and comprehensive attributes of a tourist attraction, which is beneficial for further attraction selection or tourism planning. For instance, Pang et al. [32] conducted a topic segmentation for popular attractions in the United States by employing topics extracted from the user-generated travelogues on the web. In Yeh and Cheng’s study [45], the popular tourist attractions in Taiwan were segmented into nine subject categories including natural, museum, heritage, park, animal, religious site, shopping, nightlife and visitor center on the basis of properties of attractions. In Hao et al.’s study [17], tourist destinations mentioned in travelogues on the travel websites were characterized by topics such as desert, museum, seaside and mountain, which are mined from these travelogues. Another related work is Hao et al. [18], in which the authors proposed to generate overviews for locations by mining representative topic tags from travelogues. Topic detection is also applied in Shen et al.’s study [38], where the topic features of tourist attractions were mined from user comments on travel websites and then matched with tourists’ preferences to generate personalized attraction recommendation for them.
Probabilistic topic models have been proposed for topic extraction from textual data and successfully applied to a series of text mining tasks in different research fields over the past decade, owing to their powerful capability of discovering meaningful latent topics from large collection of documents automatically and simultaneously representing documents with these discovered topics. Topic models are usually based upon the assumption that documents are mixture of topics, where each topic is a probability distribution over words. Early Explorations of topic modeling technique include latent semantic analysis (LSA) model [14], probabilistic latent semantic analysis (PLSA) model [20] and their varieties, where PLSA model is a useful step toward probabilistic modeling of text. Latent Dirichlet Allocation (LDA) model [6] was first proposed by David Blei and is considered to be one of the most popular topic models for its better probability statistical foundation. LDA is a well-defined generative probabilistic model that generalizes easily to new documents and improve PLSA by introducing Dirichlet priors on the model parameters, which overcomes the overfitting problem suffered in PLSA. Since LDA model can accurately extract tourism topic preferences of users as well as topic features of attractions from travel related information, it has attracted extensive attention from researchers in the field of personalized travel recommendation over the past few years. For example, Arbelaitz et al. [2] employed LDA to extract topics with respect to interests of tourists from user generated content on the travel websites, which aimd to promote a destination for tourists. Hao et al. [17] proposed a location-topic model based on LDA to mine local topics that characterize locations from a large collection of travel logs, and further to recommend the travel destinations on the basis of tourists’ travel intentions. In Jiang et al.’s research [22], the topics about user preference were extracted from the textual description of photos on social media to model users by leveraging an expanded model of LDA, then personalized attraction recommendation was performed accordingly. In Shen et al.’s study [38], LDA was introduced to obtain topic and topic probability distribution of each attraction on the basis of a collection of user comments crawled from travel websites, then the similarities between attractions were measured for further attraction recommendation.
Recently, a promising research direction in topic modeling is to include contextual information with the aim of detecting latent topics that can reflect the effect of varying contexts. Incorporating additional contextual information into topic models in the field of personalized travel recommendation can better identify the topic features regarding user preferences and attraction characteristics, which can be used in decision support tasks that are context dependent. In terms of personalized travel recommendation, time is an essential factor of contextual information. Tourists’ preferences and requirements may vary over time, leading to the changes in travel behavior [10, 40, 42]. Meanwhile, tourist attractions tend to have distinct characteristics with respect to specific time context [8]. To this end, several studies have attempted to link time information to topic models. For example, Wang and McCallum [44] presented a probabilistic topic model with consideration of the document’s timestamp that explicitly models time jointly with word co-occurrence patterns, which aimed at extracting a probability distribution over continuous time for each topic. Blei and Lafferty [5] proposed a dynamic topic model based on LDA to capture the evolution of topics in a long period from a large document collections that sequentially organized. In Lu’s study [29], Probit-Dirichlet hybrid allocation topic model was developed by including temporal features of documents to detect the cyclical topic dynamics that reflect users’ habits in the user generated content, which can be further used to recommend products for users exposed at specific contexts. Liu et al. [27] developed a probabilistic topic model by incorporating location and time information, which can extract the topics of each travel package corresponding to its suitable travel time for following personalized travel package recommendation.
Despite recent progressions, these time-dependent topic models are mainly focus on the long-term evolution of topics in a whole corpus, while the topics of each document remain constant. Specifically, their research usually based upon the assumption that each document in the corpus is associated with one timestape and all documents are collected over time. Then these topic models are applied to the document collections that sequentially organized to discover time sensitive topics. However, the hypothesis is oversimplified because one document may exhibit the feature of more than one time period. It’s apparently that the above mentioned topic models may confound topics with respect to different time contexts in one document and this is the major gap that motivates our present work. It should be pointed out that the generative process of our proposed model has the similarity to a certain extent with some topic models in the text modeling domain, such as Topic-Aspect model [33], Topic-Link LDA model [26] and Author-Topic model [36], while the logical structures of these models are totally different. For example, the Author-Topic model introduces two hyper-parameters that try to model the content of documents and the interests of authors, thus it only have two sets of latent variables that need to be estimated and it is still a three-level hierarchical Bayesian model in nature. In Topic-Aspect model, the authors decompose the generative process of words into background model and aspect model, then use a binary switching variable to determine if the word is the common background word that appear independently of a document’s topical content or topical word that associated with a topic. Similarly, the Topic-Link LDA model introduces a binary variable to model a link between two documents with the aim to identify a set of high-level topics covered by the documents in the collection as well as the social network of the authors.
Methodology
LDA Model
Latent Dirichlet Allocation (LDA) [6] is a generative probabilistic model that tries to capture the implicit topic structure from a collection of documents. It specifies a probabilistic procedure that depicts how the words in documents are generated. The basic idea is that each document is represented by a specific topic distribution and each topic is characterized by a probability distribution over words. The LDA model is a three-level hierarchical Bayesian model, where topics are associated with documents and words are associated with topics. There is a clear hierarchy followed by the document layer, topic layer and word layer.
Word layer: A word is the basic unit of discrete data, defined to be an item from a vocabulary of size Topic layer: A topic Document layer: A document is a sequence of
Figure 2a shows the graphical model representation of the LDA. In this graphical notation, nodes are random variables and arrows indicate conditional dependencies between two variables. The shaded and unshaded circles represent observed and latent variables respectively, while boxes refer to repeated sampling with the number of samples in the lower right corner of the boxes. It is well known that the Dirichlet distribution is the conjugate prior of the multinomial distribution. Therefore, a Dirichlet prior with parameter
For each topic
Draw a topic-word multinomial distribution For each document
Draw a document-topic multinomial distribution For each word
Draw a topic Draw a word
Given the parameters
Integrating over
Finally, taking the product of the marginal probability of every document in the corpus, the generative probability of a corpus is defined as follows:
In LDA, there are two sets of parameters that need to be estimated from a collection of documents, one is the topic distribution in each document and the other is the word distribution in each topic. In reality, only the documents can be observed, while the topic structure including topics and topic probability proportions is hidden. The key issue of LDA model is to use the observed documents to infer the latent topic structure. Therefore, some statistical approaches have been fully utilized for inferring the latent variables that can generate the observed collection of documents best. The exact inference for posterior estimation is intractable in general, thus a wide variety of approximate inference algorithms are considered for LDA, including Expectation-Maximization [23], Gibbs Sampling [19, 35] and Variational approximation [4].
STLDA is a novel probabilistic topic model with the aim to extract topics from a collection of attraction documents by taking advantage of information of documents as well as the intrinsic seasonal characteristic in each document. STLDA is an expanded model of LDA by adding an additional season layer between the document layer and the topic layer. Therefore, STLDA is a four-level hierarchical Bayesian model, where seasonal features are correlated with documents, under which topics are associated with seasonal characteristics and words are related to topics. The STLDA model has a crucial enhancement that can clearly identify the meaningful topics corresponding to various seasonal contexts for each tourist attraction. As a result, the tourist attractions are described more comprehensively and precisely on the season level of fine-grained, which can benefit the further analysis. In this paper, by using the intrinsic seasonal characteristic in each tourist attraction, we assume that the words in attraction documents have distinct seasonal tendencies. The STLDA model is represented as a probabilistic graphical model in Fig. 2b.
The graphical model for the LDA and STLDA.
Assume that we have a corpus with a collection of
[H] Generative process of STLDA.[1] each topic
Notations used in this paper
Example of STLDA model.
As previously mentioned, STLDA considers both general description of attraction and seasonal features exisiting in attraction document in a unified manner and can detect the meaningful topics with respect to different seasons for each attraction. Figure 3 is a running example of STLDA model. As can be seen from this figure, there is a clear hierarchy followed by the attraction document layer, season layer, topic layer and word layer. The words constitute a number of topics and the tourist attraction corresponds to various topics in different seasons, where the weights labeled in the corresponding edges indicate the topic occurrence probabilities. For example for the attraction in spring, the detected topics are T7 with probability value 0.635 and T20 with probability value 0.208, while in winter the attraction corresponds to topics T16, T4 and T26 and the probability values are 0.613, 0.184 and 0.136 respectively.
Now, the likelihood function for the observed textual data of tourist attractions can be formulated according to our proposed probabilistic generative model. Given the hyperparameters
where
By integrating out the distributions of
where
Finally, the likelihood of the complete corpus
There are three sets of latent variables that need to be estimated in our model, including: the topic distribution of the corresponding per document-season pair
Model inference based on Gibbs sampling
Given a collection of documents, the posterior distribution of the latent variables including
In the case of our model, the target of inference is the posterior distribution of the hidden variables
where
Specifically, in our model, the full conditional probability distribution for a word
where
To apply a Gibbs sampling algorithm, the joint probability distribution of the observed words, topics and season labels assignments of the whole corpus is first derived by dividing this joint distribution into three parts, which is given by:
since the first part
The first part
where
where
Then, the target distribution
where
Analogous to
where the notation
where
Likewise, the third part
where
After deriving the joint probability distribution, the topic assignment
The above formula can then be factored, which is shown as follows:
Noting that, the above derivations exploited one important property of Gamma function
Finally, the full conditional probability distribution for each variable
and
Gibbs sampling will serially draw each variable of
[h] Gibbs sampling procedure of STLDA. Corpus of attraction documents,
By applying Bayes’ rule, the multinomial distributions
where
The probability of topic
For the remaining parameters
The approximate probability of season label
Note that the model parameter
In this section, we evaluate the performances of the proposed STLDA model on real-world travel data. As far as we know, no literature has conducted the similar research on seasonal topic mining of tourist attractions. Therefore, all experimental results of STLDA model are compared with the original LDA model both qualitatively and quantitatively. In the following experiments, Gibbs sampling algorithm is used both for STLDA and LDA model. We run Markov chains for 1000 iterations to produce samples of latent variables in each of the experiments. Previous studies [13, 25, 39] have shown that topic models are not sensitive to hyperparameters and can produce reasonable results with a simple symmetric Dirichlet prior. During the Gibbs sampling, we use empirical values for the smoothing parameters
Data collection and pre-processing
We employ English database of Wikipedia (
We construct an attraction corpus that consists of attractions description documents written in English. Each document in the corpus is associated with a single famous tourist attraction in China, covering 160 unique attractions in total. The selected attractions including natural landscape and cultural landscape are mainly 5A or 4A tourist attractions evaluated by China National Tourism Administration, where 5A represents the highest level of tourist attraction. Table 2 shows the summary of our data collection.
Summary of our data collection
Summary of our data collection
Since attraction textual information acquired from the Internet is unstructured and usually contains much disturbance, it is necessary to perform preprocessing on the original attractions textual data. Firstly, punctuations, numbers and other non-alphabet characters are removed. Secondly, all words are lowercased, stop words are removed based on a stop word list from natural language toolkit (NLTK) [3]. Thirdly, for the purpose of reducing the vocabulary size, the low frequency words that appear less than twice in corpus are also filtered out. Then, we use the spell-check function of word processor to examine our data collection. After preprocessing of the textual information in each attraction, the word distribution of a document can be obtained. Finally, the corpus is further expressed with a data format that can be identifiable by STLDA and LDA model.
Perplexity, widely used in the natural language modeling fields, is an important indicator to demonstrate the predictive power of a model [30]. A lower perplexity value means that a higher likehood is achieved on a test dataset, thus indicates a better generalization performance of a model. Given a test dataset D of M documents, the perplexity value can be calculated as follows:
where
Perplexity value comparison of STLDA and LDA.
In our experiments, we use perplexity to measure the generalization performance of the proposed model and the results are shown in Fig. 4. For the attraction corpus, 10-fold cross-validation scheme is adopted for enhancing the reliability of our results. The 10-fold cross-validation is a technique of dividing the original dataset into ten disjoint subsets with approximately equal size, and every subset is in turn used for prediction while the remaining nine subsets are used to train the model [21]. As shown in Fig. 4, STLDA presents lower perplexity value than LDA with different number of topics, which indicates STLDA owns a better predictive power for unseen documents than the original LDA model. Further analysis shows that the perplexity performance is improved about 24.09% on average. This is due to the ability of STLDA can well represent the content of new attractions documents by taking the intrinsic seasonal features of attractions into consideration. From Fig. 4, we can find that the perplexity values of these two models decrease rapidly with the number of topics increasing from 10 to 30, while the performances of these two models become worse when further increasing the latent topic number from 30 to 100. The experimental results reveal that the optimum number of topics for the attraction corpus is 30.
The statistical significance of the difference between the STLDA and LDA model regarding the perplexity performance is further assessed by using the Wilcoxon signed ranks test. The Wilcoxon test is a nonparametric test method that is used when overall distribution is unknown [37]. According to the test result, the value of
Running time comparison of STLDA and LDA.
To evaluate the time complexity of our proposed model on attraction corpus, we summarize the running time of STLDA and LDA for different number of topics
In this section, we train STLDA and LDA model on the whole attraction corpus to learn topics respectively. The number of topics is set empirically to 30 according to Section 5.2. Since the representative topical words generated from the STLDA and LDA model are very close, we present the 22 topics that sharing the same meaning by these two models in Table 3, where the topic number
Topics extracted from STLDA for attraction corpus
Topics extracted from STLDA for attraction corpus
As can be seen from Table 3, the extracted topics apparently characterize some features of attractions, including both natural styles like woods (topic 7), snowscape (topic 16) and cultural styles like playground (topic 21). The representative words in these different topics are quite informative and coherent. For example for topic 4, words such as relaxing, entertainment, comfort, fun, vacation and resort are related to each other and semantically coherent, conveying the meaning about leisure and entertainment, and thus we name the topic accordingly. In addition to the meaning of these topical words, we also refer to classification criteria of Chinese national tourism resources [12] and the related study of paper [17, 24, 32, 45] to name all the topics that extracted from attraction corpus.
Next, we illustrate how STLDA can accurately capture season-dependent topic clusters and improve topic representation of tourist attractions. In our experiments, topics whose occurrence probability greater than 0.1 are selected for each tourist attraction. Three tourist attractions, namely Nalati scenic spots, Yuntai Mountain and Zhangjiajie National Forest Park, are chosen as typical examples to compare with LDA model. These three attractions are all national 5A tourist attractions and located in Northwest of China, middle east of China and Central of China respectively. Table 4 summarizes the obtained topics and topic probability distributions of these tourist attractions using STLDA and LDA model. The detected topics for each tourist attraction are arranged in descending order according to their probability values and the probability value corresponding to each topic is shown in parentheses.
Topics representation for selected tourist attractions using STLDA and LDA
From Table 4, we can clearly see that the topics and topics’ occurrence probability of all three tourist attractions change significantly with the alternation of seasons. Specifically, we take Nalati scenic spots for example. In spring, the detected topics from STLDA are woods and blossom and the corresponding probability value are 0.635 and 0.208 respectively. Obviously, the representative topic features of Nalati scenic spots are woods and flowers in spring, which is consistent with the common sense that the beautiful natural scenery is prominent in spring of Nalati scenic spots. The topic generated from STLDA with highest probability in summer is cultural activity (0.494). By accessing the relevant information, we see that the temperature is agreeable in summer of Nalati scenic spots, and the local hospitable Kazakhs often hold a variety of folk activities to show their colorful ethnic culture. These observations reveal that topics generated from STLDA are capable of reflecting the features of attraction with respect to different seasons in real life. For LDA model, the detected topic with highest probability is snowscape, followed by woods, village, entertainment and blossom. The topic representation of Nalati scenic spots obtained from two models are also visually shown in Fig. 6.
The topic distribution of Nalati scenic spots with respect to different seasons.
Further comparison and analysis of the results generated from different models, three conclusions can be made. Firstly, the topics found by STLDA and LDA for the same tourist attraction indeed have a certain degree of similarity, but the topic probability distributions are prominently different. For instance, the results of STLDA and LDA for Yuntai Mountain both have topics such as woods and blossom. The corresponding probability value obtained from STLDA are 0.648 and 0.121, while in LDA are 0.220 and 0.176 respectively. Secondly, STLDA explicitly identifies topic clusters corresponding to various seasonal contexts, while the topic representation of LDA tend to be more general and less coherent. Taking Zhangjiajie National Forest Park as an example, STLDA clearly detects and localizes the snowscape topic in winter and the maple leaves topic in autumn, but these topics are confusingly merged by LDA. Not modeling time can confound co-occurrence topic patterns and result in unclear topic representation for tourist attractions. Finally, STLDA clearly detects some other topics that are ignored by LDA model. For Nalati scenic spots, topics found by STLDA model such as maple leaves, ice sports and cultural activity may be the representative features in specific season, but are neglected by LDA model. If the topics whose occurrence probability less than 0.1 are examined from the results of LDA, we can see that the probability values of these ignored topics are 0.088, 0.043 and 0.0002 respectively. Fortunately, the probability value of cultural activity topic increases from 0.0002 to 0.494 in STLDA, which makes the cultural activity topic prominent in summer of Nalati scenic spots. This difference comes from STLDA’s assumption that takes the intrinsic seasonal features of each tourist attraction into consideration. Therefore, STLDA can capture the potential season-dependent topics on a season level of fine-grained, while some of meaningful topics are filtered out in LDA due to their extremely low probability value on a coarse-grained level.
Comparison of statistical indicators between STLDA and LDA for attractions corpus
To show the dominancy of STLDA over the basic LDA model more intuitively, we evaluate the statistical properties of obtained topics and topic probability distributions of all 160 tourist attractions. We choose five statistical indicators, namely Richness, Coincidence, Diversity, Significance and Volatility, and the results are shown in Table 5. Richness indicator refers to the average number of detected topics for each tourist attraction. Coincidence indicator denotes the average coincident number of topics generated from two models for each tourist attraction. Diversity indicator reflects the average number of extra topics generated from one model over the other model for each tourist attraction. Significance indicator represents the average highest topic probability value of each tourist attraction. Volatility indicator indicates the average standard deviation of topic probability distribution corresponding to each tourist attraction. The SP, SU, AU and WI in Table 5 denote spring, summer, autumn and winter respectively.
According to Table 5, the average number of topics generated from STLDA is 7.838, while the value is 3.313 in LDA, indicating that STLDA is capable of detecting more topics than LDA. The value of Coincidence indicator is 2.856, comparing with the Richness value 3.313 in LDA, we can see that the topics identified by STLDA can mostly cover the topics generated from LDA. Further referring to the Diversity indicator, the values corresponding to STLDA and LDA are 4.981 and 0.456, which implies that STLDA can capture 4.981 topics more than LDA for each tourist attraction on average, while fail to detect only 0.456 topics. The Significance indicator of STLDA is calculated in each season. Their values are 0.564, 0.537, 0.536 and 0.563 respectively, all larger than the value 0.355 in LDA. The result reveals that topic with the highest probability is dominant in topics generated from STLDA for most attractions in specific season. In other words, the results of STLDA can clearly give the prominent topic that a tourist attraction belongs to. Finally, we investigate the Volatility indicator. The Volatility indicator values of STLDA all larger than that of LDA in different seasons, indicating that the topic probability values generated from STLDA have remarkable difference. But the topics found by LDA tend to have relatively uniform probability value and consequently there is no obvious discrimination between these topics.
Further deeply investigate the topic probability distributions of all 160 tourist attractions generated from STLDA model, we find that not all tourist attractions have clearly seasonality in every season, but all exhibit seasonal characteristics to some extent. For example, Wuyuan, the most beautiful village in China, have prominent topics in spring, summer and autumn. However, in winter, it presents a uniform probability distribution over the topics and the corresponding probability values all less than 0.1. Similarly, Thousand Islet Lake shows seasonal characteristics in spring, autumn and winter and it does not present distinct seasonality in summer.
These comparative results indicate that the seasonal contextual information captured by STLDA plays a very important role in forming better topic representation of tourist attractions. By including seasonal contextual information, STLDA can model the variations of topic occurrence that reflect the changing seasonal contexts, which helps us understand more precisely what remarkable topic features are corresponding to specific season for a given tourist attraction. The most straightforward application is to apply STLDA to the field of personalized attraction recommendation. For example, a tourist plans to travel during the Labour Day. The holiday lasting for three days is one of the peak travel seasons every year. Besides, the Labour Day is on May 1st which coincides with spring in most areas of China. At this time, the tourist intend to see flowers. Taking the data shown in Table 4 as example, the probability values of blossom topic in Nalati scenic spots, Yuntai Mountain and Zhangjiajie National Forest Park generated from LDA are 0.102, 0.176 and 0.179 respectively and there is no distinct difference between these probability values. When providing recommendation for the tourist based on the results of LDA, the tour operator would lack obvious pertinence. But according to the results of STLDA, it is apparently that Zhangjiajie National Forest Park should be recommended to him since the probability value of blossom topic up to 0.678 in spring. Recommending different tourist attractions to users exposed at specific seasonal contexts on the basis of different seasonal topic features of attractions can achieve higher satisfaction for tourists and realize more profit for tour operators undoubtedly.
It is desperately needed to outline tourist attractions from massive travel-related information available on the Web, with the aim of providing decision support for both tourists and tour operators. Thematic analysis for a given tourist attraction provides us a good opportunity to obtain the high-level concepts that reflect the attributes of the attraction. However, the common sense that tourist attractions tend to show distinct features corresponding to different seasons has been neglected by aforementioned probabilistic topic models, which should be considered as a valuable reference for improving topic representation of tourist attractions. Our work addressed this gap and proposed the STLDA model which can model the variations of topic occurrence that reveal the changing seasonal contexts with consideration of the seasonal contextual information exisiting in attraction description documents. Then, we developed an inference algorithm using Gibbs sampling to learn the posterior distributions and model parameters of our proposed model. Finally, to our best of knowledge, there is no similar research result on seasonal topic features of tourist attractions. Therefore, an empirical study that compares the performance of STLDA with the original LDA model was conducted on real-life textual data of selected tourist attractions. Experimental results demonstrate that STLDA outperformed LDA in terms of perplexity and the seasonal contextual information contributed to the improved performance of STLDA. The results also show that STLDA model can effectively capture the season-dependent topics and the topic representations for tourist attractions are much more comprehensive and representative compared with the basic LDA model.
Footnotes
Acknowledgments
This research is supported by the National Natural Science Foundation of China under Grant Nos. 71671038 and 41571133.
