Topic discovery from short reviews based on data enhancement

Abstract

With the rapid development of social media and mobile Internet, short reviews, such as Weibo and Twitter, have exploded online. Discovering topics from short reviews is significant for many practical applications. It can effectively not only identify users’ attitudes and emotions but also enhance customer satisfaction and shopping experience. Because reviews are relatively short, the sparsity of reviews considerably restricts the quality of topic discovery. To improve the efficiency of topic discovery, we introduce the concept of data enhancement and strengthen the data in sentences and words in short reviews based on the weight of importance. We then propose a topic model for reviews to topic discovery based on data enhancement (shorted as DE-LDA). We verify the rationality and feasibility of DE-LDA on real datasets. Results show that the proposed method outperforms benchmarks in topic discovery and also has better clustering effects.

Keywords

Short reviews topic discovery data enhancement clustering

1. Introduction

With the rapid development of social media and mobile Internet, reviews play an increasingly important role in online social networks. Short reviews, such as Weibo and Twitter, have exploded online in recent years. According to Weibo’s 2020 fourth-quarter earnings report, Weibo’s monthly active users increased to 521 million and its daily active users rose to 225 million. In consideration of hundreds of millions of shorts reviews sent by users daily, topic discovery from reviews is crucial for practical applications. It can effectively help managers analyze stock price movements [1], study the evolution of public opinion [2], identify users’ emotions [3], and explore the impact of reviews on box office and sales [4, 5].

Topic models are effective and popular methods to discover topics and convert unstructured data into structured data in normal texts, such as long reviews. Topic models are widely used to uncover latent topics from texts through capturing document-level word co-occurrence patterns. Generally, documents are a probability distribution over topics, where a topic is modeled as a mixture of words. The effectiveness of leveraging traditional topic models, such as LDA [30] and pLSA [6], to long reviews for topic discovery has been proved, but applying them directly to short reviews does not work well. In reality, the small amount of words in short reviews makes rich context insufficient, which will result in the sparsity of reviews and minimal co-occurrence of feature words. Therefore, utilizing topic models to discovering topics from short reviews is a huge challenge.

To conduct topic models for short reviews effectively, some scholars conduct kinds of research. One common method to alleviate the sparsity problem is to aggregate short reviews into long documents or pseudo-documents before training a standard topic model [7 $\sim$ 10]. Although aggregating short reviews into long documents or pseudo-documents yields more realistic results than traditional topic models, aggregated documents compromise the integrity and coherence of original documents. Making strong assumptions about short reviews is another common approach [11 $\sim$ 13]. Phelan et al. [11] assume that each short review is made up of a topic and Gruber et al. [12] suppose that words in each sentence come from the same topic. However, in the reality, short reviews and sentences may be drawn from multiple topics. Making strong assumptions can cause problems, such as losing the flexibility to capture different topic elements in one document and suffering from overfitting issues. Another way is to introduce external and domain knowledge for short reviews [14], but acquiring external knowledge is difficult, time-consuming, and not universal.

To some extent, these three methods can alleviate the sparsity problem of short reviews and effectively increase word co-occurrence. However, these ways not only destroy the integrity and coherence of documents but also have created strong assumptions for documents and sentences. They also are time-consuming and too weak in universality. Meantime, they do not distinguish the importance between different sentences or words. In a single document, the importance between sentences or words has considerable differences. To alleviate the sparsity of short reviews and increase word co-occurrence, we introduce the concept of data enhancement for short reviews, consider the importance of different sentences and words and then propose a topic model for short reviews based on data enhancement (referred to as DE-LDA). Data enhancement is a method for insufficient data, which can maximize limited data and produce a substantial amount of information. Using data enhancement in this study can effectively alleviate the sparsity of short reviews.

Compared with usual topic models for short reviews, the major advantages of DE-LDA are that 1) DE-LDA explicitly distinguishes the importance of different sentences and words; and 2) DE-LDA increases word co-occurrence through data enhancement. We first identify the importance of different sentences and words, determine the “stretching” operation based on different importance with the help of data enhancement, and enrich training data. Then, we utilize DE-LDA to model documents and obtain the final distributions of document-topic and topic-word after optimization. We verify the rationality and feasibility of our method in the Weibo and Yelp reviews datasets. The experimental results show that DE-LDA can effectively alleviate the sparsity of short reviews caused by the lack of rich context and word co-occurrence and improve the performance of topic discovery and clustering in practical application.

The rest of the study is organized as follows: related works are described in Section 2. Section 3 introduces the construction and inference of DE-LDA. Section 4 presents the experimental results of DE-LDA, and Section 5 discusses the clustering application. Finally, the last section concludes the study.

2. Related works

With the rapid development of social media, topic models have been applied to analyze content in a variety of tasks. In the absence of specific topic models for short reviews, some researchers use the traditional (or slightly modified) topic models directly for analysis [34, 35]. To conduct topic models for short reviews effectively, some scholars conduct kinds of research. One common method to alleviate the sparsity problem is to aggregate short reviews into long documents or pseudo-documents before training a standard topic model. Some scholars aggregate tweets posted by one user [7] or tweets with the same hashtag into one document [8]. Quan et al. and Zuo et al. assume that each short review is part of a pseudo-document and has the same proportion of topics as the pseudo-document [9, 10]. Although aggregating short reviews into long documents or pseudo-documents yields more realistic results than traditional topic models, aggregated documents compromise the integrity and coherence of original documents.

Making strong assumptions about short reviews is another common approach. Phelan et al. [11] assume that each short review is made up of a topic and Gruber et al. [12] suppose that words in each sentence come from the same topic. Cheng et al. [13] define the unordered word-pair co-occurred in short reviews as a biterm and assume that a biterm is generated from the same topic. However, in the reality, short reviews, sentences, and biterms may be drawn from multiple topics. Making strong assumptions can cause problems, such as losing the flexibility to capture different topic elements in one document and suffering from overfitting issues. Another way is to introduce external and domain knowledge for short reviews. Combining domain knowledge to evaluate words with high or low probability in each topic and then utilizing the Dirichlet Forest distribution to change prior is an effective method [14]. The method implicitly distinguishes the importance of words in documents. However, acquiring external knowledge is difficult, time-consuming, and not universal.

Based on the above, a topic modeling method based on data enhancement, shorted as DE-LDA, is proposed without introducing any external knowledge. The method distinguishes the importance of sentences and words in documents so that reducing the impact of text data sparsity problem and improving the effectiveness of short text-based topic modeling.

3. Construction and inference of DE-LDA

3.1 Data enhancement

Data enhancement is a method proposed when data is insufficient and missing, which is applied to the image field [15, 16] health care field [31], business field [32], and geophysics [33]. In image recognition, data enhancement easily obtains ideal results when utilizing a large amount of data to train models. To leverage data effectively under the condition of limited data, researchers usually use operations, such as scaling, rotating, cropping, and toning to enhance and enrich image training data. For example, cropping an image in different ways, making an object in an image appear in different proportions at different locations, or adjusting brightness, contrast, saturation, and hue, are universal and specific operations. Data enhancement, which generates equivalent data from limited data, can reduce the dependence of the model on attributes and improves the generalization ability. If useful information is extracted effectively from low-quality data, then the model can gain good results [17]. Data enhancement has been proved to be effective in image datasets [18, 19, 20].

In consideration of limitations in our study, user habits, and other factors, the text posted by users is relatively short. For example, Weibo requires the text posted by users to be less than 140 characters. The short and sparse content of short reviews makes the information expression inadequate and informal, hence, extracting effective information and topic discovery is difficult. To enrich data information, we introduce and apply data enhancement in the image field to short reviews, thus short reviews generate equivalent and effective data from limited datasets to train topic models. In reality, speakers often emphasize key sentences to highlight key points, and authors usually describe and write words, phrases, or sentences repeatedly, to highlight emotions and convey vital information. Emphasizing and repeating important sentences and words can highlight key information and facilitate users to extract effective information easily.

Given that the units of short reviews are usually sentences and words, we apply data enhancement to sentences and words in this study. It is an effective method in utilizing data enhancement to emphasize and repeat some sentences or words in short reviews. For example, a speaker’s point, when writing “let’s have lunch together”, is to emphasize having lunch together. After data enhancement, the short review can be regarded as “let’s have lunch together, having lunch together”. Introducing and applying data enhancement to short reviews not only adds data information but also does not change the original information and makes the extraction of data information easy. In this study, by distinguishing and repeating the importance of sentences and words, we propose a topic model for short reviews based on data enhancement (shorted as DE-LDA).

Figure 1 shows the flow chart of DE-LDA. The process of DE-LDA is organized as follows: 1) determining the importance weight of a sentence; 2) applying TextRank to calculate the importance weight of words; 3) utilizing data enhancement based on the importance weight of sentences and words, obtaining new importance sentences after the operation of DE-LDA, finally calculating the rate of change between two iterations; 4) according to the importance weights in step 3, carrying out iterative and update in step 2 until the rate of change is less than the value of the threshold and the iteration is stopped; 5) gaining the final document-topic distribution and topic-word distribution.

Figure 1.

Basic flow chart of DE-LDA.

3.2 Discrimination of enhanced data

When we apply data enhancement to topic modeling for short reviews, we first distinguish the importance of sentences and words and then identify the dataset that needs to be enhanced. We use topic models to measure the importance weight of sentences. Topic modeling assumes that each document consists of hidden topics $z\in[1,K]$ , which are described by a set of words. The word that has a high probability under the topic calculated by topic models can reasonably express the meaning of the topic. Therefore, if a sentence contains a word that has a high probability under the topic, then the sentence is more vital and has a higher importance weight than other sentences [21]. Based on this assumption, we utilize topic-word distribution to measure the importance weight of sentences.

According to the topic-word distribution, we can acquire the top n words under different topics and the sentences in which the top n words are placed. The sentence s consists of $S$ words, $s=\{{w_{1}\cdots w_{S}}\}$ , and the position of words under different topics is $\{{n({w_{1}}),\cdots,n({w_{S}})}\}$ . The smaller the value of $n$ is, the higher the probability of words under topics. Therefore, the weight of sentence s depends on the smallest position of words in sentence s, which is shown as $\text{Min}\{{n({w_{1}}),\cdots,n({w_{S}})}\}$ . The measurement formula that defines the importance of each sentence is shown as follows:

$\displaystyle\pi=\mathop{\sum}\limits_{i=1}^{t}a_{i}\ast\delta_{A_{i}}({\text{% Min}\{{n({w_{1}}),\cdots,n({w_{S}})}\}})\ast\Delta$ (1) $\displaystyle\delta_{A_{i}}({\text{Min}\{{n({w_{1}}),\cdots,n({w_{S}})}\}})=% \left\{{{\begin{array}[]{ll}1&\text{if Min}\{{n({w_{1}}),\cdots,n({w_{S}})}\}% \in A_{i}\\ 0&\text{if Min}\{{n({w_{1}}),\cdots,n({w_{S}})}\}\notin A_{i}\\ \end{array}}}\right.$ (2)

In this study, we utilize the length of interval $t$ and interval $A_{i}$ to represent the position of words under topic-word distribution. The interval position coefficient $a_{i}$ represents the importance of the current interval. $\Delta$ , which are core parameters for data enhancement, indicates the degree to which the current sentence should be enhanced. The value $\Delta$ is too small to show the importance of words, yet it is too large to affect the document presentation. $\delta_{A_{i}}$ is regarded as the indicator function about $A_{i}$ . A step function is used to measure the importance weight of the sentence. The step function not only reflects the topic information in the sentence but also avoids repeated calculation when utilizing the interval to control the calculation for sentences.

Although the importance weight of sentences, calculated by the topic model, can locate valid information, the importance of words in the identical sentence is different. TextRank is utilized to measure the importance weight of words in sentences. TextRank is a method that calculates word weights in a sentence to extract text keywords [22]. The process of TextRank in this study is divided into four steps:

Identifying units in sentence s and adding them to the graph model as nodes.

Sentence s has S units because S words are found in sentence s.

Confirming relationships between units and adding the relationships to the graph model as edges. The relationships between text units are based on co-occurrence between words. If two words occur in the same co-occurred window cw, that the number is often set from 2 to 10, then a relationship between words can be established. cw is an important adjustment parameter in TextRank and affects the accuracy of weighing the importance of words.

Giving an initial value for each node. We set the initial value as 1 in this paper and run the algorithm until it converges.

Ordering nodes based on weights and acquiring final scores. The results are defined as the vector $B$ , $B=\{{({S({w_{1}}),\cdots,S({w_{S}})})}\}$ . $S({w_{\ast}})$ represents the importance score of word $w_{\ast}$ in sentence s.

After measuring the importance weight of sentences and words in short reviews, we simultaneously use the importance weight to enhance data. To describe topics better through ordering words, we utilize the importance weight to determine the number of words appearing in a document. The higher the list of words that describe a topic effectively, the better the quality of the topic is. In this study, we introduce and define that the more important a sentence is, the more important words appear in the sentence; the more important the word is, the more frequently the word appears. We define the number of words in original sentence s as $({\textit{num}_{w_{1}},\cdots,\textit{num}_{w_{S}}})$ , and regard the number of added words in sentence s after data enhancement as:

$\displaystyle({\textit{num}_{w_{1}}^{\ast},\cdots,\textit{num}_{w_{S}}^{\ast}}% )=({\lceil\pi_{s}\ast S({w_{1}})\rceil-\textit{num}_{w_{1}},\cdots,\lceil\pi_{% s}\ast S({w_{S}})\rceil-\textit{num}_{w_{S}}})$ (3) $\displaystyle\lceil x\rceil=\text{Min}\{{n\in Z|{x\leqslant n}}\}$ (4)

We obtain enhanced short reviews after applying data enhancement to original reviews. The set of original words in document d is $W_{d}=\{{\textit{num}_{w_{1}},\cdots,\textit{num}_{w_{V}}}\}$ , and the set of added words after data enhancement is $W_{d}^{\ast}=\{{\textit{num}_{w_{1}}^{\ast},\cdots,\textit{num}_{w_{V}}^{\ast}}\}$ . Thus, the set of words in document $d$ after data enhancement can be regarded as $d=\{{W_{d},W_{d}^{\ast}}\}$ .

3.3 DE-LDA

We propose a topic model for short reviews based on data enhancement (referred to as DE-LDA). Our method explicitly distinguishes the importance of different sentences and words and eliminates the weakness of the traditional topic model for short reviews. It not only alleviates the sparsity in short reviews and increases the number of words that describe topics, but also has an improved and accurate effect on document-topic distribution ${\theta}$ and topic-word distribution $\varphi$ .

In this chapter, we assume that each document consists of multiple topics and that words are generated by each topic. Specifically, in DE-LDA, we set topics and words from two sources, original documents, and enhanced documents. Figure 2 shows the generation process.

Figure 2.

Generative process of DE-LDA.

Figure 3.

Probabilistic graphical model of DE-LDA.

Figure 3 shows the probabilistic graphical model of DE-LDA. The joint probability distribution $p_{\textit{DE-LDA}}({w,z|{\alpha,\beta}})$ of DE-LDA can be obtained as follows:

$\displaystyle{p}_{\textit{DE-LDA}}({{W},W^{\ast},\mbox{Z}|{\alpha,\beta}})={P}% ({W|{Z,\beta}}){P}({W^{\ast}|{Z^{\ast},\beta}}){P}({Z|\alpha}){P}({Z^{\ast}|% \alpha})$ (5) $\displaystyle\quad=\mathop{\smallint}\nolimits P({Z|\theta})P({Z^{\ast}|\theta% })P({\theta|\alpha})d\theta\ast\mathop{\smallint}\nolimits{P}({W^{\ast}|{Z^{% \ast},\varphi}})P({W|{Z,\varphi}})P({\varphi|\beta})d\varphi$

3.4 Inference by Gibbs sampling

We choose the Gibbs sampler to infer the joint probability distribution of DE-LDA. The Gibbs sampler is a simple and widely used Monte Carlo algorithm. Compared with other inference algorithms for topic models, such as variational inference, and maximum posterior estimation, the Gibbs sampler improves the accuracy of results by approximating the correct distribution asymptotically and is used to large datasets easily [23]. Figure 4 shows the process of Gibbs sampler for DE-LDA.

Figure 4.

Process of Gibbs sampler for DE-LDA.

Joint probability distribution eliminates hidden unknown variables through integration and then samples the topic for each word. Once the topic of each word is determined, the parameters in Eq. (3.3) can be calculated after counting frequency. Thus, the purpose of utilizing the Gibbs sampler to infer parameters is to calculate the conditional probability of the topic sequence under the word sequence. After inferring the topic for the set of words $W_{d}$ and $W_{d}^{\ast}$ in document d, we can obtain the final topic-word distribution. The equation is shown as follows:

$\displaystyle P_{\textit{DE-LDA}}(z_{d,q}|(W_{d}\mathop{\cup}\nolimits W_{d}^{% \ast}),Z_{d,-q},\alpha,\beta)\propto\frac{\textit{num}_{W_{d}}^{(k)}+\textit{% num}_{W_{d}^{\ast}}^{(k)}+\alpha-1}{N_{d}+N_{d}^{\ast}+K\alpha-1}\ast\frac{n_{% k,-q}+\beta}{n_{k}+V\beta-1}$ (6)

In the above formula, $z_{d,q}$ represents the number of the topic for the qth word in document d; $(W_{d}\mathop{\cup}\nolimits W_{d}^{\ast})$ is set as the set of words in document d; and $Z_{d,-q}$ represents the set of topics after eliminating the qth word in document d. $\textit{num}_{W_{d}}^{(k)}$ is the number of words under topic $k$ in $W_{d}$ and $\textit{num}_{W_{d}^{\ast}}^{(k)}$ is the number of words under topic $k$ in $W_{d}^{\ast}$ . $N_{d}$ represents the number of words in $W_{d}$ and $N_{d}^{\ast}$ represents the number of words in $W_{d}^{\ast}$ . $n_{k}$ is regarded as the number of words under the kth topic, and $n_{k,-q}$ represents the number of words under the kth topic except for the qth words. $V$ is defined as the number of vocabularies. Finally, Eq. (7) shows the document-topic distribution $\theta_{d,k}$ and Eq. (8) shows the topic-word distribution $\Phi_{k,q}$ .

$\displaystyle\theta_{d,k}=\frac{\textit{num}_{W_{d}}^{(k)}+\textit{num}_{W_{d}% ^{\ast}}^{(k)}+\alpha-1}{N_{d}+N_{d}^{\ast}+K\alpha-1}$ (7) $\displaystyle\Phi_{k,q}=\frac{n_{k,-q}+\beta}{n_{k}+V\beta-1}$ (8)

After obtaining the document-topic distributions and topic-word distribution, we choose the coherence change rate as the threshold between two iterations. Although there are many metrics to evaluate, compared with other metrics, the coherence score corresponds well with human coherence judgments and makes it possible to identify specific semantic problems in topic models without human evaluations or external reference corpora. The Coherence is a performance metric to evaluate the quality of the topic [24]. This metric is based on the property that words under the same topic often co-occur in the same document. For the quality of topics, the experiments show that coherence has a high correlation with human judgment. For the given topic $z$ and the first $T$ words $V^{(z)}=({v_{1}^{(z)},\ldots,v_{T}^{(z)}})$ based on the order of $P(w|z)$ , the metric of coherence is defined as follows:

$\displaystyle{C}({{z};{V}^{({z})}})=\mathop{\sum}\limits_{{t}=2}^{T}\mathop{% \sum}\limits_{{l}=1}^{{t}-1}\log\frac{{D}({{v}_{t}^{({z})},{v}_{l}^{({z})}})+1% }{{D}({{v}_{l}^{({z})}})}$ (9)

In Eq. (9), $D(v)$ represents the number of documents that contain word $v$ . $D(v,v^{\prime})$ represents the number of documents that contain word $v$ and word $v^{\prime}$ simultaneously. For $V^{(z)}=({v_{1}^{(z)},\ldots,v_{t}^{(z)},\cdots,v_{T}^{(z)}})$ , $v_{t}^{(z)}$ is defined as the tth word under the topic $z$ and we set $T$ as 30 in this paper. To prevent the problem of infinitesimal when ${D}({{v}_{t}^{({z})},{v}_{l}^{({z})}})$ is 0, we set the smooth parameter to be 1. The value of coherence is negative, the higher the value is, the better the quality of the topic.

During the process of iterative update, we stop iterations until the change rate between two successive iterations is below the value of the threshold and then acquire final document-topic distribution and topic-word distribution. We set the threshold to 0.03.

4. Evaluation of topic quality

4.1 Experimental setting

To validate the quality of topic discovery by using our method, we apply DE-LDA to several practical short review datasets. Firstly, we explore the best values of ${\Delta}$ and cw and utilize the optimal value for experiments. Then, we evaluate the quality of topics by utilizing quantitative and qualitative methods to estimate and analyze topic-word distribution. In our experimental setting, several traditional topic models are regarded as baselines, which include pLSA (probabilistic Latent Semantic Analysis), LDA (Latent Dirichlet Allocation), Mix (Mixture of Unigram), and BTM (Biterm Topic Model).

Most topic models contain two hyperparameters $\alpha$ and $\beta$ that represent the prior of document-topic and topic-word, respectively. We set these hyperparameters to 1/K and 0.01 based on previous studies, and K represents the number of topics manually. Additionally, in DE-LDA, unique hyperparameters need to be tuned: t is set to be 4 which is the number of intervals in the previous introduction. The four parts are A1 $=$ [1, 10], A2 $=$ (10, 20], A3 $=$ (20, 30] and A4 $=$ ( 30, $\infty$ ], and $a_{1}=3$ , $a_{2}=2$ , $a_{3}=1$ and $a_{4}=0$ .

4.2 Dataset

We use two different practical datasets in our experiments, Weibo1

¹
http://www.weibo.com.

and Yelp reviews2

https://www.yelp.com/dataset.

datasets. For the Weibo dataset, we first confirm the categories of the hottest Weibo which has 28 types, and then crawl the contents under these categories from April 9, 2016, to June 28, 2016. After preprocessing, such as eliminating numbers, standard stop words, special characters, we obtain 50,999 Weibo comments, 92,851 vocabularies, and the average document length is 16.5. The Yelp reviews dataset is a public dataset and each review has a category tag. The public dataset contains 599,699 reviews that have the tag in total and we randomly select 50,000 reviews as our dataset in this study. After preprocessing, our dataset had 834 categories, 109,238 distinct words, and the average document length is 35.

4.3 Baselines

We choose four different traditional topic models as our compared methods in this study:

1)
pLSA [6] is an early topic model. The model does not have prior. It selects topics from each document through a specific probability and selects words from the topic through a specific probability. We implement pLSA using mltoo4j.3
³
https://code.google.com/archive/p/mltool4j.

2)
LDA is the most famous probabilistic topic model which uses a hierarchical Bayesian graphical model. It assumes that document-topic distribution and topic-word distribution have a prior distribution. We use jGibbLDA to implement LDA.4
⁴
http://jgibblda.sourceforge.net.

3)
Mix [25] assumes that each document is made up of a topic.
4)
BTM [21] defines the unordered word-pair that co-occur in the same document as a biterm and assumes the biterm generated from the same topic. For a biterm $b=(w_{i},w_{j})$ , the generation probability of the biterm is

$\displaystyle{P}({b})=\mathop{\sum}\limits_{z}P(z)P({w_{i}{|}z})P(w_{j}|z)$ (10)

We use the program provided by the author to implement BTM.5
⁵
https://github.com/WHUIR/BTM.

4.4 Evaluate methods

To evaluate the quality of topic discovery by topic models, we utilize coherence as the evaluation metric. To explore the influence of different $\Delta$ and cw on the experimental results calculated by DE-LDA, we select defeat_ratio [26] as an evaluated metric:

$\displaystyle\textit{defeat\_ratio}=\frac{N_{\textit{DE-LDA}}}{N_{\textit{DE-% LDA}}+N_{\textit{LDA}}}$ (11)

where $N_{\textit{DE-LDA}}$ represents the number of topics that the value of coherence calculated by DE-LDA is better than that of LDA. $N_{\textit{LDA}}$ also has the opposite definition as $N_{\textit{DE-LDA}}$ . $N_{\textit{DE-LDA}}+N_{\textit{LDA}}$ represents the number of topics set by ours. When the value of defeat_ratio is greater than 0.5, DE-LDA has a better topic quality than LDA.

4.5 Best parameters of

\Delta

and cw

To confirm the best parameters of $\Delta$ and cw, we discuss these parameters in the Weibo dataset. When measuring the importance weight of sentences, $\Delta$ represents the value of enhancement and reflects the degree to which the current sentence should be enhanced. When measuring the importance weight of words, the co-occurred window cw has a distinct effect on the relationships between words in the same sentence. If two words are in the co-occurred window cw, we can build the relationship between words. To explore the influence of different $\Delta$ and different cw, we utilize Eq. (11) to evaluate and select average defeat_ratio as results when the number of topics is set to be from 10 to 50.

When exploring the influence of different $\Delta$ on experimental results calculated by DE-LDA, we set cw as 2, $\Delta$ is set to range from 0.5 to 5, and step size is 0.5. Figure 5 shows the final value of average defeat_ratio. all values are greater than 0.5 indicating that DE-LDA can acquire better topic quality than LDA all the time. At the same time, when $\Delta$ is 2.5, the value of defeat_ratio reaches the highest point. Hence, the optimal parameter is $\Delta=2.5$ when cw is set as 2.

Figure 5.

defeat_ratio under different $\Delta$ settings.

Figure 6.

defeat_ratio under different cw settings.

The co-occurred window cw influences the relationship between words. The value of cw often is set to range from 2 to 10. Under the setting of $\Delta=2.5$ , to explore the influence of cw, we set cw to be 2, 3, 4, 5, 6, 7, 8, 9, and 10. Figure 6 shows that the values of defeat_ration under different cw. No obvious changes were noted with the increase of cw. Moreover, all values are greater than 0.5 and reach the highest value when cw is set at 2, 7 and 9. This phenomenon shows that cw has no considerable effect on the topic quality calculated by using DE-LDA in this study. Finally, we set cw to 2 in our experiment.

4.6 Quality of topic discovery

To evaluate the quality of topic discovery calculated by using DE-LDA, we first assess the quality quantitatively of the reviews in the Weibo and Yelp datasets. We select coherence as a metric and set the number of topics to range from 10 to 50. Table 1 shows the results of average coherence.

Table 1
Results of coherence

Corpus	Topic size	DE-LDA	LDA	pLSA	BTM	Mix
Weibo	Topic-10	$-$ 4196.9022	$-$ 4291.7067	$-$ 4327.2470	$-$ 4265.8136	$-$ 4213.6865
	Topic-20	$-$ 3871.7301	$-$ 4130.1870	$-$ 4247.4965	$-$ 4039.3945	$-$ 4121.9430
	Topic-30	$-$ 3969.4669	$-$ 4061.9824	$-$ 4173.4582	$-$ 3984.8890	$-$ 4050.2706
	Topic-40	$-$ 3930.7200	$-$ 3947.9148	$-$ 4101.9045	$-$ 3980.9788	$-$ 3969.8942
	Topic-50	$-$ 3879.9150	$-$ 3999.8929	$-$ 4033.7333	$-$ 3904.2988	$-$ 3973.0014
	Average	$-$ 3969.7469	$-$ 4086.3368	$-$ 4176.7679	$-$ 4035.0749	$-$ 4065.7591
Yelp	Topic-10	$-$ 4412.5435	$-$ 4451.2375	$-$ 4428.0961	$-$ 4425.1632	$-$ 4455.9510
	Topic-20	$-$ 4351.0095	$-$ 4411.2493	$-$ 4389.6218	$-$ 4414.5404	$-$ 4453.4602
	Topic-30	$-$ 4307.5167	$-$ 4382.0581	$-$ 4356.7621	$-$ 4406.6671	$-$ 4432.0919
	Topic-40	$-$ 4255.1896	$-$ 4362.3157	$-$ 4324.7548	$-$ 4399.4473	$-$ 4411.7570
	Topic-50	$-$ 4217.6672	$-$ 4355.9979	$-$ 4301.9664	$-$ 4392.7547	$-$ 4419.3578
	Average	$-$ 4308.7853	$-$ 4392.5717	$-$ 4360.2402	$-$ 4407.7145	$-$ 4434.5236

Table 2

Top 20 words under “health” and “movies”

	DE-LDA	LDA	pLSA	BTM	Mix
UTF8gbsnå…»ç”Ÿ	UTF8gbsnä¸åŒ»ï¼Œè°«ä½“ï¼Œå…»ç”Ÿï¼Œé£Ÿç‰©ï¼Œå¥å°·ï¼Œè°«ä½“ï¼Œä°°ä½“ï¼Œå’å—½ï¼Œæ»ç–—ï¼Œç¥ï¼ŒåŠŸæ•ˆï¼Œæ„Ÿå†’ï¼Œä¸è¯ï¼Œèƒƒï¼ŒæŒ‰æ‘©ï¼ŒèŒ¶ï¼Œä¾¿ç§˜ï¼Œä½“å†…ï¼Œæ”å–„ï¼Œè‚°	UTF8gbsnä¸åŒ»ï¼ŒèŒ¶ï¼Œç¥ï¼Œå…»ç”Ÿï¼ŒåŠŸæ•ˆï¼Œæ•ˆæžœï¼Œç«ï¼Œæ»ç–—ï¼Œä¸è¯ï¼Œæ¡ï¼Œå’å—½ï¼Œç°¢æž£ï¼Œæ„Ÿå†’ï¼Œç”Ÿå§œï¼Œè„¾èƒƒï¼Œè‚ï¼Œé†‹ï¼Œè¡¥è¡€ï¼Œä½“è´¨ï¼Œèœ‚èœœ	UTF8gbsnæ»ç–—ï¼Œæ‚£è€…ï¼Œä¸åŒ»ï¼Œè®¡åˆ’ï¼Œç–¾ç—…ï¼Œé¢„é˜ï¼Œä½œæˆ˜ï¼Œæ£€æŸ¥ï¼Œé˜…è¯»ï¼Œç™Œç—‡ï¼Œç—‡çŠ¶ï¼Œå¥å°·ï¼Œè‚¿ç˜¤ï¼Œæ‰‹æœ¯ï¼Œç—…ï¼Œç–å°¿ç—…ï¼Œé¢è¯•ï¼Œè¯Šæ–ï¼Œæ•ˆæžœï¼Œç—…ä°°	UTF8gbsnä¸åŒ»ï¼Œå…»ç”Ÿï¼Œç¥ï¼Œæ¡ï¼Œæ»ç–—ï¼ŒåŠŸæ•ˆï¼ŒèŒ¶ï¼Œçš®è‚¤ï¼ŒæŒ‰æ‘©ï¼Œå’å—½ï¼Œç°¢æž£ï¼Œç«ï¼Œæ„Ÿå†’ï¼Œèƒƒï¼Œè°«ä½“ï¼Œæ‰ï¼Œè‚°ï¼Œç”Ÿå§œï¼Œè‚ï¼Œè¡¥è¡€	UTF8gbsnä¸åŒ»ï¼Œæ»ç–—ï¼Œå…»ç”Ÿï¼Œé£Ÿç‰©ï¼Œå¥å°·ï¼Œè°«ä½“ï¼Œå’å—½ï¼ŒåŠŸæ•ˆï¼Œä½œæˆ˜ï¼Œæ•ˆæžœï¼Œçš®è‚¤ï¼Œç¥ï¼Œæ„Ÿå†’ï¼Œé˜…è¯»ï¼Œä°°ä½“ï¼ŒèŒ¶ï¼Œè¥å…»ï¼Œé¢„é˜ï¼ŒæŒ‰æ‘©ï¼Œæ°´æžœ
UTF8gbsnç”µå½±	UTF8gbsnç”µå½±ï¼Œæ•…ä°‹ï¼Œå¯¼æ¼”ï¼Œä¸Šæ˜ ï¼Œå‰§æƒ…ï¼Œåœ¨ç°¿ï¼Œç°å½•ç‰‡ï¼Œå½±ç‰‡ï¼Œçˆ±æƒ…ï¼Œä¸»æ¼”ï¼Œæ›å…‰ï¼Œä¸å—ï¼Œå°è¯´ï¼Œæµ·æŠ¥ï¼ŒåŠ¨ä½œï¼ŒåŒ—ç¾Žï¼Œé¢„å‘Šç‰‡ï¼Œç”µå½±èŠ‚ï¼ŒåŠ¨ç”»ï¼Œå–œå‰§	UTF8gbsnç”µå½±ï¼Œæ•…ä°‹ï¼Œå‰§æƒ…ï¼Œé“¾æŽ¥ï¼Œåœ¨ç°¿ï¼Œä¸å—ï¼Œçˆ±æƒ…ï¼ŒæƒŠæ‚šï¼Œå½±ç‰‡ï¼Œå–œå‰§ï¼Œä¸»æ¼”ï¼Œè±†ç“£ï¼Œå¯¼æ¼”ï¼Œåˆé›†ï¼ŒåŠ¨ç”»ï¼ŒåŠ¨ä½œï¼Œä¸–ç•Œï¼Œå—å•ï¼ŒéŸ©å›½ï¼ŒçŸç‰‡	UTF8gbsnç”µå½±ï¼Œçˆ±æƒ…ï¼Œæ¢¦æƒï¼Œå‰§æƒ…ï¼ŒåŠ¨ä½œï¼Œåœ¨ç°¿ï¼Œä¸»æ¼”ï¼Œå¯¼æ¼”ï¼Œæ•…ä°‹ï¼Œå½±ç‰‡ï¼Œæ„ŸåŠ¨ï¼Œä°°ç‰©ï¼Œç¾Žå¥ï¼Œçˆ¶ä°ï¼Œå¸…ï¼Œè”ç›Ÿï¼Œä¸Šæ˜ ï¼Œæƒè±¡ï¼Œå–œå‰§ï¼ŒUTF8minæœƒ	UTF8gbsnç”µå½±ï¼Œå¯¼æ¼”ï¼Œä¸Šæ˜ ï¼Œå½±ç‰‡ï¼Œä¸»æ¼”ï¼Œæ›å…‰ï¼Œæµ·æŠ¥ï¼ŒåŒ—ä°ï¼Œå®šæ¡£ï¼Œæ‹æ‘„ï¼Œä¸å›½ï¼Œé¥°æ¼”ï¼ŒçŒŽä°°ï¼Œç”µå½±èŠ‚ï¼Œé¢„å‘Šç‰‡ï¼Œä¸‰ä°°è¡Œï¼ŒåŒ—ç¾Žï¼Œç¾Žå›½ï¼Œæ•…ä°‹ï¼Œä¸–ç•Œ	UTF8gbsné“¾æŽ¥ï¼Œç”µå½±ï¼Œç¾Žå›½ï¼Œå‰§æƒ…ï¼Œé˜Ÿé•¿ï¼Œå¯¼æ¼”ï¼Œä¸Šæ˜ ï¼Œä¸»æ¼”ï¼Œæ•…ä°‹ï¼Œå½±ç‰‡ï¼Œä¸å—ï¼Œå—å•ï¼Œå—å•ç»„ï¼Œåœ¨ç°¿ï¼Œä¸–ç•Œï¼Œæ¸¸æˆï¼ŒåŒ—ç¾Žï¼ŒæƒŠæ‚šï¼Œå–œå‰§ï¼ŒæƒåŠ›
Health	TCM, body, health, food, fitness, human body, cough, treatment, porridge, effect, cold, Chinese medicine, stomach, massage, tea, constipation, in vivo, improvement, lung	TCM, tea, porridge, health, effect, results, fire, treatment, Chinese medicine, times, cough, red dates, cold, fresh ginger, spleen and stomach, liver, vinegar, blood tonic, habitus, honey	treatment, sick person, TCM, plan, Disease, prevention, fight, inspect, reading, cancer, symptom, fitness, tumors, surgery, illness, diabetes, interview, diagnosis, effect, patient	TCM, health, porridge, times, treatment, effect, tea, skin, massage, cough, red dates, fire, cold, stomach, body, rub, lung, fresh ginger, liver, blood tonic	TCM, treatment, health, food, fitness, body, cold, effect, fight, results, skin, porridge, cold, reading, human body, tea, nutrition, prevention, massage, fruits
Movie	Movie, story, director, exhibition, drama, online, documentary, film, loving, star, exposure, Chinese, fiction, poster, action, north American, prevue, film fest, animation, comedy	Movie, story, exhibition, link, online, Chinese, loving, thriller, film, comedy, star, Douban, director, compilations, animation, action, world, subtitle, South Korea, short film	Movie, loving, dream, drama, action, online, star, director, story, film, touching, character, beauty, father, handsome, alliance, exhibition, image, comedy, do	Movie, director, exhibition, film, star, exposure, poster, Beijing, set file, shoot, china, portray, hunter, film fest, prevue, Three’s Company, North American, American, story, world	Link, movie, American, drama, captain, director, exhibition, star, story, film, Chinese, subtitle, subtitle team, online, world, game, north American, thriller, comedy, power

In the Weibo dataset, the value of coherence computed by DE-LDA has a better result than all baselines. This result means that DE-LDA generates high-quality topics and extracts topic features better. Compared with the average coherence between all baselines, BTM outperforms Mix, Mix is better than LDA and LDA performs better than pLSA. Thus, topic models that are proposed to deal with short reviews are better than topic models designed for normal reviews in the Weibo dataset. In the Yelp dataset, DE-LDA also has a better result than all baselines. With increasing the number of topics, pLSA always outperforms LDA. This result indicates that considering prior in the model has no advantages when dealing with short reviews in the Yelp dataset. For the average coherence, BTM and Mix have a poor effect. Thus, in the Weibo and Yelp datasets, DE-LDA could obtain the best results and generate a better quality of topic discovery.

To evaluate the quality of topic discovery qualitatively discovered by all models, we sample some topics for visualization. Table 1 shows that as the number of topics increases, the coherence value also increases. We randomly sample two topics for visualization in the Chinese dataset, when the number of topics is set to 50. We first collect the top 5 words in each topic and recognize the meaning of topics based on the top 5 words. Then, we randomly choose two topics (i.e., health and movies) from the same meaning of topics between all methods. For each topic, we list the top 20 words that are most representative of a topic. Table 2 presents the top 20 words under the topic “health” and the topic “movies” for the Chinese dataset. The second and third rows are Chinese words for selected topics and the fourth and fifth rows are English words translated from the Chinese words.

Table 2 shows that the top 20 words under “health” generated by DE-LDA highly relate to the topic “health”. The top 20 words generated by benchmarks contain irrelevant words. For example, LDA has words like “fire” and “times”; pLSA contains the words “fight” and “reading”; BTM includes “times”, “fire” and “rub”; and Mix has the words “fight” and “reading”. The top 5 words generated by DE-LDA are similar to those generated by Mix. The top 5 words in pLSA do not include “health” and BTM includes the insignificant word “times”. For the topic “movie”, all words in Table 2 generated by DE-LDA and BTM are related to the topic “movie”. In the remaining methods, LDA has an irrelevant word “link”, pLSA contains the unrelated words “image” and “do”, and Mix also has irrelevant words “link” and “game”. Although DE-LDA is the same as BTM, words computed by DE-LDA have more relevant to the topic “movie” than BTM. For the visualization of topics, we conclude that DE-LDA generates a better quality of topic discovery based on qualitative evaluation.

5. Clustering application

The quantitative and qualitative methods are direct ways to evaluate the quality of topic discovery. Based on the former works of literature for topic models, clustering application is a widely used and indirect method to evaluate the quality of topic discovery. In mining short reviews, clustering is one of the most important applications. Before clustering, turning short reviews into a vector is essential. Utilizing topic models to text vectorization is a popular and common way. To evaluate the quality of topic discovery indirectly and apply methods to the practical dataset for clustering, we verify the clustering performance based on different topic models.

5.1 Clustering based on DE-LDA

We demonstrate how to apply DE-LDA to a short review clustering task. In clustering short reviews, choosing words as features, such as all words or part of words in the documents, clustering is a common way. However, defining words as features causes not only missing semantic information, but also the problems of polysemy and synonymy. In this study, we choose topics as features. This setting can alleviate the problems of polysemy and synonym, and decrease the dimensionality of reviews. We utilize DE-LDA to vectorize short reviews and leverage the clustering algorithm to clustering. The vector of document d is represented as follows:

$\displaystyle\vec{d}=\left(\frac{\textit{num}_{w_{d}}^{(1)}+\textit{num}_{w_{d% }^{*}}^{(1)}+a-1}{N_{d}+N_{d}^{*}+a-1},\ldots,\frac{\textit{num}_{w_{d}}^{(k)}% +\textit{num}_{w_{d}^{*}}^{(k)}+a-1}{N_{d}+N_{d}^{*}+a-1},\ldots,\right.\left.% \frac{\textit{num}_{w_{d}}^{(K)}+\textit{num}_{w_{d}^{*}}^{(K)}+a-1}{N_{d}+N_{% d}^{*}+a-1}\right)$ (12)

where $K$ is the number of topics. Through the calculation of Eq. (7), the sets of documents can be defined as $D=\{{\overrightarrow{d_{1}},\cdots,\overrightarrow{d_{d}},\cdots,% \overrightarrow{d_{D}}}\}$ .

Table 3

Clustering performance for K-Means

		Weibo			Yelp
	Method	Mean	SD	$T$ -test	Mean	SD	$T$ -test
Topic 10	DE-LDA	0.9433	0.0048		0.9638	0.0000
	LDA	0.9317	0.0039	0.0000 ${}^{**}$	0.9630	0.0001	0.0000 ${}^{**}$
	pLSA	0.9303	0.0007	0.0000 ${}^{**}$	0.9635	0.0000	0.0000 ${}^{**}$
	BTM	0.9330	0.0008	0.0000 ${}^{**}$	0.9636	0.0000	0.0000 ${}^{**}$
	Mix	0.8819	0.0004	0.0000 ${}^{**}$	0.9606	0.0063	0.1060
Topic 20	DE-LDA	0.9428	0.00348		0.9647	0.0012
	LDA	0.9362	0.0060	0.0106 ${}^{*}$	0.9631	0.0000	0.0008 ${}^{**}$
	pLSA	0.9314	0.0006	0.0000 ${}^{**}$	0.9635	0.0000	0.0113 ${}^{*}$
	BTM	0.9338	0.0023	0.0000 ${}^{**}$	0.9636	0.0000	0.0059 ${}^{**}$
	Mix	0.9250	0.0054	0.0000 ${}^{**}$	0.9518	0.0109	0.0014 ${}^{**}$
Topic 30	DE-LDA	0.9398	0.0036		0.9647	0.0012
	LDA	0.9326	0.0076	0.0201 ${}^{*}$	0.9631	0.0000	0.0010 ${}^{**}$
	pLSA	0.9271	0.0053	0.0000 ${}^{**}$	0.9634	0.0000	0.0118 ${}^{*}$
	BTM	0.9255	0.0039	0.0000 ${}^{**}$	0.9636	0.0000	0.0048 ${}^{**}$
	Mix	0.9180	0.0106	0.0000 ${}^{**}$	0.9561	0.0098	0.0125 ${}^{*}$
Topic 40	DE-LDA	0.9286	0.0059		0.9653	0.0018
	LDA	0.8980	0.0160	0.0000 ${}^{**}$	0.9630	0.0000	0.0018 ${}^{**}$
	pLSA	0.8948	0.0123	0.0000 ${}^{**}$	0.9633	0.0000	0.0105 ${}^{*}$
	BTM	0.8971	0.0131	0.0000 ${}^{**}$	0.9635	0.0000	0.0053 ${}^{**}$
	Mix	0.8979	0.0186	0.0002 ${}^{**}$	0.9510	0.0094	0.0001 ${}^{**}$
Topic 50	DE-LDA	0.9270	0.0061		0.9656	0.0022
	LDA	0.8544	0.0166	0.0000 ${}^{**}$	0.9630	0.0001	0.0000 ${}^{**}$
	pLSA	0.8471	0.0186	0.0000 ${}^{**}$	0.9632	0.0000	0.0103 ${}^{*}$
	BTM	0.8463	0.0197	0.0000 ${}^{**}$	0.9635	0.0000	0.0042 ${}^{**}$
	Mix	0.8437	0.0182	0.0000 ${}^{**}$	0.9446	0.0060	0.0000 ${}^{**}$

The topic distribution of each document can be regarded as a k-dimensional vector and these vectors also are expressed by the vectors of Euclidean space. For most clustering, to compute distance and similarity between reviews, feature vectors are required from Euclidean space. Thus, the results of DE-LDA can be combined with many clustering algorithms. To evaluate the effectiveness of clustering based on DE-LDA for short reviews, we choose K-Means [27] and Gaussian Mixture Model (GMM) [28] as clustering algorithms. In consideration of the tags for each document, we select the RI index [29] as the evaluation index to evaluate the effectiveness of clustering. RI is a means of evaluating clustering with permutation and combination. The equation is shown as follows:

$\displaystyle RI=\frac{R+W}{R+M+D+W}$ (13)

where a correct (R) decision assigns two similar documents to the same cluster, and a wrong (W) decision assigns two dissimilar documents to different clusters. D represents assigning two dissimilar documents to the same cluster, and W is defined as that assigning two dissimilar documents to the same cluster. The value of RI ranges from 0 to 1, in which a high value indicates better clustering performance.

5.2 Evaluation of clustering performance

To verify the clustering performance based on DE-LDA, we compare with the clustering algorithm based on different topic models in the Chinese and English datasets. The document can be approximated by document-topic distribution in the topic models, thus, the document is defined as a vector. We combine two clustering algorithms (K-means and GMM), with different topic models, thereby forming clustering algorithms based on different topic models. We then verify the effectiveness of clustering methods based on DE-LDA. To keep the same number of true categories as the dataset, we set the number of categories in the Chinese dataset as 28 and set 834 in the English dataset. We choose clustering algorithms that are based on LDA, pLSA, Mix, and BTM, as comparison methods, and set RI as evaluation metrics. Under the varying number of topics, such as 10, 20, 30, 40, and 50, we take the average mean and standard deviation in 10 experiments as the experimental results and utilize the $T$ -test to verify the significant difference between baselines and DE-LDA.

Table 3 provides the performance of clustering algorithms based on different topic models for K-Means. The clustering algorithm based on DE-LDA generates optimal effects compared with other clustering algorithms based on baselines in all datasets. The clustering algorithm based on DE-LDA is better than other clustering algorithms based on baselines and produces a stable result.

Table 4 provides the performance of clustering algorithms based on different topic models for GMM. The clustering algorithm based on DE-LDA creates optimal effects compared with other clustering algorithms based on baselines in all datasets.

Table 4
Clustering performance for GMM

		Weibo			Yelp
	Method	Mean	SD	$T$ -test	Mean	SD	$T$ -test
Topic 10	DE-LDA	0.9343	0.0049		0.9674	0.0003
	LDA	0.9260	0.0011	0.0001 ${}^{**}$	0.9607	0.0000	0.0000 ${}^{**}$
	pLSA	0.9262	0.0011	0.0001 ${}^{**}$	0.9606	0.0000	0.0000 ${}^{**}$
	BTM	0.9324	0.0003	0.0425 ${}^{*}$	0.9630	0.0000	0.0000 ${}^{**}$
	Mix	0.8944	0.0000	0.0000 ${}^{**}$	0.9306	0.0000	0.0000 ${}^{**}$
Topic 20	DE-LDA	0.9368	0.0052		0.9673	0.0039
	LDA	0.9263	0.0009	0.0000 ${}^{**}$	0.9609	0.0000	0.0001 ${}^{**}$
	pLSA	0.9236	0.0031	0.0000 ${}^{**}$	0.9594	0.0001	0.0000 ${}^{**}$
	BTM	0.9330	0.0017	0.0502 ${}^{**}$	0.9631	0.0000	0.0051 ${}^{**}$
	Mix	0.9344	0.0000	0.1610	0.9391	0.0000	0.0000 ${}^{**}$
Topic 30	DE-LDA	0.9378	0.0023		0.9672	0.0014
	LDA	0.9266	0.0006	0.0000 ${}^{**}$	0.9610	0.0001	0.0000 ${}^{**}$
	pLSA	0.9205	0.0007	0.0000 ${}^{**}$	0.9592	0.0002	0.0000 ${}^{**}$
	BTM	0.9351	0.0006	0.0035 ${}^{**}$	0.9632	0.0000	0.0000 ${}^{**}$
	Mix	0.9341	0.0003	0.0000 ${}^{**}$	0.9393	0.0000	0.0000 ${}^{**}$
Topic 40	DE-LDA	0.9410	0.0027		0.9679	0.0010
	LDA	0.9276	0.0004	0.0000 ${}^{**}$	0.9613	0.0000	0.0000 ${}^{**}$
	pLSA	0.9229	0.0003	0.0000 ${}^{**}$	0.9574	0.0001	0.0000 ${}^{**}$
	BTM	0.9353	0.0003	0.0000 ${}^{**}$	0.9633	0.0000	0.0000 ${}^{**}$
	Mix	0.9267	0.0037	0.0000 ${}^{**}$	0.9415	0.0010	0.0000 ${}^{**}$
Topic 50	DE-LDA	0.9394	0.0004		0.9687	0.0012
	LDA	0.9280	0.0004	0.0000 ${}^{**}$	0.9620	0.0000	0.0000 ${}^{**}$
	pLSA	0.9232	0.0007	0.0000 ${}^{**}$	0.9578	0.0028	0.0000 ${}^{**}$
	BTM	0.9347	0.0001	0.0000 ${}^{**}$	0.9613	0.0032	0.0000 ${}^{**}$
	Mix	0.9215	0.0040	0.0000 ${}^{**}$	0.9422	0.0000	0.0000 ${}^{**}$

6. Conclusions

With the rapid development of social media and mobile Internet, the volume of short reviews that are created has exploded. Mining short reviews and discovering topics can help managers analyze stock price movements, study the evolution of public opinion and effectively identify users’ emotions. Aiming to address the problems of sparsity and few co-occurrences of words, we introduce data enhancement from the image field, and we propose a topic model for short reviews based on data enhancement (DE-LDA). We evaluate quantitatively and assess qualitatively the quality of topic discovery in the Chinese Weibo and English Yelp datasets and verify rationality and validity. The clustering algorithm based on DE-LDA not only improves the performance of traditional clustering algorithms that utilize words as features but also strengthens the cluster effects for short reviews. DE-LDA can be applied to general short reviews on the Internet, provide method support to model and cluster short reviews, and has theoretical and practical significance.

Although DE-LDA improves performance for topic discovery from short reviews, the problems of fragmentation and diversification have not been solved. In the future, exploring new methods will be necessary to reduce the sparsity of short reviews and improve the accuracy of topic modeling.

Footnotes

Acknowledgments

This work is supported by the Major Program of the National Natural Science Foundation of China (91846201), the National Natural Science Foundation of China (71872060, 72071069, 71801069, 71802068). The National Key Research and Development Program of China (2017YFB0803303).

References

Dařena

et al., Machine learning-based analysis of the association between online texts and stock price movements, Inteligencia Artificial 21(61) (2018), 95–110.

Lian

Dong

and Liu

, Topological evolution of the internet public opinion, Physica A: Statistical Mechanics and its Applications 486 (2017), 567–578.

Camacho-Vázquez

V.A.

Sidorov

and Galicia-Haro

S.N.

, Automatic detection of negative emotions within a balanced corpus of informal short texts, Cyberpsychology, Behavior, and Social Networking 21(12) (2018), 781–787.

Fan

Che

and Chen

, Product sales forecasting using online reviews and historical sales data: A method combining the Bass model and sentiment analysis, Journal of Business Research 74 (2017), 90–100.

Liu

Feng

and Liao

, When online reviews meet sales volume information: is more or accurate information always better? Information Systems Research 28(4) (2017), 723–743.

Hofmann

, Probabilistic latent semantic indexing, in: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999, pp. 50–57.

Weng

et al., Twitterrank: finding topic-sensitive influential twitterers, in: Proceedings of the Third ACM International Conference on Web Search and Data Mining, 2010, pp. 261–270.

Hong

and Davison

B.D.

, Empirical study of topic modeling in twitter, in: Proceedings of the First Workshop on Social Media Analytics, 2010, pp. 80–88.

Quan

et al., Short and sparse text topic modeling via self-aggregation, in: Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015.

10.

Zuo

et al., Topic modeling of short texts: A pseudo-document view, in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 2105–2114.

11.

Phelan

McCarthy

and Smyth

, Using twitter to recommend real-time topical news, in: Proceedings of the Third ACM Conference on Recommender Systems (2009), 385–388.

12.

Gruber

Weiss

and Rosen-Zvi

, Hidden topic markov models, in Artificial Intelligence and Statistics, 2007, 163–170.

13.

Andrzejewski

Zhu

and Craven

, Incorporating domain knowledge into topic modeling via Dirichlet forest priors, in: Proceedings of the 26th Annual International Conference on Machine Learning, 2009, pp. 25–32.

14.

Cheng

et al., Btm: Topic modeling over short texts, IEEE Transactions on Knowledge and Data Engineering 26(12) (2014), 2928–2941.

15.

Baykulov

and Gajewski

, Prestack seismic data enhancement with partial common-reflection-surface (CRS) stack, Geophysics 74(3) (2009), 49–58.

16.

et al., Deep image: Scaling up image recognition, arXiv preprint arXiv:1501.02876, 2015.

17.

Wang

and Perez

, The effectiveness of data augmentation in image classification using deep learning, in: Convolutional Neural Networks Vis. Recognit, 2017.

18.

Wong

S.C.

et al., Understanding data augmentation for classification: when to warp? in: 2016 International Conference on Digital Image Computing: Techniques and Applications, 2016, pp. 1–6.

19.

et al., Improved relation classification by deep recurrent neural networks with data augmentation, in: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, 2016.

20.

Vasconcelos

C.N.

and Vasconcelos

B.N.

, Increasing deep learning melanoma classification by classical and expert knowledge based image transforms, CoRR (2017).

21.

Yan

et al., A biterm topic model for short texts, in: Proceedings of the 22nd International Conference on World Wide Web, 2013, pp. 1445–1456.

22.

Lin

and He

, Joint sentiment topic model for sentiment analysis, in: Proceedings of the 18th ACM Conference on Information and Knowledge Management, 2009, pp. 375–384.

23.

Asuncion

et al., On smoothing and inference for topic models, in: Proceedings of the Twenty-fifth Conference on Uncertainty in Artificial Intelligence, 2009, pp. 27–34.

24.

Mimno

et al., Optimizing semantic coherence in topic models, in: Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2011, pp. 262–272.

25.

Nigam

et al., Text classification from labeled and unlabeled documents using EM, Machine Learning 39(2-3) (2000), 103–134.

26.

Liu

et al., A crowdsourcing-based topic model for service matchmaking in Internet of Things, Future Generation Computer Systems 87 (2018), 186–197.

27.

Hartigan

J.A.

and Wong

M.A.

, Algorithm AS 136: A K-Means clustering algorithm, Journal of the Royal Statistical Society. Series C (Applied Statistics) 28(1) (1979), 100–108.

28.

Zivkovic

, Improved adaptive gaussian mixture model for background subtraction, in: Proceedings of the 17th International Conference on Pattern Recognition, 2004, pp. 28–31.

29.

Rand

W.M.

, Objective criteria for the evaluation of clustering methods, Publications of the American Statistical Association 66(336) (1971), 846–850.

30.

Blei

D.M.

A.Y.

and Jordan

M.I.

, Latent dirichlet allocation, Journal of Machine Learning Research (2003), 993–1022.

31.

Faris

P.D.

et al., Multiple imputation versus data enhancement for dealing with missing data in observational health care outcome analyses, Journal of Clinical Epidemiology 55(2) (2002), 184–191.

32.

Wang

and Tudoreanu

M.E.

, Utilizing public data for data enhancement and analysis of federal acquisition data, 2018.

33.

Baykulov

and Gajewski

, Prestack seismic data enhancement with partial common-reflection-surface (CRS) stack, Geophysics 74(3) (2009), 49–58.

34.

Ramage

Dumais

and Liebling

, Characterizing microblogs with topic models, Proceedings of the International AAAI Conference on Web and Social Media 4(1) (2000).

35.

Wang

Agichtein

and Benzi

, TM-LDA: efficient online modeling of latent topic transitions in social media, in: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2012, pp. 123–131.

Topic discovery from short reviews based on data enhancement

Abstract

Keywords

1. Introduction

2. Related works

3. Construction and inference of DE-LDA

3.1 Data enhancement

4.1 Experimental setting

4.2 Dataset

1 http://www.weibo.com.

Table 1 Results of coherence

5.1 Clustering based on DE-LDA

Table 4 Clustering performance for GMM

Footnotes

Acknowledgments

References

¹
http://www.weibo.com.

Table 1
Results of coherence

Table 4
Clustering performance for GMM