Abstract
As an important issue in sentiment analysis, sentence-level polarity classification plays a critical role in many opinion-mining applications such as opinion question answering, opinion retrieval and opinion summarization. Employing a supervised learning paradigm to train a classifier from sentences often faces the data sparseness problem owing to the short-length limit introduced to texts. In this article, regarding this problem, we exploit two different feature sets learned from external data sets as additional features to enrich data representation: one is a latent topic feature set obtained using a topic model, and the other is a related word feature set derived using word embeddings. Furthermore, we propose an ensemble approach by using these additional features to guide the design of different members of the ensemble. Experimental results on the public movie review dataset demonstrate that the enriched representations are effective for improving the performance of polarity classification, and the proposed ensemble approach can further improve the overall performance.
1. Introduction
Sentiment analysis (SA) is a field of study that analyses people’s opinions, sentiments, attitudes and emotions expressed in written language [1]. With the explosive growth of user-generated opinionated contents such as blogs, reviews, forum discussions and tweets, SA is becoming increasingly important and has grown to be one of the most active research areas in natural language processing (NLP). Among various SA tasks, an important sub-task is polarity classification, which determines if an opinionated piece of text expresses positive or negative sentiment.
Generally, polarity classification has been performed at different granularity levels: word level [2], phrase level [3], sentence level [4, 5] and document level [6–8]. In this article, we focus on polarity classification at the sentence level and treat reviews in a sentence-level dataset as short texts as they generally consist of a few phrases or sentences.
Unlike traditional normal texts, short texts are usually noisier, sparser and much shorter. Because of the short length, short texts do not provide enough word co-occurrence or shared context for a good similarity measure, which presents great challenges in clustering and classification. Owing to the data sparseness, normal machine-learning methods usually encounter significant performance degradation and fail to achieve the desired results when they are directly applied to short text tasks [9, 10].
In traditional short-text classification and clustering tasks, to deal with the data sparseness problem, most existing work try to enrich the representation of a short text by exploiting additional features derived from various sources, for example, the short-text collection itself [11, 12], extra information obtained from search engines [13], a large external collection of unlabelled documents [9, 14] or some external knowledge base such as WordNet, Open Directory Project and Wikipedia [12, 15, 16].
Inspired by the success of representation enrichment approach in short-text classification and clustering tasks, we also adopt this approach in this work. However, different from the previous works, we propose to explore both latent topics and related words as additional features to enrich the representation of reviews.
Consider the following sentences, in which sentence S1 is labelled with positive polarity, and sentence S2 is used as the test data.
S1. The iPod is great.
S2. I bought iPhone. What a product!
From S1, we can see that ‘iPod’ appears in a review text of positive polarity. Since ‘iPhone’ and ‘iPod’ are likely to be grouped under the same topic named ‘topic_Apple’, if we can identify the common topic shared by S1 and S2, then we can enrich S1 and S2 with topic name as follows:
S1′. The iPod is great. topic_Apple
S2′. I bought iPhone. What a product! topic_Apple
As a result, the shared topic makes S2′ and S1′ more related in a semantic way, and S2′ is more likely to have a positive polarity than S2.
Next, consider the following sentences, in which sentence S3 is labelled with positive polarity, and sentence S4 is used as the test data.
S3. a pleasant movie.
S4. an enjoyable film.
In the sentences S3 and S4, words ‘pleasant’ and ‘enjoyable’ would be treated with no relation in traditional bag-of-words (BOW) representations,whereas they are semantically related in the real context. The situation is similar for both ‘movie’ and ‘film’ words. Hence, from this perspective, if we can identify the semantic relatedness of words, then we can enrich S3 and S4 with related words as follows:
S3′: a pleasant movie enjoyable film.
S4′: an enjoyable film pleasant movie.
We can see that S3′ and S4′ share common words. By using BOW representation in a traditional classifier, we can get the same polarity for both sentences.
From the above examples, we can see that both latent topics and related words are possible ways to alleviate the data-sparseness problem. Since probabilistic topic models have been widely used to perform latent topic analysis, in this work, we also use the models to obtain latent topics. Inspired by the success of continuous space word representations in capturing the semantic similarities in various NLP tasks, we use word embeddings to derive related words. We then utilize latent topics and related words as additional features to enrich the representation of reviews. Furthermore, we come up with an ensemble approach for polarity classification by combining the individual classifiers.
The approach consists of four steps. First, we train a topic model and learn word embeddings on a large external dataset. Second, we utilize two types of additional features including latent topics and related words to enrich the original feature space of the test corpus, respectively. Third, we train two classifiers for classification. One is trained on the representations enriched with the latent topics, and the other is trained on the representations enriched with related words. Finally, an ensemble technique is applied to the two classifiers to obtain a combination classifier.
To the best of our knowledge, no previous work has explored both topic model and word embeddings to alleviate the data-sparseness problem in sentence-level sentiment analysis, and this work is the first attempt towards the problem. We demonstrate that our approach can effectively take advantage of external data to improve the performance. Moreover, since a large external dataset can be gathered easily because huge document collections are widely available from the Web, the approach is suitable and practical for real applications.
The remainder of this article is organized as follows. Section 2 introduces related work. The proposed approach is described in detail in Section 3. The experimental setup and results are presented in Sections 4 and 5, respectively. Finally, Section 6 concludes the paper and outlines directions for future research.
2. Related work
Since our work is closely related to short-text processing and sentiment analysis, in this section, we first present related work on short-text classification and clustering, and then related work on sentence-level polarity classification is presented.
2.1. Short-text classification and clustering
Different from traditional normal documents, short texts typically only include a few words or sentences. Therefore, the sparsity of content in short texts brings new challenges to classification and clustering. To deal with the data-sparseness problem, researchers made considerable efforts to enrich the representation of short texts in classification and clustering. There are mainly three major approaches that have been exploited. The first is acquiring additional information from the Web by using search engines to evaluate similarity measures for short texts, and then performing the short-text classification based on similarity [13, 17]. The second is using world knowledge such as Wikipedia or WordNet to expand feature terms [12, 16]. The third is mining latent topics from an existing large corpus using topic model such as latent Dirichlet allocation (LDA), and then using latent topics as features to enrich the representation of short texts [9, 14]. In Phan et al. [9] and Chen et al. [14], the classification performance is significantly improved by enriching feature vector with relevant latent topics derived from the topic model.
2.2. Sentence-level polarity classification
In the SA field, much work has been done on sentence-level sentiment analysis during recent years. Usually, two types of approaches are proposed to tackle the problem: lexicon-based approaches and machine learning-based approaches.
Lexicon-based approaches assume that the sentiment orientation of a sentence is determined by summing up the orientation scores of all sentiment words in the sentence. For example, a positive word is given the sentiment score of +1 and a negative word is given the sentiment score of −1. Early work using the lexicon-based approach includes Hu and Liu [4]. Contextual valence shifters such as negation and intensification are considered in subsequent works [18, 19]. Thelwall et al. [20] used a lexicon-based classifier called as SentiStrength to simultaneously detect positive and negative sentiment strength from short informal text. For each text, the SentiStrength assigns a score of 1–5 for positive sentiment strength and a separate score for negative sentiment strength.
Machine learning-based approaches treat the problem as a text classification task. They use machine-learning algorithms and features such as unigrams, bigrams, part-of-speech (POS) tags, etc. to train classifiers. Davidov et al. [21] proposed a supervised learning approach that utilizes 50 Twitter tags and 15 smileys as sentiment labels to avoid the need for labour-intensive manual annotation. Gamon et al. [22] used a semi-supervised learning algorithm based on Expectation Maximization to learn a classifier from a small set of labelled sentences and a large set of unlabelled sentences. By encoding lexical and discourse knowledge as expressive constraints and integrating them into the learning of conditional random field models, Yang and Cardie [5] proposed a context-aware method for analysing sentiment at the level of individual sentences.
In the domain of SA, tweets, short product reviews can be treated as short texts. To alleviate the data sparseness problem, Saif et al. [23] used semantic concept features and sentiment-topic features in Twitter classification and improved the performance. Montejo-Ráez et al. [24] proposed an unsupervised approach that is based on the expansion of the concepts expressed in the tweets through the application of PageRank over the WordNet graph, and the approach can improve overall performance.
3. The proposed approach
3.1. The general framework
In this section, we introduce the framework of our approach. The processing flow of the framework is shown in Figure 1. It is mainly composed of the following steps:

Framework of the proposed approach.
carrying out topic analysis and performing learning of word embeddings on a large external dataset T;
performing topic inference on test corpus;
enriching the training set Dtrain and testing set Dtest with latent topics, then building the classifier Ctopic on enriched training set
enriching the training set Dtrain and testing set Dtest with related words learned from word embeddings, then building the classifier Cembeddings on enriched training set
getting the final classifier via an ensemble of the classifier Ctopic and Cembeddings.
In the presented framework, choosing a proper large external dataset is important. For the classification task, we need to collect a dataset that is large enough to cover a lot of words, concepts and topics that are relevant to the classification problem. Besides that, the large external dataset and test dataset should be from the same domain, so that the large dataset is consistent with the training and future unseen data that the sentiment classifier will work with. It should be noted that the large dataset can be gathered easily because huge document collections are widely available on the Web.
3.2. Building classifiers with latent topics
To automatically discover latent topics from a text collection to enrich the representation of short texts, previous works of [9, 14] used LDA as topic model to perform topic analysis and achieved significant results. However, in this work, we focus on a different task: sentiment analysis. As stated in Pang et al. [7], compared with traditional text classification, polarity classification is a more difficult task, and sentiment can often be expressed in a more subtle manner, making it more difficult to identify. Naturally, when confronted with polarity classification task, we want to know whether the topics extracted from text collections using LDA model are still effective for enriching the representation of texts. Moreover, since sentiment polarities are intuitively dependent on topics, we also try to check whether sentiment-specific topics extracted using the joint sentiment-topic (JST) model are effective for enriching the text representations. In this section, we first describe the LDA and JST model used for topic analysis respectively. Second, we describe two different integration strategies for enriching the text representations. Finally, we build classifiers on the enriched dataset.
3.2.1. LDA model
As a well-known probabilistic generative model, LDA has been widely used in various NLP tasks to extract latent topics from a given text corpus. The generative graphical model of LDA is shown in Figure 2.

Graphical model of LDA.
Assume that we have a document collection D, for each document d that is a sequence of Nd words, the generative process corresponding to the graphical model shown in Figure 2 is as follows:
For each topic
For each document
For each word
choose a topic
choose the word
Here,
3.2.2. JST model
By adding an additional sentiment layer between the document and the topic layer, JST can be used to detect sentiment and topic simultaneously from text [27]. The graphical model of JST is represented in Figure 3. The generative process in JST corresponding to the graphical model shown in Figure 3 is as follows:

Graphical model of JST.
For each topic
For each sentiment label
For each document d, choose a distribution
For each word w in document d,
choose a sentiment label
choose a topic
choose the word
In JST, there are three parameters that need to be estimated: the joint sentiment/topic-document distribution
3.2.3. Strategies for integrating latent topics into data
Strategy 1: after doing topic inference, to integrate
The formula (1) has two parameters: scale and cut-off, where scale is the parameter that determines topic frequency added into review d, and cut-off is the topic probability threshold. We can extend the original review d by adding latent topics with high probabilities to its content, which makes it enriched and more topic-focused. After integrating latent topics into the review, a new extended vector
Strategy 2: it can be observed from a trained model that words grouped under one topic are often informative and coherent, and we can augment the original text with topic assigned to each word so grouping words under the same topic could potentially reduce data sparseness in sentiment analysis. Following the technique described in Saif et al. [23], we consider another strategy to enrich the text representation using latent topics. After training a topic model on the large external dataset with class labels being discarded, we can perform topic inference on test dataset using the trained model, and the resulting model will assign each word in reviews with a topic label. Note that for the JST model, besides a topic label, each word in reviews will also be assigned with a sentiment label. To avoid introducing too much noise, rather than augmenting the topic label of each word in reviews, we just consider those words that may express sentiments. Since most sentiment words are adjectives, adverbs and verbs, we consider adjectives, adverbs and verbs in the reviews and augment their corresponding topic labels as additional features into the original feature space for classifier training.
3.2.4. Training the classifier
After integrating latent topics into data, in this study, for the sake of comprehensiveness, we choose NB and MaxEnt as classification methods to build classifiers on enriched data. They are robust and have been applied successfully to a wide range of NLP tasks. In addition, they are very fast in training speed and can be applied to almost real-time applications.
3.3. Building classifiers with related words
In this section, we will describe how to use related words obtained from word embeddings to enrich the training and testing set, and then build a classification model on the enriched dataset. In traditional BOW representation, it is common to represent words as indices in a vocabulary. Although this choice is simple and robust, it fails to sufficiently capture the complex linguistic characteristics of words. With the revival of interest in deep learning, representation of words as continuous vectors (also called ‘word embeddings’) has been effectively used in a variety of NLP tasks including parsing, language modelling and NER [28]. Word embeddings typically represent words with low-dimensional, dense and real-valued vectors. Most recently, two log-linear models for learning word embeddings were proposed by Mikolov et al. [29], namely the Continuous Bag-of-Words (CBOW) and Skip-gram model. Figure 4 shows the architectures of CBOW and Skip-gram model.

CBOW and Skip-gram model.
In Figure 4, wt denotes the tth word in a corpus. The CBOW model predicts the current word based on a set of surrounding context words in a window of size c, and the Skip-gram model predicts surrounding words given the current word wt as input. In Mikolov et al. [29], the authors demonstrated that the learned word representations could measure syntactic and semantic word similarities. In Baroni et al. [30], the authors performed a systematic comparison of context-counting and context-predicting semantic vectors on a wide range of lexical semantics tasks. The results show that predict models trained with the word2vec toolkit 2 achieve an impressive overall performance. Hence, in this article, we propose to use word embeddings to capture the semantic word similarities and find related words for feature terms in the reviews.
The word2vec toolkit implements both the Skip-gram and CBOW models for learning word embeddings. In this work, we employ the toolkit to pre-train the word embeddings on the large external dataset. In Mikolov et al. [29], experiment results have shown that the Skip-gram model performs better than the CBOW model in identifying semantic relationship among words. Therefore, we employ the Skip-gram model for estimating word embeddings.
After training word embeddings on the large external dataset, our next task is to find semantically related words for the reviews by using the embeddings. Our goal is to use word ws in the review to select word wt in the large external dataset as the closest matching word to enrich review text.
Having obtained vector representations of words, we could determine the semantic relatedness of two words (e.g. ws and wt) as distance between word vectors in a high-dimensional space.
In formula (2),
For the original review d, if we denote those filtered words as
3.4. Ensemble combination
The ensemble technique, which combines the outputs of base classification models to obtain an integrated output, has become an effective method for polarity classification [31, 32]. Inspired by these works, after obtaining classifiers by using the approaches presented in the previous sections, we exploit the ensemble technique to build an ensemble classifier by combining the class predictions of different classifiers.
By associating each individual prediction value with a weight, we can use the following ensemble method for deriving a new semantic orientation value.
where d denotes the review text and s denotes the class label of d,
3.5. The algorithm of the proposed approach
On the basis of the above discussions, a main technical flow has been established for applying our approach to the sentence-level polarity classification problem. The algorithm of the approach is illustrated as follows.
In the framework of our approach, the topic model can be LDA or JST model. Also, when enriching text representation using latent topics, we can adopt one of the two strategies to add latent topics as additional features.
4. Experimental setup
4.1. Dataset description and evaluation metrics
We make use of the two public datasets in our experiments. The summary of the datasets is shown in Table 1. The first is the sentence polarity dataset, 3 which is used as the test corpus; the sentence-level dataset was introduced in Pang and Lee [33], and it consists of 10,662 sentences selected from movie review websites. The other dataset is large movie review dataset 4 first proposed by Maas et al. [34] as a benchmark dataset for sentiment analysis. The document-level dataset consists of 100,000 informal movie reviews from the Internet Movie Database. The dataset is divided into three parts: 25,000 labelled training instances, 25,000 labelled test instances and 50,000 unlabelled instances. In both the training and test set, there are two types of labels: positive and negative, and these labels are balanced.
The summary of the datasets.
Intuitively, since the large movie review dataset and the sentence polarity dataset are from the movie domain, and the former dataset has a much larger size than the latter, it is reasonable to use the former as the large external dataset in our approach. Note that, when the dataset is used as the external dataset, the class labels are discarded.
In the experiments, we divide the sentence polarity dataset into two parts with balanced labels. The first part is used as development set for parameter tuning and it has 2000 reviews. The remaining 8662 reviews are used as training set and test set data. We have two settings, a development setting (DEV) and a test setting (TEST). In the development setting, we run the typical five-fold cross-validation where we train on four folds and test on the other fold, and then average the results. In the test setting, we run with the best configurations yielded from the development setting, and the results are averaged over 10 different training/test splits of the 8662 reviews. For each split, we randomly select 6000 reviews as training set and the remaining 2662 reviews are treated as test set for the final evaluation. The test set is not inspected while we develop the algorithms.
To measure overall performance, we use accuracy metric. Accuracy is a measure of what percentage of the identified reviews are correctly classified in the whole test collection.
4.2. Data pre-processing
Before the subsequent experiments, we conduct the following pre-processing steps on the sentence polarity dataset and the large movie review dataset. First, punctuations, numbers and other non-alphabetical characters are removed. Stop words are also removed based on the default stop word list provided in Mallet toolkit. 5 Second, for the purpose of reducing the vocabulary size, stemming is performed using Porter’s stemmer algorithm. Besides, on the sentence polarity dataset, the reviews are annotated with POS tags using Stanford POS tagger. 6
4.3. Experiment settings
To perform the review enrichment with latent topics derived from the large external dataset, we use GibbsLDA++ toolkit 7 to estimate LDA models on the external dataset, and use JST C++ implementation 8 to estimate JST models on the external dataset. All of the LDA and JST models are estimated with different number of topics (3, 9, 15, 21, 30, 45 and 60), and the influence of the number of topics will be discussed further in this work. When estimating the LDA and JST models, the hyper parameter α is set to 50/K, where K is the number of topics. The hyper parameter β is set to 0.01, and the number of Gibbs sampling iterations is set to 1200. Based on pilot experiments on the development data, the parameter αis set to 0.6; it is used to control the influence of each individual classifier in the ensemble based method.
In the experiments, when using strategy 1 to integrate latent topics extracted from topic model into the text, the two parameters cut-off and scale need to be set. For integrating latent topic features extracted from the LDA models, we use the following settings that are tuned on the development set: {(3, 0.4, 5), (9, 0.13, 20), (15, 0.08, 30), (21, 0.06, 40), (30, 0.04, 50), (45, 0.025, 80), (60, 0.02, 100)}, where (3, 0.4, 5) denotes when K=3, cut-off=0.4 and scale=5, and the rest of triples are similar. For integrating sentiment-topic feature sets extracted from the JST models, the following settings tuned on the development set are used: {(3, 0.6, 4), (9, 0.5, 5), (15, 0.4, 6), (21, 0.35, 7), (30, 0.3, 8),(45, 0.25, 9), (60, 0.2, 10)}. Note that, in the JST model, for example, when K is set to 21, it indicates that under each of the positive, negative and neutral sentiment labels there are seven topics. Hence the total number of sentiment-topics is 21.
Learning of word embeddings is performed by using the word2vec toolkit. We use the Skip-gram model to train word vectors with vector dimensionality value set to 50. For the context window size, as pointed out in Levy and Goldberg [35], a window size of 5 is commonly used to capture broad topical content, whereas smaller context window of size 2 may miss some important contexts. Following this point, we also use a window of size 5 around the target word w. To build classifiers, we choose Mallet toolkit implementations of NB and MaxEnt with default parameter settings.
Rather than just comparing the accuracies, we make most of the decisions based on statistical significance test. For simplicity, in the following analysis, unless otherwise specified, statistical significance is evaluated using a paired t-test at the p<0.05 level.
5. Experiment results and evaluation
In the experiments presented below, for comparing the effectiveness, the following seven different text representation methods are evaluated on the sentence polarity dataset:
BOW (baseline 1) – traditional ‘bag of words’ model with the term frequency (TF) weighting method.
BOW+LDA (baseline 2) – BOW integrated with additional latent topic features learned from the LDA model. The topic distribution is integrated into the original text by using strategy presented in Phan et al. [9], and the strategy is named as strategy 1 in this work, as described in Subsection 3.2.3.
BOW+JST (baseline 3) – BOW integrated with additional latent topic features learned from the JST model. The topic distribution is integrated into the original text using strategy 1.
BOW+LDAtsgn (baseline 4) – BOW integrated with additional latent topic features learned from the LDA model. Unlike BOW+LDA, the topic distribution is integrated into the original text by using strategy presented in Saif et al. [23], and the strategy is named strategy 2 in this work, a described in Subsection 3.2.3.
BOW+JSTtsgn (baseline 5) – BOW integrated with additional latent topic features learned from the JST model; the topic distribution is integrated into the original text using strategy 2.
BOW+RelWord (baseline 6) – BOW integrated with additional related word features learned from the word embeddings using the word2vec toolkit.
BOW+Lex (baseline 7) – BOW integrated with additional features derived from the publicly available sentiment lexicon. The lexicon features include NumPositive and NumNegative. The former denotes the number of positive expressions in the review, and the latter denotes the number of negative expressions in the review. We use the MPQA subjectivity lexicon as the sentiment lexicon. The lexicon contains 2718 positive and 4912 negative words. When recognizing the positive or negative expression, we take into account negation words, and there are 88 negation words defined in the MPQA subjectivity lexicon. Specifically, within the window of two words previous to the polarity word w, if there is a term w1 that belongs to the negation lexicon, then the polarity of the word w is reversed.
We also use the two types of features learned from the large-scale external dataset as joint features to enrich text representations, and the following four different methods are evaluated on the sentence polarity dataset:
BOW+BothLDA – BOW integrated with additional features learned from the LDA model and word embeddings as in BOW+LDA and BOW+RelWord.
BOW+BothJST – BOW integrated with additional features learned from the JST model and word embeddings as in BOW+JST and BOW+RelWord.
BOW+BothLDAtsgn – BOW integrated with additional features learned from the LDA model and word embeddings as in BOW+LDAtsgn and BOW+RelWord.
BOW+BothJSTtsgn – BOW integrated with additional features learned from the JST model and word embeddings as in BOW+JSTtsgn and BOW+RelWord.
We then investigate using the ensemble method introduced in earlier section to combine different text representation methods, and the following eight different representation combinations are also evaluated on the sentence polarity dataset:
Ensemble-LDA – this is an ensemble of individual BOW+RelWord and BOW+LDA classifiers.
Ensemble-JST – this is an ensemble of individual BOW+RelWord and BOW+JST classifiers.
Ensemble-LDAtsgn – this is an ensemble of individual BOW+RelWord and BOW+LDAtsgn classifiers.
Ensemble-JSTtsgn – this is an ensemble of individual BOW+RelWord and BOW+JSTtsgn classifiers.
Ensemble-BothLDA – this is an ensemble of individual BOW+RelWord and BOW+BothLDA classifiers.
Ensemble-BothJST – this is an ensemble of individual BOW+RelWord and BOW+BothJST classifiers.
Ensemble-BothLDAtsgn – this is an ensemble of individual BOW+RelWord and BOW+ BothLDAtsgn classifiers.
Ensemble-BothJSTtsgn – this is an ensemble of individual BOW+RelWord and BOW+ BothJSTtsgn classifiers.
5.1. Experiments on test data
In TEST mode, for each training/test split, to better understand the effect of the proposed approach on the test data, we used different sized subsets of the training set: starting from 10% of its original size, and increasing the size by 10% each time, and a test on the test data exclusively. The classification results of the above methods using NB and MaxEnt as the base classifier are reported in Tables 2 and 3, respectively. In Tables 2 and 3, under different topic numbers, the value of each method is the average value of 10 different training/test splits, and for each split, the accuracy value is averaged over the 10 classifiers that are trained with different sized training sets and tested on the test set. Ave-Acc denotes the average accuracy for each method over different topic numbers. Note that BOW, BOW+Lex and BOW+RelWord do not use latent topics information and they have nothing to do with the topic number. However, to facilitate comparison with the other methods, we also present the results of these methods in Tables 2 and 3, and the values of each method under different topic numbers are the same. For the purpose of discussion, we also give the detailed results of classifiers in Tables 4 and 5. Without loss of generality, we only show the results of classifiers with the topic number K set to 21.
Accuracies (%) on test data: using NB as the base classifier.
Accuracies (%) on test data: using MaxEnt as the base classifier.
Accuracies (%) on test data: using NB as the base classifier and the topic number K=21.
Accuracies (%) on test data: using MaxEnt as the base classifier and the topic number K=21.
Next, we will evaluate the experiment results from the following aspects:
Whether the related word feature set is effective for enriching the representation of texts.
Whether the latent topic feature set is effective for enriching the representation of texts.
Whether the combination of the two types of additional features helps to improve the performance.
Whether the performance of a polarity classification system can benefit from the ensemble technique. Among various combination methods, which one can be selected as the winner.
For simplicity, in the following analysis, unless otherwise specified, we focus on the Ave-Acc result on the test data.
First, in Table 2, we observe that BOW+ RelWord yields an improvement of 1.18% (73.03 vs 71.85%) over BOW. By inspecting detail information in Table 4, we can see that the improvements in the results are statistically significant at the p<0.01 level using a paired t-test over 10 pairs of different-sized training sets, which confirms the benefit of integrating related words into the representation of review text.
To answer the second question, we then assess the performances of the following text representations methods including BOW+LDA, BOW+JST, BOW+LDAtsgn and BOW+JSTtsgn. In Tables 2 and 3, among the four methods, in terms of Ave-Acc, BOW+JSTtsgn performs statistically significantly better than the other methods using a paired t-test over pairs of different topic numbers, so BOW+JSTtsgn is considered to be the most effective. Also, BOW+JSTtsgn gains an improvement of 1.53% (73.38 vs 71.85%) and 1.34% (71.58 vs 70.24%) over BOW in Tables 2 and 3. In Tables 4 and 5, we find that BOW+JSTtsgn consistently outperforms BOW at the p<0.01 level using a paired t-test over 10 pairs of different training/test splits, which demonstrates the benefit and potential of integrating external sentiment topics knowledge into the representation of short texts. We also note that, in Tables 2 and 3, when K is changed from 3 to 60, in some cases, BOW+LDA does not show significant difference over BOW in sentence polarity classification. While in traditional short-text classification, the integration of hidden topics into BOW can always achieve significant improvements over BOW representations with different numbers of topics [9]. Moreover, in traditional short-text classification, BOW+LDA has a much larger margin of improvement than that in sentence polarity classification. For example, in Phan et al. [9], when K=50, BOW+LDA can achieve an impressive improvement of 23.14% (80.25 vs 57.11%) over BOW when classifying test data. Generally speaking, BOW+LDA in sentiment analysis is not as effective as that in traditional short-text classification; we speculate that the reason for this is that the LDA model focuses on discovering and analysing topics of documents without any analysis of sentiment in the text, which limits the usefulness of the mining topic results in sentiment analysis. When comparing strategy 1 with strategy 2, since in Tables 2 and 3, BOW+JSTtsgn significantly outperforms BOW+JST in terms of Ave-Acc, we conclude that strategy 2 is more effective than strategy 1. Furthermore, considering the fact that strategy 2 does not need to tune parameters, we recommend using it as the strategy to integrate latent topics extracted from topic model into the text representation.
When using the combination of the latent topics and related words as additional features to enrich the representation of texts, the corresponding methods will show performance improvements over individual additional features-based enriched representations. For example, in Tables 2 and 3, in terms of Ave-Acc, BOW+BothJSTtsgn significantly outperforms BOW+JSTtsgn and BOW+RelWord by using a paired t-test across different number of topics. So, joint use of features learned from the external dataset to enrich text representations is superior to the use of individual features as additional features. We believe that the improvements stem from the ability of the combination-based method to better utilize the characteristics of latent topics and related words to generate high-quality semantics for more accurate classification.
Next, we further investigate the effectiveness of ensemble technique. We find that, when using an ensemble of BOW+RelWord and one of the methods including BOW+LDA, BOW+JST, BOW+LDAtsgn and BOW+JSTtsgn, the ensemble-based methods can consistently obtain improvements over individual classifiers. For example, in Table 3, in terms of Ave-Acc, Ensemble+JSTtsgn significantly outperforms BOW+JSTtsgn and BOW+RelWord with an improvement of 0.9% (72.48 vs 71.58%) and 1.63% (72.48 vs 70.85%), respectively. The results show that the individual methods in the ensemble can complement each other, and even the combination of two weak individual methods can improve the overall accuracy. Furthermore, to gain a more comprehensive view of the ensemble method, we also investigate combining BOW+RelWord with one of the methods including BOW+BothLDA, BOW+BothJST, BOW+BothLDAtsgn, and BOW+BothJSTtsgn. We find that, in most cases, there is no significant difference between the performances of the joint features-based method and the ensemble-based method. We guess that, since the methods, which use joint extra features, have included latent topics and related word information, BOW+RelWord cannot guarantee being complementary to the other ensemble components, and the ensemble methods do not show superiority over the ensemble components. Generally, an ensemble usually benefits more from leveraging the distinct strengths of individual classifiers that can complement each other.
Finally, we compare BOW+BothJSTtsgn with Ensemble-JSTtsgn. In Table 2, there is no significant difference between the performances of both methods in terms of Ave-Acc. However, in Table 3, we find that Ensemble-JSTtsgn significantly outperforms BOW+BothJSTtsgn in terms of Ave-Acc, which provides suggestive evidence that, to achieve the same level of performance, we cannot replace our ensemble approach with a simpler setup that relies on a single classifier using all of the features currently exploited by different members of the ensemble. Generally, among the various kinds of methods, Ensemble-JSTtsgn is recommended. In Tables 2 and 3, in terms of Ave-Acc, an improvement is achieved of 2.31% (74.16 vs 71.85%) and 2.24% (72.48 vs 70.24%) over BOW, respectively. With the specified topic number, the improvements can be increased. For example, in Tables 4 and 5, in terms of Ave-Acc, Ensemble-JSTtsgn achieves an improvement of 2.67% (74.52 vs 71.85%) and 2.6% (72.84 vs 70.24%) over BOW respectively.
An interesting phenomenon observable from Tables 4 and 5 is that the overall trend of performance improvement for Ensemble-JSTtsgn over BOW is downwards along with increasing training set size, as shown in Figure 5. A similar trend can be observed for the other methods such as Ensemble-JST, Ensemble-LDAtsgn and BOW+JSTtsgn. This is most likely due to the fact that, when the training set size is small, the data becomes sparser and the additional features can help the classifier to alleviate the data sparseness problem. On the other hand, when the training set size becomes larger, the sparseness problem becomes less severe, and the margin of performance improvement achieved using additional features is lower.

Accuracy improvement of Ensemble-JSTtsgn over BOW with different sizes of training data.
5.2. Data-driven features vs knowledge-based features
From the above all discussions, we can see that the two types of features learned from the large-scale external dataset are helpful for improving classification performance. In this respect, the work reported in this paper incorporates data-driven features as additional features to enrich the representation of texts. In the SA field, previous works [3, 36] have shown that sentiment lexicons can be treated as informative features for the sentiment analysis task, and sentiment lexicon-based features can be seen as knowledge-based features. In this section, we will compare the data-driven features with the knowledge-based features. From Tables 4 and 5, we find that BOW+Lex significantly outperforms BOW by using a paired t-test over pairs of different sized training sets and achieves an improvement of 1.41% (73.26 vs 71.85%) and 1.42% (71.66 vs 70.24%), respectively, and we can see that knowledge-based features are helpful for improving the performance, which is consistent with the previous works [3, 36]. In Tables 4 and 5, we find that the performances of BOW+Lex and BOW+JSTtsgn show no significant difference. We can conclude that the sentiment topic features derived from JST model can be as effective as the knowledge-based features. This is reasonable since the JST C++ implementation used by us also incorporates prior knowledge that is obtained from the MPQA subjectivity lexicon. When comparing BOW+Lex with BOW+RelWord, we find that the former significantly outperforms the later in Table 5. However, when the two types of data-driven features are used together as a single feature set, BOW+BothJSTtsgn significantly outperforms BOW+Lex.
From the above discussions, we can conclude that, in contrast to the knowledge-based features, the data-driven features can achieve equal or better performance, which also proves the effectiveness of the data-driven features learned from large-scale external dataset.
5.3. Analysis of parameters
There are two parameters in the proposed ensemble approach: the combination weight parameter α and the number of topics K. In this section, we will look into the influence of the two parameters on the proposed approach.
Figure 6 shows the results of the representative approaches by varying K from 3 to 60. We can observe that Ensemble-JSTtsgn, which utilizes the two types of features, steadily outperforms the individual feature-based approaches, which means our recommended approach is effective. Similar results can be observed for the other methods such as Ensemble-JST, Ensemble-LDAtsgn and Ensemble-LDA.

Ensemble results with different numbers of topics.
By using NB as the base classifier, the peak performance of Ensemble-JSTtsgn is obtained when the number of topics K is set to 30. By using MaxEnt as the base classifier, the peak performance of Ensemble-JSTtsgn is obtained when the number of topics K is set to 21.
Next, we study the influence of the αsettings on the performance of the proposed approach. With K fixed as 21, by varying α value from 0.1 to 0.9, we perform another set of experiments. The results are shown in Figure 7. We can see that the peak performance of Ensemble-JSTtsgn is achieved when α is set to 0.6. Note that, when α is set to another value, in most cases, Ensemble-JSTtsgn still outperforms the other ensemble technique-based approaches, which also indicates that Ensemble-JSTtsgn is most suitable for polarity classification.

Ensemble results with different α values.
5.4. Effect of external dataset size
To explore the effect of external dataset size on classification performance, on the basis of the large movie review dataset, we construct the following different sized external datasets. The first is a subset of the large movie review dataset; it only consists of 40% of the data randomly selected from the large movie review dataset. For convenience, we name it E-1 and name the large movie review dataset E-2. Furthermore, to facilitate comparisons with larger external dataset, we make use of the Amazon movie review dataset 9 used in McAuley and Leskovec [37], which spans 15 years with 7,911,684 reviews. Specifically, we randomly select 200,000, 300,000 and 400,000 reviews from the Amazon movie review dataset and add these reviews into the large movie review dataset. As a result, we have three different sized external datasets, and we name these datasets E-3, E-4 and E-5, respectively. After pre-processing, external dataset statistics are given in Table 6.
External dataset statistics after pre-processing.
To assess the effect of external dataset size on the performance of our approach, we conduct a set of experiments on the test data with the external datasets presented in Table 6. Representative results are shown in Table 7. For each training/test split, we obtain the average value of 10 classifiers with different-sized training sets on the test data. Results reported in Table 7 are averaged over 10 different splits. We conduct a paired t-test between E-2 and E-1 on the accuracy values of different numbers of topics, and find that the performance of Ensemble-JSTtsgn using E-2 as the external dataset is statistically significant better than that of Ensemble-JSTtsgn using E-1 as the external dataset. For the other external datasets including E-3, E-4 and E-5, we can obtain the same conclusion. Moreover, in the set M = {E-2, E-3, E-4, E-5}, there is no significant difference between the performance of Ensemble-JSTtsgn, which uses x∈M as the external dataset, and that of Ensemble-JSTtsgn, which uses y ∈ (M − x) as the external dataset. The results for the development data are also consistent with the trends observed on the test data. Owing to space limitation, we do not present detailed experiment results on the development data here.
Accuracies (%) of test data: using different-sized external datasets.
From the above discussions, we can see that, when the external dataset size is small, adding more data (e.g. from E-1 to E-2) to the external dataset can lead to better results. However, when the external dataset size reaches a certain level, adding more data (from E-2 to E-5) does not increase the performance, and the effect of external dataset size on the performance of our approach is minor.
6. Conclusion
In this article, we exploit both latent topics and related words as additional features to enrich the representation of texts, and then obtain an ensemble classifier by using the two additional feature sets to guide the design of different members of the ensemble. The latent topics features are extracted from a large external dataset using a topic model and the related words features are learned from word embeddings trained on the large external dataset. Our approach provides a way to make sparse short reviews more related and topic-focused. We carry out extensive experiments and the results show that the proposed approach is effective for improving the overall performance for sentence-level polarity classification. To summarize, the contributions mainly include three parts:
By making an extensive study of latent topic feature sets learned from topic model in combination with different integration strategies, we find that latent feature set extracted from the JST model is most effective for enriching the representation. Furthermore, for integration strategies, Strategy 2 is superior to Strategy 1.
We propose enriching the short-text representation with related words learned from word embeddings.
We propose an effective ensemble approach using the two types of additional features to guide the design of members of the ensemble, and experimental results show that the proposed approach effectively takes advantage of external data and can further improve the performance.
In future work, there are still several other avenues that might be explored. First, although the experiments are conducted only on the English corpus in this paper, we will attempt to apply the proposed approach to corpus in other languages. Second, as pointed out in Tang et al. [28], most existing algorithms for learning continuous word representations typically only model the syntactic context of words but ignore the sentiment of text, and we will investigate sentiment-specific word embeddings in the proposed approach. Finally, we will further apply the approach to various domains (e.g. hotel, travel, camera, laptops) in future work.
Footnotes
Acknowledgements
We sincerely thank anonymous reviewers for their valuable and insightful comments.
Funding
This work was supported by the National Planning Office of Philosophy and Social Science (grant number 14CTQ026).
