Abstract
Sentiment analysis plays an important role in understanding individual opinions expressed in websites such as social media and product review sites. The common approaches to sentiment analysis use the sentiments carried by words that express opinions and are based on either supervised or unsupervised learning techniques. The unsupervised learning approach builds a word-sentiment dictionary, but it requires lengthy time periods and high costs to build a reliable dictionary. The supervised learning approach uses machine learning models to learn the sentiment scores of words; however, training a classifier model requires large amounts of labelled text data to achieve a good performance. In this article, we propose a semisupervised approach that performs well despite having only small amounts of labelled data available for training. The proposed method builds a base sentiment dictionary from a small training dataset using a lasso-based ensemble model with minimal human effort. The scores of words not in the training dataset are estimated using an adaptive instance-based learning model. In a pretrained word2vec model space, the sentiment values of the words in the dictionary are propagated to the words that did not exist in the training dataset. Through two experiments, we demonstrate that the performance of the proposed method is comparable to that of supervised learning models trained on large datasets.
Keywords
1. Introduction
The advancement of Internet technology has enabled the rise of social media, and the various platforms enable people to share their opinions by posting texts that express their thoughts and intentions more quickly. Sentiment analysis involves classifying the polarity of texts such as reviews and has become popular in many fields because it can be used to assess public opinion by automatically analysing large amounts of text posted to websites. In the business world, customer reviews related to services (movies, food, etc.) or products have a significant effect on the decisions of potential consumers and service providers. Potential consumers analyse the positive or negative reviews about a service and use the results as an important aspect of service selection. Service providers use the review analysis results to establish strategies to improve their service quality [1]. In addition, sentiment analysis has been used to analyse relationships between the stock market and people’s sentiments as expressed on social media [2] and to assess public sentiments regarding political issues based on individual opinions posted to Twitter [3]. In one recent paper, the authors conducted a bibliometric study of the literature related to sentiment analysis [4] and performed a sentiment analysis of object aspects [5].
Sentiment analysis has conventionally been conducted using two different approaches: unsupervised learning and supervised learning. In unsupervised learning, the polarity of texts is assessed based on a sentiment dictionary that contains the sentiment scores of words [6]. This approach identifies the polarity of a text by combining the sentiment scores of the words that compose the text. Its advantages are that it offers an intuitive appeal and executes quickly to determine the text polarity [7]. However, the downside of the approach is that it requires creating the sentiment dictionary. Although preconstructed dictionaries exist that can be used, the sentiment scores for individual words may change based on the analysis domain [8]. However, building a sentiment dictionary that matches a specific domain involves high costs [9]. In addition, because sentiment dictionaries are created manually, it is difficult to guarantee their reliability [8].
In supervised learning, text polarity is predicted using a trained classifier [10]. This approach relies on machine learning techniques, which are methods of training classifiers from a dataset composed of texts with assigned polarity labels. The supervised learning approach can result in high prediction accuracy [11,12]; however, the accuracy of supervised methods is highly affected by the quality and volume of training data. Significant differences in performance have been observed among various machine learning models used for sentiment analysis [13].
In this article, we propose a semisupervised learning method for sentiment analysis that results in excellent performance even when only small amounts of training data are available. The proposed method overcomes the disadvantages of the aforementioned approaches. The method first learns the sentiment scores of words from a small training dataset using a lasso regression; then, it estimates the scores of words not in the dataset using the learned sentiment scores of the training words using a word2vec representation of words. In this stage, we propose an adaptive instance-based learning method to expand the dictionary.
The experimental results show that the performance of our proposed semisupervised method is nearly equivalent to that of the machine-learning-based supervised models that require a large volume of training data. In addition, our method outperformed other comparative algorithms, including SentiWordNet, which uses a preconstructed dictionary [14]. The proposed method can build a high-performance dictionary because words with similar usages are highly probably to be near one another in the word2vec vector space; consequently, they are highly probably to have similar sentiment scores. The advantage of the proposed method is that it can perform an appropriate sentiment analysis in a domain while achieving outstanding performance based on the minimal effort required to assign polarity for a small amount of training data.
The remainder of this study is organised as follows. Section 2 reviews the relevant studies of unsupervised, supervised and semisupervised sentiment analyses. Section 3 presents the proposed sentiment analysis method. Section 4 presents the experiment result of the proposed method. Section 5 discusses the experiment result. Section 6 concludes this study and discusses future research topics.
2. Related works
In sentiment analysis, an unsupervised learning approach is used to build sentiment dictionaries. Turney [6] calculated the sentiment score of documents from the combined sentiment scores of the words or phrases in those documents. Kennedy and Inkpen [15] performed sentiment analyses using the frequency of adjectives in the text through the term-counting method. The term-counting method categorises texts as positive when more positive words exist than negative words; otherwise, the texts are categorised as negative. Benamara et al. [16] proposed the use of adverbs and adjectives in text polarity detection. Taboada et al. [17] studied a lexicon-based method to estimate the sentiment intensity of words, and Khoo and Johnkhan [18] compared six sentiment lexicons for sentiment Categorisation of texts at the document and sentence levels. Hogenboom et al. [19] used emoticons for sentiment analysis. Kontopoulos et al. [20] developed ontology-based techniques to perform a sentiment analysis of Twitter posts. Fernández-Gavilanes et al. [21] proposed an approach to predict sentiment in online texts based on a linguistic sentiment propagation model. Fersini et al. [22] suggested a method to estimate the sentiments of words using the personal relationships discovered from an approval subnetwork of a social network. Yu et al. [23] refined word embedding models (e.g. word2vec and GloVe) using a sentiment dictionary. Deng et al. [24] proposed an adaptive sentiment lexicon model using the pointwise mutual information of seed lexicons and texts. These unsupervised learning models are intuitive, but building their dictionaries occasionally involves high costs, and it is difficult to ensure that the resulting dictionaries are reliable.
Supervised learning converts texts with polarity labels into a document–term matrix and uses the matrix as input to train a machine learning model. Then, the trained model is used to classify the polarity of new texts. Researchers have studied sentiment analysis using various representative machine learning models, including the support vector machine (SVM) [25], ensemble learning [26], stacked denoising autoencoder [27], naïve Bayes algorithms [28], logistic regression [29], convolutional neural network [30] and shrinkage regression [5,31]. These models have achieved good performance in various domains. However, all these models require a considerable amount of labelled data to create high-performance classifiers, and training them demands considerable computing power.
Semisupervised learning uses both sentiment dictionaries and machine learning models. Sindhwani and Melville [32] claimed that when a strong association exists between a document and a word, their sentiment scores will be similar. The authors proposed a bipartite graph representation to find the degree of association between unlabeled documents and labelled words in the dictionary. They argued that the proposed method achieved higher performance levels than machine-learning-based methods when fewer labelled than unlabeled data were available. Li et al. [33] applied voting by multiple classifiers to estimate the polarity labels of unlabeled text data from estimated confidence levels. When the confidence level of the estimated labels was high, the authors assigned the estimated labels to the corresponding unlabeled data and iterated the training of multiple classifiers using the augmented text data. Yang et al. [34] proposed a method that combined the unsupervised latent Dirichlet allocation model with an SVM trained with the text features extracted by a stacked denoising autoencoder to create a sentiment dictionary. Blitzer et al. [35] proposed a method to learn a small amount of source domain data using supervised learning and then adapting it to another target domain.
The above semisupervised learning methods focus more on supervised than unsupervised learning. Accordingly, the computing issues that occur during supervised learning (e.g. extremely long training times and the large amounts of computer memory required to create a document–term matrix) remain unresolved. In contrast, our semisupervised method completes the dictionary-building process within a few minutes with few memory issues by adopting a two-stage approach to build a sentiment dictionary. Because this approach requires only a small amount of labelled data for training during the first stage of the method, it is possible to train a model in just a few minutes. We assume that a ‘small amount of data’ represent a number (1,000) that humans can label directly without much effort. Therefore, our method can be applied to real domains without requiring significant data labelling effort and without computing issues.
3. The proposed method
3.1. Overview
Figure 1 depicts the steps to create a sentiment dictionary using the proposed method. First, the proposed method preprocesses the entire collection by removing unnecessary terms such as stop words (step 1). Second, a small amount of training text data are selected from the preprocessed corpus (in the experiments, we used restaurant and mobile phone reviews as the training data), and a polarity label is manually assigned to each selected data point (step 2). This small dataset is referred to as the training dataset. The next steps are base dictionary creation and dictionary expansion. Base dictionary creation involves learning the sentiment scores of the words in the training dataset. The result of this process is a base dictionary composed of the training words and their corresponding sentiment scores (step 3). Lasso bagging is applied to build the base dictionary during this process. The second stage expands the base dictionary using word2vec and adaptive instance-based learning. We used review datasets as the word2vec model for the proposed method. Word2vec is a feature representation model that transforms words into vectors in a space such that words with similar uses reside at similar locations in the space [36]. For the collected dataset (step 4), in the proposed method, word2vec maps all the words in the entire collection – including the training data – to a vector space (step 5). For each word not present in the training data, the adaptive instance-based learning model finds the k nearest neighbour words in the vector space that are registered in the base dictionary using the cosine similarity distance measure (step 6). Then, it estimates the sentiment score of the new word by combining its neighbours’ sentiment scores and the cosine similarity values (step 7). When these two steps are complete, a suitable sentiment dictionary for the analysis has been built.

Procedural overview of the proposed method.
3.2. Preprocessing
Standard text preprocessing removes stop words in the text corpus and applies part-of-speech (POS) tagging, which involves separating the words of sentences into their morpheme units (verb, noun, adjective, etc.). This preprocessing step ensures that words have identical meanings even when they are used as different parts of speech or in different tenses with similar meanings. For example, words such as ‘do’, ‘did’, ‘does’ and ‘doing’ have identical meanings but appear in different tenses or parts of speech, but they all have a common etymology of ‘do’. Therefore, all the words that originate from ‘do’ are standardised to ‘do’. However, we do not perform POS tagging in this study; instead, we remove only the minimum stop words in the text corpus because the ensemble learning process and word2vec will assign similar sentiment scores to words with identical etymologies. A sentiment dictionary with good performance can be created even when the preprocessing step skips POS tagging.
3.3. Base dictionary building
3.3.1. Shrinkage regression
We fit a linear regression after transforming the preprocessed text into a document–term matrix to build a base dictionary. Linear regression is a statistical model that linearly combines
Lasso and ridge, which are shrinkage regression models, add a penalty term to the regression coefficients of the objective function to minimise the size of the regression coefficients and the objective function
The amount of shrinkage in the regression coefficients can be controlled by the penalty parameter
3.3.2. Lasso bagging
Ensemble learning uses multiple base models of identical types: each trained base model estimates the sentiment scores of words in the training dataset. As the base model, the proposed method uses lasso. The trained base models may each estimate different sentiment scores for the same word. In such cases, their average score is assigned to the word. This approach increases the reliability of the base dictionary. We propose the following ensemble algorithm to assign more reliable sentiment scores to the words in the training data:
From the training data, texts are randomly sampled with replacement to create a sample. The size of the sample is set to that of the training data. The sample is referred to as a ‘bootstrap sample’.
A base model (lasso) is trained using the bootstrap sample.
The remaining training data (except for the bootstrap sample) are used to evaluate the performance of the trained model. This step is referred to as validation.
Steps 1–3 are repeated
Because of the property of bootstrap sampling (i.e. sampling with replacement), there is a high probability that each text in the original training data belongs to more than one sample. Accordingly, the base regression models will generate different regression coefficients for identical words. The proposed method regards the sum of the regression coefficients as the final sentiment score of the word. Ensemble learning has the advantage of building a stable sentiment dictionary. In addition, the validation in the second step enables the minimum performance of the base dictionary to be verified. The reliability of dictionary-based sentiment analysis has been criticised as its main disadvantage.
3.4. Expansion of base dictionary
The base dictionary contains the words used during the lasso bagging training. This section explains the procedure to expand the base dictionary using word2vec and adaptive instance-based learning. The expanded dictionary registers all the words in the text corpus in the analysis domain along with their estimated sentiment scores.
3.4.1. Word2vec
Word2vec is an artificial neural network that expresses words as numerical vectors. Its architecture consists of an input layer, a hidden layer and an output layer. Each layer includes nodes, and the nodes in consecutive layers are fully connected with weights. A trained word2vec model outputs a vector for each input word. The training principle is as follows. Given a target word, the network inputs word sequences of finite lengths that appear before and after the target word in a sentence; then, the weights of the network are adjusted so that the output becomes the target word. This weight learning mechanism is iterated for each word in the text corpus.
For example, consider the sentence, ‘This juice is very nice and sweet’, and suppose that word2vec is trained so that the output is ‘nice’ for the inputs ‘This juice is very’ and ‘and sweet’. The learned weights will become a feature vector that represents ‘nice’. If the surrounding words of ‘nice’ and ‘great’ are similar, for example, in restaurant reviews, ‘nice’ and ‘great’ will be located near each other in the vector space. Figure 2 shows an example of word vectors learned from a restaurant review dataset. A visualisation tool referred to as t-distributed stochastic neighbour embedding (t-SNE) [38] was applied to represent the word vectors in the two-dimensional space. In this figure, word cluster 1 includes (dog, cat, cats and dogs), cluster 2 includes (did, didn’t, don’t, can, could and will) and cluster 3 includes (nice, wonderful, great and good). This example demonstrates that word2vec places words with similar uses close to each other in the vector space. We trained the review datasets as the word2vec model (skip-gram model). In addition, we evaluated the pretrained word2vec model in the experiment.

Example of word2vec results in a two-dimensional space.
3.4.2. Adaptive instance-based learning
Adaptive instance-based learning is an unsupervised learning algorithm proposed in this article. It calculates the sentiment scores of unregistered words (those that are in the text corpus but not registered in the base dictionary) using the scores of the words that are registered in the base dictionary. Instead of finding the k nearest registered words for each unregistered word, the adaptive instance-based learning finds the neighbouring unregistered words for each registered word in the word2vec vector space. The number of neighbouring unregistered words is proportional to the size of the sentiment score of the registered word – the higher the score is, the larger the neighbourhood size is. This approach reflects the fact that registered words with high sentiment scores play a more important role in determining the sentiment scores of neighbouring unregistered words. In addition, the number of neighbours is designed to be affected by the class imbalance ratio between the number of positive texts and the number of negative texts in the text corpus. It is reasonable to increase the size of the number of neighbours for registered words that have the same sentiment sign as that of the majority text label (whether positive or negative). If, instead, the number of neighbours is fixed for every word, it is highly probable that a sentiment score will be assigned to words that do not convey sentiment. Rather, the words are likely to be confused in determining the value of sentiment. Figure 3 presents the procedure for determining the number of neighbouring unregistered words by considering the size of the sentiment score of a registered word and the class imbalance ratio.

Procedure for determining the number of neighbours in adaptive instance-based learning.
For each registered word in the base dictionary, the proposed method finds the neighbouring unregistered words in the word2vec vector space using the cosine similarity measure. Cosine similarity uses the cosine value of the angle between two vectors to measure their similarity degree. The higher the similarity between a registered word and an unregistered word is, the greater the likelihood is that both usages are similar, which implies that the two words have similar polarity degrees. For example, if the sentiment score of ‘delicious’ is 8.94 in the base dictionary of restaurant reviews, unregistered words that are closest to ‘delicious’ in the word2vec space are extracted, and the sentiment score of ‘delicious’ is multiplied by the cosine similarity value of each extracted word. As shown in Table 1, words that are close to ‘delicious’ in the word2vec space are ‘yummy’, ‘delish’, ‘tasty’, ‘fabulous’, ‘scrumptious’ and ‘fantastic’ in that order. These words have meanings similar to ‘delicious’ and are used in similar situations.
Example of base dictionary expansion (the sentiment score of ‘delicious’ is assumed to be 8.94).
The sentiment score for an unregistered word is the weighted sum of the sentiment scores of its associated registered words. A weight is defined as the cosine similarity value between an unregistered word and a registered word. For example, if ‘yummy’ is selected as a neighbouring word of ‘delicious’ (which has a sentiment score of 3) and ‘delish’ (with a sentiment score of 2), and the latter two words are registered in the base dictionary, the proposed method sets the sentiment score of ‘yummy’ as follows: sentiment score of delicious (3 points) × similarity (0.8) + sentiment score of ‘delish’ (2 points) × similarity (0.7) = 3.8 points.
In the procedure of Figure 3, each registered word is associated with multiple unregistered words; however, unregistered words are not guaranteed to be associated with registered words. In other words, some unregistered words may have no neighbours. For example, suppose that pronoun ‘I’ is in the word corpus of restaurant reviews but is not included in the training dataset used to build the base dictionary. Word2vec allocates ‘I’ to a position in the space distant from registered words with high sentiment scores (e.g. ‘delicious’, ‘terrible’). Thus, the registered words might not regard the pronoun ‘I’ as a neighbour in adaptive instance-based learning. Although the pronoun ‘I’ might be associated with some registered words in the worst case, it is likely to have a sentiment score near zero because the cosine similarity values will be very small. In contrast, if k nearest neighbour (kNN) were used instead of the adaptive instance-based learning, nonzero scores would be assigned to the ‘I’ pronoun because of the positive k value.
The polarity prediction for new texts not in the collected text corpus is conducted as follows. When a word in the text is included in the expanded sentiment dictionary, then the sentiment scores are used to determine the text polarity; otherwise, that word is not considered. If the sum of the sentiment scores is positive, then the text is assigned a positive label; otherwise, it is assigned a negative label.
The proposed sentiment analysis method is based on semisupervised learning. The method initially involves supervised learning because the base dictionary is built using the ensemble model, which is trained from a small amount of labelled training data. Then, the method switches to unsupervised learning because the base dictionary expansion is performed without training data; instead, the sentiment scores of the registered words in the base dictionary are used to estimate the scores of unregistered words by measuring their cosine similarity values in the word2vec vector space. The main advantage of the proposed method is that a suitable sentiment dictionary for the analysis domain can be built without a large human effort. In addition, a number of cases exist in which the same word may have different sentiment scores according to the domain. In such cases, word2vec learns different locations for the identical word in the vector space when the role of the word changes. It is not necessary to measure sentiment values for every word in the document. Words that require sentiment values usually appear frequently in all documents. Thus, most of them are in the word2vec model. For this reason, a large amount of documents are required to train the word2vec model.
4. Experiment
The proposed method was tested using two review datasets: the Amazon product review dataset collected by Blitzer et al. [35] and a mobile phone and restaurant review dataset provided by Kaggle. 1 Table 2 shows detailed statistics for these datasets. A lasso bagging model composed of 30 base models was used to build a base dictionary; 2 then, the base dictionary was expanded using the words in the test data. The maximum number of neighbouring words required for dictionary expansion was set to 10, 30, 50, 100, 300, 500 or 1,000. We trained word2vec models for each review dataset to 50 dimensions. We set the window size to 12 and the minimum count to 30. We also compared the results with a pretrained word2vec model. 3 The final performance result for each review case was the average of 10 tests.
The dataset statistics.
4.1. Blitzer dataset
Blitzer et al. [35] proposed sentiment domain adaptation using a small labelled dataset. In our experiment, we compared our method with a single-domain setting. To ensure a fair comparison with previous works, we randomly split each dataset into a training dataset of 1,600 samples and a test dataset of 400 samples. We trained the unlabeled dataset with the word2vec model, built the base dictionaries using labelled training samples and then expanded them using the word2vec model. We repeated the experiment 10 times. We used accuracy as a performance measure calculated as shown in equation (3), because the data contained equal numbers of positive and negative labels. In equation (3), accuracy refers to the proportion of reviews that were correctly classified
4.2. Kaggle dataset
In this experiment, we compared our method with some supervised learning approaches and those that use a preconstructed dictionary. We compared the performance differences between the methods when both used the same small training dataset and when a large training dataset was used for the supervised learning method and only a small training dataset was used for our method. In the first experimental setting, we formed a training dataset using 1,000 randomly sampled reviews from the 100,000 total reviews: the remaining 99,000 reviews were used as a test dataset to measure the performance of the methods. In the second experimental setting, we formed a training dataset using 70,000 randomly sampled reviews, and the remaining 30,000 reviews were used as a test dataset for supervised learning. We used the following supervised models on unigram features (a single lasso, a single ridge, an SVM and a random forest) and used SentiWordNet as a preconstructed dictionary. We selected two performance metrics: accuracy and area under the curve (AUC). The AUC is the area under the receiver operating characteristic (ROC) curve.
4.3. Results and analysis
4.3.1. Results on the Blitzer dataset
Table 3 shows the comparison of the performance of our method with a baseline for each dataset. The results include average performance indicators and 95% confidence intervals. The baseline is a linear classifier trained without adaptation. We confirmed that there is a no statistically significant difference between the baseline and our method, but our method was found to outperform the baseline on average. In addition, the expanded dictionary acquired through word2vec showed improved performance over the baseline, especially on the Book datasets. In general, the performance of the word2vec model depends on the amount of training data. The Book dataset is suitable for training the word2vec model because it contains a greater amount of unlabeled data that can be learned than the other datasets. On the other hand, for the Electronics and Kitchen datasets, the performance of the expanded dictionary is sometimes lower than that of the base dictionary because the number of data that can be learned is relatively small. Furthermore, we confirmed that the dictionary performance degraded when the base dictionary was expanded through the pretrained word2vec model.
Performance of the proposed method on the Blitzer dataset.
4.3.2. Results on the Kaggle dataset
Figure 4 visualises the base dictionary of the mobile phone review case, which was created using the lasso-based ensemble model and trained with a bootstrap sample of 1,000 reviews. Words with positive and negative sentiment scores were visualised with dark and white circles, respectively. If the sentiment score of a word is significant, its circle is large. For the words in the bootstrap sample, only a small portion presents the dark colour. For example, the positive words include ‘perfect’, ‘excellent’, ‘great’, ‘quickly’ and ‘offer’, and the negative words include ‘limited’, ‘defective’, ‘do not’, ‘upset’ and ‘angry’.

Visualisation of the base dictionary (mobile phone review data).
As illustrated in Section 3.3.2, the ensemble learning process was used to train 30 base models from bootstrapping samples; then, its performance was evaluated (using accuracy and AUC) for validation. Throughout the validation process, we could verify the minimum performance of the base dictionary before the dictionary was expanded. For the mobile phone review case, the average accuracy and AUC were 76% and 84%, respectively. For the restaurant review case, the average accuracy and AUC were 77% and 78%, respectively. Therefore, the lasso-based ensemble learning generated a base dictionary that resulted in an accuracy of at least 77% and an AUC of 80% on average.
Figure 5 shows the sentiment levels of words in the expanded dictionary of the mobile phone review case when the maximum number of neighbouring words was set to 100. This level included 5249 words, and, compared with Figure 5, many words were assigned significant sentiment scores. In addition, both positive and negative words created clusters, which implies that the positive (or negative) words in a cluster frequently appeared together in reviews of the same polarity label.

Visualisation of the expanded dictionary (mobile phone review data).
As explained in Section 4.2, 1,000 of 100,000 reviews were randomly sampled, and polarity labels were assigned. This training dataset was used to build the base dictionary. To compare the performance before and after the base dictionary was expanded, the remaining 99,000 reviews were used as the test data. Table 4 shows the comparison results which include average performance indicators and 95% confidence intervals. From the results, we confirmed that the expanded dictionary significantly outperformed the base dictionary on the test data. Because the confidence intervals do not overlap, it can be said that there is a statistically significant difference within 95% of the confidence. Although the base dictionary did not exhibit excellent performance because of the small amount of training data, its performance was stable, and it showed similar performance results for the small validation dataset (the training data other than the bootstrap sample) and on the large test dataset of 99,000 reviews. Unlike the Blitzer dataset experiments, the performance of the expanded dictionary through pretrained word2vec was slightly higher than that of the base dictionary. Depending on the domain, using the pretrained word2vec model may result in a slight performance improvement, but we have confirmed that using word2vec after it has learned data from that domain will result in greater performance improvements. This result occurs because word usage in each domain can be slightly different. Furthermore, we confirmed that the adaptive instance-based learning algorithm performs better than the algorithm fixing a number of neighbour words when expanding the dictionary. In particular, in the restaurant review, adaptive learning showed a better performance of 4%–5%, and when the number of neighbour words was fixed, the performance deviation was large as the maximum number of neighbour words was changed. Figure 6 shows the ROC curves of the base and expanded dictionaries for each test dataset. For the two datasets, the base dictionary predicts the bias towards one class (positive), while the expanded dictionary predicts the balance between the two classes.
Performance of the proposed method before and after the dictionary expansion.
AUC: area under the curve.

ROC curve of each review dataset when the maximum number of neighbour words is 10.
The expanded dictionary was approximately 5% higher in accuracy and AUC than the base dictionary because sentiment scores were assigned to words that did not appear in the base dictionary. For example, a mobile phone review stating, ‘Great phone! It has survived my son for months, and he is a heavy user’, was evaluated as a negative review with a sentiment score of –2.95 by the base dictionary, but the same review was given a positive score of 1.34 by the expanded dictionary. The sentiment score for ‘great’ was not registered in the base dictionary; however, when the dictionary was expanded, ‘great’ was given a positive sentiment score because, in the word2vec vector space, the word was placed near ‘awesome’, ‘perfect’ and ‘wonderful’, which were registered in the base dictionary. Similarly, the review ‘It keeps freezing and isn’t working properly’ was given a positive sentiment score of 0.34 by the base dictionary but a negative score of –3.02 by the expanded dictionary. Although all the words in this sentence appeared in both the base dictionary and the expanded dictionary, the sentiment scores of the words were corrected during the dictionary expansion process. The word ‘freezing’ was influenced by ‘crashing’ in the base dictionary, and its sentiment score changed from positive to negative during the expansion, thereby allowing the polarity of the review to be correctly determined.
The performance of the proposed method was compared with those of the four supervised learning models and SentiWordNet. Table 5 shows the performance results of the models when using the same small training dataset. The models were trained with 1,000 labelled review data and the remaining 99,000 reviews were used as the test data. In this experiment, the proposed method achieved scores approximately 10%–20% higher than those of the supervised learning models in terms of accuracy and AUC. Because only 1,000 labelled data exist, they are insufficient for the supervised leaning models to predict the 99,000 test data. The training data features do not contain the test data features because the small training dataset is insufficient. In addition, since SentiWordNet contains only universal sentiment words, it could not be expected to achieve high performance when applied to a specific domain. Table 6 shows the performance results of the models when using sufficient training data for supervised learning models. The supervised learning models were trained with 70,000 labelled data (70% of 100,000 reviews) for each review case. In contrast, in the proposed method, the ensemble model was trained with only 1,000 labelled data (1% of 100,000 reviews). The supervised learning models’ scores were approximately 3%–5% higher than those of the proposed method in terms of accuracy and AUC; however, the model performances were not significantly different. Furthermore, we confirmed that supervised learning takes a very long time to learn. In particular, SVM and random forest took more than a few hours to learn as the amount of training data increased. On the other hand, our proposed method only takes a few minutes to learn. Table 7 summarises the performance comparison results.
Performance comparison when using the same small training dataset for our method, supervised learning and SentiWordNet.
AUC: area under the curve; SVM: support vector machine.
Performance comparison when using the small training dataset for our method and the large training dataset for supervised learning.
AUC: area under the curve; SVM: support vector machine.
Summary of the performance comparison results.
AUC: area under the curve.
This comparison test verifies that the proposed method achieves outstanding performance considering the required size of the training dataset. Therefore, the sentiment analysis method is expected to be useful for applications in domains involving big data analysis. In addition, the method is nearly free of the computer memory issues that occur when large text datasets are converted into a large-scale document–term matrix for training supervised learning models. The training time of the ensemble model is also reasonable because the lasso model does not require intense computation, and the training is completed in minutes.
Figures 7 and 8 show the performance differences according to the maximal number of neighbouring words, which is the only user-defined parameter in the proposed method. No significant performance differences were observed based on the maximal number of neighbouring words between the mobile phone review and restaurant review cases. The highest performance was achieved when the maximum number of words was set to 100–300, and the performance slightly decreased when the number was above 300. If the maximum number of words were to increase infinitely, the sentiment value would be given to words that are far away from the word in the base dictionary; consequently, the words are likely to act as noise. We increased the amount of labelled training data from 1,000 to 3,000 using a step size of 1,000.

Performance according to the maximal number of neighbouring words (mobile phone review).

Performance based on the maximal number of neighbouring words (restaurant review).
Table 8 shows the base and expanded dictionary performances according to the amount of labelled training data. As the training data volume increases from 1,000 to 3,000, the performance of the expanded dictionary increased by only approximately 1%. This result means that approximately 1,000 labelled data are sufficient to result in a good performance.
Performance comparison according to the amount of labelled training data.
AUC: area under the curve.
5. Discussion
We proposed a semisupervised sentiment analysis method. This method uses a dictionary-based method when predicting the sentiment value of text. Thus, this method has a disadvantage of the dictionary-based method. For instance, this method is more likely to misclassify the sentiment value if the number of words in text is high. In the food review of our experimental result, the average number of words for misclassified text was about 96 and the average number of words for correctly classified text was about 84. In the mobile phone review, the average number of words for misclassified text was about 69 and the average number of words for correctly classified text was about 17. Intuitively, this can be attributed to the fact that the reviews are the film critics and relatively short. The more words in the text, the more difficult it is to predict the sentiment values. In particular, it is difficult to predict the sentiment value of a document only using the sentiment value of words without understanding context.
Furthermore, it is difficult to say that a large number of words in the dictionary have high classification accuracy. In Table 8, as the number of training data increases, the number of words in the dictionary also increases. However, we confirmed that the accuracy does not increase as much as the number of data increases, and that if we increase the number of words artificially, the number of words in the base dictionary increases while the accuracy decreases.
Thus, this requires future studies to control the number of dictionary depending on the amount of training data and the length of the text to be predicted.
6. Conclusion
Social media sites are Internet-based platforms that offer a suitable environment for analysing macroscopic public sentiments. Thus, researchers are interested in developing sentiment analysis methods that assess positive and negative public opinions regarding services, products and social issues through individual reviews.
This study proposed a semisupervised sentiment analysis method. The method has two main stages. First, in the supervised learning stage, a lasso bagging ensemble is trained using a small amount of textual data with manually assigned polarity labels from which a base sentiment dictionary containing the words in the training data is created. Second, in the unsupervised learning stage, adaptive instance-based learning assigns sentiment scores to the words from unlabeled data that were not initially included in the base dictionary (i.e. words not used to train the lasso bagging ensemble) to complete the dictionary expansion. Tests using mobile phone review and restaurant review datasets were conducted, and no significant performance differences were observed between the proposed method and the four supervised learning models trained with a large number of words.
This study makes the following contributions. First, it provides an efficient way to build a sentiment dictionary for individual domains. In actual domains, most of the collected texts do not have polarity labels; hence, a dictionary that fits a specific domain is difficult to build. The proposed method demonstrated favourable performance despite using only 1,000 labelled data items for each domain. Second, the proposed method simplifies the preprocessing stage by skipping POS tagging, and it requires only a small amount of training data to build a sentiment dictionary. Therefore, using the proposed approach, the process for building the dictionary is fast (in both cases, the dictionaries were completed in a few minutes), and the memory overflow issue that may occur during machine learning with large amounts of training data was not observed. Third, most sentiment analysis methods do not suggest the minimum dictionary performance. In the proposed method, however, whenever a base dictionary is built, the ensemble learning method validates the performance of the dictionary before it is expanded. Thus, a minimum reliability score can be obtained for the expanded dictionary. Experiments using review datasets verified that the expanded dictionary achieves a performance approximately 5% higher than that of the base dictionary.
The experiments show that the proposed method performs comparably to the supervised learning models. However, to more reliably estimate the performance of the method, additional case studies are necessary in other domains, such as political and social issues, and the method should be tested with data in languages other than English. Furthermore, word2vec requires a high number of training data points to achieve a favourable performance and its high performance cannot be guaranteed when a large amount of data are not available.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship and/or publication of this article: This work was supported by the Ministry of Education of the Republic of Korea and the National Research Foundation of Korea (No. NRF-2018S1A3A2075114).
