Abstract
Sentiment analysis of messages posted on micro-blogs is helpful in determining the current usefulness and acceptability of target products or services. It is the basis for finding users with similar attitudes. In this paper, we propose a new sentiment similarity technique to analyse Chinese micro-blog accounts. However, the Chinese text features have not been well studied. Therefore, we first chose four types of feature sets and selected principle features by combining information gain and support vector machine techniques. Next, we compared the four types of features to determine which type of feature contributed more than others. Here we used three classification techniques: decision tree, support vector machines and naive Bayes. Finally, we used Karhunen–Loéve transform technique and average precision between positive and negative features to measure sentiment similarity. Experiment evaluations demonstrated that this new method is efficient and performs better than original average distance for Chinese micro-blogs.
1. Introduction
More and more users are sharing real-time news and information sources in micro-blog communities. People post their thoughts and opinions on news and information in micro-blogs every day. ‘What other people think’ has always been an important piece of information for most of us during the decision-making process [1]. Social influences play a key role when people are making a purchase decision. Social networks are new forms of self-representation and communication, and are subject to social behaviour that is different from that in the real world [2]. The recent developments in Web 2.0 have provided more opportunities to investigate web-based social networks. Some examples of such social networks are Facebook, MySpace, LinkedIn, Twitter, MyBlogLog, Flickr, Youtube, Join me, Delicious, RenRen and Sina Weibo. As the new social media, micro-blogs are becoming a popular communication platform. They allow users to post short messages to describe, update and share their current status and opinion [3]. Therefore, the micro-blog platform is becoming an ever-larger social network.
Seeking and extracting useful information from Chinese micro-blog websites also poses significant challenges. Chinese micro-blogs have special posting content and characteristics. People can not only post words, but also music and videos. While some networks like Twitter and Facebook have been well documented, the popular Chinese micro-blog social network, Sina Weibo, has not been well studied [4]. Sina Weibo is the biggest Chinese micro-blog platform. It is a new and promising platform that has almost 300 million users. According to the statistics of Hitwise, the utilization rate and user loyalty of Sina Weibo had surpassed those of Twitter by April 2011. Moreover, there is a vast difference in the content shared in Chinese micro-blogs when compared with a global social network such as Twitter. Most previous works have focused on Twitter, and little emphasis has been given to Chinese social media. That is why we focus on Chinese micro-blogs in this work.
Detecting the current attitude of users can provide information on online services and products. Furthermore, it has other potential applications. A good number of companies have considered opinion mining and sentiment analysis as part of their mission. Sentiment analysis has been used as a sub-component technology of recommendation systems to recommend items that receive much positive text feedback [5, 6]. Sentiment analysis attempts to identify and analyse opinions and emotions. Using sentiment analysis on social networka, users’ words can be classified into positive or negative attitudes on certain topics. A fundamental technology in many current opinion mining and sentiment analysis applications is classification. Diverse methods have been studied for improving sentiment classification performance [7, 8]. The approaches that have been adopted in previous sentiment classification studies can be classified into two categories: machine learning technique and semantic orientation technique [9]. The machine learning approach tends to be more accurate, but the semantic orientation approach has better generality [10, 11].
One important issue of sentiment analysis is to define the list of feature sets for opinion classification. Different feature sets are used in different social media research. Recently, several researchers summarized text features into four types: lexical features, syntactic features, structural features and content-specific features [12, 13]. Since different languages have their special characteristics, the feature set can be remotely changed. According to our experiment platform, we adjusted those features and formed new feature sets that also contain Chinese part-of-speech tag features and micro-blog features.
In our research, we collected data from the biggest Chinese micro-blog, Sina Weibo. In China, there are currently 40 million people suffering from diabetes. Patients, doctors and hospitals open accounts in Sina Weibo to express their ideas and share diabetes information. They make up a large social network through micro-blogging. Therefore, we collected network data on the topic of diabetes. A corpus of 884 text posts that contained positive or negative opinions were extracted from 50 diabetes accounts. We wanted to obtain the attitudes of people and their friends to produce a new network. In this new network, people will have the same attitudes on the same topics. Finding people with the same attitude in the network requires calculation of the degree of sentiment similarity. In this paper, to calculate sentiment similarity, we used average precision based on Karhunen–Loéve transform, which is widely performed to analyse data in many fields [14].
The remainder of this paper is organized as follows. A literature review is presented in Section 2. Techniques and experiment processes of feature selection are described in Section 3. Section 4 is the result of sentiment similarity of Chinese micro-blog texts. We talk about conclusions and future works in the last section.
2. Literature review
2.1. Sentiment classification
Text mining tries to solve the crisis of information overload by combining techniques from data mining, machine learning, natural language processing, information retrieval and knowledge management. An important part of text information mining is to find out what other people think and what are their opinions and emotions. Therefore, there has been an explostion of interest in opinion mining or sentiment analysis in the literature. David and Pinch published a paper in which the term ‘opinion mining’ appeared, to explain the popularity of the term within web communities in 2003 [15]. For a recent survey of opinion mining, see the paper of Pang and Lee, who summarize the techniques and approaches of this field [1].
One important issue of sentiment analysis is to classify different sentiments. Sentiment classification studies attempt to determine whether a text is objective or subjective, and whether a subjective text contains positive or negative sentiments [7]. The common methods try to classify sentiment into positive and negative [8, 11, 12, 16–20]. This is called binary sentiment classification, that is, classifying each document with two labels. Moreover, Koppel and Schler find that neutral examples are crucial in learning polarity and permit accurate classification [21]. When researchers want to gain attitudes from social media, they consider two polarity sentiments and refine data by removing noise [22, 23]. Some studies have attempted to classify emotions into multiple types, including love, happiness, sadness, anger and horror [24]. Therefore, polarity of sentiment is changed into multiclass problems. Generally, the category set consists of strong positive, positive, neutral, negative and strong negative [25].
2.2. Sentiment similarity
Part of the common approach is using a sentiment dictionary to find the sentiment words and sentences in sentiment searching. In sentiment analysis, several sentiment dictionaries are generally used to post words values. SentiWordNet gives each synset of WordNet three sentiment values. SenticNet was developed by Cambria et al. and its concepts are labelled with sentiment values between −1 and +1. WordNet-Affect marks sentiment labels at some synsets in WordNet [26]. Ku et al. create a Chinese sentiment dictionary, The National Taiwan University Sentiment Dictionary (NTUSD), with 11,088 Chinese words. HowNet-VSA is a sentiment dictionary that can be used with both Chinese and English contents [1]. Also, there are two English dictionaries with no labels called WeFeelFine dictionary and NELL dictionary [26].
Sentiment similarity has not received enough attention to date. Turney and Littman proposed sentiment similarity of word pairs to calculate semantic similarity [27]. Hassan and Radev created a graph-based method using WordNet similarity measures [28]. Marneffe et al. used sentiment orientation of the adjectives to calculate the probability of rating [29]. Those works focused on semantic similarity measures that used latent semantic analysis, point-wise mutual information and WordNet-based similarity.
In several papers, the researchers used sentiment similarity instead of classes to analyse users’ opinions [7, 14, 30]. Sentiment similarity was used to identify the authors based on their writing styles [12]. According to Abbasi et al., the new stylometric similarity detection techniques can be applied to assess the degree of similarity between individuals based on writing style. The results were better than those of other techniques such as principal component analysis, N-gram models, Markov models and cross entropy [14]. They used Karhunen–Loéve transforms to assess the degree of similarity and a pattern disruption mechanism to determine dissimilarity. Owing to its better performance, we use this technique to calculate sentiment similarity in our experiment.
2.3. Sentiment analysis features
The sentiment analysis features for social media proposed in previous works can be divided into four types as follows.
Lexical features, including character-based and word-based features, such as total number of digit characters, space characters, total different words, word-length frequency and frequency of once and twice occurring words [7,12]. When it comes to Chinese text, total number of non-Chinese characters and average sentence length in terms of words are added in this paper. In total, we chose seven more important lexical features for Chinese micro-blogs.
Syntactic features, including punctuation, N-gram and part-of-speech. Punctuation frequency is combined into feature sets and can improve the performance of user identification [31]. Some kinds of punctuation are very important in sentiment analysis, such as the exclamation and question marks. Unigrams, bigrams and trigrams are commonly used as features in previous works. Some N-grams worked better than others for the produce-review and movie review polarity classification [8, 32]. Part-of-speech (POS) is the features set for counts of the numbers of nouns, verbs, adjectives, and any other parts of speech. We count 16 types of Chinese POS tags in this paper.
Content-specific features, including content-specific words, function words and structural features. Content-specific words are used to discriminate posted topics as users may use different words on specific topics [13]. For this reason, special words can provide a clue about the user. For example, a diabetes doctor may use ‘type I’, ‘type II’ and ‘blood sugar’ as his keywords. Different function words were shown to have good discrimination in many studies [12, 33]. However, there is no generally accepted set of function words because they have different discriminating power in different fields. Structural features express the writing layout. DeVel introduced several structural features for email [34]. Zheng revised some features and formed 14 structural features for online messages [12]. According to the traits of micro-blogs, we adopted eight structural features, such as number of other users, number of images used and number of URLs used.
Micro-blog features, including the emoticons and abbreviations. Emoticons are one of the most powerful signals to differentiate polar positive and negative messages [35]. The English emoticon data set was created by Go et al. for a project at Stanford University by collecting positive and negative emoticons of tweets [23]. Pak et al. queried Twitter for two types of emotions: happy emoticons such as ‘:-)’, ‘:)’, ‘=)’ and ‘:D’ and sad emoticons such as ‘:-(’, ‘:(’, ‘=(’ and ‘;(’ [36]. An abbreviation is a shortened form of a word or phrase for quick writing. For example, the meaning of ‘OMG’ is ‘Oh My God’. It can be treated as individual token. English abbreviations are easily found in Internet Lingo Dictionary [37]. As Chinese text has no such abbreviations, we do not use this feature in Chinese micro-blogs.
3. Techniques and experiments
3.1. Data
We collected data from one Chinese micro-blog, Sina Weibo, which has almost 300 million users. It is the first and the biggest micro-blog website in China. We obtained the data Sina API. People share information and opinions on this website constantly. Healthcare is a promising and useful area to provide convenient service via social media. Increasing numbers of doctors and hospitals are opening their micro-blog accounts to help patients. Among those patients, a relative majority are suffering from diabetes. Therefore, we searched diabetes as our topic and chose 50 users as our research seeds, including diabetes hospital accounts, prominant diabetes doctors’ validated accounts and diabetes magazines’ validated accounts. They all have many followers who want to obtain diabetes information and treatments. Account information for the 50 users is listed in Table 1. We collected 884 texts about diabetes among their messages from the date that Sina Weibo was released until 30 April 2012. For conveniently tackling sentiment analysis, the number of positive opinions was equal to −1 in our dataset.
Collected diabetes micro-blogging information.
3.2. Feature selection
3.2.1. Feature set
Syntactic, semantic, link-based, and content-specific features are four feature categories that have been used in previous sentiment analysis. For example, syntactic, lexical and content-specific features have been successfully applied to English and Chinese [12, 13, 30]. Zheng et al. [12] created 270 write–print features for online messages for English and 117 for Chinese. However, they do not use POS tags as features. More researches have use POS tag in English sentiment analysis. In this paper, we wanted to find out whether POS tags make contributions in Chinese text. Two useful tags post software and word segmentation softwares, AntConc and MyTxtSegTag (http://www.corpus4u.org/), were used to tackle Chinese texts in our experiments. We also identified a set of N-grams, including unigrams, bigrams and trigrams, using AntConc. Emoticons are commonly used in micro-blogs to express emotions. Related works show that emoticons are one of the most powerful signals to differentiate positive from negative messages [35, 37]. We selected three positive and three negative emoticons as features in our dataset. Finally, we used a 126 features as the original feature set for classification experiments. All of the features are shown in Table 2.
Adopted feature set in the classification experiments.
After we counted the number of features, several of them were equal to zero. We deleted those features that made no contribution to classifications and 100 features were left. Among these, we picked the features that were relevant to Sina Weibo text. Some features may be redundant, which will reduce the prediction accuracy. Therefore, feature selection should be undertaken to gain a subset of features that are relevant to our target. In this work, we used a new method to operate feature selection. As we know that both information gain and support vector machine have better performance than other methods in different cases, we first used two methods to select features and second blended two selected feature sets together to determine an optimum feature set. We explain this method in the following sections.
3.2.2. Information gain
Information gain performs better on feature selection in text categorization than other methods, such as document frequency and mutual information [30]. Information gain is a common approach to selecting features [14]. In previous research, features were selected with information gain >0.0025. In our experiment, we assumed the class to be Ci (C1 = positive and C2 = negative), and the probability of the positive class to be equal to −1. The input feature is Fj (F1, F2, …, Fm). Then a simplified version of information gain for our issue is given below:
where
We calculated feature information gains and sorted them using WEKA data mining tool. Then, 11 features are selected whose information gain was >0.0025, as shown in Table 3.
Information gain ranked attributes.
3.2.3. Support vector machine
Because the support vector machine (SVM) is a very strong classifier and outperforms other classifiers in many fields, we used this method in our experiment to achieve a better feature set. Support vectors are the important vectors to make a distinction between two classifications. For two-class classification problems, the basic SVM concepts are briefly described below [38].
Given a training set of instance–label pairs (xi, yi), i = 1, 2, …, m, where
The support vector technique solves an optimization problem to find the minimum number of training errors.
SVM also can be applied to select features. It gives a couple of support vectors that discriminate different classes with minimum training errors, as shown in Figure 1 [39]. We rank the support vectors by their importance in SVM. Therefore, important features are gained in a new way. According to this method, in our experiment, we obtain support vectors lists which are SVM selected attributes:

Example for selecting better support vectors by applying SVM on testing data.
Both information gain and SVM have better performances in many cases. They have different strong points that can influence classification accuracy. Therefore, we blended these two methods to select Chinese text features. We combined those features with that obtained from information gain. Different methods may have diverse feature selection sets. We chose more features from SVM method based on features obtained from information gain.
When using SVM, choosing the optimal input feature subset and setting the best kernel parameters are crucial. These two problems influence the SVM classification accuracy. The kernel parameters, C and γ, can be obtained using LIBSVM tools (http://www.csie.ntu.edu.tw/~cjlin/libsvm/). In our experiments, we choose 10-fold cross-validation. We used IG-11, SVM-18, IGSVM-21, IGSVM-48, IGSVM-67, IGSVM-86 and Original-100 to express the methods and numbers of features, respectively. To compare with different classification methods, we used the naive Bayes classifier as a control group in our experiments. Precision, recall and F-mean are common metrics to measure the classification accuracy.
We used those three metrics to discover whether features selected by our new method outperform other features. Tables 4–6 show the classification accuracies when using J48, SVM and naive Bayes respectively. When using LIBSVM tools, the optimum parameters are: −G 1.220703125E −C 2048.0, −G 3.05517578125E − 5 −C 8192.0, −G 3.05517578125E −C 8192.0, −G 3.0517578125E −5 −C 32768.0, −G 1.220703125E − 4 −C 2048, −G 3.05E −5 −C 8192.0 and −G 1.22E − 4 −C 1.0 for data IG-11, SVM-18, IGSVM-21, IGSVM-48, IGSVM-67, IGSVM-86 and Original-100, respectively.
Classification accuracy when using J48 as classifier.
Classification accuracy when using SVM as classifier.
Classification accuracy when using naive Bayes as classifier.
When using J48 to classify seven data sets, the best result was achieved on the IGSVM-21 feature set. Because we used the LIBSVM tool with optimum parameters to classify those data sets, their F-means were difficult to distinguish in the first four datasets and the last three results were a little low. The SVM classifier outperformed the other two classifiers. The F-mean of IGSVM-21 was one of the top four results. Considering the naive Bayes classifier, SVM-18 and IGSVM-21 performed much better than the other datasets. Although the classification accuracy of IG-11 and IGSVM-48 was better when using SVM classifier, their results were worse than IGSVM-21 when using J48 and SVM classifiers. Therefore, we can say that IGSVM-21 outperformed the other feature sets when using three different classifiers. We chose IGSVM-21 as the final feature set.
3.3. Feature comparison
Following feature selection, we determined which types of feature made a greater contribution to classification. As can be seen in Table 2, we created four feature types, F1–F4, which denote features. The first feature set (F1) contains lexical features only. Syntactic features are added to the first feature set to form the second feature set (F1 + F2). The third feature set (F1 + F2 + F3) contains both content-specific features and the second feature set. The last feature set contains all features (F1 + F2 + F3 + F4). We examined the effect of adding new features to existing ones and determined which type of feature made a greater contribution to classification. We adopted J48, SVM and naive Bayes as the classifiers. Precision, recall, F-mean and receiver operating characteristic (ROC) area are commonly used as metrics. Meanwhile, growth of the F-mean is calculated to embody F-mean changes when adding new features to existing ones.
From Table 7, we determined that SVM outperformed J48 and naive Bayes classifiers. This means that SVM is the best classifier in our experiment to classify Chinese social media sentiments. When using J48 as classifier, the results of the F-mean kept increasing as more types of features were added. The accuracy achieved was the best as all features were added. That is to say that four types of features were very useful. The values of F-mean growth were 0.025, 0.12 and 0.05, respectively, when adding F2, F3 and F4 to existing features. We obtained a high growth with putting F3 into the feature set. Therefore, content-specific features such as function words and content-specific keywords are much more important to classify different sentiments than syntactic feature and emoticons.
Accuracy for different feature sets and three different classifiers.
Considering the SVM classifier, the accuracy was best when adding syntactic features to lexical features. There was an 8% improvement in accuracy. The results plateaud even with the addition of content-specific features and emoticon features. Therefore, syntactic features provide an accuracy increase compared with lexical features. The remaining features are all the same under a certain classifier.
The results of the naive Bayes classifier are similar to those of the J48 classifier. The F-mean results increase with the addition of new features into old feature sets until the last feature set. Content-specific features have higher accuracy growth (4.2%) than syntactic features (0.9%). The result of all features is equal to the result of the third feature set, which means that emoticon features have no effect in classifying the sentiment when using naive Bayes classifier.
4. Results of KLT
We used the Karhunen–Loéve transform (KLT) and average distances of positive and negative texts to detect sentiment similarity. The Karhunen–Loéve transform, also known as the Hotelling transform, is widely used in data analysis in many fields. It is a minimum distortion transform under the measurement of mean square error. Because of this character, the KLT is known as the best transform. Abbasi et al. have used this technique to detect stylometric similarity in electronic markets [14]. The KLT has advantages over other techniques such as principal component analysis and Markov models. Therefore, we used the KLT to assess the degree of similarity between each pair of diabetes accounts from micro-blogs. The procedure contins three steps:
Let
It is expressed in matrix form:
As the covariance matrix
The N eigen-equations
which can also be expressed in matrix form,
where
Now, given a vector
where the ith
We see that by this transform, the vector
The overall similarity between two users is the sum of the average distance between two instances of

Average distances of KLT and non-KLT.
In our experiment, we sorted the distances of all users with positive and negative texts. We obtained two sort lists of positive and negative similarities. As we collected 50 user accounts, there were 100 instances in similarity matrices. The gaps of sort lists between positive similarity and negative similarity were used as average precision to evaluate the sentiment similarity. Different window lengths were chosen to compare the average precision between non-KLT data and KLT data, as shown in Table 8. The values of the sum and average precision of KLT data were better than the non-KLT ones. The values of precision growth were 7.5, 6.9 and 0.8% before and after KLT. The difference was bigger when the window length was shorter. The gaps of different window lengths were 8.3% (from KLT 20 to KLT 30) and 14.2% (from KLT 30 to KLT 40). Therefore, this technique was much better with window length becoming longer.
Sentiment similarity comparison results.
5. Conclusions and future work
In this study, we proposed a sentiment similarity technique to analyse similar Chinese micro-blogging accounts. We chose 126 features from Sina Weibo on the topic of diabetes and we obtained several important conclusions.
First of all, the feature set was refined by blending information gain and SVM technique to select the optimum set of features. We used J48, SVM and naive Bayes as three classifiers to evaluate the results of different feature sets. The optimum feature set selected by using this new technique performed better than others. We also demonstrated that SVM (using LIBSVM tool) outperformed J48 and naive Bayes technique in our experiments.
Second, to the end of finding out which type of feature set is more important, we performed features comparison based on the optimum feature set. We found that content-specific features were the most important type, while syntactic features were more important than other two types (lexical features and emoticon features) for Sina Weibo. Emoticon features made almost no contribution.
Third, we used KLT and average distances of positive and negative texts to detect sentiment similarity. Experiments illustrated that KLT distances were shorter than non-KLT ones. Average precisions of KLT similarities were all better than the non-KLT data. Meanwhile, this technique is much better when the window length is longer.
In terms of future work, the next step is to attempt to use our technique to recommend patients with similar sentiment to accounts in micro-blogs. Those with similar attitudes to the same topic, such as diabetes medicine and techniques, will want to know each other. We will study how to recommend accounts in the diabetes network. In a sentiment similarity network, finding out who shares similar attitudes on the same topic will be meaningful and attractive for patients and doctors.
Footnotes
Appendix
Acknowledgements
The authors acknowledge the support by the China Scholarships Council (file no. 2011612202). This work is also supported by the National Natural Science Foundation of China (grant no. 71171068).
