CDNB: CAVIAR-Dragonfly Optimization with Naive Bayes for the Sentiment and Affect Analysis in Social Media

Abstract

With the advent of the new information technologies, the growth of online reviews regarding an organization or a company or any other sector has been playing a vital role in improving the sector plans and decisions. The vast significance of the online reviews that determine the sentiment polarity is the hectic challenge of the current scenario. Sentiment classification is a process of classifying the text according to the sentimental polarities of opinions, which has positive or negative. Thus, this article concentrates on presenting a novel method, named CAVIAR-Dragonfly optimization with Extended Naive Bayes (CDNB), for performing sentiment classification and affective state classification. At first, the BITS review from Twitter is subjected to preprocessing, which includes stop word removal and stemming. Then, the next step is the feature extraction, in which all the reviews are converted to a feature vector. After that, all the individual feature vectors are collected to form the feature matrix, which is applied to the proposed C-Dragonfly optimization algorithm, to perform the sentiment classification and affective state classification. The performance of the proposed method is analyzed using the Twitter Sentiment Analysis Training Corpus Data Set based on true positive rate (TPR), true negative rate (TNR), and accuracy. From the analysis, it can be shown that the proposed method yields the maximum TPR, TNR, and accuracy of 89.0934%, 72.3064%, and 79.3591% for sentiment classification and 84.2122%, 66.2187%, and 76.6249% for the sentiment affective state classification.

Introduction

The fast-growing developments in the field of information technology have increased the number of the web users and the Internet users' accounts to 3 billion in the world. The online users comment their views and express their experience on certain products and services via opinion posting in the social networks on various websites, which are very important to take effective decisions by the government, private, or by the consumers.¹ The public reviews made on the social networks enable the government in analyzing better decisions on a social issue and in addition, the social reviews help the company to forecast the market demand and weakness of a product. Moreover, it ensures the consumers to undergo better purchasing experience through the online reviews.^2–4 The expanded growth of the user-generated contents in the website and the social networks such as Twitter, Amazon, and the trip advisors has paved the way for the users to express their opinions and views on products, services, events, and many others, and the power of the social networks has been increased. The above facility of expressing the views online has made the views much valuable that has led the sentiment classification, a hot topic in identifying the sentiment classes of the online texts.⁵

Sentiment classification finds a valuable role in the day-to-day life of each individual, in the political environment, the production and release of certain products,⁶ and commercial activities, which makes the emotion classification, and finding attacks is a challenging task.⁷ Sentiment classification is the process of identifying the emotion whether it is positive or negative opinion for an opinion-laden text, which may be a product review, a blog post, an editorial, and so on.^8,9 The positive or negative opinion is provided by the community regarding a product, a person, a political party, or a policy. The above opinion classification is simply a binary classification carrying two classes, namely the positive and the negative. The presence of complex scenarios introduces the neutral class that makes the task a single-label multiclass classification, or leads to the assessment of the sentiment strength as very positive, positive, fair, negative, and very negative, resulting in ordinal classification.^10,11 Thus, sentiment analysis (SA) is the interesting criteria that classify the opinions and sentiments provided by humans in textual format.^5,12

SA is the critical process that generates the subjective information from the text documents that are available online. Ensemble learning is considered the robust classification method, but a lot of approaches used feature engineering for developing the sentiment classifiers.¹³ Sentence-level sentiment classification is the basic area in SA that aimed at deciding the sentiment polarity based on the textual content.¹⁴ The major concern is to represent the user reviews correctly,¹⁵ and the need for determining the emotions is of current interest. For instance, from the labeled data of a certain content, the adaptive classification of the social emotion depends on the classification of the readers' emotions of the unlabeled data in any other context.¹⁶ Affective text analysis is the interesting area that is relevant for natural language processing, web, and multimodal dialogue applications. The term affective analysis corresponds to the representation of the emotions,¹⁷ and the generated social emotions pave the way for document categorization; it is much useful for the users in selecting the related documents based on the preferences in the emotions, and in addition, it is valuable for a number of applications such as the contextual music recommendation. The research at present is focusing on the automatic prediction of the document emotions. It is known that each and every word neutrally possesses either the pleasant or painful emotion due to its semantic relationship with emotional concepts or categories.¹⁸

This article proposes a novel algorithm named the CAVIAR-Dragonfly optimization with Extended Naive Bayes (CDNB) optimization algorithm, integrating the CAVIAR formula, dragonfly optimization algorithm, and the Naive Bayes (NB) algorithm to perform the optimized classification. Initially, the BITS review from Twitter is taken to perform the classification that enables effective decision-making of BITS. The reviews are subjected to stop word removal in which the stop words that are the useless words are removed from the review. It is then followed by the stemming process that determines the root word for all the created words. Then, the feature matrix is developed with each review being converted to individual feature vector. The feature vector comprises 10 features, namely punctuation marks, number of sentences present in the review, number of elongated words, nondictionary words, numerical values, hashtags of the review, and mean, variance, entropy, and information gain of the score vector. These features are combined to form the feature vector, and they are applied to the proposed CDNB algorithm that performs the sentiment classification and sentiment affect classification using the maximum value of the NB-based objective function and maximum posterior probability principle.

The major contributions of the article are as follows:

C-Dragonfly optimization algorithm: The proposed C-Dragonfly optimization algorithm is the integration of the standard dragonfly optimization algorithm and the CAVIAR formula in which the position update in the dragonfly optimization is based on the CAVIAR formula. The proposed algorithm performs the optimal sentiment classification and sentiment affect classification for which it uses the maximum fitness function.

CDNB classifier: The CDNB classifier is used to determine the optimized model using the maximum posterior probability-based NB principle.

The article is organized as follows: A brief introduction to the article is presented in the Introduction section, and the Motivation section presents the related works with deep discussion of the existing methods and their advantages and elaborates the challenges of the work. In the Proposed Methodology section, the proposed method is discussed in detail with the algorithmic steps, the Results and Discussion section presents the results and discussion, and the Conclusion section concludes the article.

Motivation

This section discusses the motivation behind the development of the proposed CDNB optimization for sentiment and affect classification. Here, recent works related to the sentiment and affect classification are reviewed. Also, the major challenges to be taken for further work are also discussed.

Related works

Lin et al.¹ proposed a method for cross-lingual sentiment classification named Cross-Lingual JST (CLJST) and Cross-Lingual ASUM (CLASUM). Moreover, two unsupervised cross-lingual joint aspect/sentiment models were used for sentiment classification that made use of information/knowledge generated from a source language and contributed toward sentiment classification. The main advantages of the method include better accuracy, and it does not require parallel corpora, machine translation systems, and labeled sentiment texts, but showed poor robustness. Tang et al.¹⁴ proposed a framework for performing the sentence-level sentiment classification, which is termed the joint segmentation and classification framework. The main advantage is that there are no syntactic or polarity annotations during segmentations, but the drawback is that it did not use any of the semantic compositional methods. Bollegala et al.¹⁵ proposed a method named sentiment sensitive embeddings cross-domain. The objectives are framed based on the distributional properties, on the label constraints of the source domain documents, and on the geometric properties present in the unlabeled documents. Sentiment classification accuracies used as the metric joint optimization do not improve over the separately trained objectives, and the joint optimization strategy did not improve the performance. Ji et al.¹⁹ proposed a method using the NB classifier that involves two steps. The personal tweets are differentiated from the nonpersonal tweets in the first step, and the personal negative tweets are removed from the personal non-negative tweets in the second step. The main advantage is that it possesses better measure of concern negation, but the irony in the tweets is not reported.

Shang et al.²⁰ proposed a method for sentiment classification termed the fitness proportionate selection binary particle swarm optimization (F-BPSO) that is the modification of BPSO. It uses fitness sum in the fitness proportionate selection step. The main benefit is that it provides a better feature subset than traditional feature selection techniques, but the addition of the free parameters reduces the accuracy. Rao¹⁶ proposed a multilabeled sentiment topic model called the contextual sentiment topic model that performs the adaptive social emotion classification. It possesses the capacity to distinguish the context-independent topics from both background and contextual themes, but it did not consider the dynamic social emotions. Bao et al.¹⁸ proposed a model termed the latent Dirichlet allocation that initially develops a set of latent topics from the emotions and then extracts the affective terms from each topic. This method considers each term in the document, but is suitable only for small-scale documents. Malandrakis et al.¹⁷ proposed a model termed the fusion of n-gram models that do not require any linguistic resources other than the affective ratings, but it yields lower order n-gram ratings. Table 1 shows the literature review.

Table 1.

Literature review

Authors	Purpose	Methods	Limitations	Advantages	Results
Lin et al.¹	Control the resources in the source language for improving the sentiment classification in the target language.	Cross-Lingual JST (CLJST) and Cross-Lingual ASUM (CLASUM)	Poor robustness.	Does not require parallel corpora, machine translation systems, and labeled sentiment texts.	Experimental results show that this method is applicable to practical sentiment classification applications.
Tang et al.¹⁴	Sentence-level sentiment classification.	Joint segmentation and classification framework	It did not use any of the semantic compositional methods.	There is no syntactic or polarity annotations during segmentations.	This method outperforms pipelined baseline methods in different experimental settings.
Bollegala et al.¹⁵	Cross-domain sentiment classification.	Sentiment sensitive embedding cross-domain	Poor performance.	Some of the combinations of the constraints obtain good results.	This technique outperforms several baselines and previously proposed embedding learning methods when applied to cross-domain sentiment classification.
Ji et al.¹⁹	To address the issue of spreading public concern about epidemics.	NB classifier	The irony in the tweets is not reported.	It possesses better measure of concern negation.	The results support the hypothesis that this method is generalizable to other domains.
Shang et al.²⁰	Introduce a method for feature selection in sentiment classification.	Fitness proportionate selection binary particle swarm optimization	The addition of the free parameters reduces the accuracy.	It provides a better feature subset than traditional feature selection techniques.	The results of the experimentation show that this method is superior to traditional feature selection methods on subjective consumer review data sets.
Rao¹⁶	Social emotion classification.	CSTM	It did not consider the dynamic social emotions.	It possesses the capacity to distinguish the context-independent topics from both background and contextual themes.	The results show that the proposed CSTM outperformed the baseline models by a large margin.
Bao et al.¹⁸	Mining social emotions from affective text.	Latent Dirichlet Allocation	Suitable only for small-scale documents.	This method considers each individual term in the document.	Experimental results depict that this model effectively identifies meaningful latent topics for each emotion.
Malandrakis et al.¹⁷	Affective text analysis.	Fusion of n-gram models	It yields lower order n-gram ratings.	It does not require any linguistic resources other than the affective ratings.	The insertion of bigrams leads to significantly improved performance over the unigram-only models, with accuracy reaching 75.9%.

CSTM, contextual sentiment topic model; NB, Naive Bayes.

Challenges

The various challenges in this research work are enlisted as follows:

The main challenge is that the sentiment classifiers are mainly trained to perform the classification of a similar topic or specific domain. Whenever the test data from different domains arrive, the classifier provides poor classification performance, for which the language and the words that express the sentiments vary depending on the topic.^21,22

Insufficient label data lead to inaccurate training and affect the robustness of the classifier. In addition, they increase cost and time.³

The main challenge is regarding the context-sensitiveness that occurs while using the trained model of a particular topic for any other context.¹⁶

Feature selection is the most hectic challenge²⁰ as the transformed features should have a strong bond on the texts and context in the reviews.

In Ji et al.,¹⁹ the NB classifier was used for sentiment classification, which is applicable only for the data obeying the Gaussian data distribution. To make it suitable for all the data distribution, further optimization is required for the model characteristics.

Proposed Methodology

This section presents the proposed method of performing the sentiment classification and sentiment affect classification using the proposed C-Dragonfly optimization that uses the NB-based principle for classification. The novel CDNB algorithm possesses the following steps: Initially, the features of the review are extracted, and the model is generated by the NB classifier, and the classification is optimized by the CDNB classifier that performs the sentiment classification and the affect classification.

Proposed CDNB for classification

In this section, the algorithmic steps of the proposed CDNB algorithm are presented, and the steps involved in the sentiment and affect classification are discussed. Figure 1 depicts the process of optimization carried out using the CDNB optimization algorithm. The feature matrix generated from online Twitter reviews of BITS is fed to the proposed classifier that initially generates a number of models, and the best model is selected using the optimization algorithm. Once the best model is selected, the class value of the models is derived based on the maximum posterior probability function. For the selection of the best model, the objective function is framed that is based on accuracy, sensitivity, and specificity. The sentiment classification uses two classes, and the sentiment affect classification uses three classes.

FIG. 1.

CDNB Optimization algorithm for sentiment classification and sentiment affect classification. CDNB, CAVIAR-Dragonfly optimization with Extended Naive Bayes.

Training phase

NB model creation

Once the features are extracted from the reviews, the solution model is generated using the NB that employs the maximum posterior probability concept. The representation of the solution models is as follows: $S = \{S_{1}, S_{2}, \dots, S_{l}\}$ (1)

where $S_{1}, S_{2}, \dots, S_{l}$ are the number of the sample models in which the solution model S₁ is generated using the feature matrix, and it consists of the mean and variance values of the features for all the classes. The number of classes is two for sentiment classification, and the number of classes is three for sentiment affect classification. The other sample models, namely $S_{2}, S_{3}, \dots, S_{l}$ are generated based on the maximum/minimum bound condition. In the maximum/minimum bound condition, the mean value is determined using $m e a n = m e a n \pm 5 % o f m e a n$ , and the variance is determined using the condition $v a r i a n c e = v a r i a n c e \pm 5 % o f v a r i a n c e$ . Thus, all the possible samples are generated using the maximum/minimum bound condition from the sample vector S₁.

Optimizing the model

In this step, the generated models are optimally chosen using the novel proposed algorithm that employs the NB-based objective function. The following are the optimization steps.

Solution encoding

The solution encoding is the pictorial representation of the optimal solution determined using the proposed CDNB algorithm. The optimal classification of the training data enables the effective sentimental classification that may be either positive or negative, and in addition enables effective sentiment affect classification, which can be happy, love, or desire in case of positive class and anger, hate, or grief in case of negative class. The length of the solution vector is denoted as $[c \times 10]$ in which c is the number of class. The input to the proposed CDNB algorithm is the feature matrix that carries the M number of feature vectors corresponding to each review, and the dimension of the input is denoted as $[M \times Q]$ . Q is the total number of features and is equal to 10. Figure 2 depicts the solution encoding of the proposed algorithm. Each training data possess 10 features with 2 classes, namely positive and negative class for sentiment classification, whereas 10 features with 3 classes for sentiment affect classification.

FIG. 2.

Block diagram of the proposed CDNB algorithm. NB, Naive Bayes.

NB-based objective function

The objective function depends on three parameters, namely the accuracy, sensitivity or true positive rate (TPR), and specificity or true negative rate (TNR). The accuracy, sensitivity, and specificity should be maintained maximum to maintain a maximum value of the fitness and they are evaluated based on the NB classifier principle. The NB obeys a maximum posterior probability strategy, which is followed in the calculation of accuracy, TPR, and TNR. The formula for the NB-based objective function is as follows: $F i t n e s s, F = \frac{1}{3} \times [A c c u r a c y + T P R + T N R]$ (2) $S e n s i t i v i t y, T P R = \frac{T P}{P} = \frac{T P}{T P + F N}$ (3)

S p e c i f i c i t y, T N R = \frac{T N}{N} = \frac{T N}{T N + F P}

(4)

A c c u r a c y = \frac{T P + T N}{T P + F P + F N + T N}

(5)

where TP, TN, FP, and FN are the true positive, true negative, false positive, and false negative, and TP, TN, FP, and FN are determined to evaluate the fitness parameters. The fitness measure is evaluated for all the models generated in the classifier, and the model with the maximum fitness is selected as the best model.

Step 1: Read the training samples: The input samples are read by the classifier, and the input sample models are generated from the feature vector that carries M number of feature vectors corresponding to the individual reviews. The sample models generated for a feature vector are represented in Equation (1). The training models are developed based on the mean variance and the maximum/minimum bound principle, and this model is applied to the proposed CDNB classifier algorithm.

Step 2: Computing the posterior probability: The objective function is based on the maximum measure of fitness. The parameters of fitness depend on the posterior probability, and therefore, the posterior probability of the generated models is determined. The posterior probability of all the generated models is computed, and the computation is continued for all the sample models. Once the posterior probability of all the samples is computed, then the maximum probability is determined. The class corresponding to the maximum value of the posterior probability is assigned to the training sample.

Step 3: Generating the TP, TN, FP, and FN: Once the class is assigned to all the training samples or the models, the training model is matched with the classification model to determine the FP, FN, TP, and TN. Once the four parameters are determined, the fitness parameters, namely the accuracy, specificity, and sensitivity, are determined.

Step 4: Compute the fitness: The fitness is computed to select the best solution from the available models for which it uses the accuracy, sensitivity, and specificity parameters.

Proposed C-Dragonfly optimization algorithm

The C-Dragonfly optimization algorithm is a metaheuristic optimization algorithm, which is advantageous over the other evolutionary algorithms. The major advantages of the C-Dragonfly optimization algorithm are as follows. The information in the search space is collected and retained for future reference. Moreover, there are only a few control parameters in the search space, and they possess good flexibility. The convergence time of the proposed algorithm is very less due to the integration of the CAVIAR formula in the position update step of the standard dragonfly optimization algorithm. The proposed algorithm eases the optimal results in converging to the best solution. The proposed algorithm is based on the static and dynamic swarming behavior of the dragonflies. The proposed optimization algorithm follows the No Free Lunch (NFL) theorem.²³ Hence, it is more advantageous than the existing optimization algorithms. The static swarm of the dragonfly relies on the concept that the dragonfly takes a back and forth motion for hunting the prey, which is the exploration phase. In the static swarm, some of the dragonflies group themselves in the large group and move in the same direction for a long distance, which is the exploration phase of the dragonflies.

The principles followed in the swarm behavior are as follows. Separation that avoids the collision of the individual flies, alignment that corresponds to the matching of the velocities with each other, and cohesion that aligns the individual flies toward the center of the mass.

(i)

Initialization: The main step in the proposed CDNB algorithm is the initialization step in which the solutions are initialized, and initially, sample models are generated. The main aim of the total population is survival toward the fittest. The population or the sample models are represented in Equation (1).

(ii)

SA²CD parameter update for determining the position: The intention of the swarm is to survive for which the individuals traverse in the direction of availability of food, and they flutter away from the enemies. The five main parameters are updated based on the behavior of the individual swarms once the population is initialized. The parameters updated are separation, alignment, cohesion, attraction, and distraction.

B_{k} = - \sum_{k = 1}^{b} X - X_{k}

(6)

where X is the current position of the individual dragonfly, X_k is the position of the kth neighbor, and b be number of neighboring dragonflies with respect to the reference dragonfly. B_k denotes the separation parameter of the individual dragonflies. $D_{k} = \frac{\sum_{k = 1}^{n} v_{k}}{b}$ (7)

where D_k refers to the alignment of the fth dragonfly and v_k is the velocity of the kth neighbor. $G_{k} = \frac{\sum_{k = 1}^{b} X_{k}}{b} - X$ (8)

where X is the current position of the individual dragonfly and G_k refers to the cohesion of the fth dragonfly.

(iii)

Formulating the step vector: The position of the dragonfly is updated based on the step vector that determines the direction of the dragonfly, and therefore, the step vector of the dragonfly is updated as follows:

\begin{matrix} Δ X_{T + 1} & = (a_{1} . B_{k} + a_{2} . D_{k} + a_{3} . G_{k} + a_{4} . I_{k} + a_{5} . J_{k}) \\ + a_{6} . X_{t} \end{matrix}

(9)

where a₁, a₂, a₃, a₄, and a₅ are the weights of separation, alignment, cohesion, food factor, and enemy factor, respectively. These five weights are responsible for the effective tuning between the exploration phase and the exploitation phase and they balance the tuning process. a₆ denotes the initial weight, T represents the iteration number. B_k, D_k, G_k, I_k, and J_k denote the separation, alignment, cohesion, attraction, and distraction coefficients of the fth dragonfly.

(iv)

Calculating the objective function based on NB: The objective function is formulated based on the accuracy, specificity, and sensitivity shown in Equation (2) and the best solution depends on the maximum value of these parameters that is responsible for the maximum fitness value.

(v)

Updating the position of the food and the enemy: The main concept of the dragonfly optimization is the survival of the best such that they move toward the food and move away from the enemies finding a way for the survival of the dragonfly. The position of the food and the enemy is determined and based on that, the attraction and the distraction parameter is determined. The formula for calculating the attraction and distraction parameter is represented as follows:

I_{k} = X^{*} - X

(10)

where I_k represents the attraction of the dragonfly toward the food and $X^{*}$ is the position of the food. $J_{k} = X^{-} + X$ (11)

$X^{-}$ is the position of the enemy and J_k is the distraction of the dragonfly away from the enemy.

(vi)

A new means of position update: -CAVIAR-based position update of the dragonfly: The position of the dragonfly is based on the step vector and is represented as follows:

X_{T + 1} = X_{T} + Δ X_{T + 1}

(12)

X_{t} = X_{t + 1} - Δ X_{t + 1}

(13)

The position update of the standard dragonfly optimization is presented by Equation (12). The integration of the CAVIAR equation²⁴ imputes the importance of the CAVIAR in the optimization process. The main importance is that the newly designed equation (18) describes the time-varying processes, and it determines the position of the dragonfly that changes position randomly based on the availability of food and the presence of enemies. The CDNB is the simplest method of predicting the current position of the dragonfly, easing the time complexity and converging to the optimal solution with a faster convergence rate. Using the CAVIAR equation, the following equation is obtained: $X_{t} = γ_{0} + \sum_{z = 1}^{q} γ_{z} . X_{t - z} + \sum_{y = 1}^{g} γ_{y} . F |X_{t - y}|$ (14)

From the above equation, t is the prediction variable, and therefore, it is assumed as $t = t + 1$ . Substituting $q = 2$ , $g = 2$ , and $t = t + 1$ in the above equation, we get the following: $X_{t} = γ_{0} + γ_{1} . X_{t} + γ_{2} . X_{t - 1} + γ_{1} F |X_{t}| + γ_{2} F |X_{t - 1}|$ (15)

Substitute Equation (13) in the above equation and now Equation (15) becomes $\begin{matrix} X_{t + 1} & = γ_{0} + γ_{1} . (X_{t + 1} - Δ X_{t + 1}) + γ_{2} . X_{t - 1} \\ + γ_{1} F |X_{t}| + γ_{2} F |X_{t - 1}| \end{matrix}$ (16)

\begin{matrix} X_{t + 1} - γ_{1} . X_{t + 1} & = γ_{0} - γ_{1} Δ X_{t + 1} + γ_{2} . X_{t - 1} \\ + γ_{1} F |X_{t}| + γ_{2} F |X_{t - 1}| \end{matrix}

(17)

\begin{matrix} X_{t + 1} & = \frac{1}{(1 - γ_{1})} \\ [γ_{0} - γ_{1} Δ X_{t + 1} + γ_{2} . X_{t - 1} + γ_{1} F |X_{t}| + γ_{2} F |X_{t - 1}|] \end{matrix}

(18)

When the search mechanism is carried out using the random walk when there are no neighboring solutions, the position of the dragonfly is updated as shown below: $X_{T + 1} = X_{T} + L e v y (L) \times X_{T}$ (19) $X_{T + 1} = X_{T} (1 + L e v y (L))$ (20)

\begin{matrix} X_{T + 1} & = (γ_{0} + γ_{1} . X_{t} + γ_{2} . X_{t - 1} + γ_{1} F |X_{t}| + γ_{2} F |X_{t - 1}|) \\ (1 + L e v y (L)) \end{matrix}

(21)

X_{T + 1} = \frac{1}{(1 - γ_{1})} [(γ_{0} - γ_{1} . Δ X_{t + 1} + γ_{2} . X_{t - 1} + γ_{1} F | X_{t} | + γ_{2} F | X_{t - 1} |) (1 + L e v y (L))]

(22)

L e v y (L) = 0.01 \times \frac{R_{1} \times λ}{{|R_{2}|}^{\frac{1}{α}}}

(23)

λ = {(\frac{Γ (1 + α) \times S i n (\frac{π α}{2})}{Γ (\frac{1 + α}{2}) \times α \times 2 \times (\frac{α - 1}{2})})}^{\frac{1}{α}}

(24)

where $Γ (L) = (L - 1)!$ . T represents the current iteration and L is the dimension of the position vectors. R₁ and R₂ are the random vectors, $α$ is the constant, and the value is 1.5.

(vii)

Termination: The above steps are repeated until the optimal solution is selected. The optimal position of the dragonfly carries the best solution that is processed for affect analysis. The affect analysis and sentiment classification follow all the above steps in determining the best class value and the corresponding affect factors of positive and negative class.

Testing phase

Posterior probability for testing

Whenever the test model arrives, the proposed CDNB classifier algorithm is applied to the test model to determine the positive and negative class in case of sentiment classification, happy, love, and desire for positive class or anger, hate, and grief for negative class. Let us consider the test data as t, and once the test data arrive the posterior probability of the test model is determined using the following formula: $y_{t} = arg {max}_{k \in \{1, \dots, K\}} P (C_{k}) \prod_{d = 1}^{10} P o s t [f_{t}^{d}| C_{k}]$ (25)

where y_t is the posterior probability of the test model and $P (C_{k})$ refers to the probability of the class. The NB classifier follows the maximum posterior probability, the maximum value of the posterior probability is selected, and the corresponding class is assigned to the model. The calculation of the value of $P o s t [f_{t}^{d}| C_{k}]$ depends on the following formula. $P o s t [f_{t}^{d}| C_{k}] = \frac{1}{\sqrt{2 π σ^{2} (C_{k})}} . e^{- \frac{{(b - μ_{e})}^{2}}{2 σ_{c}^{2}}}$ (26)

where C_k is the kth class, $σ$ is the variance value, and $μ_{c}$ is the mean value. The classification results of the test model are performed in an effective way to take proper decision-making.

Proposed CDNB classifier for sentiment and affect classification

In this section, the proposed strategy of sentiment classification and the sentiment affect classification is presented. In general, sentiment classification reveals the ideas, experiences, grievances, problems of a product, or an organization, or government, and can enable effective decision-making process such that it benefits the society. The existing methods of classifying the emotions failed to provide optimal classification results and failed to recognize the emotions of the reviews. Therefore, a novel method named CDNB algorithm is proposed to perform the sentiment classification, and the sentiment affect classification that performs the optimal classification. The main objective of the work is to model the connections between the words and emotions to determine the exact emotions of the reviews that are expressed online. Figure 2 shows the block diagram of the proposed CDNB optimization algorithm to perform classification. Initially, the Twitter review of the BITS is taken for performing the classification. The classification of the BITS review determines the emotion of the person, who expressed the view that promotes taking effective steps in solving the issues of the environment such that it well benefits the organization. At first, the reviews are subjected to stop word removal in which all the meaningless words are removed from the sentence, and then, it is subjected to the stemming process that identifies the root words for all the words in the review sentence. Once the preprocessing step including the stop word removal and stemming is completed, the features are extracted. Each of the reviews is presented as a single-feature vector, and hence, the feature vectors for all the reviews are generated and sent to the proposed classifier that classifies as positive and negative for sentiment classification, happy, love, and desire for positive class or anger, hate, and grief for negative class. The classification result of the proposed classifier is determined optimally, and the results enable the institution to make effective decisions for the enhancement of the BITS environment.

Preprocessing the Twitter reviews of BITS

Consider the M number of reviews taken from the online Twitter reviews of BITS and represented as follows: $R = \{r_{1}, r_{2}, \dots ., r_{n}, \dots r_{M}\}$ (27)

where r_n refers to the nth review among the M number of online BITS reviews. Let us consider the nth review, which consists of m number of words. The number of words in the mth review is represented as follows: $r_{n} = \{w_{1}, w_{2}, \dots, w_{m}\}$ (28)

The words in the reviews are subjected to preprocessing to enable effective emotion identification. The preprocessing stage involves two steps: namely, the stop word removal and the stemming process that processes the words present in the individual reviews and presents it to the feature extraction step.

Stop word removal

To avoid ambiguity and loading problems, the stop words are removed from the reviews such that it enables instant results and eases the process of feature extraction.²⁵ $r_{n} = \{w_{1}, w_{2}, \dots, w_{u}, \dots w_{s}\}; s < m$ (29)

where s is the total number of words present in the nth review after the removal of the stop words. $w_{1}, w_{2}, \dots, w_{u}, \dots w_{s}$ are the words in the review after the removal of the stop words.

Stemming

Stemming is the technique of reducing the inflected (or sometimes derived) words to their word stem, base, or root form, generally a written word form.

Development of the feature matrix

The filtered words are subjected to feature extraction for which all the reviews are converted to a feature vector and each feature vectors carries 10 features. The 10 features for every review are extracted, and the feature vector representation for the nth review is as follows: $A_{n} = [f_{n}^{1} f_{n}^{2} f_{n}^{3} \dots \dots \dots \dots . . f_{n}^{10}]$ (30)

where A_n is the feature vector corresponding to the nth review. $f_{n}^{1} f_{n}^{2} f_{n}^{3} \dots \dots \dots \dots . . f_{n}^{10}$ refers to the feature corresponding to the number of punctuation marks present in the nth review, number of sentences present in the nth review, number of elongated words, nondictionary words, numerical values, hashtags of the nth review, and mean, variance, entropy, and information gain of the score vector. The mean, variance, entropy, and information gain of the score vector are determined from the score vector that is generated by applying the individual words of the review in the SentiWordNet. The individual words in the review applied to the SentiWordNet are represented as follows: $[S P_{u}^{n} S N_{u}^{n}] = S e n t i W o r d N e t [w_{u}^{n}]$ (31)

Once the individual words are applied to the SentiWordNet, two categories of the score, namely the positive score and negative score are updated. Using these score measures, the score vector is obtained based on the below condition: $V_{n}^{u} = \{\begin{matrix} 1; i f {S P}_{u}^{n} > {S N}_{u}^{n} \\ 0; O t h e r w i s e \end{matrix}$ (32)

where ${S P}_{n}^{u}$ and ${S N}_{n}^{u}$ are the positive and negative scores of each word in the review. The score value is updated as 1 when the positive score is greater than the negative score and 0 when the negative score is greater than the positive score. Similarly, the score vector for all the reviews is updated, and the score vector for all the reviews is represented as follows: $V = \{V_{1}^{1}, V_{2}^{2}, \dots, V_{u}^{n}, \dots V_{s}^{M}\}$ (33)

The length of the score vector V is denoted as $[M \times s]$ . The score vector is utilized to determine the following parameters such as mean, variance, entropy, and information gain. The mean of the score vector is as follows: $f_{n}^{7} = \frac{1}{s} \times \sum_{q = 1}^{s} V_{u}^{n}$ (34)

where $V_{u}^{n}$ is the vector value of uth word present in the nth review. s is the total number of words in the individual review after the application of stop word removal. The entropy feature is as follows: $f_{n}^{9} = - \sum_{i = 1}^{U (V_{u}^{n})} P_{j} \times log P_{j}$ (35)

$U (V_{n}^{u})$ is the unique value in the score vector of the individual reviews and P_j is the probability of the unique values in the score vector. The information gain feature is as follows: $f_{n}^{10} = E_{H} (V_{u}^{n}) - C E_{H} (V_{u}^{x, n}, V_{u}^{y, n})$ (36)

where $E_{H} (V_{u}^{n})$ is the holo-entropy corresponding to the score vector of the uth word present in the nth review, and $C E_{H} (V_{u}^{x, n}, V_{u}^{y, n})$ is the conditional entropy of the score vector of the uth word present in the nth review. $C E_{H} (V_{u}^{x, n}, V_{u}^{y, n}) = \sum_{t = 0}^{U (V_{u}^{n})} P_{f} \times E_{H} (V_{u}^{x, n}, V_{u}^{y, n})$ (37)

E_{H} (V_{u}^{x, n}, V_{u}^{y, n}) = W_{H} \times E (V_{u}^{x, n}, V_{u}^{y, n})

(38)

W_{H} = 2 \times (1 - \frac{1}{1 + e^{[- E (V_{u}^{x, n}, V_{u}^{y, n})]}})

(39)

\begin{matrix} E (V_{u}^{x, n}, V_{u}^{y, n}) & = \sum_{v = 1}^{U (V_{u}^{n})} P (V_{u}^{x, n} = x, V_{u}^{y, n} = y) \\ log (V_{u}^{x, n} = x, V_{u}^{y, n} = y) \end{matrix}

(40)

where W_H is the weighing coefficient and E is the entropy parameter. The features mentioned above are computed for all the reviews, and each review is converted into the feature vector of dimension $[1 \times 10]$ and all the individual feature vectors are composed to form the feature matrix. The dimension of the feature matrix is $[M \times 10]$ .

Sentiment classification by CDNB algorithm

The sentiment classification of the BITS review enables effective decision-making, and hence, the proposed method concentrates on the topic to perform an effective sentiment classification that returns any of the two classes, which is either positive or negative. Initially, the features are extracted from the review, and the features are provided to the NB classifier that generates the model based on the mean variance and maximum/minimum bound condition such that the number of possible models is generated. Then the posterior probability of the models is computed to determine the class label such that the class for the model depends on the maximum value of the posterior probability. The class label is either positive or negative. Therefore, the review can be categorized, and this makes better decisions.

Affect classification by CDNB classifier

The sentiment affect classification is performed to classify in which mood the user gives the reviews on BITS. The three classes referred for the positive class are happy, love, or desire, and the classes for the negative review may be anger, hate, or grief.

Figure 3 shows the proposed method of sentiment classification and sentiment affect classification.

FIG. 3.

Pseudocode of the proposed CDNB optimization algorithm for sentiment classification and sentiment affect classification.

Results and Discussion

This section presents the performance analysis of the sentiment classification and the sentiment affect classification methods to prove the effectiveness of the proposed CDNB method along with the comparative analysis of the methods.

Experimental setup

For the experimentation, we utilize the Twitter Sentiment Analysis Training Corpus Data Set,²⁶ which is about the tweet of #BITS PILANI. The implementation is performed in the JAVA programming that runs in the Windows 8 OS with 4 GB RAM personal computer. The two setups used for analysis include the following: Setup 1: In setup 1, the values of $γ_{0} = 0.5$ , $γ_{1} = 0.4$ , and $γ_{2} = 0.3$ and the analysis is carried out. Setup 2: In setup 2, the values of $γ_{0} = 0.9$ , $γ_{1} = 0.8$ , and $γ_{2} = 0.7$ and the analysis is carried out.

Research Questions:

Which preprocessing techniques are available for converting natural language text into appropriate format?

Is preprocessing a successful method to improve classification independent of implementation approach?

How to analyze the performance of the sentiment classification and affective state classification system?

Based on these research questions, the experimentation is done, and the performance of the proposed method is analyzed.

Data set description

The experiment is performed using the Twitter Sentiment Analysis Training Corpus Data Set. The data of the data set are collected from the University of Michigan Sentiment Analysis competition on Kaggle and Twitter Sentiment Corpus by Niek Sanders. The Twitter Sentiment Analysis Data Set contains 1,578,627 classified tweets with each row entered as 1 for positive sentiment and 0 for negative sentiment.

Evaluation metrics

The metrics used for evaluating the effectiveness of the proposed method include TPR, TNR, and accuracy.

Competing methods taken for comparison

The methods taken for comparison include the SentiWordNet,²⁷ NB,¹⁹ and Naive Bayes with the dragonfly algorithm (NBDF). The effectiveness of the proposed method is proven by comparing the methods in terms of the performance metrics, such as TPR, TNR, and accuracy.

Performance analysis

The performance of both the sentiment and affect classification stages is analyzed using the parameters, such as TPR, TNR, and accuracy.

Using the setup 1

Sentiment classification

Figure 4 shows the comparative analysis of the sentiment classification methods in terms of the metrics, such as TPR, TNR, and accuracy. In Figure 4a, the TPR percentage of the classification methods is analyzed. When the percentage of the data is 60, the TPR obtained for the methods such as SentiWordNet, NB, NBDF, and the proposed CDNB is 84.17660%, 85.0419%, 86.4408%, and 87.7628%, respectively. The proposed method shows 4.09%, 3.1%, and 1.51% improved performance than the SentiWordNet, NB, and NBDF. With the increasing percentage of the data, the TPR percentage increases. The increase of the TPR percentage is noted from 60% of training data to 90% of training data. When the percentage of data is 90, the TPR of the methods, such as SentiWordNet, NB, NBDF, and the proposed CDNB, is 86.9399%, 87.9976%, 88.5063%, and 89.7066%, respectively. Here, the proposed method shows 3.08%, 1.91%, and 1.34% improved performance than the existing methods, SentiWordNet, NB, and NBDF, respectively. It is evident that the TPR percentage of the classification methods increases, but the proposed CDNB method of sentiment classification exhibits greater TPR when compared with the other existing methods.

FIG. 4.

Sentiment classification using the setup 1. (a) Data percentage versus TPR. (b) Data percentage versus TNR. (c) Data percentage versus accuracy. TNR, true negative rate; TPR, true positive rate. Color images are available online.

In Figure 4b, the TNR percentage of the classification methods is analyzed. When the percentage of the data is 60, the TNR obtained for the methods such as SentiWordNet, NB, NBDF, and the proposed CDNB is 67.8254%, 69.0331%, 69.7231%, and 71.3336%, respectively. With the increasing percentage of the data, the TNR percentage increases. The increase of the TNR percentage is noted from 60% to 90%, and when the percentage of data is 90, the TNR percentage of the methods such as SentiWordNet, NB, NBDF, and the proposed CDNB is 69.2538%, 70.8516%, 71.8159%, and 72.8849%, respectively. It is evident that the TNR percentage of the classification methods increases, but the proposed CDNB method of sentiment classification exhibits greater TNR when compared with the other existing methods.

In Figure 4c, the accuracy of the classification methods is analyzed. When the percentage of the data is 60, the accuracy obtained for the methods such as SentiWordNet, NB, NBDF, and the proposed CDNB is 74.1228%, 76.0686%, 77.1933%, and 78.3741%, respectively. With the increasing percentage of the data, the accuracy percentage increases. The increase of the accuracy percentage is noted from 60% to 90%, and when the percentage of data is 90, the accuracy percentage of the methods such as SentiWordNet, NB, NBDF, and the proposed CDNB is 76.0767%, 77.6144%, 78.5227%, and 79.555%, respectively. It is evident that the accuracy percentage of the classification methods increases, but the proposed CDNB method of sentiment classification exhibits greater accuracy when compared with the other existing methods. The preprocessing used in the proposed model helps to improve the performance of the classifier.

Sentiment affect classification

Figure 5 shows the comparative analysis of the sentiment affect classification methods in terms of the metrics, such as TPR, TNR, and accuracy. In Figure 5a, the TPR rate of the classification methods is analyzed. When the percentage of the data is 60, the TPR obtained for the methods such as SentiWordNet, NB, NBDF, and the proposed CDNB is 84.1766%, 80.3396%, 81.6248%, and 83.4495%, respectively. With the increasing percentage of the data, the TPR percentage increases. The increase of the TPR percentage is noted from 60% to 90%, and when the percentage of data is 90, the TPR percentage of the methods such as SentiWordNet, NB, NBDF, and the proposed CDNB is 86.9399%, 82.7225%, 83.27%, and 84.9071%, respectively. It is evident that the TPR percentage of the classification methods increases, but the proposed CDNB method of sentiment affect classification exhibits greater TPR when compared with the other existing methods.

FIG. 5.

Sentiment affect classification using setup 1. (a) Data percentage versus TPR. (b) Data percentage versus TNR. (c) Data percentage versus accuracy. Color images are available online.

In Figure 5b, the TNR percentage of the sentiment affect classification methods is analyzed. When the percentage of the data is 60, the TNR obtained for the methods such as SentiWordNet, NB, NBDF, and the proposed CDNB is 67.8254%, 62.5604%, 64.3089%, and 64.4182%, respectively. With the increasing percentage of the data, the TNR percentage increases. The increase of the TNR percentage is noted from 60% to 90%, and when the percentage of data is 90, the TNR percentage of the methods such as SentiWordNet, NB, NBDF, and the proposed CDNB is 69.2538%, 64.8470%, 65.3725%, and 66.44%, respectively. It is evident that the TNR percentage of the classification methods increases, but the proposed CDNB method of sentiment affect classification exhibits greater TNR when compared with the other existing methods.

In Figure 5c, the accuracy of the sentiment affect classification methods is depicted with all the comparative methods. When the percentage of data is 60, the proposed model obtained an accuracy of 74.2602%, which is 0.19%, 1.65%, and 1.65% higher than the accuracy obtained by the existing methods, such as SentiWordNet, NB, and NBDF, respectively. With the increasing percentage of the data, the accuracy percentage increases. The increase of the accuracy percentage is noted from 60% to 90%, and when the percentage of data is 90, the accuracy percentage of the methods such as SentiWordNet, NB, NBDF, and the proposed CDNB is 76.07674%, 74.5915%, 75.6414%, and 76.1204%, respectively. It is evident that the accuracy percentage of the classification methods increases, but the proposed CDNB method of sentiment affect classification exhibits greater accuracy when compared with the other existing methods.

Using the setup 2

Sentiment classification

Figure 6 shows the comparative analysis of the sentiment classification methods in terms of the metrics, such as TPR, TNR, and accuracy. In Figure 6a, the TPR of the classification methods is analyzed. When the percentage of the data is 60, the TPR obtained for the methods such as SentiWordNet, NB, NBDF, and the proposed CDNB is 84.1766%, 86.3232%, 86.7547%, and 87.7552%, respectively. With the increasing percentage of the data, the TPR percentage increases. The increase of the TPR percentage is noted from 60% to 90%, and when the percentage of data is 90, the TPR percentage of the methods such as SentiWordNet, NB, NBDF, and the proposed CDNB is 86.9399%, 87.992%, 87.9972%, and 90.0516%, respectively. It is evident that the TPR percentage of the classification methods increases, but the proposed CDNB method of sentiment classification exhibits greater TPR when compared with the other existing methods.

FIG. 6.

Sentiment classification using the setup 2. (a) Data percentage versus TPR. (b) Data percentage versus TNR. (c) Data percentage versus accuracy. Color images are available online.

In Figure 6b, the TNR percentage of the sentiment classification methods is analyzed. When the percentage of the data is 60, the TNR obtained for the methods such as SentiWordNet, NB, NBDF, and the proposed CDNB is 67.8254%, 68.7915%, 69.6040%, and 71.4195%, respectively. With the increasing percentage of the data, the TNR percentage increases. The increase of the TNR percentage is noted from 60% to 90%, and when the percentage of data is 90, the TNR percentage of the methods such as SentiWordNet, NB, NBDF, and the proposed CDNB is 69.2538%, 69.5766%, 71.8249%, and 72.78692%, respectively. It is evident that the TNR percentage of the classification methods increases, but the proposed CDNB method of sentiment classification exhibits greater TNR when compared with the other existing methods.

In Figure 6c, the accuracy of the sentiment classification methods is depicted with all the comparative methods. When the percentage of the data is 60, the accuracy obtained for the methods such as SentiWordNet, NB, NBDF, and the proposed CDNB is 74.1228%, 77.0156%, 76.5645%, and 77.9685%, respectively. With the increasing percentage of the data, the accuracy percentage increases. The increase of the accuracy percentage is noted from 60% of training data to 90% of training data. When the percentage of data is 90, the accuracy of the proposed CDNB model is 80.3420%, which is 5.31%, 2.82%, and 1.35% higher than the accuracy of the existing methods, such as SentiWordNet, NB, NBDF, respectively. It is evident that the accuracy percentage of the classification methods increases with the increase in the training data, but the proposed CDNB method of sentiment classification exhibits greater accuracy when compared with the other existing methods.

Sentiment affect classification

Figure 7 shows the comparative analysis of the sentiment affect classification methods in terms of the metrics, such as TPR, TNR, and accuracy. In Figure 7a, the TPR of the classification methods is analyzed. When the percentage of the data is 60, the TPR obtained for the methods such as SentiWordNet, NB, NBDF, and the proposed CDNB is 84.1766%, 80.7332%, 81.6768%, and 83.1231%, respectively. With the increasing percentage of the data, the TPR percentage increases. The increase of the TPR percentage is noted from 60% to 90%, and when the percentage of data is 90, the TPR percentage of the methods such as SentiWordNet, NB, NBDF, and the proposed CDNB is 86.9399%, 82.7646%, 83.9067%, and 84.8487%, respectively. It is evident that the TPR percentage of the classification methods increases, but the proposed CDNB method of sentiment affect classification exhibits greater TPR when compared with the other existing methods.

FIG. 7.

Sentiment affect classification using the setup 2. (a) Data percentage versus TPR. (b) Data percentage versus TNR. (c) Data percentage versus accuracy. Color images are available online.

In Figure 7b, the TNR percentage of the sentiment affect classification methods is analyzed. When the percentage of the data is 60, the TNR obtained for the methods such as SentiWordNet, NB, NBDF, and the proposed CDNB is 67.8254%, 62.5021%, 64.43372%, and 64.699%, respectively. With the increasing percentage of the data, the TNR percentage increases. The increase of the TNR percentage is noted from 60% to 90%, and when the percentage of data is 90, the TNR percentage of the methods such as SentiWordNet, NB, NBDF, and the proposed CDNB is 69.2538%, 64.9910%, 66.4977%, and 66.9749%, respectively. It is evident that the TNR percentage of the classification methods increases, but the proposed CDNB method of sentiment affect classification exhibits greater TNR when compared with the other existing methods.

In Figure 7c, the accuracy of the sentiment affect classification methods is depicted with all the comparative methods. When the percentage of the data is 60, the accuracy obtained for the methods such as SentiWordNet, NB, NBDF, and the proposed CDNB is 74.1228%, 73.6704%, 73.5917%, and 76.0811%, respectively. With the increasing percentage of the data, the accuracy percentage increases. The increase of the accuracy percentage is noted from 60% to 90%, and when the percentage of data is 90, the accuracy percentage of the methods such as SentiWordNet, NB, NBDF, and the proposed CDNB is 76.0767%, 74.5251%, 76.0490%, and 77.2505%, respectively. It is evident that the accuracy percentage of the classification methods increases, but the proposed CDNB method of sentiment affect classification exhibits greater accuracy when compared with the other existing methods.

Comparative discussion

The comparative discussion of the classification method is presented in Table 2. The metrics are compared in terms of the average TPR, average TNR, and average accuracy. The average TPR percentage of the sentiment classification using the methods, such as SentiWordNet,²⁷ NB,¹⁹ NBDF, and CDNB, is 85.24201, 87.1521, 87.7192, and 89.0934, respectively. Here, the proposed method shows 4.32%, 2.18%, and 1.54% improvement than the existing methods, such as SentiWordNet, NB, and NBDF, respectively. The average TPR percentage of the sentiment classification using the proposed CDNB method is greater than all other methods. Similarly, the average TPR percentage of sentiment affect classification using the proposed CDNB method is 84.2122, which is 1.15%, 3.13%, and 1.87% greater than other methods. The average TNR percentage for sentiment classification using the SentiWordNet,²⁷ NB,¹⁹ NBDF, and CDNB is 68.29894, 70.2333, 70.6812, and 79.3591, whereas for the affect classification, the average TNR percentage is 63.2989, 63.8265, 65.1921, and 66.2187. The greater accuracy is determined using the proposed method that is 79.359 for sentiment classification and 76.6249 for affect classification.

Table 2.

Comparative analysis of the sentiment classification and affect classification methods

Methods	Sentiment classification			Sentiment affect classification
Methods	Average TPR (%)	Average TNR (%)	Average accuracy (%)	Average TPR (%)	Average TNR (%)	Average accuracy (%)
SentiWordNet ²⁷	85.24201	68.29894	70.0134	83.24201	63.2989	75.0134
NB¹⁹	87.1521	70.2333	77.5748	81.5786	63.8265	74.1978
NBDF	87.7192	70.6812	78.0801	82.6371	65.1921	74.714
CDNB	89.0934	72.3064	79.3591	84.2122	66.2187	76.6249

The bold values represent the best performance.

CDNB, CAVIAR-Dragonfly optimization with Extended Naive Bayes; NBDF, Naive Bayes with the dragonfly algorithm; TNR, true negative rate; TPR, true positive rate.

From the analysis, it can be shown that the proposed CDNB model has maximum performance than the existing methods. The reasons for the improved performance of the proposed model are that it is the straightforward process of predicting the current position of the dragonfly and converges to the optimal solution with a faster convergence rate. Due to the integration of the CAVIAR formula in the position update step of the standard dragonfly optimization algorithm, the convergence time of the proposed algorithm is very less. Also, the proposed model follows the NFL theorem, which increases the overall performance.

Conclusion

This article proposes a novel CDNB model for sentiment classification and affective state classification. The proposed CDNB performs the classification based on the proposed C-Dragonfly optimization and the maximum posterior probability-based NB principle. This novel C-dragonfly algorithm is the integration of the CAVIAR formula in the standard dragonfly optimization algorithm. Once the optimized model is determined using the proposed model, it employs the maximum posterior probability-based classification that classifies the data as positive or negative. In addition, it performs affective state classification that enables the effective decision-making process. The proposed method exhibits maximum TPR, TNR, and accuracy percentages of 89.0934%, 72.3064%, and 79.3591% for the sentiment classification and 84.2122%, 66.2187%, and 76.6249% for the sentiment affective state classification. In the future, we will utilize more data sets to analyze the performance of the proposed classifier.

Footnotes

Author Disclosure Statement

No competing financial interests exist.

Funding Information

No funding was received from any organization or agency.

Abbreviations Used

References

Lin

, Jin

, Xu

, et al. An unsupervised cross-lingual topic model framework for sentiment classification. IEEE/ACM Trans Audio Speech Lang Process. 2016; 24:432–444.

Liu

, Bi

, Fan

. A method for multi-class sentiment classification based on an improved one-vs-one (OVO) strategy and the support vector machine (SVM) algorithm. Inf Sci. 2017; 394:38–52.

, Huang

, Yuan

. Domain-specific sentiment classification via fusing sentiment knowledge from multiple sources. Inf Fusion. 2017; 35:26–37.

Philip

, Mukesh

. Hybrid tracking model for multiple object videos using second derivative based visibility model and tangential weighted spatial tracking model. Int J Comput Intell Syst. 2016; 9:888–899.

Araque

, Platas

, Rada

JFS

, Iglesias

. Enhancing deep learning sentiment analysis with ensemble techniques in social applications. Expert Syst Appl. 2017; 77:236–246.

Hernández

, Sanchez

, Sánchez

, et al. Security attack prediction based on user sentiment analysis of Twitter data. In: Proceedings of the IEEE International Conference on Industrial Technology (ICIT), Taipei, Taiwan, 2016. IEEE, pp. 610–617.

Phu

, Dat

, Tran

VTN

, et al. Fuzzy C-means for English sentiment classification in a distributed system. Appl Intell. 2017; 46:717–738.

Ohana

, Tierney

. Sentiment classification of reviews using SentiWordNet. In: Proceedings of the 9th IT&T Conference, Dublin, Ireland: Dublin Institute of Technology, 22nd.–23rd, 2009.

Anastasia

, Budi

Twitter sentiment analysis of online transportation service providers. In: Proceedings of the 2016 International Conference on Advanced Computer Science and Information Systems (ICACSIS), Malang, Indonesia, 2016. IEEE, pp. 359–365.

10.

Gao

, Sebastiani

. From classification to quantification in tweet sentiment analysis. Soc Netw Anal Min. 2016; 6:19.

11.

Mirjalili1

Dragonfly algorithm: A new meta-heuristic optimization technique for solving single-objective, discrete, and multi-objective problems. Neural Comput Appl. 2016; 27:1053–1073.

12.

Karanasou

, Ampla

, Doulkeridis

, Halkidi

Scalable and real-time sentiment analysis of twitter data. In: Proceedings of the 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW), Barcelona, Spain, 2016. IEEE, pp. 944–951.

13.

Onan

, Koruko glu

, Bulut

. A hybrid ensemble pruning approach based on consensus clustering and multi-objective evolutionary algorithm for sentiment classification. Inf Process Manage. 2017; 53:814–833.

14.

Tang

, Qin

, Wei

, et al. A Joint Segmentation and Classification Framework for Sentence-Level Sentiment Classification. IEEE/ACM Trans Audio Speech Lang Process. 2015; 23:1750–1760.

15.

Bollegala

, Mu

, Goulermas

. Cross-domain sentiment classification using sentiment sensitive embeddings. IEEE Trans Knowl Data Eng. 2016; 28:1–14.

16.

Rao

Contextual sentiment-topic model for adaptive social emotion classification. IEEE Intell Syst. 2016; 31:41–47.

17.

Malandrakis

, Potamianos

, Iosif

, Narayanan

. Distributional semantic models for affective text analysis. IEEE Trans Audio Speech Lang Process. 2013; 21:2379–2392.

18.

Bao

, Xu

, Zhang

, et al. Mining social emotions from affective text. IEEE Trans knowl Data Eng. 2012; 24:1658–1670.

19.

, Chun

, Wei

, Geller

. Twitter sentiment classification for measuring public health concerns. Soc Netw Anal Min. 2015; 5.

20.

Shang

, Zhou

, Liu

. Particle swarm optimization-based feature selection in sentiment classification. Soft Comput. 2016; 20:3821–3834.

21.

Liu

, Cheng

, Li

. TASC:Topic-adaptive sentiment classification on dynamic tweets. IEEE Trans Knowl Data Eng. 2015; 27:1696–1709.

22.

, Huang

, Song

. Structured microblog sentiment classification via social context regularization. Neurocomputing. 2016; 175:599–609.

23.

Wolpert

, Macready

. No free lunch theorems for optimization. IEEE Trans Evol Comput. 1997; 1:67–82.

24.

Engle

, Manganelli

. CAViaR: Conditional autoregressive value at risk by regression quantiles. J Bus Econ Stat. 2004; 22:367–381.

25.

Hassan

, Miriam

, Yulan

, Harith

On stop words, filtering and data sparsity for sentiment analysis of Twitter. In: Proceedings of LREC 2014, Ninth International Conference on Language Resources and Evaluation, Reykjavik, Iceland, 2014. European Language Resources Association, pp. 810–817.

26.

Twitter Sentiment Analysis Dataset. Available at: http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22 (accessed December 5, 2018).

27.

Esuli

, Sebastiani

Sentiwordnet: A publicly available lexical resource for opinion mining. In: Proceedings of the 5th Conference on Language Resources and Evaluation (LREC’06), Genoa, Italy, 2006. European Language Resources Association, pp. 417–422.