Feature selection for sentiment analysis using hybrid multiobjective evolutionary algorithm

Abstract

As the volume of data continues to grow, the significance of text classification is on the rise. This vast amount of data majorly exists in the form of texts. Effective data preparation is essential to extract sentiment data from this vast amount of text, as irrelevant and redundant information can impede valuable insights. Feature selection is an important step in the data preparation phase as it eliminates irrelevant and insignificant features from the huge features set. There exist a large body of work related to feature selection for image processing but limited research is done for text data. While some studies recognize the significance of feature selection in text classification, but there is still need for more efficient sentiment analysis models that optimize feature selection and reduce computational. This manuscript aims to bridge these gaps by introducing a hybrid multi-objective evolutionary algorithm as a feature selection mechanism, combining the power of multiple objectives and evolutionary processes. The approach combines two feature selection techniques within a binary classification model: a filter method, Information Gain (IG), and an evolutionary wrapper method, Binary Multi-Objective Grey Wolf Optimizer (BMOGWO). Experimental evaluations are conducted across six diverse datasets. It achieves a reduction of over 90 percent in feature size while improving accuracy by nearly nine percent. These results showcase the model’s efficiency in terms of computational time and its efficacy in terms of higher classification accuracy which improves sentiment analysis performance. This improvement can be beneficial for various applications, including recommendation systems, reviews analysis, and public opinion observation. However, it’s crucial to acknowledge certain limitations of this study. These encompass the need for broader classifier evaluation, and scalability considerations with larger datasets. These identified limitations serve as directions for future research and the enhancement of the proposed approach.

Keywords

Feature selection sentiment analysis multi-objective optimization evolutionary algorithms

1 Introduction

This is the big data era in which immense amount of data is generated on a daily basis. According to Forbes, experts believe that 463 exabytes of data will be generated daily worldwide by 2025 [1]. This data from diverse domains like education, health, finance, amongst others, exists in the form of texts, audios, videos, graphs etc. We can categorize this data for useful insights and predictions. The majority of the data is generated in the form of texts such as tweets, product reviews, news reports, website articles etc.

Sentiment analysis, a form of text categorization, serves as a pivotal tool for extracting sentiments and opinions from this vast corpus of textual data. Sentiment analysis plays a major role in many real world applications such as recommendation systems [2], reviews analysis [3], public opinion monitoring [4] etc. Yet, the analysis of sentiments is not without its challenges, including the use of slang, typographical errors, and more. In data mining, the quality of the analysis is intrinsically tied to the quality of the data [5]. Consequently, effective sentiment analysis necessitates thorough data preparation to eliminate redundancy and irrelevance from raw data.

Feature selection is a technique used in initial preparation of data that selects a subset of relevant, informative features from the initially large number of features. It is aimed at identifying a subset of pertinent and informative features from an initially extensive set. This serves to reduce dimensionality, optimize memory utilization, and enhance performance [6]. Feature selection methods generally fall into three categories: filter-based, wrapper-based, and embedded methods. Filter-based techniques assess features using mathematical scoring metrics such as Information Gain (IG) [7], Chi-square [8], Pearson Correlation [9], and others. Wrapper-based methods employ machine learning algorithms to assess features, and they often outperform filter methods in predictive performance. Embedded methods, on the other hand, incorporate feature selection within machine learning algorithms, offering computational efficiency.

Feature selection is inherently an optimization problem with dual objectives: minimizing the feature set size while maximizing classification accuracy. Much of the existing research on feature selection focuses on single-objective optimization problems, with limited attention devoted to multi-objective scenarios [10]. Additionally, the combination of filter and wrapper methods is a relatively unexplored terrain. To our knowledge, only one prior study [11] has endeavored to apply a hybrid multi-objective algorithm to feature selection in the context of sentiment analysis.

In this paper, a solution to the feature selection problem is presented using a hybrid multi-objective feature selection model, which uses Information Gain (IG) and Binary Multi-Objective Grey Wolf Optimizer (BMOGWO) [12]. The proposed model is evaluated on the Stanford Sentiment Treebank (SST) dataset, 4 Amazon product reviews datasets and Sentiment140 dataset. Our results are compared against state-of-the-art feature selection methods like IG and NSGA-II, illustrating that our model significantly reduces the subset of features while concurrently improving classification accuracy.

1.1 Research objectives

The study seeks to address the challenges associated with data preparation for sentiment analysis, particularly in dealing with irrelevancy and redundancy in textual data. The study aims to combine filter based and wrapper based methods for feature selection. The research objectives of this study revolve around recognizing the importance of text classification, addressing data preparation challenges, exploring feature selection for text data, and advocating for more efficient sentiment analysis models.

1.2 Research contributions

The primary contributions of our research can be summarized as follows:

Pioneering the fusion of Information Gain (IG) with Binary Multi-Objective Grey Wolf Optimizer (BMOGWO) for sentiment analysis.

Harnessing BMOGWO as a wrapper function that leverages filter-based IG scores within its updating mechanism.

Empirical evidence from a diverse range of datasets, establishing the superior performance of our model in comparison to state-of-the-art feature selection methods.

The remaining paper is organized as follows. Section 2 describes the recent studies on feature selection. Section 3 provides the proposed feature selection model. Section 4 discusses the experimental setup and results. Section 5 presents limitations of this study. Section 6 presents the conclusion and future work.

2 Literature review

Sentiment analysis has gained significant attention by a large number of studies due to its important role in real world applications in various domains [13]. To provide a comprehensive understanding, we will summarize the main methods of sentiment analysis, describe Information Gain (IG) and its features, discuss related work on sentiment analysis and feature selection using evolutionary algorithms, address existing research gaps, and elucidate the purpose behind this research.

2.1 Mainstream methods of sentiment analysis

Sentiment analysis has three mainstream methods as discussed in the following:

Rule-Based Systems: Rule-based systems employ predefined linguistic rules to identify sentiment expressions, offering transparency but potential limitations in adaptability and flexibility [14, 15]. These methods are not popular as they cannot be scaled to large datasets.

Lexicon-based Approaches: These methods use predefined sentiment dictionaries to assess sentiment in text [16]. These methods lack the context information for analyzing sentiment of text.

Machine Learning Techniques: Machine learning models like Support Vector Machines, Naive Bayes, and deep learning models such as BERT are widely used for sentiment analysis [17, 18]. They excel at capturing text patterns but often require extensive labeled data.

In this study we have used machine learning based methods for sentiment analysis. Feature selection plays a pivotal role in improving performance of machine learning models. We have used feature selection in this study, it is discussed in next section.

2.2 Information Gain (IG) and BMOGWO

Feature selection has two mainstream methods, filter based approach and wrapper based approach. Filter based approach involved the steps of feature scoring, feature ranking and feature selection. Wrapper based methods generate different feature subsets, evaluate model based on different feature subsets, and then select feature subset based on best performance.

In this study we have used Information Gain (IG) [7] and Binary Multi-Objective Grey Wolf Optimizer (BMOGWO) [19, 20] for feature selection. IG is a important metric for feature selection and text classification. It is a filter based feature selection approach. Its key features include its ability to identify the most informative features for classification tasks, contributing to better model performance. We have chosen Information Gain (IG) for feature selection due to its proficiency in reducing uncertainty in the target variable, a key aspect in semantic contexts. IG’s capability to highlight the most informative features is crucial for effective binary classification, ensuring focus on pertinent data aspects. Furthermore, IG’s alignment with BMOGWO’s objectives enhances the optimization process, ensuring a balance between model complexity and performance.

It is defined in detail in Section 3.2. BMOGWO is an optimization algorithm. Its features include binary nature, use of a sigmoid transfer function, and competitive results feature selection problems. It is a wrapper based approach for feature selection. It is discussed in detail in Section 3.3.

2.3 Related work on feature selection for text classification

Recent studies, such as the work by Abiodun et al. [21], offers invaluable insights into feature selection methods for text classification. Their systematic review of over 200 articles underscores the promise of metaheuristics and hybridized versions to optimize feature selection, advocating for the use of hyper-heuristics to enhance classification accuracy. Their comprehensive assessment of feature selection methods serves to clarify the intricate details of accurate method implementation. Deng et al. [22] conducted a detailed review of feature selection techniques in the context of sentiment analysis. Their work delves into the advantages and drawbacks of these methods while clarifying the most suitable classifiers to accompany them. Additionally, they explore areas like multi-label feature selection, online feature selection, and stability measures for feature selection methods.

Gokalp et al. [23] contributed an algorithm based on an iterated greedy approach for feature selection. Utilizing pre-calculated scores from filter methods and the Multinomial Naïve Bayes classifier, they conducted experiments on various datasets, including public sentiment and Amazon product reviews, demonstrating the superiority of their algorithm over existing alternatives. Deniz et al. [11] introduced a groundbreaking hybrid feature selection method that addressed the problem with a two-objective approach. They employed Information Gain Filtering (IGF) as a filter method, followed by NSGA-II as a wrapper method. Their experiments, employing text representation techniques like Bag-of-Words (BoW) and Glove, showcased the remarkable performance of the hybrid method, IGF+NSGA-II, outshining its individual components.

Banka H. Dara [24] proposed a binarized particle swarm optimization (HDBPSO) algorithm, leveraging hamming distance to reduce dimensionality in feature selection. Their comparative experiments underscored the algorithm’s effectiveness in discovering feature subsets with competitive classification accuracy. Shang et al. [25] introduced a binary particle swarm optimization (FS-BPSO) algorithm that employed fitness sum proportionate selection. By enhancing velocity updating and evaluating individual features, it addressed the limitations of standard BPSO. Experiments on consumer review datasets validated its ability to select high-quality feature subsets.

Shunmugapriya P and Kanmani S [26] combined ant and bee colony optimization to create the AC-ABC Hybrid for feature selection, mitigating weaknesses in both algorithms. Their novel approach demonstrated improved global search efficiency and avoidance of stagnation issues. Liu et al. [27] tackled imbalanced and high-dimensional datasets using multi-objective ant colony optimization. Employing the Bootstrap method, they compared their model with six state-of-the-art algorithms, revealing its superior performance. Tawhid and Dsouza [28] proposed a hybrid algorithm called HBBEPSO, leveraging the BAT algorithm for exploration and PSO for exploitation. Experiments on 20 UCI repository datasets demonstrated its efficacy in identifying optimal feature subsets. Mafarja and Mirjalili [29] presented a hybrid method combining Whale Optimization Algorithm (WOA) with simulated annealing to enhance feature selection. Their method improved classification performance and was competitive with three wrapper-based methods on 18 UCI repository datasets. Al-Tashi et al. [12] introduced BMOGWO, a binary version of the Multi-Objective Grey Wolf Optimizer (MOGWO) using a sigmoid transfer function for feature selection. Comparative experiments against MOGWO with a tanh transfer function, NSGA-II, and MOPSO demonstrated BMOGWO’s competitive results with reduced computational cost.

The collective body of research in feature selection methods and hybrid algorithms provides a rich foundation for advancing the field of sentiment analysis and text classification, bridging the gap between theoretical frameworks and practical implementation.

2.4 Existing research gaps and motivations for this research

Despite advances, there are persistent research gaps. Notably, the need for more interpretable and efficient sentiment analysis models that optimize feature selection. This manuscript aims to bridge these gaps by proposing a novel approach integrating IG and BMOGWO in sentiment analysis. This research is motivated by the dual objective of addressing these existing gaps and enhancing the practicality of sentiment analysis methods. Information Gain (IG) and Multiobjective Grey Wolf Optimizer (MOGWO) offer distinct advantages for feature selection in sentiment analysis. IG is computationally efficient, model-independent, interpretable, reduces dimensionality, and doesn’t require labeled data. MOGWO is flexible for optimization, handles multiple objectives, combines global and local search, adapts to problem spaces, and enhances model performance through optimal feature subset selection. When used together, IG and MOGWO provide a well-rounded approach for efficient and effective feature selection in sentiment analysis, addressing various challenges and objectives. The ultimate goal is to facilitate informed decision-making across diverse applications, from marketing to customer feedback analysis, and social media monitoring. This research contributes to the evolution of sentiment analysis methods and their real-world utility.

3 Proposed methodology

This section delineates the feature selection procedure for sentiment analysis and introduces the proposed feature selection approach. Feature selection is a crucial data preparation technique to extract valuable features from the features set. This process not only accelerates the training phase of classification models but also enhances their learning capabilities. Figure 1 shows workflow of our proposed model. The overall methodology of our proposed model can be summarized in the following steps.

Preprocess the dataset

Apply filter based feature selection using information gain measure

Apply wrapper based feature selection by applying binary MOGWO algorithm on selected features from previous step

Apply machine learning algorithm for text classification (sentiment analysis) using selected subset of features from previous steps and calculate classification accuracy

The above mentioned steps are explained in detail in the following sections.

Fig. 1

The proposed feature selection model.

3.1 Preprocessing

This is the first step in any sentiment analysis algorithm. The preprocessing step can make a model or break a model. It is a crucial step in a classification problem [30]. This step removes the features that do not contribute to the classification problem, providing meaningful subset of features. This, as a result, reduces computation time of the training process by reducing the high dimensionality of data. We present a four-fold processing phase:

The preprocessing step serves as the foundational stage in any sentiment analysis algorithm. Its significance cannot be overstated; it has the potential to either elevate or undermine the model’s performance. This step involves the removal of features that do not contribute to the classification problem, resulting in a more meaningful subset of features [30]. Consequently, this reduction in feature dimensionality has the dual benefit of enhancing the computational efficiency of the training process and simplifying the data structure. The preprocessing phase can be comprehensively broken down into four distinct steps:

3.1.1 Lowercase conversion

Every word in the data is converted to lowercase in this step. This step is to reduce sparsity in the data so the model considers two words similar regardless of their case.

3.1.2 Punctuation removal

During this step, the data undergoes a cleaning process to eliminate any punctuation marks. This step is imperative to ensure that the classifier doesn’t misinterpret or treat punctuation marks in a manner similar to words.

3.1.3 Tokenization

In this step, the sentences are tokenized, which involves identification of individual words. These words are split to be used in in the next phase.

3.1.4 Stop words removal

In this step, stop words are removed from the data. Stop words are the very frequent words occurring in documents that do not provide meaningful information about the actual text. Removal of stop words leaves us with valuable words for further processing.

3.1.5 Bag-of-Words (BoW)

The Bag of Words (BoW) method is employed as a text representation scheme, yielding fixed-length vectors to represent textual data. BoW creates a vector for each sentence, irrespective of the word order within the data. In this approach, every word functions as a feature, and each BoW vector encapsulates the content of a specific sentence. The BoW representation is the chosen approach in our work.

3.2 Filter-based feature selection

This section delves into the initial feature selection methods applied in this study. Our first approach involves the use of Information Gain (IG) as a filter-based method. IG assesses each feature’s information content, quantifying how much valuable information each feature carries. For a single feature, denoted as F, Information Gain is calculated as follows:

$\begin{matrix} IG (N, F) \\ = Entropy (N) \sum_{u ϵ U} \frac{| N_{u} |}{N} Entropy (N_{u}) \end{matrix}$ (1) where N is a variable contains all features of the data, F is a single feature, U represents a set of unique values for the particular feature F. Nu represents a subset of N which contains the values where F is u. We calculate a subset S’s entropy as: $Entropy (S) = \sum_{c ϵ C} p_{c} \log_{2} p_{c}$ (2) where C represents the set of all classes and pc gives the ratio of the instances belonging to the cth class over all instances in S.

After calculating Information Gain of all features, we calculate a threshold value and remove the features whose IG values are less than that threshold. The threshold value is the median value of the IG values of all features. We sorted Information Gain (IG) values into quartiles and set the feature selection threshold at the median, corresponding to the second quartile (50th percentile). This approach ensured we selected only the top 50 percent of features by IG, balancing model efficiency and effectiveness. By avoiding a lower threshold, like the first quartile, we prevented the exclusion of crucial discriminative features, and by not setting it too high, like the third quartile, we avoided cluttering the model with low-impact features. This step is to keep features having higher predictive power. This process is named as Information Gain Filtering (IGF).

3.3 Wrapper-based feature selection

This section introduces the second feature selection method, which leverages the Binary Multi-Objective Grey Wolf Optimizer (MOGWO) algorithm. This approach is categorized as a wrapper-based feature selection technique. Before delving into the algorithm, it’s important to establish a clear understanding of what constitutes a multi-objective optimization problem. Additionally, we’ll explore how to assess the results using Pareto fronts as a key evaluation metric.

3.3.1 Multi-objective optimization and Pareto front

Feature selection can be formulated as a multi-objective optimization task aiming to optimize two objectives: reducing the features set size and maximizing classification accuracy. We can define this multi-objective optimization problem as: $minimize f_{1} (x) = | M |, M \subseteq N$ (3) $maximize f_{2} (x) = accuracy (M)$ (4) where N denotes the whole features set and M is the subset of N i.e., the selected feature subset. Accuracy gives us the percentage of correctly predicted data points from all the data points.

In an optimization problem involving multiple objectives, we can get a solution set containing all the possible solutions to the problem. Within this set, one solution may excel in terms of one objective, while another surpasses it in a different objective. To visually illustrate this concept within the context of our specific problem, we present sample solutions in Fig. 2.

Fig. 2

Sample solutions for the multi-objective optimization problem.

The objective in our problem is to simultaneously minimize the number of features and maximize classification accuracy. Figure 2 serves as an illustrative example of solutions for this multi-objective problem, specifically feature selection. The green solutions (S1 to S4) in the figure represent the Pareto front, consisting of non-dominated solutions. These solutions outperform others in both objectives.

In contrast, the blue solutions are classified as dominated solutions. This means that at least one other solution within the Pareto front dominates them in both objectives. For instance, S3 surpasses S5 in both criteria, as it exhibits a reduced feature count and higher accuracy. This can be mathematically expressed as:

$f_{1} (S 3) < f_{1} (S 5)$ (5) $f_{2} (S 3) > f_{2} (S 5)$ (6)

In these equations, S3 dominates solution S5, demonstrating its superiority in both feature reduction and classification accuracy.

he Pareto front solutions hold a unique position in that they are not dominated by any other solution, including those within the Pareto front itself. Let’s illustrate this with solutions S1 and S2 as examples. S1 excels over S2 in terms of feature count, boasting a smaller number of features. Meanwhile, S2 outperforms S1 in accuracy, displaying a higher level of precision. As they each surpass the other in different objectives, they are deemed non-dominated by one another. Thus, the solutions on the Pareto front represent the best possible outcomes in the context of the defined objectives. Our primary aim in a multi-objective problem is to identify and characterize this Pareto front in accordance with the specified objectives.

3.3.2 Binary Multi-Objective Grey Wolf Optimizer (BMOGWO)

For wrapper based feature selection, we apply the Binary Multi-Objective Grey Wolf Optimizer (BMOGWO). Each solution is represented by a wolf as [f₁, f₂, . . . , f_k] where k denotes the overall count of features and f_k is the dataset’s feature number k. The total number of features gives us the length of a solution. This is a binary optimization problem, implying that a feature can be selected (1) or not (0). A sample solution is shown in Fig. 3. In the figure, we can see that the solution has selected features 2, 4, 5 and 7. So, the number of features (first objective value) of this solution is 4. The other features i.e., 1, 3 and 6 are removed and classifier is trained on the selected features, which will the accuracy (second objective value) for this solution.

Fig. 3

A sample solution.

In our BMOGWO implementation, the algorithm proceeds as follows: Initially, we generate a random population. Then each search agent is evaluated based on the objective functions. We calculated fitness values by focusing on two main criteria: the number of selected features and the classification accuracy using the Multinomial Naive Bayes (MNB) classifier. We determined the number of selected features by counting the features included in each model variant. For classification accuracy, we assessed the effectiveness of each model variant specifically with the MNB classifier. This approach helped us effectively balance a lean model with fewer features and optimal predictive performance using MNB, which is essential in multi-objective optimization tasks. The non-dominated solutions are retrieved from the wolves to initialize the archive. Then, three best solutions are selected from the archive namely alpha, beta and delta. The selection of these leaders and the determination of the archive being full are done by a controller. The leaders are selected by a leader selection strategy. The three leaders guide other wolves to globally optimal search areas, in hopes to find global best or near global best solutions. The least crowded regions are selected by leader selection technique and one of its non-dominated solutions is chosen as alpha, beta or delta. Each hypercube has a probability in the roulette-wheel selection method to select such solutions. This probability is defined below: $P_{i} = \frac{b}{S_{i}}$ (7) where b is a constant having value greater than 1 and S is the number of solutions in the pareto front in the ith segment.

Following the initialization phase, the iterations commence. In each iteration, the positions of the wolves are updated. Subsequently, the continuous values obtained during this update are converted into binary form. This binarization process is facilitated through the use of a sigmoid activation function, implemented as follows: $x_{d}^{t + 1} = {\begin{matrix} 1 & if sigmoid (\frac{x_{1} + x_{2} + x_{3}}{3}) \geq rand \\ 0 & otherwise \end{matrix}$ (8) where $x_{d}^{t + 1}$ is the modified binary grey wolf position in d dimension at t iteration, rand indicates a random number taken from a uniformly distributed range ϵ [0, 1], and sigmoid(a) is calculated as follows: $sigmoid (a) = \frac{1}{1 + e^{- 10 (x - 0.5)}}$ (9)x₁,x₂ and x₃ are defined by the following equations: $x_{1}^{d} = {\begin{matrix} 1 & if (x_{α}^{d} + {bstep}_{α}^{d}) \geq 1 \\ 0 & otherwise \end{matrix}$ (10) $x_{1}^{d} = {\begin{matrix} 1 & if (x_{β}^{d} + {bstep}_{β}^{d}) \geq 1 \\ 0 & otherwise \end{matrix}$ (11) $x_{1}^{d} = {\begin{matrix} 1 & if (x_{δ}^{d} + {bstep}_{δ}^{d}) \geq 1 \\ 0 & otherwise \end{matrix}$ (12) where bstep for alpha, beta and delta wolves is defined as: ${bstep}_{α, β, δ}^{d} = {\begin{matrix} 1 & if {cstep}_{α, β, δ}^{d} \geq rand \\ 0 & otherwise \end{matrix}$ (13) where rand shows a random number taken from a uniformly distributed range ϵ [0, 1], and ${cstep}_{α, β, δ}^{d}$ is a step size for d dimension having a continuous value, calculated using a sigmoid function as shown below: ${cstep}_{α, β, δ}^{d} = \frac{1}{1 + e^{- 10 (A_{1}^{d} D_{α, β, δ}^{d} - 0.5)}}$ (14) where A is a coefficient vector, defined as: $\vec{A} = 2 \vec{a} . \vec{r_{1}} - \vec{a}$ (15) where $\vec{a}$ is a vector that linearly reduces over the course of iterations from 2 to 0 and can be formulated as: $\vec{a} = 2 - t . \frac{2}{maxIter}$ (16) where maxIter is the maximum number of iterations. $\vec{r_{1}}$ is a random vector taken uniformly from

$\vec{D} = | \vec{C} . {\vec{X}}_{p (t)} - {\vec{X}}_{(t)} |$ (17) where $\vec{C}$ is a coefficient vector defined as: $\vec{C} = 2 . \vec{r_{2}}$ (18) where $\vec{r_{2}}$ is a random vector taken uniformly from the distribution [0, 1]. ${\vec{X}}_{p (t)}$ is vector representing position of the prey, and ${\vec{X}}_{(t)}$ is vector representing position of the wolf.

The wolves are again evaluated based on the objective functions and non-dominated solutions are extracted. We update the archive with the obtained non-dominated solutions. We use grid mechanism to remove the extra solution(s) if there is no more space for more solutions. New solutions are added. The three best solutions are determined again. If the stopping criteria is met, we return the archive. The flowchart of the proposed model is shown in Fig. 1 and the pseudo-code of the feature selection algorithm is shown in Algorithm 1.

3.4 Machine learning classifier

We utilized the Multinomial Naïve Bayes (MNB) classifier for our experiments, which uses word frequency information for its algorithm. MNB has been proven to be very effective in text classification tasks. MNB works by assuming that the probabilities of features do not depend on the class c (conditional independence). MNB is a naïve classification method that uses the Bayes rule. According to Bayes rule, the probability of a document d belonging to a class c is calculated by: $P (c | d) = \frac{P (d | c) P (c)}{P (d)}$ (19)

4 Experiments

This section presents the datasets used for experiments, parameter settings, followed by results and discussion.

4.1 Datasets

We performed our experiments on 6 datasets. The first dataset for the experiments is the Stanford Sentiment Treebank (SST). The next 4 are Amazon product reviews datasets (Watches, Mobile Electronics, Gift card and Digital Video Games. The sixth is the Sentiment140 dataset. We provide characteristics of all the six datasets in Table 1. These datasets are discussed in the following sections.

Table 1
Datasets used in experimental setup

Dataset Number of instances Number of features

SST 10754 15163

Amazon Watches 10000 13257

Amazon Mobile Electronics 10000 16271

Amazon Gift Card 10000 10269

Amazon Digital Video Games 10000 19018

Sentiment140 10000 17852

Dataset	Number of instances	Number of features
SST	10754	15163
Amazon Watches	10000	13257
Amazon Mobile Electronics	10000	16271
Amazon Gift Card	10000	10269
Amazon Digital Video Games	10000	19018
Sentiment140	10000	17852

4.1.1 Stanford Sentiment Treebank

This is a well-known dataset used for sentiment analysis in the literature. It was introduced by Socher et al. [31] in 2013 containing movie reviews. We used the SST-5 dataset which contains more than 10,000 samples with labelled training and test sets. We removed the sentences having neutral sentiment as we focus on binary classification. We mapped the negatively labelled sentences i.e., labels 1 and 2 to 1, and positively labelled sentences i.e., 4 and 5 to 2. The total number of instances of both sentiments in the training and test sets is shown in Table 2.

Table 2
Number of instances for each sentiment in each dataset

Dataset Sentiment Train set size Test set size

Amazon Watches Positive 3722 1278

Negative 3778 1222

Amazon Mobile Electronics Positive 3765 1235

Negative 3735 1265

Amazon Gift Card Positive 3748 1252

Negative 3752 1248

Amazon Digital Video Games Positive 3757 1243

Negative 3743 1257

SST Dataset Positive 3610 909

Negative 3310 912

Sentiment140 Positive 3724 1276

Negative 3776 1224

Dataset	Sentiment	Train set size	Test set size
Amazon Watches	Positive	3722	1278
	Negative	3778	1222
Amazon Mobile Electronics	Positive	3765	1235
	Negative	3735	1265
Amazon Gift Card	Positive	3748	1252
	Negative	3752	1248
Amazon Digital Video Games	Positive	3757	1243
	Negative	3743	1257
SST Dataset	Positive	3610	909
	Negative	3310	912
Sentiment140	Positive	3724	1276
	Negative	3776	1224

4.1.2 Amazon product reviews

We used 4 Amazon product reviews datasets i.e., Watches, Mobile Electronics, Gift Card and Digital Video Games [32]. These datasets contained a huge number of instances ranging from 100,000 to 1,000,000. These were cut down to 10,000 in each dataset, with exactly 50 percent positive and 50 percent negative reviews in the whole dataset. The total number of instances of both sentiments in the training and test sets of all 4 datasets is shown in Table 2.

4.1.3 Sentiment140

This is a widely used dataset in the literature containing 1,600,000 samples [33]. This dataset contains Twitter sentiments of a brand, product or a topic. These were also cut down to 10,000 samples, with exactly 50 percent positive and 50 percent negative reviews in the whole dataset. The total number of instances of both sentiments in the training and test sets is shown in Table 2.

4.2 Parameter settings

In this work, we set training and test sets of SST as their given labels i.e., 6920 training samples and 1821 test samples. The other five datasets i.e., the four Amazon reviews datasets and the Sentiment140 dataset were randomly partitioned into 75 percent training and 25 percent test sets. We set the population size/number of wolves as 100 and the number of iterations as 200, recognizing that a larger population size disproportionately increases computation time compared to a higher number of iterations. Each additional individual in the population demands more computational resources for evaluation, leading to a significant escalation in overall processing time. Consequently, we balanced the population size to ensure computational efficiency. We determined the number of iterations as 200 based on convergence behavior observed in preliminary runs. This specific count ensured that our BMOGWO had ample scope to effectively explore and exploit the solution space. With 200 iterations, the algorithm was able to converge towards optimal solutions, striking a balance between thorough search and optimization processes and computational feasibility. It provided a sufficient number of cycles for the algorithm to mature in its search without unduly extending the computation time. The detailed parameter settings are shown in Table 3.

Table 3
Parameter settings of used algorithms

Parameters NSGA-II BMOGWO

Number of iterations 200 200

Population size/Number of wolves 100 100

Archive size 100 100

Crossover ratio 1 -

Mutation ratio 0.02 -

alpha - 0.1

nGrid - 10

beta - 4

gamma - 2

Parameters	NSGA-II	BMOGWO
Number of iterations	200	200
Population size/Number of wolves	100	100
Archive size	100	100
Crossover ratio	1	-
Mutation ratio	0.02	-
alpha	-	0.1
nGrid	-	10
beta	-	4
gamma	-	2

4.3 Fitness value calculation for feature selection

We computed fitness values by focusing on two primary criteria: the count of selected features and the classification accuracy using the Multinomial Naive Bayes (MNB) classifier. The number of selected features was determined by tallying the features included in each model variant. For assessing classification accuracy, we gauged the effectiveness of each model variant specifically with the MNB classifier. This approach enabled us to effectively strike a balance and train a model with fewer features while maintaining optimal predictive performance using MNB, a crucial aspect in multi-objective optimization tasks. To establish the feature selection threshold, we utilized the median of Information Gain (IG) values. We organized IG values into quartiles and positioned the feature selection threshold at the median, corresponding to the second quartile (50th percentile). This strategy ensured that we selected only the top 50% of features based on IG, achieving a balance between model efficiency and effectiveness. By avoiding a lower threshold, such as the first quartile, we mitigated the risk of excluding essential discriminative features. Simultaneously, by not setting the threshold too high, like the third quartile, we avoided cluttering the model with low-impact features.

4.4 Results

We compare our model with IG and the very popular evolutionary algorithm NSGA-II [11]. The pareto fronts obtained by IGF+NSGA-II and IGF+BMOGWO on all datasets are shown in Fig. 4. The dataset, the original number of features and the accuracy of MNB on those features without feature selection are specified above each graph. The number of features are represented on the x-axis and the accuracy is shown on the y-axis. We can clearly see from Fig. 4 that BMOGWO performs significantly well than NSGA-II. BMOGWO reduces features size to substantially low and achieves greater accuracy than NSGA-II on all datasets.

Fig. 4

Comparison of Pareto fronts obtained by NSGA-II and BMOGWO on all datasets. The title of the figures show dataset, the original number of features and the accuracy without feature selection.

We conducted a comprehensive comparison between our model, and the widely recognized evolutionary algorithm NSGA-II combine with IG, as described in reference [11]. The Pareto fronts obtained through the IGF+NSGA-II and IGF+BMOGWO approaches across all datasets are visually depicted in Fig. 4. Each graph in the figure is accompanied by information about the dataset, the original number of features, and the accuracy of Multinomial Naive Bayes (MNB) when applied to the features without feature selection.

The x-axis represents the number of features, while the y-axis displays accuracy. Upon closer examination of Fig. 4, it becomes evident that BMOGWO outperforms NSGA-II significantly. BMOGWO achieves a substantial reduction in feature size while simultaneously attaining higher accuracy levels compared to NSGA-II across all datasets.

Figure 5 shows percentage reduction in number of features by both algorithms on all six datasets. It is evident that the proposed model outperforms NSGA-II [11]. Our feature selection model achieves reduction to 90 percent, whereas NSGA-II is able to achieve only 76 percent reduction in number of features.

Fig. 5

Feature reduction on all datasets by NSGA-II [11] and BMOGWO.

Figure 6 shows percentage improvement in accuracy by both algorithms on all six datasets. The figure shows the superiority of the proposed model over NSGA-II. Our feature selection model improves classification accuracy by a maximum of nine percent, whereas NSGA-II is able to achieve a maximum of around six percent boost in accuracy.

Fig. 6

Accuracy improvement on all datasets by NSGA-II [11] and BMOGWO.

Table 4 shows the computation time for both algorithms on all datasets. The results show that the proposed algorithm is fairly efficient in terms of computation time.

To measure the performance of both algorithms statistically, we performed Wilcoxon Rank Sum Test (Mann-Whitney U Test). This non-parametric test was chosen because it doesn’t assume data normality or equal variances, making it suitable for comparing our algorithms’ non-normally distributed accuracy values. The p-values for all datasets are shown in Table 5. All six datasets showed statistically significant differences between the two algorithms, with all p-values <0.05.

Table 4

Estimated execution times for both algorithms on all datasets

Dataset	BMOGWO (seconds)	NSGA-II (seconds)
SST	4415.72	8681.58
Amazon Watches	3752.66	6913.88
Amazon Mobile electronics	4781.99	8809.70
Amazon Gift Card	2994.62	5519.32
Amazon Digital Video Games	5643.28	10394.09
Sentiment140	5249.35	9668.96

4.5 Discussion

Our model autonomously extracted a concise yet highly effective set of features from the textual data, significantly contributing to the accuracy of sentiment classification. The key features identified by the model include:

Positive Sentiment Indicators: Words such as “amazing”, “enjoyable”, “fascinating”, “charming”, “engaging”, and “inspiring”. These words often appeared in contexts that expressed approval, admiration, or pleasure, indicating positive sentiments in the texts.

Table 5
Wilcoxon Rank Sum Test p-values. Our proposed method based on BMOGWO is significantly better than NSGA-II [11]. All six datasets showed statistically significant differences between the two algorithms, with all p-values <0.05

SST	Amazon Watches	Amazon Mobile Electro-nics	Amazon Gift Card	Amazon Digital Video Games	Sentim- ent140
0.000	0.015	0.009	0.004	0.004	0.016

Negative Sentiment Indicators: On the opposite end, the model identified words like “disappointing”, “abhorrent”, “problematic”, “lackluster”, and “dismal”. These words typically surfaced in negative reviews or critiques, signaling adverse sentiments.

Intensity Modifiers: Words such as “extremely”, “absolutely”, “utterly”, and “incredibly” were extracted as key features for their role in amplifying the sentiment expressed in a statement.

Subjectivity Indicators: The model also picked up on terms like “believable”, “convincing”, “unconvincing”, and “predictable”, which tend to introduce subjectivity into the text.

These extracted features demonstrate the model’s effectiveness in discerning various emotional tones and intensities in text. By focusing on a select yet impactful set of features, the model achieved a balance between accuracy in sentiment classification and computational efficiency. This targeted approach in feature extraction underlines the importance of selecting quality features over quantity, ensuring that the model remains both accurate and agile.

In our pursuit of feature selection for classification tasks, our developed model has emerged as a noteworthy contender. When compared with NSGA-II, our model consistently outperforms it across all datasets.

The results demonstrate the superior performance of our algorithm. Notably, our model achieved a substantial reduction in the number of features. For instance, in the SST dataset, the feature count was reduced from 15,163 to 2,638. Similarly, for the Amazon product reviews datasets, the feature counts were dramatically reduced from the range of 100,000 to 200,000 down to approximately 2,500 to 5,000. On the Sentiment140 dataset, the feature count was curtailed from 17,852 to 1,870.

One of the standout achievements of our model is its selectivity, retaining as few as only ten percent of the available features. In contrast, NSGA-II conserves a minimum of 23 percent of the features. Furthermore, our model exhibits an accuracy advantage over NSGA-II, with a noteworthy difference of nearly 3 percent. Our model also realized an impressive nine percent increase in accuracy, while the best performance by NSGA-II was an approximate six percent increase in accuracy.

BMOGWO distinguishes itself through its adeptness at maintaining diversity when selecting solutions and employs various techniques for this purpose. Additionally, it effectively manages the solutions within the archive via an archive controller. The leader selection strategy plays a pivotal role in preventing BMOGWO from becoming ensnared in local optima, promoting its exploration of the solution space.

BMOGWO outperforms NSGA-II on both objectives by excelling in two critical aspects inherent to evolutionary algorithms: exploration and exploitation. An added advantage of BMOGWO is its streamlined parameter requirements, enhancing its overall efficiency.

The grid mechanism integrated into BMOGWO surpasses the non-dominated sorting method employed by NSGA-II. BMOGWO’s utilization of the three best solutions from the non-dominated set fosters greater diversity when compared to NSGA-II. Notably, in NSGA-II, the best solutions are updated only after a full iteration is completed, whereas BMOGWO updates its best solutions whenever the position of a search agent is altered. This feature significantly bolsters the exploration aspect of the algorithm, further contributing to its effectiveness.

The superior diversity achieved by BMOGWO, in comparison to NSGA-II, can be attributed to the leader selection strategy embedded within BMOGWO. This strategy involves the selection of three leaders for each search agent during every iteration, and it plays a pivotal role in elevating the algorithm’s performance over NSGA-II.

Furthermore, BMOGWO surpasses NSGA-II in terms of both convergence and diversity when seeking solutions. This high convergence is primarily attributable to the leader selection method, which consistently identifies the three best leaders for each search agent. The algorithm’s ability to maintain diversity in solutions can be traced back to various techniques employed in maintaining the archive and selecting leaders. In particular, BMOGWO’s preference for the least crowded regions when choosing leaders significantly contributes to the preservation of diversity within the solutions. An additional noteworthy advantage of our model is its absence of parameter tuning requirements for BMOGWO, streamlining its utilization and making it an efficient choice for feature selection.

These results show the effectiveness of our approach in achieving feature reduction and enhancing classification accuracy, firmly establishing it as a superior alternative to NSGA-II in feature selection.

4.6 Implications and limitations

The primary implication of the study is the potential for improved sentiment analysis performance. By selecting a subset of relevant features and optimizing classification accuracy, the proposed hybrid algorithm can enhance the overall effectiveness of sentiment analysis models. This improvement can be beneficial for various applications, including recommendation systems, reviews analysis, and public opinion observation. The study highlights the efficiency of the feature selection process by significantly reducing the number of features. This reduction leads to reduced computational requirements. The study suggests new research directions for multi-objective feature selection and hybrid algorithms. The success of this approach may inspire researchers to explore similar techniques in various optimization problems, contributing to the advancement of evolutionary algorithms.

Despite excellent results on improved accuracy and feature set reduction, the proposed method has its limitations. The study primarily uses Multinomial Naive Bayes (MNB) for classification. Evaluating the proposed model with a variety of classifiers would provide a more comprehensive understanding of its generalizability and effectiveness. The study does not explicitly address the scalability of the proposed model. It is essential to assess its performance and efficiency as the size of the dataset and the number of features increase significantly. Imbalanced datasets can pose challenges for sentiment analysis. The study does not explicitly address methods for handling data imbalances, which can impact the model’s effectiveness.

5 Conclusion and future work

In this study, we introduced a novel hybrid multi-objective feature selection algorithm, addressing feature selection as a multi-objective challenge. We harnessed the combined power of IG and BMOGWO and achieved superior results. Our experimentation involved six diverse datasets, including the Stanford Sentiment Treebank, four Amazon product review datasets, and the Sentiment140 dataset. Experimental results showed that the proposed model was able to decrease the feature size size and improve the accuracy. The comparison of our model with NSGA-II showed its superiority in terms of feature reduction and accuracy improvement. Our model achieved up to 90 percent reduction in feature set size and up to nine percent improvement in accuracy. Notably, the Pareto fronts generated by our model exhibited more robust convergence and diversity compared to NSGA-II.

In future work, we aim to combine our proposed model with other metaheuristic algorithms like Particle Swarm Optimization (PSO). Furthermore, we intend to subject our model to evaluation with various machine learning classifiers and an expanded array of datasets.

References

Migdal

Miki

, How big data empowers organizations to work smarter, not harder, https://www.forbes.com/sites/forbestechcouncil//08/23/how-big-data-empowers-organizations-to-work-smarter-not-?sh=41ba2045532f.

, Kumar

, Al-Turjman

, Gupta

, Seth

and Shubham , Reviewer credibility and sentiment analysis based user profilemodelling for online product recommendation, IEEE Access. 8 (2020), 26172–26189.

, What are customers commenting on, and how is their satisfaction affected? examining online reviews in the on-demand food service context, Decis. Support Syst 142(113467) (2021).

Wang

, Can

, Kazemzadeh

, Bar

and Narayanan

, A system for real-time twitter sentiment analysis of us presidential election cycle, Proc. ACL Syst. Demonstrations (2012), 115–120.

Hussein

D.M.E.-D.M.

, A survey on sentiment analysis challenges, J. King Saud Univ.-Eng. Sci. 30(4) (2018), 330–338.

Guyon

and Elisseeff

, An introduction to variable and featureselection, Journal of Machine Learning Research (2013), 1157–1182.

Ding

and Fu

, A hybrid feature selection algorithm based on information gain and sequential forward floating search, Journal of Intelligent Computing 9(3) (2018), 93.

Sarkar

S.D.

, Goswami

, Agarwal

and Aktar

, A novel feature selection technique for text classification using naÏve bayes, International Scholarly Research Notices (2014), 1–10.

Chandrashekar

and Sahin

, A survey on feature selection methods, Comput. Electr. Eng. 40(1) (2014), 16–28.

10.

Xue

, Zhang

, Browne

W.N.

and Yao

, A survey on evolutionary computation approaches to feature selection, IEEE Trans. Evol. Comput. 20(4) (2016), 606–626.

11.

Deniz

, Angin

and Angin

, Evolutionary multiobjective feature selection for sentiment analysis, IEEE Access. 9 (2021), 142982–142996.

12.

Al-Tashi

, Abdulkadir

S.J.

, Rais

H.M.

, Mirjalili

, Alhussian

Mohammed

G. Ragab

and Alqushaibi

, Binary multi-objective greywolf optimizer for feature selection in classification, IEEEAccess. 8 (2020), 106247–106263.

13.

Shayaa

, Jaafar

N.I.

, Bahri

, Sulaiman

, Wai

P.S.

, Chung

Y.W.

Piprani

A.Z.

, and Al-Garadi

M.A.

, Sentiment analysis of big data: Methods, applications, and open challenges, IEEE Access. 6 (2018), 37807–37827.

14.

Chikersal

Prerna

, Poria

Soujanya

and Cambria

Erik

, Sentu: sentiment analysis of tweets by combining a rule-based classifier with supervised learning. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015) (2015), 647–651.

15.

Hutto

Clayton

and Gilbert

Eric

, Vader: A parsimonious rule-base dmodel for sentiment analysis of social media text. In Proceedings of the International AAAI Conference on Web and Social Media 8 (2014), 216–225.

16.

Bonta

Venkateswarlu

, Kumaresh

Nandhini

and Janardhan

, A comprehensive study on lexicon based approaches for sentiment analysis, Asian Journal of Computer Science and Technology 8(S2) (2019), 1–6.

17.

Zhang

Lei

, Wang

Shuai

and Liu

Bing

, Deep learning for sentiment analysis: A survey, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8(4) (2018), e1253.

18.

Hasan

Ali

, Moin

Sana

, Karim

Ahmad

and Shamshirband

Shahaboddin

, Machine learning-based sentiment analysis for twitter accounts, Mathematical and Computational Applications 23(1) (2018), 11.

19.

Mirjalili

Seyedali

, Mirjalili

Seyed Mohammad

and Lewis

Andrew

, Grey wolf optimizer, Advances in Engineering Software 69(2014), 46–61.

20.

Emary

Eid

, Zawbaa

Hossam M

and Hassanien

Aboul Ella

, Binary grey wolf optimization approaches for feature selection, Neurocomputing. 172 (2016), 371–381.

21.

Abiodun

E.O.

, Alabdulatif

, Abiodun

O.I.

, Alawida

Alabdulatif

and Alkhawaldeh

R.S.

, A systematic review of emerging feature selection optimization methods for optimal text classification: the present state and prospective opportunities, Neural Comput. & Applic.. 33 (2021), 15091–15118.

22.

Deng

Weng

and Zhang

, Feature selection for text classification: A review, Multimed. Tools Appl. 78(2019) (2019), 3797–3816.

23.

Gokalp

, Tasci

and Ugur

, A novel wrapper feature selectionalgorithm based on iterated greedy metaheuristic for sentimentclassification, Expert Systems with Applications 146(2020), 113176.

24.

Banka

and Dara

, A hamming distance based binary particle swarm optimization (hdbpso) algorithm for high dimensional feature selection, classification and validation, Pattern Recogn. Lett.. 52 (2015), 94–100.

25.

Shang

, Zhou

and Liu

, Particle swarm optimization based feature selection in sentiment classification, Soft Comput. 20(10) (2016), 3821–3834.

26.

Shunmugapriya

and Kanmani

, A hybrid algorithm using ant and bee colony optimization for feature selection and classification (ac-abc hybrid), Swarm Evol. Comput.. 36 (2017), 27–36.

27.

Liu

Wang

, Ren

, Zhou

and Diao

, A classification method based on feature selection for imbalanced data, IEEE Access. 7 (2019), 81794–81807.

28.

Tawhid

M.A.

and Dsouza

K.B.

, Hybrid binary bat enhanced particle swarm optimization algorithm for solving feature selection problems, Appl. Comput. Inf.. 16 (2018), 117–136.

29.

Mafarja

M.M.

and Mirjalili

, Hybrid whale optimization algorithm with simulated annealing for feature selection, Neurocomputing. 260 (2017), 302–312.

30.

Uysal

A.K.

and Gunal

, The impact of preprocessing on text classification, Inf. Process. Manag. 50(1) (2014), 104–112.

31.

Socher

, Perelygin

, Wu

, Chuang

, Manning

C.D.

, Ng

A.Y.

and Potts

, Recursive deep models for semantic compositionality over asentiment tree bank, Proc. Conf. Empirical Methods Natural Lang. Process. (2013), 1631–1642.

32.

TensorFlow Datasets, a collection of ready-to-use datasets. https://www.tensorflow.org/datasets.

33.

Alec

, Bhayani

Richa

and Huang

Lei

, Twitter sentiment classification using distant supervision, 2009.