Ant colony optimization for text feature selection in sentiment analysis

Abstract

In sentiment analysis, the high dimensionality of the feature vector is a key problem because it can decrease the accuracy of sentiment classification and make it difficult to obtain the optimum subset of features. To solve this problem, this study proposes a new text feature selection method that uses a wrapper approach, integrated with ant colony optimization (ACO) to guide the feature selection process. It also uses the k-nearest neighbour (KNN) as a classifier to evaluate and generate a candidate subset of optimum features. To test the subset of optimum features, algorithm dependency relations were used to find the relationship between the feature and the sentiment word in customer reviews. The output of the feature subset, which was derived using the proposed ACO-KNN algorithm, was used as an input to identify and extract sentiment words from sentences in customer reviews. The resulting relationship between features and sentiment words was tested and evaluated to determine the accuracy based on precision, recall, and F-score. The performance of the proposed ACO-KNN algorithm on customer review datasets was evaluated and compared with that of two hybrid algorithms from the literature, namely, the genetic algorithm with information gain and information gain with rough set attribute reduction. The results of the experiments showed that the proposed ACO-KNN algorithm was able to obtain the optimum subset of features and can improve the accuracy of sentiment classification.

Keywords

Sentiment analysis metaheuristic algorithm ant colony optimization k-nearest neighbour text feature selection

1. Introduction

Information is being continuously added to websites and the amount of data is increasing every second. All sorts of information can be found on various social media platforms, such as Facebook and Twitter, in postings, blogs, discussion forums, and in emails. Such information could help individuals or organizations make decisions about the selection, production, or improvement of quality products, or help in business transactions and product purchases. However, with more data being added to the web every day, a technology is needed to analyse all these information with accuracy and precision. A technology that could help consumers or organizations acquire accurate and useful information in this respect is known as sentiment analysis.

Sentiment analysis combines the fields of text mining, natural language processing, and computer intelligence. In the past few years, sentiment analysis has become the focus of attention of many researchers in these fields [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]. There are three basic problems in sentiment analysis: feature selection, identifying sentiment words, and sentiment classification [10, 11, 12, 13]. Features or aspects are the topics that users comments about or give opinions on. Features are important; failing to identify the features present in the sentences or comments made by users will make it hard to identify the sentiments. Thus, the main problem or challenge in sentiment analysis is to improve the quality of feature selection by finding a way to identify and remove features that are irrelevant or overlapping and to effectively deal with a large feature space [6, 7, 8].

Machine learning is widely used in sentiment classification. According to [8], machine learning can overcome problems of high dimensionality of the feature space by applying a feature selection technique to eliminate irrelevant and noisy features. Indeed, the main problems or challenges in sentiment analysis are the large size of the feature space, as well as irrelevant and overlapping features [6, 7, 8]. Researchers are trying to find a suitable technique to identify and choose the best subset of features from the original feature space to reduce dimensionality and increase the accuracy of the classification process. Feature selection involves finding effective features from a large number of total features in a high-dimension space. This problem remains a challenge for researchers in the field of machine learning for sentiment analysis.

According to [14], it is important to select an optimum subset of features that could represent the actual subset of features to reduce feature size and increase classification accuracy. In [15], a metaheuristic algorithm was used as the feature selection technique to determine the optimum subset of features. This is because feature selection problem is an NP-hard combinatorial problem that requires an effective and appropriate algorithm for the feature selection process [15, 16, 17, 18]. Previous studies [19, 20, 21, 22] have shown that using the metaheuristic approach for feature selection can solve problems of feature selection and feature reduction in large-size datasets that contain noisy, overlapping, and irrelevant data. Various techniques have been suggested for feature selection, including metaheuristic approaches, such as the genetic algorithm (GA), ant colony optimization (ACO), and particle swarm optimization (PSO). These techniques have produced good results in obtaining optimum feature subset.

In [15], a modified discrete PSO algorithm was presented for solving the problem of feature subset selection. The study showed that the proposed discrete PSO algorithm was competitive in terms of classification accuracy and computation performance. On the other hand [19, 21], implemented ACO and the GA as feature selection techniques for text classification. The results of these two studies showed that ACO was more advantageous compared to GA in terms of its efficiency and competence in the convergence process, and its strong ability to find the problem space and to successfully obtain the optimum subset of features. Moreover, ACO is a very simple concept and it is computationally inexpensive in terms of memory requirement and speed. Based on these advantages, this paper proposes an integrated text feature selection algorithm based on ACO and KNN to obtain a subset of optimum features and to improve the accuracy of sentiment classification.

The remainder of this paper is organized as follows: Section 2 discusses the existing works on feature selection and the basic concept of ACO. Then, Section 3 presents an overview of the methodology adopted in this study, while Section 4 describes the proposed hybrid ACO and KNN algorithm for feature selection in sentiment analysis. Section 5 discusses the experimental results and Section 6 concludes the paper.

2. Related works

Various feature selection techniques have been suggested and implemented to identify product features that are discussed among users. In [6], the authors proposed a feature extraction method that used both syntactic and stylistic features. Syntactic features include word n-grams, part-of-speech (POS) tags, punctuation, and phrase patterns. These are usually used as a set of features for sentiment analysis. Stylistic features include lexical and structural features, but these types of features have limited usage in sentiment analysis. Their method of selecting a feature involved using a combination of information gain (IG) and GA techniques. The combination of these two algorithms is known as the entropy weighted genetic algorithm (EWGA). In the initial stage of the EWGA, the unigram features would be extracted from the document. The next stage involved feature selection using the IG technique to determine the importance of the features contained in the document.

However, the IG technique has two main weaknesses: first, it must determine the threshold value in advance, and second, it does not consider features that overlaps [9]. To overcome these drawbacks [9], combined IG with the rough set attribute reduction (RSAR) technique. This technique can reduce the number of noisy and irrelevant features. In addition, it has the advantage of being able to consider overlapping features. However, if RSAR is used on its own, it would take a long time to obtain the optimal feature subset. Therefore, by combining IG and RSAR [9], was able to reduce the number of overlapping features, as well as obtain the minimum feature subset more quickly, which can reduce time complexity during sentiment classification.

In a comprehensive study [23] showed that ACO-based feature selection achieved a very promising result. The authors used ACO to search the feature space and select an ‘appropriate’ subset of features. The proposed algorithm used mutual information evaluation information (MIEF) to measure the local importance of a given feature. In their work, ACO was applied in the classification of speech segments. The Texas instrument and Massachusetts Institute of Technology (TIMIT) database were used and an artificial neural network (ANN) was applied to classify the features. For this classification, the performance of the proposed ACO in selecting the features was compared with that of the GA and of sequential forward selection (SFS). The authors found that the proposed ACO algorithm achieved 84.22% for classification accuracy, which was better than the GA and SFS, which only achieved 83.49% and 83.19% accuracy, respectively. Thus, ACO gave superior outcomes compared to the GA and SFS.

In [24], the performance of a proposed ACO algorithm for feature selection in a face recognition system was investigated. The authors used the classifier performance and the length of the selected feature vector as heuristic information. The proposed algorithm was tested using the Olivetti Research Laboratory (ORL) greyscale face image database that contains 400 facial images of 40 individuals. They used two different sets of features, namely, the pseudo Zernike moment invariant (PZMI) and the discrete wavelet transform (DWT) coefficients as a feature vector. In the first step, the PZMI and DWT coefficients were extracted from each facial image and in the second step, the proposed ACO algorithm was used to select the optimal feature subsets. A nearest-neighbour classifier was used to classify the selected features and obtain the mean squared error (MSE), which was used as a performance measure. Overall, the study indicated that the proposed ACO algorithm achieved better performance compared to other tested algorithms.

Meanwhile [19, 21] showed that ACO can be used to solve problems of feature selection for text categorization. In these works, classifier performance and feature subset length were used to evaluate every feature. The ACO algorithm was applied to the so-called ‘bag of words’, which is a document containing a set of words or phrases. Every feature in the text was represented as a vector of term weights. A classification algorithm was used to evaluate and sort the selected subsets of features based on the performance of the classifier and the subset length. In these experiments, the authors assumed that classifier performance was more important than feature subset length. In their proposed algorithms, the value of the pheromone was updated based on the performance of the measured classifier and feature subset length. The performance of the algorithm as a text feature selection technique was compared with that of a GA and two statistical methods (IG and chi-square analysis). For these experiments, they used the Reuters-21578 dataset and the results of the experiment showed that ACO had outstanding capabilities and was more efficient than the GA, IG, and chi-square analysis.

In [25], the authors conducted an experiment to explore ACO as a suitable feature selection technique for a bioinformatics dataset. They used classifier performance and the length of the selected feature subset as the parameters to evaluate the performance of their proposed ACO algorithm. In their experiment, the proposed ACO algorithm was compared with the standard binary PSO algorithm. The ACO algorithm, when using only a small subset of selected features, achieved better classification accuracy than the PSO algorithm. The study showed that ACO has higher intelligence, is easy to implement, faster computation time, has the ability to converge quickly, has a strong search capability in the problem space, and efficient in finding the minimum subset of features [25]. In another investigation into ACO algorithms [26], found that ACO can select the optimum feature subset when applied to a web page classification problem. The authors used a naïve Bayesian algorithm to measure classification performance. In [27], a two-stage feature selection process was described. During stage one, the position of the terms in documents was determined using the IG technique. Stage two consisted of feature extraction using a principal component analysis (PCA) technique, where a GA functioned as the feature selection technique. The aim was to increase the effectiveness of the text categorization process by applying these two stages to identify only the important terms. The study proved that dimension reduction can be achieved using GA and PCA, with the help of IG, in determining the importance of features, thus increasing the success of the text categorization process.

In [28], a binary PSO was combined with KNN to select a feature set in an Arabic-language document. The results showed that the combination of these two techniques could increase the accuracy of text classification. In [29], a combination of ACO with trace-oriented feature analysis (TOFA) was proposed. The advantage of ACO, which uses ants to search for the optimum subset of features in the search space, was combined with the ability of TOFA to analyse large-scale text data. The authors claim that previous studies have proven that TOFA is effective in reducing dimension size. The experimental results showed that the suggested approach could reduce the size of the feature space to a smaller dimension and increase the accuracy of text classification. In [30], a novel heuristic algorithm for feature selection, known as the chaos genetic feature selection optimization (CGFSO), was suggested. The authors proposed a hybrid of the GA and chaos optimization (CO) for feature selection in text categorization. Their results confirmed that the association of the GA and CO was effective and superior to other compared algorithms for the text categorization problem. In [31], a combination of ACO for feature selection and a neural network (NN) as a classifier was proposed. The authors utilized the concept of a hybrid search technique to select outstanding features using a subset size determination schema. The study looked at the performance of a variety of hybrid techniques on eight benchmark classification datasets and one gene expression dataset. Their results showed that 98.91% of average accuracy was achieved by their proposed algorithm.

In [19], ACO showed higher classification performance compared to GA, IG, and chi-square when used as a feature selection technique in text classification. Based on a literature review conducted by [32] and the experiment in [19], ACO has been proven to have several advantages. First, it can rapidly find the optimum solution, even with numerous iterations. Second, it is efficient in creating the minimum feature subset, and third, it has a powerful exploration capability during a gradual search process towards the optimum solution.

Therefore, this current study has chosen ACO as the feature selection technique for sentiment classification of customer product reviews, which is different than in normal text classification. Sentiment classification functions to classify the feature subset acquired from feature selection into groups of positive or negative sentiments. This classification is related to the sentiments expressed by customers when they evaluate each feature of a product they have purchased.

Based on the successes of the methods in [19], a metaheuristic algorithm is proposed in this study for text feature selection based on the characteristics of the words in customer reviews to improve the accuracy of sentiment classification. In this study, the classifier performance and the length of the selected feature subset were taken as heuristic information for ACO. Therefore, the proposed algorithm did not require prior knowledge of features. The proposed ACO-KNN algorithm was applied to the text features of the bag-of-words model in which a document is considered as a set of words, or phrases (called terms). Additionally, each position in the input feature vector corresponds to a given term in the original document.

3. Methodology

Generally, the methodology for sentiment analysis is designed to obtain the optimum feature subset and to classify sentiments accurately. In this study, it comprises of five phases, namely, text preprocessing, feature selection, detection of the relationship between the features and sentiment words, sentiment classification and testing, evaluation, and analysis, as depicted in Fig. 1. The working principles of these phases are explained in the following subsections.

Figure 1.

Methodology of the proposed algorithm.

3.1 Summary of methodology

This section briefly explains the phases in methodology as applied in this study. First, each sentence present in the dataset went through a text preprocessing phase. Each extracted sentence was saved in a document. Next, a clean-up process was performed on each document that contained a customer review sentence. Then, a feature extraction process was applied to each customer review sentence using part-of-speech tagging (POST). In this work, the following features were given POS tags: the noun (NN), noun-adjective (NN-ADJ), noun-verb (NN-VBN), and verb (VB). The resulting list of extracted features was saved.

Next, a transformation process was applied to convert the feature list into a feature vector because the feature list was in text form, which cannot be interpreted by a classifier. The resulting feature vector was used in the feature selection phase to select the optimum feature subset. The proposed ACO-KNN algorithm was used to select and assess the feature list derived from the POST process.

The optimum feature subset obtained by the feature selection process was used to detect the relationship between the features and sentiment words. This process involved a dependency relations (DR) algorithm from [32]. This algorithm is capable of identifying the relationships between the features and sentiment words present in every sentence in the dataset. Each word has adjective and verb tags, and their connections to the features were checked using the DR rule proposed in [32]. The words that were identified as connected were extracted and then, compared with the sentiment lexicon to identify the sentiment type; whether it is a positive or a negative sentiment. The output from the process of identifying feature and sentiment word relations was the relationship (feature, sentiment word, sentiment type).

Next, in the sentiment classification phase, which was manually conducted in this study, each derived relationship (feature, sentiment word, sentiment type) was checked against the information that were present in the dataset. Finally, the testing, evaluation, and analysis phase consisted of several experiments. Detailed explanations of each of the five phases are given below.

3.2 Phase 1: Text preprocessing

The customer review datasets went through a cleaning process to edit the review sentences written by users. This is because the comments were written by regular users who are not language experts. Therefore, the datasets were likely to contain spelling errors, grammatical errors, such as punctuation mistakes and the wrong use of capital letters, slang words that do not exist in dictionaries, and abbreviations or acronyms for common terms. The cleaning process can be performed using standard word processing software, such as Microsoft Word to correct spelling errors, to amend the first letter of each sentence to a capital letter where necessary, and to ensure that each sentence ends with the correct punctuation, such as a full stop (.), question mark (?), or exclamation mark (!). An example of a sentence written by a user, before and after cleaning, is given below:

Sentence before preprocessing: “the vibrate settting is loud ! ! !” Sentence after preprocessing: “The vibrate setting is loud!”

In the above sentence, the first letter was a lowercase ‘t’, which was changed to an uppercase or capital ‘T’ to denote the start of the sentence; the error in the spelling of the word ‘settting’, with an extra ‘t’ was corrected to ‘setting’; and the repeated exclamation marks (‘! ! !’) were amended to appear only once (‘!’). Once the cleaning process was completed, the POST process was performed for each sentence to identify the noun, adjective, adverb, verb, and determiner. This study used the noun, verb, and adjective to identify the features present in sentences, whereas the adjective, noun, verb, and adverb were used to identify the sentiment words. The list of features was extracted from the sentences and placed in a table. Next, this feature list was changed, or transformed into numerical values, which was performed through a feature weighting calculation process. For this study, term frequency-inverse document frequency (tf-idf) [33] was used to calculate the weight for each feature.

3.2.1 Feature extraction

In this study, feature extraction consisted of two steps, namely, part-of-speech tagging and a transformation process.

Part-of-speech tagging

This study was focused on the features of products that the reviewers commented on. Hence, the term feature refers to a product’s features, such as camera, size, screen, button, picture, design, and weight. Next, the feature set was extracted from the cleaned review sentences. However, prior to feature extraction, every sentence was parsed using Stanford Parser tools to build the grammatical structure. This process is known as part-of-speech tagging, or POST. Through this process, each word in the sentence was marked according to certain categories and divisions, such as noun, adjective, adverb, verb, determiner, conjunction, pronoun, preposition, and number. An example of a sentence before and after the POST process is shown below:

Before POST:

“The screen is very easy to read.” After POST: “The/DT screen/NN is/VBZ very/RB easy/JJ to/TO read/VB ./.”

Based on a previous study by [2], the normally occurring nouns (NN) and noun phrases (NP) were preserved as features, while the adjective (JJ) or adverb (RB) modifiers describing them were preserved as sentiment words.

Transformation process

A text document cannot be directly read and interpreted by a classifier. Therefore, an indexing process that maps the text document into a representation, known as a feature vector was performed. The indexing procedure that maps the text document into a compact representation of its contents must be equivalently applied to the documents.

Equation (1) shows how a text $d_{j}$ is represented by a vector of term weights:

$\displaystyle d_{j}=\{w_{1j},w_{2j},\ldots,w_{|T|j}\}$ (1)

where $T$ is the set of terms (features) present in a document, even if it only appears once in the training dataset, and 0 $\leqslant w_{kj}\leqslant$ 1 represents the total number of terms, $t_{k}$ that contribute to the semantics in the document $d_{j}$ . Each position of the feature vector input was matched with a word or phrase. This representation is usually called the bag-of-words model. The normalised tf-idf function represents the weightage [33], which is defined by Eqs (2) and (3):

$w_{kj}=\frac{tfidf\left({t_{k},d_{j}}\right)}{\sqrt{\mathop{\sum}\nolimits_{s=% 1}^{\left|T\right|}(tfidf\left({t_{s},d_{j}}\right))^{2}}}.$ (2)

where

$\displaystyle tfidf\left({t_{k},d_{j}}\right)={\#}\left({t_{k},d_{j}}\right)% \cdot\log\frac{\left|{Tr}\right|}{\#_{Tr}\left({t_{k}}\right)}.$ (3)

and #( $t_{k},d_{j})$ is the number of $t_{k}$ in $d_{j}$ . Meanwhile, Tr is the training datet and $|$ Tr $|$ is the length of the feature subset. The number of documents is represented by # ${}_{Tr}(t_{k})$ in Tr where $t_{k}$ occurs. The number of terms appearing in a document (term frequency) is also used as a weighted schema for textual data.

The tf-idf is the function frequently used to represent terms that are present in documents. The tf-idf also considers each term’s relationship in the whole document llection [34]. Each dataset contains comment sentences related to particular products, such as cameras, mobile phones, and similar items. Each sentence in the dataset is broken down to form a document, with each document containing a sentence. Each document represents a vector and is considered to be the word, or term ‘bag’. Figure 2 shows the process of transforming the dataset into a tf-idf representation. Table 1 shows an example set of features for a set of documents.

Table 1

Example set of features for a set of documents

Documents	Feature set
D1	{picture, menu, auto mode, weight}
D2	{picture, LCD, menu, focus, size, zoom}
D3	{LCD, storage, sound, size, weight, zoom}
D4	{menu, focus, storage, sound, weight}
D5	{auto mode, storage, sound, zoom}

Figure 2.

Steps to create tf-idf representation.

In Table 2, the matrix has a dimension of N $\times$ M, where the N dimension is defined by the features in the documents and the M dimension represents the number of documents to be classified. Each matrix element, mij, represents a feature weightage value $i$ that is present in document $j$ . Table 3 provides an example of a weightage matrix for a feature vector for a Nikon product. This matrix gives the relationship values between the features and the related document. Each document is represented as a vector in an n-dimension vector space.

Table 2

Illustration of feature vector in matrix form

	Documents
		D1	D2	D3	D4	D5	…
Features	Feature ${}_{1}$	m(1,1)	m(1,2)	m(1,3)	…	…	…
	Feature ${}_{2}$	m(2,1)	m(2,2)	…	…	…	…
	Feature ${}_{3}$	m(3,1)	…	…	…	…	…
	Feature ${}_{4}$	…	…	…	…	…	…
	Feature ${}_{5}$	…	…	…	…	…	…
	…	…	…	…	…	…	…

Table 3

Illustration of tf-idf representation in tabular form

	Documents
		D1	D2	D3	D4	D5	D6	$\ldots$
Features	Picture	0.61	0	0.17	0	0.23	0	$\ldots$
	LCD	0.46	0.23	0.12	0.33	0	0	$\ldots$
	Menu	0	0.44	0	0	0	0.22	$\ldots$
	Focus	0	0	0.44	0.47	0	0.56	$\ldots$
	Storage	0	0	0	0	0	0.53	$\ldots$
	Size	$\ldots$	$\ldots$	$\ldots$	$\ldots$	$\ldots$	$\ldots$	$\ldots$

3.3 Phase 2: Feature selection

Feature selection is one of the most important aspects of solving a classification problem because it removes irrelevant features to decrease the computational cost [10]. Feature selection is a process of assessing and selecting features to obtain the feature subset that best represents the actual feature list, without changing or affecting the feature quality. Several evaluation criteria need to be considered to obtain the optimum feature subset [19]. According to [35], there are four steps in selecting features, namely, subset generation, subset evaluation, application of stopping criteria, and validation of results. The generation of a subset involves a search procedure to create a suitable feature subset for the assessment process and is based on a search strategy. Each subset is assessed and compared with the previous best one based on several particular evaluation criteria. If the new subset is better, then it replaces the older subset. This process is repeated until the generated subset meets the stopping criteria. The best newly acquired feature subset will then go through a validation process based on prior knowledge, or be subjected to different tests using synthetic and/or real-world datasets [35].

According to [35, 36, 37], three feature selection techniques can be used for the classification process, which are the filter technique, the wrapper technique, and the embedded or hybrid technique. The filter technique is a processing step that is independent of a machine learning algorithm. During the filter process, the relevant score for each feature is calculated and features with low scores are removed. The acquired feature subset is the input for the classification algorithm [37]. In the wrapper technique, the induction algorithm is used to evaluate feature selections to find the best feature subset [18]. Meanwhile, the hybrid technique is a combination of the filter and the wrapper techniques [39, 42], which can be used to handle a large-size dataset [42, 43]. The combined technique can produce the optimum feature subset using a standard optimization technique, such as GA, PSO, and ACO [40]. The feature subset selection process is an NP-hard problem [16, 45]. Therefore, a thorough search process is often required to obtain the optimum solution. The metaheuristic approach is suitable because there is no need to explore the entire solution space thoroughly [42]. The experiment reported in this study used a hybrid technique that combined ACO and KNN to obtain the optimum subset of features.

3.3.1 K-nearest neighbour (KNN)

The KNN is the easiest and simplest machine learning algorithm. KNN groups new data based on the shortest distance between $k$ neighbours in the training data. The best $k$ value depends on the data; generally, the highest $k$ value reduces the effect of noise during classification. Nonetheless, it causes the limits between each classification to be more unclear. Then, a cross-validation technique can be used to find the optimum $k$ value and the best parameter for a model. In this study, the value of $k$ was determined by the number of features involved in the process of feature selection. The Euclidean distance was used to calculate the distance between two points. KNN classifier was used to evaluate the feature subsets acquired by the ants visiting each node (feature). Each feature has its own tf-idf value and is usually stored in a feature vector set. Each feature vector is considered as a node that the ants must pass through to acquire food. In this study, the dataset was divided into two sets, namely, a training set and a test set. In the initial stage, the training dataset only contained A nodes and the test dataset only contained B nodes. When the ants moved from the A nodes to the B nodes, the distance between the A and B nodes was calculated using the Euclidean distance formula, and the performance of the A and B nodes was measured.

The MSE value derived from the classification process was used to assess the classification performance. The MSE values for two nodes (A node and B node) were compared with the MSE value for the previous subset (A node). A decreased MSE value meant that the performance value has increased. Thus, the B node was chosen and placed in the training data subset list. Subsequently, the ants moved to the next node, which was the C node and was considered as the test dataset. The distances between node C and node A, and between node C and node B in the training dataset, were calculated. Then, the MSE value for the groups of A, B, and C nodes were calculated. If the MSE value is lesser than the previous MSE value, the C node would be selected and placed in the training data subset list. Otherwise, the C node would not be selected and the ants would have to continue on their journey to the next node. The process was repeated until the ants have visited all the nodes.

3.3.2 Ant colony optimization for feature selection in sentiment analysis

Ants are insects that help one another to look for food and bring them back to their nest. The ants have the ability to find the best routes to get food by producing pheromone. This chemical substance sticks on soil, thus serving as a route marker and as the medium of communication among ants. Ants also use this marker to find their way back to their nest. They can also use the marker to determine which route is the best one to use for transporting food to the nest.

This natural phenomenon was the inspiration for the development of the algorithm, known as the artificial colony of ants [43]. This algorithm is a metaheuristic approach that can be used to solve various combinatorial optimization problems. Ant colony optimization (ACO) is a popular algorithm used for solving optimization problems, such as the travelling salesman problem (TSP), work scheduling, the travelling vehicle problem, the quadratic load problem, feature selection, and bioinformatics [19, 20, 25, 32, 33, 34]. In the TSP, ACO was used to find the shortest route between cities [32, 34].

Feature selection is an important step in finding the optimum feature subset because it produces high classification accuracy and is able to maintain the original feature set. The definitions of the terminologies used in the ACO algorithm are as follows [24]:

•
Expression of the problem in graph form: The problem must be designed as a graph with a set of nodes and edges between nodes.
•
Heuristic value: This value is used to measure the movement of the ants between the nodes in the graph.
•
Pheromone update rule: A rule for updating the pheromone levels on the edges is required as well as a corresponding evaporation rule. There are two types of pheromone update rule, namely, a local pheromone update rule and a global pheromone update rule.
•
Probabilistic transition rule: This rule is used to determine the probability of an ant traversing from one node to the next node in the graph.

To apply an ACO algorithm for feature selection in sentiment analysis, three main issues must be addressed [19]. These are explained in the following subsections.

Expression of the problem in graph form

The feature selection goal needs to be expressed in a format that is suitable for ACO, thereby, ACO needs a problem to be symbolized as a graph [19, 33]. Thus, the nodes in the graph represent features, where the edges between them indicate the start of the next feature. The search for the optimal feature subset is made by an ant traversing the graph and visiting a minimum number of nodes that satisfies the traversal stopping criteria [19].

Figure 3 provides an example of this process, where the ant is currently at node $F_{1}$ and has to choose which feature to add next to its path. As shown by the solid arrows, it chooses feature $F_{5}$ next based on a transition rule and then, $F_{9}$ . Upon arriving at $F_{9}$ , the current subset becomes { $F_{1}$ , $F_{5}$ , $F_{9}$ }. This subset is deemed to fulfil the traversal stopping criteria and is an example where high classification accuracy has been achieved.

Figure 3.
Representation of subset construction by ACO for feature selection.

Heuristic value

A heuristic value is also called heuristic information or, as in [19], heuristic desirability. A heuristic value is optional, but is often needed for an algorithm to achieve high performance [47]. A heuristic value, $\eta_{i}$ , can be derived from the feature subset evaluation function using a classification algorithm [19, 23], for example an entropy-based measure [48], KNN, or Naïve Bayes classifier [25]. This study used the MSE of the classification results obtained from the KNN algorithm as a heuristic value, as explained in Section 3.3.1.

According to [47], in most studies on ACO algorithms, the probabilities for selecting the next nodes are called transition probabilities. In this study, ants were placed in $n$ number of nodes and the ants moved from node A to node B by using transition probabilities that followed a particular probabilistic transition rule (PTR). Ants were placed at the $n$ node, representing a feature and the nodes were not connected to each other. This basic ACO concept was applied in the text feature selection process in this study, where the ants’ movement from node $n_{1}$ to node $n_{2}$ was based on the PTR formula, shown by Eq. (2):

$\displaystyle P_{i}^{k}(t)=\left\{\begin{array}[]{ll}\frac{[\tau_{i}(t)]^{% \alpha}\cdot[\eta_{i}]^{\beta}}{\sum_{u\in j^{k}}[\tau_{u}(t)^{\alpha}]\cdot[% \eta_{u}]^{\beta}}&\text{if}\ i\in J^{k}\\ 0&\text{otherwise}\\ \end{array}\right.$ (4)

where:

$\eta_{i}=$ heuristic value; and

$\tau_{i}=$ pheromone value.

The PTR value is a combination of the heuristic value and the pheromone value, which is related to feature $i$ , and parameters $\alpha$ and $\beta$ . The values of parameters $\alpha$ and $\beta$ , $\alpha>$ 0 and $\beta>$ 0, define the relative importance of the pheromone value and heuristic information [47]. According to [19], a good balance between exploitation and exploration can be achieved by selecting suitable values for parameters $\alpha$ and $\beta$ . If $\alpha=$ 0, pheromone information is not used and the search process is abandoned. If $\beta=$ 0, the movement process is abandoned. When all the ants have completed their visits, the pheromone level at each node or feature needs to be updated by using a predetermined pheromone update rule. In this study, the pheromone was updated using the formula from [19], as discussed in the next subsection.

Pheromone update rule

The pheromone update rule in the ACO algorithm is the most important aspect of the feature selection process. This study used two types of pheromone update rule [31]: (a) local pheromone update rule that is applied by each ant while constructing solutions; and (b) a global update rule that refers to the global-best ant, i.e., the ant that made the best route from the start of the search in the current iteration.

Every time an ant visits a node, it adjusts the pheromone level by using the local pheromone update rule, shown by Eq. (5):

$\displaystyle\Delta\tau_{i}^{k}\left(t\right)=\left\{{{\begin{array}[]{cl}{% \emptyset\cdot\gamma\left({S^{k}\left(t\right)}\right)+\frac{\varphi\cdot\left% ({n-\left|{S^{k}\left(t\right)}\right|}\right)}{n}}\\ \text{otherwise}\\ \end{array}}}\right.\text{if}\ i\in S^{k}\left(t\right).$ (5)

where:

$n=$ quantity of features;

$S^{k}(t)=$ feature subset found by ant $k$ at iteration $t$ ;

$|S^{k}(t)|=$ feature subset length by ant $k$ at iteration $t$ ;

$\gamma(S^{k}(t))=$ classification performance;

$\phi=$ relative weight of classifier performance ( $\phi=$ 0.8);

$\varphi=$ relative weight of feature subset length ( $\varphi=$ 0.2);

$\phi\in$ [0, 1];

$\varphi=1-\phi$ .

Once all the ants have created their routes, only the overall best ant that has created the shortest route from the visit starting point was allowed to update the pheromone by using the global pheromone update rule shown by Eq. (6):

$\Delta\tau_{i}^{g}\left(t\right)=\left\{{{\begin{array}[]{*{20}c}{\emptyset% \cdot\gamma\left({S^{g}\left(t\right)}\right)+\frac{\varphi\cdot\left({n-\left% |{S^{g}\left(t\right)}\right|}\right)}{n}}\\ \text{otherwise}\\ \end{array}}}\right.\text{if}\ i\ \in S^{g}\left(t\right).$ (6)

where:

$n=$ quantity of features;

$S^{g}(t)=$ feature subset found by ant $g$ at iteration $t$ ;

$|S^{g}(t)|=$ feature subset length by ant $g$ at iteration $t$ ;

$\gamma(S^{g}(t))=$ classification performance;

$\phi=$ relative weight of classifier performance ( $\phi=$ 0.8);

$\varphi=$ relative weight of feature subset length ( $\varphi=$ 0.2);

$\phi\in$ [0, 1];

$\varphi=1-\phi$ ;

Equation (7) was used for all nodes, where $S^{k}(t)$ , or $S^{g}(t)$ is the feature subset and $|S^{k}(t)|$ or $|S^{g}(t)|$ is the feature subset length by ant $k$ or $g$ at iteration $t$ . Two parameters, $\phi$ and $\varphi$ , were used to control the classifier performance and feature subset length, respectively. Both parameters have different impact on the feature selection process. In this study, classification performance was more important than feature subset length, so the parameter settings was adopted, as stated in [19], where $\phi=$ 0.8 and $\varphi=$ 0.2. The pheromone was updated according to the measured classification performance of the best ant’s $\gamma(S^{k}(t))$ or $\gamma(S^{g}(t))$ , and the length of the subset itself. By this definition, all ants can update the pheromone. Therefore, the process of updating the new pheromone values of the ants and pheromone evaporation is performed for all nodes based on Eq. (7) in the next iteration:

$\tau_{i}\left({t+1}\right)=\left({1-\rho}\right)\tau_{i}\left(t\right)+\mathop% {\sum}\limits_{k=1}^{m}\tau_{i}^{k}\left(t\right)+\Delta\tau_{i}^{g}\left(t% \right).$ (7)

where:

$m=$ quantity of ants in each iteration;

$0<\rho<1=$ pheromone trail decoy coefficient;

$g=$ best ant at each iteration;

$S^{g}(t)=$ feature subset found by ant $g$ at iteration $t$ ;

$|S^{g}(t)|=$ feature subset length by ant $g$ at iteration $t$ ;

$\gamma(S^{g}(t))=$ classification performance.

The main purpose of pheromone evaporation is to avoid stagnation, which is a condition that requires ants to build similar solutions. This can give the optimum solution from the ants’ exploration to obtain the optimum feature subset.
3.3.3 General process of ACO-KNN for feature selection in sentiment analysis

The general ACO-KNN feature selection process applied in this study is depicted in Fig. 4. The process started by declaring all the variable values used in the ACO-KNN algorithm. Next, a number of ants were placed randomly on the graph. The chances were that the total number of ants on the graph is the same as the number of features in the dataset. Then, the ants would start to chart their routes from their original positions and go to all the nodes until they meet the stopping criteria that have been set. If the stopping criteria are not met, the pheromone value needs to be updated and the process is repeated again with a newly created set of ants. The function of ACO is to guide the feature selection process, while KNN functions as a classifier to evaluate the candidate subsets of features.

3.4 Phase 3: Detection of relationship between the features and sentiment words

After the optimum feature subset has been obtained, the next step was to identify the relationship between each feature in the optimum feature subset and the sentiment word in a customer review sentence. This study used the hybrid role of typed dependency relations (TDR) and part-of-speech tagging (POST) algorithm [49], which was designed to identify the relationships between the words present in a sentence. This approach used the Stanford Typed Dependencies to obtain grammatical relationships in a sentence. In [49], a list of combination rules was created using TDR and POST to identify the relationship between the feature and sentiment word present in a sentence, as shown in Table 4.

Table 4
Rules for typed dependency relations and POST for NSUBJ (adopted from [49])

TDR	POS Tags	(Feature, sentiment word)
NSUBJ	(JJ/VBN $\to$ NN/NNS)	(camera, perfect)
(NSUBJ $\leftarrow$ )( $\to$ ADVMOD)	(NN $\leftarrow$ (VBZ/JJ) $\to$ RB)	(autofocus, well)
(NSUBJ) $\to$ (AMOD)	(JJ $\to$ (NN) $\to$ JJ/VBN)	(manual mode, rich)
(NSUBJ) $\to$ (NMOD)	(JJ $\to$ (NN) $\to$ NN)	(picture quality, excellent)

Figure 4.

Overall process of proposed ACO-KNN algorithm.

The algorithm developed by [49] was able to detect up to five TDR layers when identifying features and sentiment word relations based on the analysed results of the proposed algorithm. For example, the relationship between a word and another word in a sentence is defined as ‘one layer TDR’. Additionally, the relationship between more than one TDR layer is known as ‘multiple layer TDR’.

The following describes the process of identifying the relationship between a feature (F) and a sentiment word (S), in one layer and in multiple layers.

Example:

One layer:

Figure 5.

The dependency relationship for the sentence: This recorder is perfect.

Sentence 1 $=$ “This recorder is perfect.”

Dependency relationship $=$ {det(recorder-2, This-1), nsubj(perfect-4, recorder-2), cop(perfect-4, is-3)}

The word “recorder” is the feature. The word “perfect” is the sentiment word. The “recorder” is a noun, which is the subject. The word “perfect” is the complement of the copular verb. In this relationship, the TDR layer is NSUBJ(perfect/JJ $\to$ recorder/NN).

ii) Multiple layers:

Figure 6.

The dependency relationship for the sentence: The video quality works excellent.

Figure 6 shows a combination of three (TDR AMOD-NSUBJ-XCOMP) layers, which is categorized as the ‘multiple TDR layers’ (video/JJ $\leftarrow$ quality/NN $\leftarrow$ works/VBZ $\leftarrow$ great/JJ). In this relationship, POS tagging for the feature is the combination of adjective and noun (JJ-NN), while the POS tagging for the sentiment word is an adjective (JJ).

3.5 Phase 4: Sentiment classification

In this study, sentiment classification was manually checked. Output from phase 3, which was the pair list (feature, sentiment word), was manually checked against the information present in the customer review dataset. Checking has to be done to ensure that the previously extracted features and sentiment words are similar to the information present in the existing dataset.

3.6 Phase 5: Testing, evaluation, and analysis

3.6.1 Dataset

The data used in this study were from customer review datasets that have been used in [1, 2]. The customer review datasets were from the Amazon website and covered five different types of electronic products (see Table 5). Each review consisted of a review title, product features, and opinion strength. There were also review sentences that did not have any features in them. This research had only focused on review sentences that contained product features and opinion words. The datasets were manually annotated by [1]. For example, the sentence, “This camera is perfect for an enthusiastic amateur photographer” would receive a tag “camera [ $+$ 2]” because camera is the product feature. A number [ $+n$ ] or [ $-n$ ] was added to denote the sentiment strength, which can be either positive or negative, which in this case was a positive sentiment. Additionally, this research used a sentiment lexicon (positive and negative sentiment words) collected by [1], which can be downloaded for free from the authors’ website.

Table 5
Summary of product features in customer review datasets

Dataset	Number of manual product features	Number of review sentences
Nikon	70	346
Nokia	100	546
Apex	104	739
Canon	100	597
Creative	170	1716

The five electronic product datasets in Table 5 were used for comparisons and to evaluate the effectiveness of the proposed algorithm. The prepared experimental data were divided according to certain percentages. Generally, based on the data mining approach, a dataset would be split into two groups, namely, training data (90%) and test data (10%). To avoid any discrepancy that might arise due to the selection of only certain samples, this data allocation process was repeated 10 times using random sample selection, so that the training data and the test data became interchangeable. These data were represented in 10 training datasets and 10 test datasets, with a 90% training and 10% test allocation without overlap between the two. This process is known as 10-fold cross-validation. The value 10 shows that the data have been tested comprehensively and are a suitable representation for generating the best error prediction [50]. The overall customer review dataset consisted of five electronic product datasets. Each dataset has 10 divisions based on different percentages, which were 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, and 100%. Each percentage contained a different dataset that did not overlap. This means that the proposed ACO-KNN algorithm and the baseline algorithms (IG-GA and IG-RSAR) had to be run 10 times for each dataset, with different data percentages.

3.6.2 Baseline algorithms

For the two baseline algorithms, a combination of IG and GA, and a combination of IG and RSAR were chosen for the following reasons:

1.
The GA is a metaheuristic technique that has been proven to be an effective feature selection technique in [6] for sentiment analysis. Furthermore, IG was combined with GA in [6] and was found to be able to filter and choose quality features. Therefore, the combination of the GA and IG techniques was deemed effective. Thus, this combination was considered to be potentially useful as a baseline algorithm in this study. By comparing the proposed ACO-KNN algorithm with the IG and GA hybrid, the effectiveness and ability of the ACO metaheuristic and that of the GA as a feature selection technique for analytical research on sentiments could be evaluated.
2.
The RSAR was merged with IG in [9] as a feature selection technique for sentiment analysis. The experimental results showed that the combination of IG and RSAR was able to produce a good classification accuracy that ranged between 78.1% and 87.7%. The combined techniques were able to reduce redundant features and produced a minimum set of features of good quality. Thus, this combination was also thought to be suitable as a baseline algorithm to test the effectiveness of the proposed ACO-KNN algorithm as a hybrid feature selection technique in sentiment analysis.

3.6.3 Evaluation

For the evaluation, all the reviews were read and evaluated by human beings. An optimum feature subset was extracted from the reviews using the ACO-KNN algorithm. Similarly, the sentiment words in the reviews were extracted and identified as either positive or negative sentiments using the DR algorithm, as described in Section 3.4. The effectiveness of the proposed algorithm was measured according to three metrics, namely, precision ( $p$ ), recall ( $r$ ), and F-score. These evaluation measures were used to assess the accuracy of the combined proposed ACO-KNN algorithm and DR algorithm for detecting and acquiring the right feature (frequent noun) and right sentiment word [51]. Precision ( $p$ ), recall ( $r$ ), and F-score are defined as:

$\displaystyle p=\frac{TP}{(TP+FP)}$ (8) $\displaystyle r=\frac{TP}{(TP+FN)}$ (9) $\displaystyle\textit{F-score}=2\cdot\frac{p\cdot r}{p+r}$ (10)

where true positive (TP) is the number of reviews from which the algorithm correctly extracts the right feature and right sentiment words, false positive (FP) is the number of reviews from which the algorithm falsely extracts the wrong features and wrong sentiment words, while false negative (FN) is the number of reviews from which the algorithm fails to identify any features or sentiment words.

4. Proposed hybrid ACO-KNN algorithm for feature selection in sentiment analysis

The ACO algorithm has been suggested as a feature selection technique because it is effective in exploring the entire search space and can reduce the dimension of the overall feature space. On the other hand, the KNN algorithm works well as a classifier to evaluate the candidate feature subsets by using MSE as a heuristic value. In other words, this study had utilised a wrapper approach because the ACO had wrapped the classifier to guide the search during the feature selection [18]. Thus, the combination of the two algorithms should be able to produce the optimum feature subset. The proposed ACO-KNN algorithm is explained in Fig. 7.

5. Experimental set-up

A series of experiments was conducted to show the utility of the proposed feature selection algorithm. All experiments were run on a machine with an Intel Core 2.10 GHz CPU and 512 MB RAM. The proposed ACO-KNN algorithm, and the IG-GA and IG-RSAR baseline algorithms, were implemented in JAVA programming language. The conducted testing and evaluation did not take into consideration the implementation time needed to create a feature subset. Testing only took into account the capability of the ACO-KNN algorithm to create a feature subset that is both relevant and of high quality. The combined algorithm should also be able to use the feature subset to identify the relationship between the feature and sentiment word present in a customer review sentence. The output was the pairing of a feature and a sentiment word. To test the accuracy of the created output, it was evaluated using precision ( $p$ ), recall ( $r$ ), and F-score values (see Section 3.6.3).

Table 6
ACO parameter settings

Parameter	Value
Initial pheromone for all features, $\tau$	1
$\alpha$	1
$\beta$	0.1
Pheromone decay parameter, $\rho$	0.2
Relative weight of classifier performance, $\phi$	0.8
Relative weight of feature subset length, $\varphi$	0.2

Figure 7.

Algorithm 1: ACO-KNN algorithm.

5.1 Experiment 1: Analysis of precision, recall, and F-score for five datasets

The capability of ACO-KNN as an effective feature selection technique in creating the optimum feature subset was tested through an evaluation of classification performance based on precision ( $p$ ), recall ( $r$ ), and F-score values. The output was a pair of feature and sentiment word, and each pair was manually checked against the information in the customer review dataset. Evaluation was in terms of accuracy, and feature and sentiment word matching, as explained in Section 3.6.3).

To show the utility of the proposed ACO-KNN algorithm, it was compared with IG-RSAR [9] and IG-GA [6]. This section presents the performance results of ACO-KNN as a text feature selection method for sentiment analysis. The algorithm was evaluated for three performance metrics, namely, precision ( $p$ ), recall ( $r$ ), and F-score. Table 6 shows the parameters used for ACO in the proposed algorithm. The values for these parameters were determined through several scientific tests that were repeatedly run on the ACO-KNN algorithm using the customer review datasets.

The proposed algorithm had achieved the highest performance with the parameters in Table 6. However, to ensure that these values were optimum and suitable, a follow-up study was conducted to verify them.

Table 7 shows the parameter settings for the IG-GA algorithm, as stated in [6]. The EWGA was used for the feature selection process. The EWGA used the IG heuristic to weight the various sentiment features. The IG values were then integrated into the two basic parameters of the GA, which are crossover and mutation operators [6]. The IG heuristic was used to select features based on the threshold value. Then, the IG heuristic was applied in the EWGA crossover procedure to improve the quality of the newly generated solution. The mutation operator was used to factor the attribute IG into the mutation probability and the mutation probability of a bit was set to mutate from 0 to 1 based on the feature’s IG [6].

Table 7
IG-GA parameter settings.

Parameter	Value
IG
Threshold	0.0025
GA
Crossover probability	0.6
Mutation probability	0.01

The IG-RSAR algorithm did not require additional parameter settings. However, for IG, all features with a value of greater than 0 were selected [9].

Table 8 shows the experimental results of the proposed algorithm, in terms of precision ( $p$ ), recall ( $r$ ), and F-score in comparison with the results of the IG-GA and IG-RSAR algorithms on five customer review datasets.

Table 8

Performance (precision, recall, and F-score) of ACO-KNN, IG-GA, and IG-RSAR on customer review datasets

Dataset	ACO-KNN			IG-GA			IG-RSAR
	$p$	$r$	F-score	$p$	$r$	F-score	$p$	$r$	F-score
Nikon	89.2	92.7	90.7	74.1	76	74.1	72.3	84	77.3
Nokia	81.3	84.6	82.7	73.3	61.6	65.9	62.8	61	61.6
Apex	71.5	71.8	71.5	63	60.5	61.6	62.4	60.5	60.6
Canon	80.6	86	83.1	78.8	83.4	80.5	76.2	85.3	80.2
Creative	84.8	86	85.4	82.6	84.5	83.5	78.3	80.2	79.2
Average	81.5	84.2	82.7	74.4	73.2	73.1	70.4	74.2	71.8

Table 8 clearly shows that the proposed ACO-KNN algorithm has the ability to obtain higher precision and recall values compared to the IG-GA and IG-RSAR algorithms.

Figures 8–10 show the average for precision, recall, and F-score values, respectively, for the three algorithms. From Figs 8 and 9, the average values for precision and recall of IG-GA were 74.1% and 76% (Nikon), 73.3% and 61.6% (Nokia), 63% and 60.5% (Apex), 78.8% and 83.4% (Canon), and 82.6% and 84.5% (Creative). Figures 8 and 9 show that the average values for precision and recall of IG-RSAR were 72.3% and 84% (Nikon), 62.8% and 61% (Nokia), 62.4% and 60.5% (Apex), 76.2% and 85.3% (Canon), and 78.3% and 80.2% (Creative). Therefore, the combination of IG and GA was generally more effective than the combination of IG and RSAR because the former has better average precision and recall values. However, it showed a slightly lower performance for precision, recall, and F-score on the Apex dataset.

Figure 8.

Comparison of average precision for ACO-KNN, IG-GA, and IG-RSAR.

Figure 9.

Comparison of average recall for ACO-KNN, IG-GA, and IG-RSAR.

Nonetheless, the proposed algorithm had outperformed both of these algorithms in terms of precision and recall for all five datasets. Thus, it is clear that ACO-KNN has the ability to choose the best features from real-value datasets, even those that contain a mixture of irrelevant or redundant features.

Figure 10.

Comparison of average F-score for ACO-KNN, IG-GA, and IG-RSAR.

Moreover, Fig. 10 shows that the proposed algorithm was able to achieve a higher F-score compared to the other two algorithms. This result indicates that ACO-KNN was effective in extracting features and superior to the baseline algorithms.

5.2 Experiment 2: Analysis of effect of number of selected features on sentiment classification performance (precision, recall, and F-score)

This section provides the results of Experiment 2, which was conducted to evaluate the performance of ACO-KNN, IG-GA, and IG-RSAR in selecting features for the feature subset. The purpose of conducting Experiment 2 was to investigate whether the number of features in the feature subset would affect the sentiment classification performance. Figures 11–13 and Table 9 show the results of the comparison between the classification performances of ACO-KNN, IG-GA, and IG-RSAR for five types of customer review dataset in terms of average precision, recall, and F-score against the number of features selected.

Figure 11 clearly shows that the proposed algorithm (ACO-KNN) was able to produce the best solution between five to nine features and a high classification performance based on precision, recall, and F-score values, of 71% to 92%.

Figure 11.

Comparison of average for precision, recall, and F-score and the number of features selected by ACO-KNN.

In contrast, as shown in Fig. 12, the average classification performance of IG-GA is between 60.5% and 85.3% when an average of six to eleven features was selected. Moreover, Fig. 13 shows that ACO-KNN was better than IG-RSAR, which has a classification performance of between 60% and 85% when six to 11 features were selected. Thus, the ACO-KNN algorithm was deemed capable of selecting a smaller number of features that are of good quality, so it can produce a better classification performance compared to the IG-GA and IG-RSAR algorithms.

The following Table 9 displays the comparison of the overall average for the ACO-KNN, IG-GA, and IG-RSAR algorithms. It was clear that the ACO-KNN algorithm was capable of acquiring, on average, the minimum number of features (i.e., six) compared to the other two algorithms. It was able to achieve a high classification performance of 81.5% of precision, 84.2% of recall, and 82.7% of F-score. In contrast, the IG-GA algorithm had only achieved 74.4% of precision, 73.2% of recall, and 73.1% of F-score with nine features. IG-RSAR was able to obtain 70.4% of precision, 74.2% of recall, and 71.8% of F-score with nine features. Thus, Experiment 2 had proven that the ACO-KNN algorithm is capable of acquiring the minimum number of features with high classification performance.

Table 9

Average values of performance and number of features selected by ACO-KNN, IG-GA and IG-RSAR

Dataset	ACO-KNN			No. of	IG-GA			No. of	sIG-RSAR			No. of
	$p$	$r$	F-S	features	$p$	$r$	F-S	features	$p$	$r$	F-S	features
Nikon	89.2	92.7	90.7	5.1	74.1	76.0	74.1	6.5	72.3	84.0	77.3	6.6
Nokia	81.3	84.6	82.7	5.6	73.3	61.6	65.9	11.7	62.8	61.0	61.6	11.1
Apex	71.5	71.8	71.5	5.6	63.0	60.5	61.6	6.7	62.4	60.5	60.6	7.4
Canon	80.6	86.0	83.1	9.0	78.8	83.4	80.5	10.5	76.2	85.3	80.2	10.0
Creative	84.8	86.0	85.4	6.4	82.6	84.5	83.5	10.9	78.3	80.2	79.2	10.1
Average	81.5	84.2	82.7	6.3	74.4	73.2	73.1	9.3	70.4	74.2	71.8	9.0

Figure 12.

Comparison of average for precision, recall, and F-score and a number of features selected by IG-GA.

Figure 13.

Comparison of average for precision, recall, and F-score and number of features selected by IG-RSAR.

Significantly, it can be said that using ACO-KNN for feature selection had improved the performance of classification, with the minimum number of features. As previously explained, the feature selection process is a very important task in choosing the best features.

5.3 Discussion

This study showed the capability and efficiency of the proposed ACO-KNN algorithm, which was suggested as a suitable feature selection technique for sentiment analysis. To determine its effectiveness, the performance of ACO-KNN was compared with that the performance of two other hybrid algorithms, IG-GA and IG-RSAR. The results have shown that ACO-KNN could achieve the highest performance with the parameters, as shown in Table 6. The values of these parameters were appropriate for the experiments that were conducted. However, the experiments were not designed to investigate whether these parameters would be the optimum values for ACO. Therefore, an in-depth investigation was necessary to determine the optimum parameter values needed to implement the ACO as a feature selection technique in sentiment analysis research. The parameter values can help in obtaining the minimum, optimum, and high-quality feature subsets.

The natural phenomenon of ants looking for the shortest route between their nest and food source is the main basis of the ACO algorithm. This idea was used as the guideline for the ants to look for the correct path, which is critical for high-quality problem solving. Thus, ACO was able to provide a guideline for ants during the feature subset building process by determining the subset size. When applied to the search space, the ACO algorithm has the capacity to produce a feature subset. Furthermore, a heuristic function can be used to assess the quality of the features that the ants visit. A good heuristic function could help in solving the problem of feature selection by ACO. This study used the KNN algorithm as the heuristic function for each feature that the ants visit. Each feature has its own heuristic value that helps the ants decide which node to choose.

Several experiments were conducted to assess the capability of the ACO-KNN algorithm when applied to five different datasets (see Table 1). Based on the results in Sections 5.1 and 5.2, ACO-KNN was found to be highly competitive with the baseline algorithms, has the ability to converge quickly, has a strong search process in the problem space, and was efficient in searching for the minimum and optimum feature subset.

This study had implemented the wrapper approach in the feature selection process. The KNN algorithm was used as a classifier to select and evaluate the candidate features using MSE as a heuristic value. The ACO algorithm was used to guide the search in the feature selection process. In other words, ACO had wrapped KNN in the feature selection process.

In ACO, if the optimum feature subset is not acquired, a pheromone update process would take place, a set of new ants would be created and the process would be repeated in further iterations. The search for the optimum feature subset is done through the ants’ traversal of the graph. When an ant traverses a path and has selected a feature, the feature is not selected again for the same path. This limitation was implemented in the ACO algorithm in this study to avoid selecting the same feature again. Therefore, the selected feature subset will have different features.

The advantages of the ACO algorithm, such as its powerful ability to explore the feature space, its robustness, and the ease with which it can be combined with other classifiers, such as KNN, nearest neighbour, the entropy-based measure, or the rough set dependency measure, could enable it to create the best feature subset to facilitate the search for the optimum quality feature subset. The role and advantages of ACO in assessing and selecting high-quality, optimum feature subsets, can produce good sentiment classification, as shown by the results presented in Sections 5.1 and 5.2.

Based on the results, using ACO-KNN combination in searching for and assessing features was possible with algorithms that are competitive in producing optimum, minimum, and high-quality feature subsets to increase classification accuracy.

6. Conclusion

This paper suggests the use of a wrapper approach, in which an ACO algorithm was wrapped around a KNN algorithm as a classifier, to evaluate the candidate subset features using the MSE value as the heuristic value. In other words, the proposed ACO-KNN algorithm had guided the feature selection process and used KNN to evaluate the candidate subset of features. Extensive experiments were conducted to evaluate the performance of the proposed ACO-KNN in finding noticeable features in different datasets. In the proposed hybrid algorithm, the MSE value of classification and the feature subset length were considered as suitable measures to evaluate the performance of the algorithm. Based on the results, this algorithm was able to select the optimal feature subset without prior knowledge of the features. The computational results have indicated that the proposed ACO-KNN algorithm could achieve good performance with a smaller number of features. The main contribution of this paper is that to the best of the authors’ knowledge, this is the first work that used ACO in this way for feature selection in sentiment analysis. The advantages of using ACO are that it is fast and efficient in searching for the optimum solution to obtain the minimum length of feature subset. Additionally, ACO is a good feature selection method that can be used to guide the selection process. Thus, it can be applied quite readily, in combination with another algorithm that acts as a classifier to find a good solution. The hybrid ACO-KNN algorithm had shown promising performances in terms of precision, recall, and F-score. It had performed better than IG-GA and IG-RSAR, except in the case of the Apex dataset. Therefore, it is necessary to find parameter settings that are more suitable for the ACO part of the hybrid algorithm to guide the ants to find the best subset of features.

Footnotes

Acknowledgments

The authors gratefully acknowledge Universiti Pertahanan Nasional Malaysia and Ministry of Education Malaysia, as well as the Fundamental Research Grant Scheme for supporting this research project through grant no. FRGS/1/2016/ICT02/UKM/01/2.

References

and Liu

, Mining and Summarizing Customer Reviews, Proceedings of the 2004 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining KDD 04. (2004), 168–177. doi: 10.1145/1014052.1014073.

and Liu

, Mining Opinion Features in Customer Reviews, in: Proceeding AAAI’04 Proceedings of the 19th National Conference on Artifical Intelligence, AAAI Press, San Jose, California, 2004: pp. 755–760.

Pang

Lee

and Vaithyanathan

, Thumbs up? Sentiment Classification using Machine Learning Techniques, in: Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, 2002: pp. 79–86. doi: 10.3115/1118693.1118704.

Pang

and Lee

, A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts, in: Proceeding ACL ’04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, 2004.

Popescu

A.-M.

and Etzioni

, Extracting Product Features and Opinions from Reviews, in: Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing HLT 05, Association for Computational Linguistics, 2005: pp. 339–346. doi: 10.3115/1220575.1220618.

Abbasi

Chen

and Salem

, Sentiment Analysis in Multiple Languages: Feature Selection for Opinion Classification in Web forums, ACM Transactions on Information Systems 26 (2008), 1–34. doi: 10.1145/1361684.1361685.

Abulaish

, Jahiruddin Doja

M.N.

and Ahmad

, Feature and opinion mining for customer review summarization, Pattern Recognition and Machine Intelligence Proceedings 5909 (2009), 219–224.

Arafat

Elawady

R.M.

Barakat

and Elrashidy

N.M.

, Different Feature Selection for Sentiment Classification, International Journal of Information Science and Intelligent System 1 (2014), 137–150.

Agarwal

and Mittal

, Sentiment Classification using Rough Set based Hybrid Feature Selection, in: Proceedings of the 4th Workshop on Computational Approaches to Subjectivity, Sentiment & Social Media Analysis (WASSA 2013), Association for Computational Linguistics, 2013: pp. 115–119.

10.

Bagheri

Saraee

and de Jong

, Care more about customers: Unsupervised domain-independent aspect detection for sentiment analysis of customer reviews, Knowledge-Based Systems 52 (2013), 201–213.

11.

Bagheri

Saraee

and de Jong

, Sentiment Classification in Persian: Introducing a Mutual Information-based Method for Feature Selection, IEEE Conference Publications, (2013).

12.

Singh

V.K.

Piryani

Uddin

and Waila

, Sentiment Analysis of Movie Reviews: A new Feature-based Heuristic for Aspect-Level Sentiment Classification, in: 2013 International Mutli-Conference on Automation, Computing, Communication, Control and Compressed Sensing (iMac4s), IEEE, 2013: pp. 712–717. doi: 10.1109/iMac4s.2013.6526500.

13.

Zhai

Liu

and Peifa

, Constrained LDA for Grouping Product Features in Opinion Mining, in: Huang

Cao

and Srivastava

(Eds.), Proceedings of the 15th Pacific-Asia Conference on Advances in Knowledge Ddscovery and Data Mining, Springer, 2011: pp. 448–459. http://wwwspringerlink.com/index/Q73120K14T206727.pdf.

14.

Shang

Huang

Zhu

Lin

and Wang

, A novel feature selection algorithm for text categorization, Expert Systems with Applications 33 (2007), 1–5. doi: 10.1016/j.eswa.2006.04.001.

15.

Unler

and Murat

, A discrete particle swarm optimization method for feature selection in binary classification problems, European Journal of Operational Research 206 (2010), 528–539. doi: 10.1016/j.ejor.2010.02.032.

16.

Amaldi

and Kann

, On the approximability of minimizing nonzero variables or unsatisfied relations in linear systems, Theoretical Computer Science 209 (1998), 237–260.

17.

Cover

T.M.

and Van Campenhout

J.M.

, On the possible orderings in the measurement selection problem, IEEE Transactions on Systems, Man, and Cybernetics 7 (1977), doi: 10.1109/TSMC.1977.4309803.

18.

Kohavi

and John

G.H.

, Wrappers for feature subset selection, Artificial Intelligence 97 (1997), 273–324. doi: 10.1016/S0004-3702(97)00043-X.

19.

Aghdam

M.H.

Ghasem-Aghaee

and Basiri

M.E.

, Text feature selection using ant colony optimization, Journal Expert Systems with Applications: An International Journal 36 (2009), 6843–6853.

20.

Nemati

Basiri

M.E.

Ghasem-Aghaee

and Aghdam

M.H.

, A novel ACO–GA hybrid algorithm for feature selection in protein function prediction, Expert Systems with Applications 36 (2009), 12086–12094. doi: 10.1016/j.eswa.2009.04.023.

21.

Aghdam

M.H.

Ghasem-Aghaee

and Basiri

M.E.

, Application of Ant Colony Optimization for Feature Selection in Text Categorization, in: Evolutionary Computation, 2008. CEC 2008. (IEEE World Congress on Computational Intelligence). IEEE Congress on, 2008: pp. 2867–2873.

22.

Zhao

and Wang

, An Improved Genetic Algorithm For Text Feature Selection, in: International Conference on Intelligent Computing and Cognitive Informatics, 2010: pp. 7–10.

23.

Al-Ani

, Ant colony optimization for feature subset selection, Society 4 (2005), 35–38. doi: 10.1109/TENCON.2009.5395862.

24.

Kanan

H.R.

and Faez

, An improved feature selection method based on ant colony optimization (ACO) evaluated on face recognition system, Applied Mathematics and Computation 205 (2008) 716–725.

25.

Aghdam

M.H.

Jafar

Naghsh-Nilchi

A.R.

and Basiri

M.E.

, Combination of ant colony optimization and bayesian classification for feature selection in a bioinformatics dataset, Journal of Computer Science & Systems Biology 2 (2009), 186–199.

26.

Sarac

and Ozel

S.A.

, An ant colony optimization based feature selection for web page classification, The Scientific World Journal. 2014 (2014).

27.

Uğuz

, A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm, Knowledge-Based Systems 24 (2011), 1024–1032. doi: 10.1016/j.knosys.2011.04.014.

28.

Chantar

H.K.

and Corne

D.W.

, Feature Subset Selection for Arabic Document Categorization using BPSO-KNN, in: Nature and Biologically Inspired Computing (NaBIC), 2011 Third World Congress, 2011: pp. 546–551.

29.

Alghamdi

H.S.

Tang

H.L.

and Alshomrani

, Hybrid ACO and TOFA Feature Selection Approach for Text Classification, in: WCCI 2012 IEEE World Congress on Computational Intelligence, 2012: pp. 1–6.

30.

Chen

Jiang

and Li

, A heuristic feature selection approach for text categorization by using chaos optimization and genetic algorithm, Mathematical Problems in Engineering 2013 (2013), 1–6.

31.

Kabir

M.M.

Shahjahan

and Murase

, A new hybrid ant colony optimization algorithm for feature selection, Expert Systems with Applications 39 (2012), 3747–3763. doi: 10.1016/j.eswa.2011.09.073.

32.

Somprasertsri

and Lalitrojwong

, Mining features-opinion in online customer reviews for opinion summarization, Journal of Universal Computer Science 16 (2010), 938–955.

33.

Salton

and Buckley

, Term-weighting approaches in automatic text retrieval, Information Processing & Management 24 (1988), 513–523. doi: 10.1016/0306-4573(88)90021-0.

34.

Moraes

Valiati

J.F.

and Neto

W.P.G.

, Document-level sentiment classification: An empirical comparison between SVM and ANN, Expert Systems with Applications 40 (2013), 621–633.

35.

Liu

and Yu

, Toward integrating feature selection algorithms for classification and clustering, IEEE Transactions on Knowledge and Data Engineering 17 (2005), 491–502. doi:101109/TKDE.2005.66.

36.

Guyon

and Elisseeff

, An introduction to variable and feature selection, Journal of Machine Learning Research 3 (2003), 1157–1182. doi: 10.1162/153244303322753616.

37.

Saeys

Inza

and Larrañaga

, A review of feature selection techniques in bioinformatics, Bioinformatics (Oxford, England) 23 (2007), 2507–2517. doi: 10.1093/bioinformatics/btm344.

38.

Nguyen

Fraken

and Petrovin

, Optimizing a Class of Feature Selection Measures, in: Proceedings of the NIPS 2009 Workshop on Discrete Optimization in Machine Learning: Submodularity, Sparsity & Polyhedra (DISCML), 2009.

39.

and Liu

, Feature Selection for high-dimensional data: A fast correlation-based filter solution, International Conference on Machine Learning (ICML) (2003), 1–8. doi: citeulike-article-id:3398512.

40.

Rahman

S.A.

, Multivariate Filter with Particle Swarm Optimization Variants for Feature Selection in Complex Datasets, 2011.

41.

Cotta

Sloper

and Moscato

, Evolutionary Search of Thresholds for Robust Feature Set Selection: Application to the Analysis of Microarray Data, in: Raidl

G.R.

Cagnoni

Branke

Corne

D.W.

Drechsler

Jin

, et al. (Eds.), Applications of Evolutionary Computing, Lecture Notes in Computer Science, Springer Berlin Heidelberg, 2004: pp. 21–30. doi: 10.1007/b96500.

42.

Yusta

S.C.

, Different metaheuristic strategies to solve the feature selection problem, Pattern Recognition Letters 30 (2009), 525–534. doi: 10.1016/j.patrec.2008.11.012.

43.

Dorigo

and Di Caro

, Ant colony optimization: A new metaheuristic, Evolutionary Computation, 1999. CEC 99, Proceedings of the 1999 Congress on. 2 (1999), 1470–1477.

44.

Dorigo

and Di Caro

, The ant colony optimization meta-heuristic, in: Corne

Dorigo

Glover

(Eds.), New Ideas in Optimization, McGraw-Hill, 1999: p. 450. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.184.955.

45.

Basiri

M.E.

Ghasem-Aghaee

and Aghdam

M.H.

, Using Ant Colony Optimization-Based Selected Features for Predicting Post-synaptic Activity in Proteins, in: Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics Lecture Notes in Computer Science, Springer Berlin Heidelberg, 2008: pp. 12–23.

46.

Dorigo

Maniezzo

and Colorni

, Ant system: Optimization by a colony of cooperating agents, IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 26 (1996), 29–41.

47.

Dorigo

and Blum

, Ant colony optimization theory: A survey, Theoretical Computer Science 344 (2005), 243–278.

48.

Jensen

, Combining rough and fuzzy sets for feature selection, Edinburgh University, 2005.

49.

Ahmad

S.R.

Abu Bakar

and Yaakub

M.R.

, Detecting relationship between features and sentiment words using hybrid of typed dependency relations layer and POS tagging (TDR Layer POS Tags) algorithm, International Journal on Advanced Science, Engineering and Information Technology 6 (2016), 1120–1126.

50.

Written

I.H.

and Frank

, Data Mining: Practical Machine Learning Tools and Techniques, Second Edi, Elsevier Science, 2005.

51.

Yaakub

M.R.

Algarni

and Peng

, Integration of Opinion into Customer Analysis Model, in: 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology, IEEE, 2012: pp. 164–168. doi: 10.1109/WI-IAT.2012.78.

Ant colony optimization for text feature selection in sentiment analysis

Abstract

Keywords

1. Introduction

2. Related works

3. Methodology

3.2 Phase 1: Text preprocessing

3.2.1 Feature extraction

3.3.1 K-nearest neighbour (KNN)

3.3.2 Ant colony optimization for feature selection in sentiment analysis

3.4 Phase 3: Detection of relationship between the features and sentiment words

Table 4 Rules for typed dependency relations and POST for NSUBJ (adopted from [49])

3.6 Phase 5: Testing, evaluation, and analysis

3.6.1 Dataset

Table 5 Summary of product features in customer review datasets

5. Experimental set-up

Table 6 ACO parameter settings

Table 7 IG-GA parameter settings.

6. Conclusion

Footnotes

Acknowledgments

References

Table 4
Rules for typed dependency relations and POST for NSUBJ (adopted from [49])

Table 5
Summary of product features in customer review datasets

Table 6
ACO parameter settings

Table 7
IG-GA parameter settings.