Abstract
Despite the great advances in spam detection, spam remains a major problem that has affected the global economy enormously. Spam attacks are popularly perpetrated through different digital platforms with a large electronic audience, such as emails, microblogging websites (e.g. Twitter), social networks (e.g. Facebook), and review sites (e.g. Amazon). Different spam detection solutions have been proposed in the literature, however, Machine Learning (ML) based solutions are one of the most effective. Nevertheless, most ML algorithms have computational complexity problem, thus some studies introduced Nature Inspired (NI) algorithms to further improve the speed and generalization performance of ML algorithms. This study presents a survey of recent ML-based and NI-based spam detection techniques to empower the research community with information that is suitable for designing effective spam filtering systems for emails, social networks, microblogging, and review websites. The recent success and prevalence of deep learning show that it can be used to solve spam detection problems. Moreover, the availability of large-scale spam datasets makes deep learning and big data solutions (such as Mahout) very suitable for spam detection. Few studies explored deep learning algorithms and big data solutions for spam detection. Besides, most of the datasets used in the literature are either small or synthetically created. Therefore, future studies can consider exploring big data solutions, big datasets, and deep learning algorithms for building efficient spam detection techniques.
Keywords
Introduction
Despite the technological and security improvement in spam detection systems, spam detection remains a never-ending problem across the globe. In the year 2019, the total number of email users worldwide was estimated to be over 3.9 billion, and it is expected to go above 4.2 billion in the year 2022 [168]. This implies that over 50% of the world population currently communicate via email. Furthermore, the number of active Facebook, Twitter, Snapchat and WhatsApp monthly users as of January 2020 is estimated to be 2.45 billion, 340 million, 382 million and 1.6 billion, respectively [47]. These growing numbers have undoubtedly contributed to the increased cybercrimes on electronic platforms. Figures 1–3 shows some examples of spam messages on Facebook, Twitter, and WhatsApp, respectively.

Example of spam message on Facebook [46].

Example of spam message on Twitter [149].

Example of WhatsApp scam [30]. WhatsApp scam can install malware on devices or obtain personal information for victims.
Given the challenges posed by spam attacks, there is an obvious need for fast and effective spam detection techniques. Many spam detection and prevention techniques have been introduced in the literature, including blacklist, whitelist, rule-based and Machine Learning (ML) based techniques. Among all these techniques, ML-based techniques are one of the most effective and accurate. However, some of them have computational complexity problem; their computational complexity increases with increase in dataset size. Therefore, some studies introduced Nature Inspired (NI) based data reduction techniques to reduce the computational complexity of ML algorithms, thus making them faster and more effective for real-time spam detection. This study, therefore, presents a survey of recent ML and NI-based approaches for spam detection. A sizable number of surveys on spam detection exist, however, most of them focused on one type of spam. Some surveys focused on online review spam detection, while other surveys focused on email spam or social network spam. The authors in [50,57] presented a survey of different ML methods used for classifying and detecting online review spam. Crawford et al. [50] also provided information on the performance of different classification techniques used for review spam detection. Bandakkanavar et al. [20] presented a survey of different techniques that have been proposed for review spam detection, with a particular focus of sentiment classification. Hussain et al. [78] presented a review of various feature extraction techniques used for online review detection methods. They also outlined some performance measures popularly used to evaluate spam detection methods. Dou [58] presented a survey of methods that used graph-based and deep learning algorithms to solve spam detection in online social network and review platforms.
Kabakus and Kara [84] presented a review of different Twitter spam detection methods. They also outlined some common features used to identify spam on Twitter. Moreover, they presented some new features that can be used to improve the effectiveness of spam detection on Twitter. The authors in [90,174] presented a survey on spam detection in online social networks. They also provided a discussion on the performance of spam detection techniques. Moreover, they presented some datasets and specific techniques used to design spam filters for Twitter social network. Wu et al. [185] presented a comparative study of state-of-the-art spam detection techniques for Twitter social network. They provided a discussion of some feature selection techniques used for online spam detection and outlined some puzzling challenges in Twitter spam detection techniques. Besides, Reddy and Reddy [137] provided a survey on spam detection methodologies used in social networking websites, such as Facebook, WhatsApp and YouTube. They provided information on various spam detection techniques and their implementation details. They also provided information on their performance and various datasets used for spam detection.
Apart from social network and online reviews, some surveys [28,178] focused on email spam. Blanzieri and Bryl [31] provided an overview of effective ML-based spam email detection solutions. Moreover, they provided different evaluation metrics for spam email detection and various approaches used in both commercial and non-commercial spam detection systems. Dada et al. [52] also presented a systematic review of some ML-based approaches used to tackle spam email detection. They examined the various applications of ML methods to the filtering process used by popular email providers, including Gmail, Yahoo and Outlook. They identified crucial concepts and various research trends in spam filtering. Saadat [142] presented a review of some existing spam filtering methods. They provided information on the classification and evaluation methods used by these techniques. They also provided a comparison between the traditional methods and ML-based methods used for spam email detection. Guzella and Caminhas [74] presented a survey on recent ML-based approaches used for spam email detection. They focused on both textual and image-based approaches used for spam detection. They emphasized the importance of considering specific characteristics of spam detection problems (specifically, concept drift) when designing spam filters. Bhowmick and Hazarika [28] presented a survey on some effective content-based approaches used in the literature for spam email detection. They presented a discussion on the impacts and effectiveness of ML-based spam email filters.
As shown in the review above, a sizable number of spam detection surveys have been presented in the literature, but most of them focused on one type of spam. This survey is different from other surveys in the following ways:
This study presents a survey of recent techniques that have been proposed for four types of spam, including email spam, web spam, social network spam, and review spam. This survey will equip the research community with valuable insights that can be used to design effective and fast state-of-the-art spam detection models for curbing the spread of spam attacks in different electronic platforms.
This survey focuses on two effective content-based spam detection techniques: NI and ML-based techniques. These two techniques are typically combined to design effective, fast, and accurate hybrid models for spam detection. NI algorithms are used to select optimal features, instances, or parameters for improving the training speed and predictive accuracy of ML algorithms. Besides, they are used to select effective classification rules for spam classifiers.
This survey also presents a discussion on the performance and effectiveness of existing spam detection techniques and outlines their strength and weakness. Moreover, this survey provides some useful information on various popular datasets and performance measures suitable for evaluating spam detection techniques. It also outlines some recommendations and future research directions. As shown in the survey, standard neural network-based techniques do not consider the hierarchical relationships between features. Capsule neural networks have the potential to better model the hierarchical relationships between features, making them suitable for natural language processing problems. Future studies can consider exploring the implementation of capsule neural networks for improved spam detection. Also, most studies focused on supervised learning which requires a huge amount of labelled dataset. Future studies can also consider exploring semi-supervised learning approaches. The remaining part of this paper is structured as follows: Section 2 provides some useful insights on the various challenges posed by web spam, review spam, social network spam, and email spam. It also outlines the characteristics of different type of spam and some features that can be extracted from them. Section 3 provides a survey on different ML-based and NI-based spam detection techniques, while Section 4 provides a discussion on the surveyed techniques. Section 5 reports a comparative analysis of the performance of different ML-based spam detection techniques, and Section 6 provides a summary and some future research directions.
Spam is an evolving threat that is spreading to virtually all kinds of electronic platforms. This threat has enormously affected thousands of internet users worldwide. Figure 4 shows that as of December 2019, a total of over 142 million US dollars have been lost to electronic scams, including phishing, investment scam, identify theft, and hacking [3]. The remaining part of this section provides some background information on web spam, online review spam, social network spam, and email spam, respectively.

Amount lost to spam for the year 2019 [3].
Given a web graph with different nodes and edges, where each node refers to a web page and each edge refers to a link between webpages, search engines are designed to rank webpages based on the structure of their link [180]. However, results displayed by search engines are no longer reliable. This is because spammers have devised different manipulative techniques to promote spam webpages and falsely increase their rankings in search engines. These techniques are called webspam.
Characteristics of webspam
Webspam can be classified based on the techniques used to increase webpage ranking. The main types include content spam, link spam, and cloaking spam [94]. Spam webpages contain misleading contents, such as trending topics, keywords, and popular terms that attract many clicks from users [115]. Furthermore, some spam webpages contain anchor links and hidden links. Hidden links can be used to conceal usernames and blog comments. Spam webpages also contain misleading associations to webpages. Some spammers create a network of pages that are closely connected to increase their link-based score and bypass the ranking algorithms of search engines. Some spammers also design techniques (called cloaking spam) that can display different versions of a website to users and search engines. Typically, a legitimate version of the website is sent to the search engine for ranking and indexing, while the malicious version is displayed to users.
Problems caused by web spam
Webspam degrades the quality of search results by suggesting unwanted web pages and misleading users to fake websites. It also causes search engines to waste computational and storage space when processing irrelevant web pages. Some spam webpages contain malicious links to malware or viruses that can cause damages to user’s computers.
Features to be exploited for web spam detection
Webspam detection systems can be designed to automatically detect spam webpages and remove them from search results. The key to achieving this is to determine certain features that are good indicators of legitimate and illegitimate web pages. These features are related to the contents or links in webpages [15]. Content-based features are related to the words on the webpage, such as the number of words in a webpage, number of words in a webpage title, and the average length of words in a webpage. Popular words and trending topics in each webpage can also serve as good indicators of legitimate webpages. These words can be extracted and compared to the actual content of the webpage. Webpages that contain words that do not match the actual content can be flagged as spam webpage. Link-based features include the number of links in a webpage, IP address in links, and the number of page entry link. These features can be extracted from links in webpages and used to build web spam classifiers.
Some webpages may contain resources (such as images and scripts) that are loaded from external links. These webpages could be spam webpages [94]. Some of them may contain JavaScript codes that are written by spammers to steal personal information from users. JavaScript codes can also be used to facilitate cloaking spam. Spammers can use it to send malicious contents to users with JavaScript-enabled browsers, and legitimate contents to search engines. JavaScript tags and functions can be extracted from webpages, analyzed, and used to build effective methods for identifying cloacking-based webspam. Also, large datasets (containing the different types of webpage features) can be collected and used to build ML models for webspam detection.
Online review spam
Review spam refers to reviews that are wholly or partially untrue [162]. Companies and business owners use online reviews as a means of promoting their products or services. Moreover, individuals use online reviews for opinion sharing and purchase decisions. However, due to the growing tendency of posting fake reviews to publicize certain products or slander certain brands of competitors, review spam detection has surfaced as an essential challenge to tackle. Business owners can pay people to write good reviews about their products or to write false reviews about their competitors’ products.
Characteristics of review spam
Fake or truthful reviews can be identified based on their content. Truthful reviews generally contain expressions that are more tangible, providing specific details on spatial configurations [125]. On the other hand, fake reviews contain more details of the product or service offered by the spammer [37], which may be more informative than those written by genuine customers. Further, truthful reviews generally contain both positive and negative sentiments, while false reviews generally contain negative sentiments [96]. Besides, fake reviews are relatively short and full of extreme (or one-sided) criticism or praise [37]. Also, fake reviewers talk less about themselves; they focus more on the aspect of their deception that will increase their credibility [125]. Fake reviews can also be identified by the attributes of the reviewer. Some spam reviewers are known to consistently write spam [96]. Also, the profile of spam reviewers is somewhat new and unverified with few personal information [37]. Besides, the number of votes gained by fake reviews are typically few [37].
Problems caused by review spam
Consumer reviews form a major part of a buyer’s decisions. It is estimated that about 16% of all reviews on the Yelp platform are fake [105], 33% of all TripAdvisor reviews are fake [29], and over 50% of some Amazon product reviews are fake. These fake reviews can mislead customers into making wrong purchase decisions. It can also affect the reputation of businesses and consequently lead to a loss of customers.
Features to be exploited
Different features can be used to build effective review spam detection models, including bag-of-word features, Linguistic and Word Count (LIWC) output, and Part of Speech (POS) tags. In bag-of-word approach, individual or small group of words from sentences can be used as features. These features are known as n-grams, and they are extracted by selecting one (unigram), two (bigram) or three (trigram) adjacent words from a given document. In the POS tagging approach, words are grouped into different part of speech based on their definition and the context in which the word is found in a sentence. To use POS features, keywords belonging to a different part of speech can be extracted from various reviews and used to build classifiers. For example, Ott et al. [125] used the following POS features: noun, adjectives, preposition, and determiners, verbs, adverbs, and pronouns. Experiments performed by them shows that truthful reviews contain more of noun, adjectives, preposition, and determiners, while negative reviews contain more of verbs, adverbs, and pronouns.
LIWC is a software that is used for text analysis, where users can create their word dictionaries to analyse different languages relevant to them. A typical example of an LIWC feature is first-person singular pronouns. Fake reviews are usually associated with decreased usage of first-person singular pronouns. This is because spammers talk less about themselves due to lack of experience or to dissociate themselves from the deception [125]. Another LIWS feature that can be exploited is space, especially for hotel and restaurant reviews. Lack of special details in reviews is a good indicator of negative reviews. Sentiment is another good LIWC feature that can be used to build spam classifiers. The presence of negative sentiments is a good indicator of false reviews.
All the linguistic features outlined above (that is, LIWC and POS tags) are not very good standalone identifiers for review spam [114]. However, they can be combined with bag-of-words features to build effective spam detection models for review spam detection [97].
Social network spam
Social network spam refers to spam contents that are directed at users of social networking platforms, such as Facebook, WhatsApp, Instagram, and Tweeter. It was estimated that about 40% of social network accounts are used for spam [91]. Spammers use this platform to send malicious comments to multiple users.
Characteristics of social network spam
Unlike email messages and online reviews, contents posted to microblogging websites (such as Twitter, Instagram, and Facebook) are restricted to limited character size. Besides, these contents contain many platform-specific attributes, such as idioms, shortened URLs, hashtags, mentions, and abbreviations. Moreover, messages in microblogging platforms are very unstructured and noisy [4]. Also, spam messages can be sent multiple times (from one account) to multiple users within a short period [170]. These messages can also be sent simultaneously from multiple spam accounts to multiple users [34]. Furthermore, some social network spam messages contain symbols and punctuations in place of letters. These messages may also contain punctuations inside words (such as “g.o.o.d”, instead of “good”). Besides, some social network messages contain malicious links. If users click on the link, it can mislead them to phishing websites designed to steal personal and sensitive information. It can also cause malware and viruses to be downloaded to the user’s computers [35]. These unique characteristics pose a huge challenge to current spam detection models for microblogging websites.
Problems caused by social network spam
Social network spam can cause pandemonium in the public and misunderstanding in trending topics. Spammers can also take advantage of trending topics by posting comments that contain links, to mislead users to fake websites and defraud them of thousands and millions of dollars. Social network spam can also be used as a means of bullying, threatening, or harassment, especially to online users. It can also be used for identity theft.
Features to be exploited
One of the first action that spammers typically execute is to send many friend requests to their victims. Generally, only a fraction of users will accept friend requests from strangers, since they do not know them. Therefore, a good way to tackle social network spam is to build detection systems that can identify spam profiles with a large number of sent friend requests and a small number of friends. Another way to identify spam profiles is to train classifiers on features that can identify profiles that accepted a large number of friend requests from strangers. Furthermore, an analysis performed by Stringhini et al. [163] indicates that legitimate profiles send out more messages than spam profiles. This is another potential feature that can be used to train classifiers. Researchers can study the characteristics of social network spam and build effective anti-spam solutions for social network sites.
Email spam
Email spam is one of the persisting challenges facing thousands of individuals and organizations. Email spam refers to an unsolicited email sent out in bulk to a random list of recipients.
Characteristics of email spam
Spam emails have some characteristics that uniquely identify them. The senders of spam emails usually hide their identity. They use different techniques to hide their identities, such as IP address, images, and HTML tags. Moreover, spam emails are usually sent to multiple people at the same time. Also, spam emails sometimes contain unpleasant content and urgent tone. They are sent with malicious intent to steal personal and sensitive information from users. Some spam emails contain URLs that can lure users to fake websites or download viruses and malware to user’s devices.
Problems caused by email spam
Spam emails may be free to send, but it is costly to its recipients and various service providers. Spam email cost different service providers and organizations a bandwidth loss of billions of dollars [92]. Besides, it cost them loss in employee productivity, wastage of time, resources, and storage space [166]. Moreover, spam emails expose users to unpleasant content, and it provides a means for the distribution of malicious software, like Trojan and worms.
Features to be exploited for email spam
Spam emails can be tackled by training ML models on features that can effectively segregate spam emails from legitimate emails. These features can be identified in different parts of an email, including email header and message content. Email headers contain information about an email, such as email route, transmission date, sender, and recipient ID. Most spammers send spam messages by falsifying some of this information. For example, they can use some sophisticated Hypertext Markup Language (HTML) functionalities to hide spam contents, such as character-entity encoding and URL encoding (Constales.book 2005). Character encoding provides a means to include special characters in HTML documents, while URL encoding provides a means to hide information in links that can be transmitted over the internet. Researchers can build models that can identify spam contents in email headers and URLs. These models can be trained on features that are extracted from different parts of email headers, including sender address, recipient address, email subject, etc. These models should also be trained on features that can identify encoded information in URLs. Table 1 shows a summary of some selected ML-based and NI-based spam detection techniques in the literature.
Summary of spam detection techniques
Summary of spam detection techniques
ML-based techniques are arguably one of the best techniques that can be used to effectively handle spam detection. This is because of their ability to automatically classify datasets by searching and extracting hidden patterns from them [19]. Figure 5 shows an example of spam detection using ML and NI algorithms. As shown in the Figure, NI algorithms are typically used to extract relevant features from training datasets. Furthermore, based on these features, ML models are built for spam detection. Different ML algorithms have been used to design spam detection systems, including SVM, RVM, ANN, NB, KNN, LR, DT, SAIS, and RF. Most of these algorithms produced different results, which was largely influenced by the quality of dataset and the suitability of features and parameters used to train them. The next subsections present a survey of spam detection techniques that have been designed for social network spam, review spam, spam emails, and webspam.

Spam detection using NI and ML algorithms.
Generally, companies are concerned with customer feedbacks and reviews because feedbacks are one of the factors considered by customers when purchasing certain products or services. Unfortunately, online reviews are not very reliable and trustworthy, because spammers can manipulate these reviews to promote or devalue products and services. They can create fake or deceitful reviews for profit. This practice is typically known as review spam.
Feature selection techniques for review spam detection
Thanks to Natural Language Processing (NLP), fake reviews can be detected using ML algorithms. One of the primary issues in review spam detection is lack of differentiating words (or features) that can be used to accurately classify online reviews as fake or real [50]. A typical approach used is the bag of words approach, where the presence of individual words or group of words is used as distinguishing features. However, findings from some studies indicate that this approach is not enough to train a classifier for effective review spam detection. Many other feature extraction approaches have been adopted in the literature. [83,96,113] used individual words from online reviews as features. Shojaee et al. [153] used lexical features (such as word-based features), while Ott et al. [124] used unigram and bigram term-frequencies as features.
Asghar et al. [16] considered features from three domains: opinion spam (review-based features), opinion spammer (review-based features) and item spam (product-based features). They designed a rule-based feature weighting technique for classifying review spam in the context of big data. It consists of two components for computing feature values and feature weights. They also introduced a spam score computation technique for computing and assigning scores to each feature based on their priority and role in spam detection. Furthermore, they introduced some new set of spamicity-related features and combined them with the rule-based weighting scheme for classification of review spam. The Spamicity-related features were designed based on the spam probabilities of different words. The technique was evaluated on an Amazon-based dataset that consists of 142.8 million product reviews. Moreover, they used the Natural Language Toolkit (NLTK) python platform. The platform provides a suite of text processing libraries for classification, tagging, stemming, other pre-processing tasks. Results from the evaluation show that the proposed weighting scheme achieved a classification accuracy of 98% and improved the accuracy of webspam detection by 3%.
Shojaee et al. [153] introduced a new technique for review spam detection using lexical and syntactic features. As mentioned above, lexical features are word-based features, while syntactic features are features that represent the writing style of the reviewers, such as the occurrence of function words or punctuation. Based on the features, they designed two ML models using SVM and NB classifiers. Moreover, they compared the performance of the combined feature set to the performance of using either lexical or syntactic features. The results revealed that the hybrid features produced the best performance, achieving an F-measure of 84% using SVM.
Ott et al. [124] introduced another technique for review spam detection using n-gram based features. They built an SVM model using bigram and unigram features (extracted from negative reviews) and achieved an accuracy of about 86%. Moreover, they evaluated the classifier on both positive and negative review and observed that the accuracy of the classifier slightly reduced. This suggests that separating spam review detection into positive sentiment and negative sentiment spam review detection is helpful. Furthermore, based on the findings in [82], review spam detection techniques can be improved if n-gram features are combined with other types of features, such as bag-of-word features.
Jindal and Liu [82] introduced a ML technique for identifying opinion spam in reviews. They constructed a dataset consisting of 5.8 million product reviews from Amazon, generated by over 2 million users. They extracted several features from the dataset and built a LR, SVM and NB model with the features. The models were evaluated, and they achieved promising results. As shown in their study, text-based features alone cannot be used to build effective models for review spam detection; addition of other types of features will improve the performance of classifiers. However, as the feature size increases, the feature size of the dataset increases, leading to an increase in computational complexity and possibly leading to overfitting [73]. Therefore, there is a need to build models with balanced speed-accuracy trade-off. This is an area for future research.
Review spam detection techniques for multiple domains
Li et al. [97] argued that existing supervised ML methods are trained on features that are specific to certain domains. Therefore, they introduced a classification framework for detecting review spam in different domains, specifically: restaurants, hotels, and doctor. The framework was based on the Sparse Additive Generative Model (SAGE) originally introduced by [61]. The authors generated a combination of different models using SAGE and SVM. Moreover, they investigated different feature extraction techniques and observed that general features like LIWC output and POS tag frequencies perform better than unigram features when performing cross-domain classification. However, unigram features produce a better result when performing intra-domain classification (e.g. restaurant reviews only). This indicates that a generic model cannot effectively detect review spam in different domains. Therefore, cross-domain classifiers require features that can effectively identify spam in different review domains.
Review spam detection techniques for different languages
Online reviews are not restricted to one Language; reviewers have the liberty to write reviews in different languages. Although some features (e.g. word features) will change per language, many of the features will remain unchanged [50]. Abu Hammad [2] introduced an approach for spam detection in Arabic online reviews. They noted that the class distribution of online reviews is imbalanced. Given this, they created a new dataset (which was very imbalanced) and applied Random Oversampling (ROS) and Random Undersampling (RUS) techniques to balance the class distribution in the dataset. Furthermore, they trained the balanced dataset on SVM and NB and discovered that NB produced the best performance, achieving a F-measure of 99.59%.
Review spam detection techniques for spammer groups
Most studies in the domain of review spam detection proposed techniques for identifying fake reviews written by individual reviewers. However, few studies have been done to identify fake reviews written by a group of malicious reviewers (called spammer groups). [113] introduced a technique for identifying spammer groups. They created a labelled group dataset and used frequent itemset mining [7] to search for spammer groups. Moreover, they evaluated the group dataset on some supervised ML algorithms (SVM, LR and Support Vector Regression) and the results indicate that applying supervised learning technique to identify spammer groups is not very effective. They introduced a relation-based technique for detecting spammer groups, called Group Spam Rank (GSRank). The technique is designed to rank the candidate groups based on their likelihood of being spam or not. The technique was evaluated, and results indicated that the proposed approach outperforms other baseline approaches for identifying spammer groups.
Semi-supervised learning for review spam detection
Most studies focused on building supervised learning approaches. A key challenge in these approaches is obtaining ground truth for the training dataset. Given this, some studies introduced semi-supervised learning approaches that rely on both labelled and unlabeled datasets. Stanton and Irissappane [161] proposed a semi-supervised GAN for spam detection in online reviews (called spamGAN). The technique consists of three modules, namely: generator, discriminator, and classifier. The generator module (which acts as the reinforcement learning agent) is used to generate new sentences (called fake sentences) belonging to either spam or non-spam class. The discriminator module is used to distinguish between fake and real sentences. It informs the generator (via rewards) whether the sentence it generates is realistic or not. The competition between the generator module and discriminator module improves the quality of the sentences that are generated. The classifier module contains a classifier that is trained on the original dataset and the fake sentences produced by the generator. The classifier’s performance on the fake sentence is used as feedback to improve the quality of sentences produced by the generator. The technique was evaluated on a dataset obtained from TripAdvisor [125], and it outperformed other supervised learning approaches, achieving a F1 score of 86.8%.
Aghakhani et al. [6] introduced another method for review spam detection, called FakeGAN. Unlike standard GANs that use one generator and discriminator models, FakeGAN consists of two discriminator models and one generator models. The generator is modelled as the reinforcement learning agent. The two discriminator models are trained simultaneously to estimate the intermediate action-value (using Monte Carlo search algorithm) and pass it as a reward to the generator. The authors claim that using two discriminator modules allows the generator to learn the data distribution in both truthful and deceptive reviews. The technique was evaluated TripAdvisor hotel reviews, and results show that it produced similar accuracy compared to supervised learning approaches.
Review spam detection technique for positive and negative reviews
Ott et al. [125] designed three ML-based techniques for detecting review spam. They created a new dataset using Amazon Mechanical Turk [12] and TripAdvisor [171]. The dataset consists of 400 truthful reviews and 400 fake reviews. In a different study [124], they created another dataset of the same size and combined it with the first dataset. The dataset contains negative reviews. They extracted three groups of features from the dataset and used them to train two classifiers: SVM and NB. The result revealed that SVM outperforms NB, achieving a classification accuracy of 89.8%.
Machine learning techniques for social network spam detection
In recent times, cybercriminals have diverted their attention to social platforms, such as social media, microblogging websites, and mobile chat platforms. Therefore, some researchers and developers are designing improved techniques for combating the spread of spam contents on mobile and social networks. Different social network spam detection techniques have been proposed in the literature. This section presents a survey of some ML-based social network spam detection techniques.
Conventional machine learning techniques for social spam detection
Aswani et al. [17] designed a hybrid spam detection technique for identifying spam profiles in social networks. They examined over 1.8 million tweets from 14,235 Twitter users (using social media analytics) and extracted several content-based features from the tweets, including hashtag frequency, retweet count, URL count, unique word count. Moreover, they extracted user-profile-based features from the tweets, such as follower count, tweet count, friend count, etc. Besides, they introduced a novel set of features, called semantic features, and extracted them from the tweets. Finally, based on the extracted features, they deigned a hybrid approach using Levy-Flight Firefly Algorithm (LFA), Chaotic Optimization Algorithm (COA) and K-means algorithm to cluster the Twitter users into spam and non-spam.
Dutta et al. [60] introduced a feature selection technique to improve the detection of spam profiles in an online social network. Previous feature selection techniques ignore the mutual dependencies between features, which might lead to the selection of highly correlated features. Therefore, Dutta et al. [60] introduced a graph-based greedy approach to select features with mutual dependencies. They used graph theory to model the dependencies between features, then used the concept of RST to select a feature subset. They evaluated the technique on five datasets, and they focused on Tweets that have links since spam is mostly circulated through links to malicious websites. Moreover, they extracted several attributes (using their attribute selection technique) from the datasets, including Tweet text-based attributes, URL-based attributes, and user-profile-based attributes. Finally, based on the extracted attributes, they built different classification models (using eight ML algorithms) for classifying spam from non-spam.
Adewole et al. [4] introduced a unified framework for spam account detection in Twitter microblogging website). The framework consists of 10 ML algorithms and one bio-inspired algorithm (EA). The bio-inspired algorithm was used to identify a reduced set of features for spam account detection in the Twitter social network. Besides, they introduced a set of new graph-based features and combined it with some existing features identified in the literature. Based on these features, they trained 10 ML algorithms, namely: RF, J48, ADTree, SVM, MLP, AdaBoost, Decorate, LogitBoost, Bayes Network, and Random committee. The results indicate that LogitBoost classifier produced the best result for spam account detection, achieving an accuracy of 93.2%.
Alsaffar et al. [11] performed a comparative analysis of seven learning algorithms for Twitter spam detection, namely RF, NB, Bayesian Network, SVM, KNN, MLP, and RNN. They compared the performance of the algorithms based on two test options: 10-fold cross-validation and percentage split. They performed the analysis on a Twitter dataset consisting of 10,000 instances and 12 features, including the number of account age, number of followers, number of following, number of user favourites, number of lists, number of tweets, number of retweets, number of hashtags, number of user mentions, number of URLs, number of characters, and number of digits. The analysis indicated that RF outperforms all the compared ML algorithms including RNN. The poor performance of RNN is likely due to the size of the dataset used for training. Moreover, the result showed that 10-fold cross-validation produces a better result than the percentage split.
TinyURL is a link shortening web service that provides short aliases for long URLs. TinyURL can be exploited by spammers to transmit malicious content in social networks. Padmanabhan et al. [126] introduced a technique for identifying spam in TinyURLs, specifically focusing on Twitter tweets. They extracted a set of reduced features from different tweets and used these features to train three ML classifiers: LR, DT, and SVM. The results showed that SVM outperforms the other classifiers in identifying spam from TinyURLs.
Deep learning techniques for social spam detection
Most of the feature selection approaches proposed in the literature used hand-crafted features. However, hand-crafted feature selection approaches can be cumbersome and time-consuming. Besides, most of the existing techniques are based on traditional ML algorithms; very few studies focused on deep learning algorithms. Deep learning algorithms are popularly used for image classification or regression problems. They can also be used for Natural Language Processing (NLP) problems, such as social network spam detection. Ma et al. [106] introduced an approach that can automatically identify rumours from microblogs using deep learning. The authors assumed that people exposed to dubious claims tend to dispute its truthfulness by commenting on it or forwarding the claim. This generates a continuous stream of posts and a long-distance dependency of evidence. Given this sequential nature of text streams in a social network, Ma et al. [106] used RNN to learn the variations of contextual information posted by users over a period. They achieved this by modelling the social context information of an event as a variable-length time series. The authors evaluated the technique on two datasets collected from Twitter and Sina Weibo, consisting of millions of rumours and non-rumour posts, and it achieved a prediction accuracy of 91.0%.
Generally, spam patterns change over time, and this change affects the performance of existing ML-based classifier. This phenomenon is known as spam drift. Due to the problem of spam drift, ML algorithms cannot efficiently identify spam activities in real-life settings [184]. Wu et al. [184] proposed a deep learning-based technique for identifying spam in social networks. They collected different Tweets and used the WordVector approach to convert the tweets to feature vectors. Based on the feature vectors, they built a CNN-based classification model for identifying Twitter spam. The results obtained shows that deep learning-based techniques perform better than text-based spam filtering methods. Jain et al. [80] introduced a hybrid deep learning framework for spam classification in a social network. The framework consists of two algorithms, namely CNN and Long Short-Term Neural Network (LSTM). They used CNN to extract the relevant n-gram features from text sentences and used LSTM to capture long-term and high-level dependencies of these features. The authors used word2vec to convert texts to feature vectors. They also used semantic dictionaries (such as WordNet and ConceptNet) to extract the closest semantic word to a given word. Doing this will enable the classification model to learn semantic vector representations of words. The proposed architecture was evaluated on a Twitter dataset, and it achieved an accuracy of 95.48%.
Semi-supervised learning for social spam detection
One of the latest trends in social network spam is semi-supervised learning for big social data analysis. Most studies rely heavily on a labelled dataset and ignore the huge volume of unlabelled data that is currently available. Semi-supervised techniques are well suited for harnessing the insights in both labelled and unlabelled datasets.
Li et al. [99] introduced a semi-supervised framework for spam detection in a social network. The framework is designed to iteratively train a classifier by using a social graph. They extracted features from a small labelled dataset and used them to train an initial classifier. Further, they designed a ranking model to generate trust and distrust samples based on the social relationship of users in the small dataset. Moreover, they used the initial classifier to label the top-ranked users and select the high confident users as the new dataset. The new dataset is then used to re-train the initial classifier. These steps are repeated until the classifier cannot be further improved. The technique was evaluated, and the results show that it can effectively identify social spammers in a situation where there is an insufficient amount of dataset. [194] introduced another semi-supervised framework for social spam detection that combines co-training with a k-medoids clustering algorithm. They used the k-medoids clustering algorithm to generate their initial dataset. Furthermore, they designed a co-training classification framework that takes advantage of the behaviour-based and content-based features of users and use them for incremental training. They evaluated the performance of the proposed network with other supervised-based learning algorithms and it produced promising results.
Sedhai and Sun [151] proposed a semi-supervised framework for spam detection on Twitter. The framework is divided into two modules, namely: spam detection module and model update module. The spam detection module operates in real-time mode and it is designed to detect different types of tweets using four detectors. The first detector is designed to identify tweets that are contained in blacklisted URLs. The second detector identifies tweets that are near duplicates, based on previously pre-labelled tweets. It uses the MinHash algorithm [33] to achieve this. The third detector detects reliable tweets. Tweets are considered to be reliable if they do not contain spam words and if they are posted by reliable users. The fourth detector uses three classifiers (LR, NB, and RF) to detect tweets that are not processed by any of the first three detectors. The second module of the framework (i.e. model update module) is used to update the models of the four detectors accordingly. It uses the new data extracted from the first module to generate a new dataset for re-training the classification models. The incremental training gives the framework the ability to capture new vocabulary and new spam words. The framework was evaluated on a fraction of the HSpam14 dataset [150] and its performance was compared to three ML algorithms: NB, LR, and RF. The result shows that it outperformed the three classifiers in terms of F1 score.
Niranjan Koggalahewa et al. [117] proposed a fully unsupervised approach for identifying spammers in a social network based on peer acceptance. Peer acceptance refers to the degree to which a user is accepted by his peers. It can be obtained from common shared interests over numerous posts. The authors generated the peer acceptance of several users from an unlabelled dataset containing post contents and built an unsupervised ML model for identifying spammers. Although the approach did not outperform supervised learning techniques, it produced promising results that can be further improved in future studies.
Machine learning techniques for web spam detection
Webspam is a spam detection problem that has affected the effectiveness of search engines. It leads to undesirable web pages displayed to users and causes search engines to waste a significant amount of computational resources and storage page. Given this, some studies introduced effective techniques for web spam detection. There are three popular approaches used to identify web spam, namely: graph-based approach, content-based approach, and cloaking-based approaches. Graph-based approaches are based on web graph or hyperlinks of webpages. A web graph describes the directed links between different pages of the world wide web. The content-based approaches are based on the contents of web documents, while cloaking-based approaches are designed to identify cloaking spam. Clocking refers to a search engine optimization technique in which the content submitted to the search engine is different from the content displayed to the web user (through the web browser).
Graph-based techniques for web spam detection
One of the popular approaches used to identify webspam is the graph-based approach. Many graph-based web spam detection approaches exploit the hyperlink structure between webpages to identify webspam [180]. In the graph-based approach, webpages are the nodes and links between the webpages are the edges. Considering a web graph and estimating the trustworthiness of a webpage in terms of its validity, webspam can be identified. Sattar et al. [148] introduced a technique for web spam detection based on a web graph. In the study, the authors introduced some graph-based features for web spam detection. Based on these features, they built different classification models for web spam detection using five ML algorithms, namely: SVM, KNN, LR, NB and RF. Moreover, they evaluated the technique on two datasets, and the results show that graph-based features achieve similar or better results compared to content-based features for web spam detection.
PageRank algorithm is another graph-based technique that can be used for web spam detection. Classical PageRank algorithms are typically designed to evenly assign weights to links, thereby disregarding the authority of webpages. Yu et al. [189] introduced an improved PageRank algorithm based on web page differentiation (DPR). The algorithm is designed to assign weights to webpages based on their page authority. The authority of each webpage is evaluated based on the number of links on a webpage. Also, Yu et al. [189] introduced a page-based clustering technique (called DPK-Means); they combined the proposed DPR technique with K-means clustering. The technique was evaluated and the result shows that the DPR algorithm outperforms existing page ranking algorithms. Moreover, the result shows that the DPK-Means algorithm produced better performance than K-means clustering. Some studies in the literature designed TrustRank and Anti-TrustRank algorithms for web spam detection. TrustRank algorithms are link-based spam detection algorithms that follow the principle that legitimate webpages tend to point to other legitimate webpages. Likewise, Anti-TrustRank algorithms are link-based web spam detection algorithms that are designed with the principles that spam webpages are likely to be referenced by other spam pages [180]. Whang et al. [180] designed an asynchronous Anti-TrustRank algorithm for web spam detection using the Gauss-Seidel algorithm. The algorithm is designed to reduce the number of arithmetic operations compared to the classical synchronous Anti-TrustRank methods. The algorithm was evaluated on a real-world Web graph collected from NAVER cooperation (a popular search engine company in Korea), and the result shows that it outperforms synchronous Anti-TrustRank algorithms.
Content-based techniques for web spam detection
Another popular approach used for webspam detection is the content-based approach. Some of these approaches focused on extracting features that can effectively distinguish between spam and legitimate webpages. Asdaghi and Soleimani [15] introduced a feature selection technique for web spam detection using a novel backward elimination feature subset selection technique. The main idea of the technique is to measure the impact of eliminating a subset of features on the performance of a classifier, instead of eliminating a single feature. The goal of the technique is to select the largest subset of features, such that omission of its member can cause maximization of the classifier’s performance. Moreover, the authors introduced a performance metric particularly suitable for an unbalanced dataset. The technique was evaluated on the WEBSPAM-UK2007 dataset and seven ML algorithms, and the result showed that RF outperformed all the algorithms, achieving an accuracy of 94.86%. Some studies [93,128] introduced a new set of link-based and content-based features for improved web spam detection, including word length, number of webpage words, and number of stop words in webpage Title. These features were extracted from the content of webpages and used to train ML models for classifying spam webpages.
Lu et al. [103] designed a hybrid technique for web spam detection based on ensemble classifiers. Firstly, they used the under-sampling technique to convert the unbalanced dataset to balanced dataset. Furthermore, they used the ICA algorithm to select feature subsets from the WEBSPAM-UK2006 dataset. Finally, based on the selected features, they built a model for webspam classification using an ensemble of C4.5 decision tree classifier. In another study [104], the same authors designed a similar approach for web spam detection. However, in this study, they used CSA for feature selection.
Many of the webspam detection techniques proposed in the literature are based on ML algorithms; very few explored deep learning algorithms. Makkar et al. [107] introduced a deep learning-based framework for web detection using RNN. The framework consists of two preprocessing algorithms (PCA, Recursive Feature Elimination (RFE)) and RNN. PCA was used for dimension reduction, RFE was used for feature selection, and RNN was used to build the classification model. The framework was evaluated, and the result shows that the preprocessing algorithms improved the classification accuracy by over 24%.
Cloaking-based techniques for webspam detection
Some studies [41,93,94,182] introduced different techniques for identifying cloaking spam. [94] introduced a set of new cloaking-based features suitable for training webspam detection models. They also proposed an improved multi-class SVM webspam detection technique that can identify three types of webspam, namely: content-based, link-based, and cloaking-based spam. The technique was trained on three datasets, and it produced a classification accuracy of 92.6%, 94.5%, and 94.8% for the three datasets, respectively. Najork [116] introduced a method for identifying cloaked web servers. In the method, an object was obtained by sending a request to a web server from a crawler, and a second object was obtained by sending a request to the server from a web browser. If there is a disparity between the two objects, then the webserver is classified as cloaked.
Wu and Davison [182] proposed three methods for identifying cloaking spam. In the first method, two copies of URLs were extracted from the web browser (
Chang et al. [41] introduced a ML-based technique for identifying cloaking URLs based on features on URL redirects. The method identifies fake webpage contents by first using an online system to transmit the URL of a queried website to a client (such as a mobile device). The mobile device then transmits a URL log back to the online system, which includes the URLs that the mobile device accessed when it was requesting the contents from the website. The online system then extracts a feature from at least one URL in the URL log, and then feed the feature into a ML model that was trained to identify cloaked websites. The model was trained on features for URL redirects. The model generates a score that indicates the likelihood that a website is a cloaked website. If the score is greater than a specified threshold, then the website is classified as cloaked.
Semi-supervised learning for webspam detection
Karimpour et al. [86] proposed a semi-supervised learning approach for webspam detection using the Expectation-Maximization (EM) algorithm. The EM algorithm is used to learn a classification model from a small set of labelled data and a large set of unlabeled data. The small dataset is used to build an initial classification model. The classification model is then used to label the samples in the unlabeled dataset as spam or non-spam. After labelling, the new samples are added to the training process and used to re-train the classification model. The process continues until there are no new labels to predict. The method was evaluated on the WEBSPAM-UK2007 dataset and the results show that it outperforms three supervised learning approaches (NB, C4.5, and Bayesian network), achieving an F-Score of 86%.
Zhang et al. [193] introduced a novel semi-supervised learning approach for webspam detection, called harmonic functions based semi-supervised learning. In the study, they represented labelled and unlabeled web pages as vertices in a web graph. The learning problem is then modelled as a Gaussian random field on the web graph, where the mean of the random field is described by different harmonic functions that can be computed using the matrix method. They evaluated the technique on the WEBSPAMUK2006 dataset and the result show that it is effective and can benefit from a large volume of unlabeled dataset.
Tian et al. [169] introduced another semi-supervised approach for webspam detection. The approach begins by extracting different features from a labelled dataset. The features are then used to train an initial classifier. Furthermore, the initial classifier is used to classify the new samples in the unlabeled dataset. The labelled samples are then added to the training process for re-training. In the study, the authors used a combinatorial feature-fusion method to reduce the feature size of the dataset. The technique was evaluated, and experimental results show that the hybrid approach outperformed three ML algorithms, producing an AUC score of 93.1%, 62.8%, and 77.2% for ADTree, SVM, and NB algorithms, respectively.
Machine learning techniques for spam email detection
Many ML algorithms have been used to build effective models for spam email detection. This section provides a review of some related works in spam email detection.
Conventional machine learning techniques for spam email detection
SVM is one of the popular ML algorithms that has been used to design improved spam email detection systems. It is one of the prevalent supervised ML algorithms, with robust theoretical background, excellent classification accuracy and good generalization performance [23]. [8,55,65,165,191] designed different spam email detection techniques using SVM and NI algorithms (such as ACO, CSA, BA). Some studies used the NI algorithms to extract relevant features or instances, while some studies used the algorithms to optimize SVM parameters for improved performance. The selected features, instances and parameters were then used to build effective SVM models for spam email detection.
ANN is another popular algorithm that has been used for spam detection. Nosseir et al. [120] designed a character-based spam email detection approach using ANN. They extracted different words from email datasets and divided them into three groups, based on their character size. Specifically, they considered words containing three, four and, five characters, such as AIR, POET, and CLAIM. They also divided the words into “bad” and “good” words, where 50% of the words were assigned to “bad” words and the other 50% were assigned to “good” words. They built different classification models based on each group and the five-character neural network produced the best result, achieving a FP rate of 99.90% for good words and 99.85% for bad words. In another study, Kufandirimbwa and Gotora [92] designed a spam email classifier using ANN and Perceptron Learning Rule (PLR). They extracted different features from the header and body parts of emails. Based on the features, they developed two classification models using ANN and PLR. They evaluated the models on a small dataset, and it produced an FP rate of 97.14%. The result looks promising; however, the model was built with small datasets, hence it will not generalize well on unseen samples.
A good number of studies used NB classifier for spam email classification because of its effectiveness for text classification. For example, Bhagyashri et al. [27] and Teli and Biradar [166] designed a spam detection technique based on NB algorithm. They used the Term Frequency (TF) algorithm to select important keyword-based features and then used NB algorithm to build a classification model based on the selected features. Rusland et al. [141] performed an analysis on NB algorithm and reported that its performance is affected by email type and dataset size. They trained NB classifier on two datasets of different sizes and the results show that it performs better on datasets with small sizes compared to datasets with large sizes. Awad and ELseuofi [18] compared the performance of six ML algorithms on the SpamAssassin dataset. They extracted 100 word-based features from the dataset and used them to train the six ML algorithms, namely: NB, k-NN, ANNs, SVMs, AIS, and Rough sets. The evaluation shows that NB produced the best results, achieving a prediction accuracy of 99.46%.
Ensemble-based techniques for spam email detection
Some studies introduced ensemble-based techniques for spam email detection. Trivedi and Dey [172] introduced an ensemble-based technique, consisting of three classifiers, namely: Boosted Bayesian, Boosted NB and SVM. They used a committee selection mechanism to combine the decisions of the three classifiers. In the mechanism, SVM is the president, while Boosted NB and Boosted Bayesian classifiers are the committee members. They used a greedy stepwise method to select features and built an ensemble-based email classification model with the features. In another study, Gupta et al. [72] introduced an ensemble-based spam detection method consisting of four classifiers, namely: Gaussian NB, Multinomial NB, Bernoulli NB, and DT. They used voting classifier to evaluate the accuracy of a different combination of classifiers in the ensemble. The results obtained shows that the use of voting classifier produces better accuracy than individual classifiers. However, ensemble classifiers are slower than single classifiers because they involve the combination of multiple classifiers.
Other ML algorithms have been used to design spam email classification systems, however, some of them have not been fully explored, such as RoF. Shuaib et al. [157] performed a comparative analysis of 14 classification algorithms. They evaluated the algorithms on spambase dataset [76], and the analysis showed that RoF produced the best result, achieving a prediction accuracy of 94.2%. The authors did not use any feature selection method in their study; hence the accuracy can be further improved by exploring feature selection methods. Table 2 shows a summary of all the ML-based detection techniques surveyed in this paper.
Semi-supervised learning for spam email detection
In the literature, many supervised learning algorithms have been studied for email detection, however, much work has not been done on semi-supervised learning for email spam detection. Li et al. [98] proposed a semi-supervised learning technique for email classification that leverages both labelled and unlabeled datasets. The technique consists of three main stages, namely: feature preparation, training, and classification. In the first stage, features were extracted from emails and converted into two sets of features: internal feature set and external feature set. The internal feature set contains features that are related to the content of an email, while the external feature set contains features that are related to routing and forwarding. In the second stage, a semi-supervised learning algorithm was used to build classification models based on the labelled features extracted from the first stage. The built classification models were used to automatically label and classify unlabeled instances. In the last stage, a decision is made by classifying the emails as spam or non-spam. The technique was evaluated on the spambase datasets, and the result shows that it outperforms four other supervised ML approaches, achieving a prediction accuracy of 88.84%. Meng et al. [111] introduced a technique for email classification using semi-supervised learning approach and data reduction. The data reduction technique is used to select optimal features for training, while the semi-supervised learning algorithm is used to improve the classification accuracy of spam detection models by automatically using unlabeled datasets. They evaluated the technique on three datasets and the result shows that it improved the classification accuracy of email detection and reduced false positives.
Whissell and Clarke [181] introduced a semi-supervised approach for email spam filtering using the k-means clustering algorithm. They considered a specific scenario for semi-supervised spam filtering; the scenario when a large amount of dataset is available for training and a small number of labels can be obtained for the dataset. They presented two approaches for this scenario, both starting with a cluster of emails for training. The first approach uses the true labels of the mediods of each email cluster to train a spam classifier. The trained classifier is used to filter the test data. The second approach is like the first approach; however, it uses the true label of each cluster’s mediods as the label for each email in the cluster. The technique was evaluated on two email datasets and the result shows that it outperforms other semi-supervised techniques. In the experiments, a different number of clusters were evaluated, and the best result was achieved when the number of clusters was set to 50 clusters.
Summary of ML-based techniques
Summary of ML-based techniques
NI-based methods are generally used to solve complex real-world problems, such as feature selection, instance selection and parameter optimization. They are becoming popular in the domain of ML applications because they are easy to implement and they rely on some basic concepts [156]. Also, they can bypass local optima and can be used in a wide range of applications [156]. Each NI algorithm is designed with different strategies; however, their basic principles remain similar [156]. Their goal is to search for a solution that maximizes or minimizes a user-specified objective function. NI algorithms generally take note of the current optimum and execution is terminated based on different user-defined conditions, such as the number of iterations. In the context of ML speed optimization, NI algorithms are designed to select a reduced set d of features or instances, that produces the highest possible predictive accuracy, where d is a subset of a larger dataset. Three techniques can be used to select relevant features from datasets, namely: the wrapper technique, filter technique, and hybrid technique. The main difference between the three techniques is in their method of selection. The wrapper-based feature selection methods use a classifier or mining algorithm to evaluate the accuracy of different subsets in the feature selection stage [122]. The filter-based techniques do not depend on a classifier for subset evaluation; features are selected based on a fitness function [122]. The hybrid technique combines the two methods by using their evaluation benchmarks in different search stages [104].
The following are some NI algorithms that have been used in the literature for spam detection: ACO, PSO, GA, EA, CS, SFLA, IWO, BA, FFA, ICA, FFO, AIS, CSA and SAIS. The next subsections present a survey of some NI-based techniques that have been used to improve the performance of spam detection classifiers.
Nature-inspired based techniques for email spam
Wang et al. [179] proposed a hybrid feature selection technique for email classification, called Document frequency and Term frequency combined Feature Selection (DTFS). The technique consists of two feature selection algorithms, namely: Optimal Document Frequency Based Feature Selection (ODFFS) and Optimal Term Frequency-based Feature Selection (OTFFS). The two techniques were used to select optimal features based on different predetermined thresholds. The authors argued that thresholds are dataset-dependent, hence they used harmony search optimization algorithm to select the optimal threshold for each dataset. This threshold will improve the trade-off between the features selected by ODFFS and OTFFS. The subset of feature selected by each algorithm was combined and used to train SVM and NB. They compared the performance of the techniques with other conventional feature selection algorithms, such as term frequency-based information gain, and the experimental results show that DTFS outperformed the other algorithms. Karthika and Visalakshi [87] introduced another feature selection technique for improving spam email classifiers. They used ACO to select a subset of spam email features, and the features were used to train and improve the prediction accuracy of SVM classifier. [191] combined PSO with ANN for feature selection. They used PSO to select relevant features and ANN to evaluate the fitness of the feature subset in different iterations. The best feature subset was then used to build a SVM model for spam email classification. [156] used WAO to select important features from two email datasets: Spambase and Enron-Spam [112]. The WAO algorithm selected 55 features out of the 58 features in the Spambase dataset. It also selected 426 features out of the 1054 features in the Enron-Spam dataset. The selected features were then combined with RoF algorithm for classification of emails.
Behjat et al. [24] introduced a feature selection technique for spam detection in spam emails. They used GA to reduce the feature dimensionality of the LingSpam dataset, and the reduced dataset was used to train Multi-Layer Perceptron (MLP). The result shows that GA reduced the dataset dimensionality and improved the prediction accuracy of MLP classifier. In another study, [138] used FFA to select optimal features for email spam classification. The selected features were used to train NB classifier, and experimental results show that the proposed FFA-based technique outperforms PSO and NN in terms of classification accuracy. Some studies [75,140] introduced different NI-based spam pattern-finding techniques for automatically identifying spam contents. These techniques use different NI algorithms (such GP) to extract spam patterns from datasets. Based on the extracted patterns, spam classification models were built. The models were evaluated, and the results showed that GP can generate spam patterns suitable for effectively classifying spam contents.
In addition to feature selection, some studies introduced instance selection techniques for improving the speed of ML algorithms. Instance selection techniques are used to remove superfluous and harmful instances from training datasets. Superfluous instances are instances that contribute negligibly to the classification accuracy of a classifier, while harmful instances are instances that lead to increased FP and FN rates [32]. Superfluous and harmful instances contribute less to the classification accuracy, hence discarding them do not have a negative impact on the overall performance of ML algorithms [32]. Very few studies designed instance selection techniques, and these techniques can effectively improve the performance of spam detection systems. Akinyelu and Adewumi [8] introduced two NI instance selection techniques based on CS algorithm and bat algorithm. The two algorithms were used to select reduced instances for SVM speed optimization. The algorithms were evaluated on different datasets, including spam email detection datasets, and the results showed that instance selection techniques can be used in combination with ML algorithms to produce improved spam detection models. In another study, Anwar et al. [14] designed an ACO-based instance selection technique for improving the prediction accuracy of classification models (called ADR-Miner). They used ACO to select relevant instances from 20 datasets. Based on the selected instances, they trained different classification algorithms, and the result shows that ACO improved the prediction accuracy of classification models.
Nature-inspired based techniques for webspam
NI-based feature selection has not been fully explored in the domain of web spam detection. Lu et al. [104] are one of the few authors that used NI algorithms for feature selection in web spam detection. They used CSA to design a feature selection technique for web spam detection. Initially, they used the under-sampling technique to handle the data imbalance problem in the WEBSPAM-UK2006 dataset. Further, CSA was applied to the balanced dataset to select optimal feature subsets. The selected feature subsets were used to build an ensemble of C4.5 decision tree classifiers. The results show that CSA improved the prediction accuracy of the ensemble classifier by over 5%. In a different study, Lu et al. [103] introduced a similar method for web spam detection, using ICA to select optimal features for building decision tree models.
Nature-inspired based techniques for review spam
Rajamohana et al. [134] introduced a technique for selecting optimal features in website reviews using adaptive binary flower pollination algorithm (BFPA) and NB algorithm. They used BFPA to select features and NB algorithm as an objective function. They evaluated the technique and the experimental results showed that it selected informative features and produced high classification accuracy. Rajamohana and Umamaheswari [132] introduced a hybrid feature selection technique using BPSO and BFPA. As claimed by the authors, the hybrid approach overcomes the drawbacks of PSO (low convergence rate) and take advantage of the strength of BPSO for feature selection. PSO has a slow convergence rate and BPSO can generate new solutions which improve the search process of features. The authors used BPSO to select some set of features, and then passed the best solutions found by the algorithm to BFPA for further optimization. Finally, BFPA outputs the best solution it obtains. The hybrid approach was evaluated on SVM, NB, and k-NN, and experimental results show that it outperformed the fitness value of standard BPSO and BFPA by 4.42% and 2.45% respectively.
Rajamohana and Umamaheswari [133] proposed a hybrid feature selection approach for review spam detection using an improved version of the binary PSO algorithm and SFLA. Initially, PSO was used to select a set of optimal features. Further, the optimized feature set was provided as an input to SFLA to further reduce the feature set. The final feature subset was used to train three classifiers: NB, k-NN, and SVM. Experiments show that the hybrid approach reduced the feature set by over 50% and improved the classification accuracy of k-NN and NB. In addition to feature selection, some authors introduced rule-based systems for review spam detection. Rule extraction techniques are designed to discover classification rules for prediction. Manaskasemsak and Rungsawang [108] introduced a rule-based algorithm for web spam detection. They used ACO to extract relevant classification rules from the WEBSPAM-UK2006 and WEBSPAM-UK2007 datasets. The extracted rules were used to distinguish between non-spam and spam hosts. The technique was evaluated, and the results show that it outperformed popular rule-based classification algorithms, achieving an AUC score of 0.899 and 0.784 for WEBSPAM-UK2006 and WEBSPAM-UK2007, respectively.
Nature-inspired based techniques for social network spam
Aswani et al. [17] introduced a hybrid method for identifying spam profiles in social network using bio-inspired technique and social media analytics. They combined Levy-Flight Firefly Algorithm (LFA) with K-Means for clustering Twitter users into spam and non-spam. The LFA was used to select the best clusters from the K-Means algorithm. The authors claimed that combining LFA with K-Means algorithm prevented K-Means from falling into local optimum, thus obtaining a global optimal solution for computationally intensive problems. Chaotic optimization algorithm (COA) was also used to tune certain coefficients of the LFA. Barushka and Hajek [21] introduced another NI-based technique for spam detection in a social network. They used the multi-objective evolutionary algorithm to minimize the misclassification cost and select relevant features for spam filtering. The selected features were used to train different ML algorithms, including Deep Neural Networks (DNN), RF, and NB. The result shows that the DNN-based technique produced the best result.
Table 3 shows a summary of some NI spam detection techniques and the specific problems they solved (e.g. feature selection, instance selection, rule extraction or parameter optimization).
Summary of nature-inspired spam detection techniques
Summary of nature-inspired spam detection techniques
Many studies have contributed towards the fight against spam. As shown in Fig. 6, most of the studies focused on reducing spam in emails, while some studies focused on spam detection in other electronic platforms. Furthermore, as shown in Fig. 7, many studies designed feature engineering techniques for improving the effectiveness of spam detection filters in social networks, online reviews, web pages, and spam emails. Besides, some studies developed techniques for extracting classification rules for spam detection, while other studies introduced parameter optimization techniques for optimizing the performance of classifiers.

Summary of spam detection problems solved in the literature.

Summary of spam detection techniques proposed in the literature.
Many studies proposed different types of features and feature engineering techniques for spam detection. in emails, web pages, microblogging sites, and online review. One of the commonly used technique is the bag-of-word technique, where features consist of individual words or group of words extracted from each sample in the dataset. Moreover, some studies used other features, including lexical features, n-gram features, POS tags, and LIWC. Besides, some studies used a combination of different features to build spam detection models. Experiments performed by Li et al. [97] showed that combining general features (such as POS and LIWC) with bag-of-words features will produce a better model than using only bag-of-word approach. Moreover, Mukherjee et al. [114] noted that using linguistic features (such as n-grams and POS) to build review spam filters is not very effective in real-world scenarios. Moreover, using only POS unigrams or n-grams features to build classification models is not very effective. Therefore, the choice of features and feature engineering technique used to build spam detection classifiers much be chosen carefully. Different feature engineering techniques have been used in the literature, such as TF-IDF and information gain. However, some of them did not significantly improve the performance of spam detection systems [114]. Therefore, as shown in Table 3, some studies introduced NI-based feature engineering algorithms for improved spam detection in different electronic platforms. Some of the popular NI algorithms that were used in the literature include ACO, FPA, PSO, PSO, DT, FPA, GP, KHA, CSA, BA, WOA, FFA, FFO, ICA, IWD, AIS, and EA. In addition to feature selection, some studies used NI algorithms to select optimal parameters for training ML-based spam detection classifiers. Results achieved by these techniques show that NI algorithms can be used to select optimal and effective features and parameters for improving the predictive accuracy or training speed of spam detection classifiers in different domains.
Supervised, semi-supervised, and unsupervised techniques for spam detection
Most studies in the literature focused on supervised learning approaches, however, some studies introduced unsupervised and semi-supervised learning techniques for detecting review spam [96,113] and spam emails [44,88,131,147,169]. These approaches have advantages and disadvantages. Supervised learning approaches are reliable and accurate because their input data are well known and labelled. However, they can be more complex (than unsupervised techniques), computationally expensive, and ineffective for real-time spam filtering, especially when a large volume of dataset is processed. Unsupervised learning approaches can be less complex and more effective than supervised learning approaches. This is because they can effectively (in real-time) model hidden patterns in data and swiftly track evolutions of spam attacks. Moreover, supervised learning approaches mainly depend on labelled datasets for learning. However, labelling these datasets can be very cumbersome, expensive, and prone to error, especially when big datasets are processed (which is the case in this fourth industrial revolution era). Moreover, the mislabeling of data can affect the effectiveness and overall performance of a classifier. Conceptually located between supervised and unsupervised learning approaches is the semi-supervised approach. Semi-supervised learning takes advantage of the strengths of supervised and unsupervised learning techniques and overcomes their shortcomings. They harness the large volume of the unlabelled dataset available and combine them with a smaller set of labelled data. They are very useful for building improved models where there is a lack of sufficient dataset. Experimental results achieved by some studies [151,193] shows that semi-supervised learning approaches can be more effective than some supervised learning approaches for real-time spam detection in a social network. Therefore, there is an obvious need for big data solutions that can handle dynamic, big, and growing unlabeled datasets. These approaches may improve the performance of spam filtering systems in speedily detecting incoming spam attacks. As mentioned above, some studies introduced unsupervised learning approaches for spam detection, however, most of them did not explore big data solutions. This a good research direction future studies can explore.
Rule-based and tree-based techniques for spam detection
Some studies introduced rule-based techniques for identifying spam emails [75,109,146,155], for filtering malicious online reviews [83] and for improving the detection capacity of web spam filters [108]. The idea of rule-based systems is to capture knowledge and encode them as rules. Rule-based classifiers are easy to design and easy to interpret. Moreover, they can classify new instances quickly. However, their prediction accuracy is not as reliable as other classifiers, such as SVM, RF, CNN. Moreover, rule-based classifiers are static – their filtering capacity does not change based on a change in spam patterns. Given this, they cannot be relied upon as standalone classifiers for spam detection, because they can be easily bypassed by sophisticated and new spam attacks. Moreover, in cases where we have advanced rule-based systems, it can be expensive and time-consuming to maintain. Moreover, it can be challenging to add rules to advanced rule-based systems with large knowledge bases. Rule-based techniques can be used in combination with other spam detection techniques to achieve improved performance. Furthermore, some studies proposed tree-based classifiers (such as DT and RF) for spam detection in emails [55,88,156,195], social networks [4], SMS [45] and web pages [103,104]. Tree-based classifiers generally produced better performance compared to rule-based classifiers. Tree-based classifiers are good spam detectors. Compared to other classifiers, tree-based classifiers require little data preparation. Moreover, unlike other algorithms (such as CNN and SVM), tree-based classifiers do not require normalization and scaling of data. Moreover, missing values do not significantly affect the building of DTs. Besides, unlike other Blackbox classifiers (such as ANN, CNN and SVM), insights from tree-based classifiers can be easily interpreted and explained to non-experts. However, tree-based models also have some drawbacks. A small change in data can lead to a change in the tree structure and consequently affect the stability of the model.
Nature Inspired techniques for spam detection
Figure 8 shows the rankings of application of different NI algorithms in the literature. As shown in the figure, PSO, ACO and GA have been widely for spam detection. GA is a good optimization technique suitable for optimal dimension reduction [167]. GA is also an effective optimization algorithm suitable for parameter optimization. It can be used in combination with ML algorithms to design robust and fast spam detection solutions. However, GA is time-consuming [195], and it requires more parameter tuning [77]. Moreover, GA cannot effectively search for a perfect solution, and it is not suitable for local optimization [59]. Memetic Algorithm (MA) is a good substitute for GA. It is an improved optimization algorithm that has not been fully explored in the literature for spam detection. MA use a local search technique and GA to search for improved solutions [59]. Moreover, it can handle local optimization [59]. Another popular NI algorithm used in the literature is PSO. It has fewer parameters, and it does not have complex time-consuming operators like GA, such as crossover and mutation [183]. In PSO, time is mainly consumed during fitness function evaluation [177]. Besides, PSO is quicker in locating optimal solution compared to GA [177]. However, the execution time of PSO is affected by data size and feature size. ACO is another widely used NI algorithms for spam detection. It can be used in combination with other techniques to obtain better performance. However, an increase in population size and other parameters of ACO will increase its computational time [62]. RST is another algorithm that has been applied to spam detection. It is efficient when applied to datasets with little noise and few features. However, it is time-consuming and it does not have good data reduction capacity [177].

NI algorithms used for spam detection.
Figure 9 provides a picture of the number of ML algorithms that have been used in the literature. As shown in the figure, SVM, NB, C4.5 and NN are the top four algorithms that have been widely used in different studies. The performance analysis carried out by Yu and Xu [188] revealed that SVM produced better results compared to NN and NB. However, the performance of SVM is affected by the number of support vectors [188]. The training speed and accuracy of SVM can be improved by combining it with NI algorithms. NB algorithm is another popular algorithm is widely used for spam detection. However, NB-based techniques can be easily bypassed [27]. Furthermore, the performance of NB is affected by large features [10] and change in the class ratio (e.g. spam to ham ratio) [1]. Besides, C4.5 is another widely used ML algorithm. The algorithm is suitable for real-world problems because of its ability to effectively handle incomplete datasets and datasets with discrete and continuous values. Moreover, C4.5 is time effective and it can be used to build small or large DTs. Further, the NN algorithm is another popular ML algorithm that is widely used in the literature. However, NN is not a good stand-alone spam classifier [92,197]. Besides, the NN algorithm takes a long time to train, and its accuracy is affected by dataset and feature size [71].

ML-based techniques used for spam detection.
Most of the classification algorithms used in the literature are conventional ML algorithms. Few studies designed deep learning-based spam detection techniques. Jain et al. [80] combined two deep learning algorithms (CNN and Long Short-Term Neural Network) and trained them on the semantic vector representations of words from two knowledge bases, namely: WordNet and ConcepNet. Besides, Feng et al. [68] proposed a deep learning approach for improving spam detection in social networks using CNN. [184] introduced a deep learning approach that can be used for learning the syntax of Tweets on Twitter. They used WordVector to convert different Tweets to feature vectors and then used the features to build a CNN model. Some of these studies achieved promising results, which shows the potential of deep learning algorithms for spam detection. Deep learning spam solutions can be implemented in a distributed environment (such as Hadoop) using big data solutions (such as Spark and Mahout). Doing this will improve the speed of learning and improve the generalization performance of spam detection models since these models will be exposed to a large volume of data. Besides, designing big data solutions will improve the effectiveness of spam filters and give them the capacity to keep up with the fast and evolving rate of spam attacks.
Scalability issues in spam detection classifiers
Most of the studies did not report issues on scalability and resources requirement, which is very important for building spam detection solutions. Most studies used conventional ML algorithm (such as, such as SVM, NB, and C4.5), while some of them used deep learning algorithms (such as DNN, CNN, and RNN). Generally, some conventional ML algorithms have scalability issues [100]. They are not effective enough to process and utilize a large volume of features that are present in high-dimensional datasets. Moreover, deep learning algorithms require a tremendous amount of resources (such as memory) when processing a large volume of datasets. Since most techniques in the literature were built with small-scale datasets, their memory requirements are not very high, hence their scalability to large-scale datasets is not fully evaluated. Therefore, they do not properly represent real-world scenarios, where millions of data are generated regularly. Considering the rise in network speed, cheapness in storage devices, and the increase in data usage, we now have big datasets generated by many internet users. This implies that there is a need to design more scalable spam detection solutions that allow fast computations of large-scale datasets in cost-effective ways. Big data solutions, such as Apache mahout and Spark can be used to accomplish this task.
Despite all the techniques that have been proposed in the literature, spam detection remains a major problem and it has not been fully eradicated. The world is seeking for smart solutions that can tackle the evolving patterns of spam attacks. Solutions that can adapt to changing spam patterns and accurately protect users from falling victim to spam attacks. Table 4 provides a summary of different existing spam detection techniques.
Summary of anti-spam detection techniques
Summary of anti-spam detection techniques
(Continued)
(Continued)
(Continued)
(Continued)
(Continued)
Different performance measures have been used in the literature to evaluate the effectiveness of spam detection models, including accuracy, precision, recall, F-Measure, Geometric Mean, area under curve, total cost ratio, batting average, root mean square, specificity, sensitivity, positive prediction value, negative prediction value, and correlation coefficient. Table 5 shows a summary of some performance measures used in the literature and the various domain they were used. Generally, the choice of performance measure used for evaluation is mostly based on the ML algorithm and the learning problem at hand. As shown in Table 5, the performance of spam detection problems (in different platforms) is popularly evaluated using accuracy, F-Measure, precision, and Recall. The choice of performance measures should be taken into careful consideration when evaluating spam detection models.
Performance measure for spam detection techniques
Performance measure for spam detection techniques
Dataset is the backbone for building ML-based spam detection techniques, and without it, we cannot train and test ML models. A good number of dataset repositories are available for the empirical analysis of ML algorithms. Table 6 provides a list of some publicly available datasets that were used in the literature to train and evaluate spam detection and classification models for emails, web spam, social network spam, and review spam. As shown in the table, the following datasets have been popularly used for evaluating email spam techniques: SpamAssassin, spambase, Enron, and TREC. Similarly, for web spam, most studies evaluated their techniques on WEBSPAM-UK2006 and WEBSPAM-UK2007. Besides, for review spam, most of the studies evaluated their techniques on a ‘gold standard’ dataset collected by [124,125]. The dataset contains positive and negative reviews.
Dataset information
Dataset information
Data collection and annotation are important parts of ML models. The reliability and sample distribution of datasets are very important issues that must be considered when collecting a dataset. Data samples in a dataset are expected to be representative of their real-world distribution so that developers can build unbiased ML models with good generalization performance. Although, a massive amount of data is available on the internet, collecting and labelling large-scale dataset to train a model can be challenging, error-prone, and complex [194]. An option could be to employ crowdsourcing services, such as Amazon Mechanical Turkers (AMT). These services grant anyone access to unknown online workers willing to complete small tasks. Ott et al. [125] used AMT to generate a dataset for review spam. They created a pool of tasks (called human intelligence tasks) and allocated them to different online workers. Each online worker can use a maximum of 30 minutes to work on a task. After completing and submitting the task, they are paid $1. As reported by Ott et al. [125], a total of 400 acceptable fake reviews were collected in 14 days. If it took them 14 days to collate just 400 reviews (which is small), then it will likely require more days (and even months) to collect large-scale datasets. This paints the picture of how challenging it can be to collect large-scale dataset. In another study, Yadav et al. [186] also used the crowdsourcing approach to collect large-scale spam dataset. They ran an incentivized scheme on their campus and awarded food coupons to participants that sent a certain number of SMS spam messages to their server. They gathered nearly 4000 samples from users in less than two months. Another option for obtaining large-scale dataset is to generate synthetic samples, where fake samples are generated from truthful samples. As an example, Sun et al. [164] used this technique to create a dataset for review spam detection. They designed an algorithm that can generate a synthetic review from a pool truthful review. They evaluated this method, and the result showed that synthetic dataset can be used to generate effective models for spam detection. [63,135] also used SMOTE to generate synthetic samples from existing datasets.
After data collection, assigning labels to each of the data samples is required. Typically, researchers use two methods to obtain annotations for datasets, namely manual annotation, or crowdsourced annotation. In manual annotation, two or more annotators are employed to manually label the data samples [83,95]. The degree of consistency between the annotation is computed (using techniques such as Cohen kappa [125]), and the annotations are accepted or rejected based on the consistency measure [152]. In crowdsourced annotation, researchers pay voluntary workers money to contribute to samples. For example, [123–125] used Amazon Mechanical Turk to employ the services of several workers to write artificial reviews for 20 hotels from TripAdvisor. [186] sent food coupons to voluntary users that contribute spam samples to their server.
These outlined methods produced good results; however, they have their shortcomings. Manually annotation is time-consuming and challenging. Also, there is a high probability of false-positive annotation, because spam samples cannot be easily identified by users that do not have enough knowledge about the domain [114]. Also, the employed annotator might not do a very good job due to the poor incentive provided for annotation [114]. On the other hand, using crowdsourced samples may not be representative of real-world samples [114]. Moreover, since we do not have control over the quality of the crowdsourced annotators, spammers can dominate the annotations by randomly assigning labels to samples without looking at the sample itself [136]. This can increase the cost of acquiring annotations and degrade the quality of the annotated dataset [136]. [136,152] proposed techniques that can be used to improve the annotation of spam datasets.
Dataset issues in spam detection
As shown in Table 4, most of the review spam detection techniques in the literature did not yield very good results, maybe due to lack of enough datasets for training. Some studies trained their technique on a “gold-standard” small-scale dataset [124] consisting of 800 instances, while some trained their techniques on highly imbalanced datasets, such as Yelp [187]. Moreover, some studies built their model with highly synthetically created datasets (such as Amazon Mechanical Turk (AMT) [12]). Generally, small-scale, and imbalanced datasets will produce poor models that cannot generalize well on unseen datasets. Moreover, training models on synthetic samples is a challenge, because synthetic spam samples do not properly represent real-world spam samples. Experiments performed by Mukherjee et al. [114] revealed that models built on synthetic samples produced poor generalization performance. They trained a model and evaluated it on two datasets: a synthetically created dataset (from AMT) and a real-world dataset (from Yelp). The synthetic dataset produced an accuracy of 87.8%, while the real-world dataset produced an accuracy of 65%. This 22% decrease in accuracy shows that synthetically created models cannot be used to effectively filter real-world spam samples. Future research can consider designing improved models for online review spam detection. These techniques can be built with large-scale datasets consisting of real-world data samples. Moreover, sampling techniques (such as ROS and RUS) can be explored to improve the sensitivity of the models towards spam detection. Besides, future research should avoid using crowdsourced reviews to build detection models for review spam, because crowdsourced fake reviews do not perfectly represent real-world fake reviews [114].
Habitually, real-world datasets are imbalanced; they largely consist of a higher proportion of “normal” samples and smaller proportion of “abnormal” samples. Besides, the cost of misclassifying abnormal instances is arguably higher than the cost of misclassifying normal instances [42]. Therefore, constructing classifiers from imbalanced datasets can affect the sensitivity of the classifier to minority class and lead to biased accuracy and overfitting. Sampling techniques can be used to improve the quality of datasets. Some studies introduced sampling techniques as an approach to tacking imbalance in spam detection. Habib et al. [75] used Synthetic Minority Over-sampling Technique (SMOTE) to generate synthetic samples of the minority class for training. Moreover, [2,103,104] used under-sampling techniques to remove instances of the majority (or over-represented) class. Besides, Abu Hammad [2] used an oversampling technique (ROS) to generate more copies of the minority (or under-represented) classes. Moreover, the two approaches (i.e. under-sampling and over-sampling), can be combined to effectively tackle the imbalance problem in datasets, as shown in [42]. Besides, penalized classification algorithms (such as penalizedSVM and penalizedLDA) can be used to handle imbalance problems in datasets. These penalization algorithms enforce an extra cost for making additional errors on minority classes when training. Furthermore, many studies ignored the use of ensemble learning techniques for spam detection, such as Bagging or Boosting. These techniques are very suitable for improving the performance of imbalance or noisy datasets [56].
Comparative analysis of ML-based spam detection techniques
This section presents a comparative analysis of different ML-based spam detection techniques. The goal of the comparison is to give readers an approximate idea of the performance of spam detection techniques in the literature. Figures 10–13 shows the results obtained by various ML-based spam detection techniques in the literature. The figures show, at a glance, the potential of each ML algorithm in handling spam detection for emails, social networks, and microblogging websites. Moreover, the figures show the best ML algorithms for each of the spam detection problem considered in this survey. The next section presents the methodology used for the comparative analysis.

Predictive accuracy of ML algorithms for spam email.

Predictive accuracy of ML algorithms for social network spam.

Predictive accuracy of ML algorithms for web spam.

Prediction accuracy of ML algorithms for review spam.
All the studies did not evaluate their technique on the same dataset. Moreover, most of them did not report the same performance metrics, hence it is difficult to report a perfect comparison of all the techniques. Given this, the following were taken into consideration during the comparison: dataset, spam platform, performance metric, ML algorithm used, and feature selection techniques.
Dataset
All the studies reviewed in this paper did not use the same dataset, therefore, to ensure a fair comparison, the comparison is vastly based on studies that used the same dataset. As shown in study, many studies used SpamAssassin [160] and spambase datasets [76] for spam email evaluation, while many studies used WEBSPAM-UK2006 [38] and WEBSPAM-UK2007 [39] for webspam evaluation. Moreover, many studies used a ‘gold standard’ dataset [124,125]. for review spam evaluation. Therefore, in the performance analysis, we focused on studies that used the SpamAssassin dataset, spambase dataset, WEBSPAM-UK2006 dataset, WEBSPAM-UK2007 dataset, and the ‘gold standard’ dataset, so that we can capture as many studies as possible in the analysis.
Spam platform
This paper considered four types of spam namely: email spam, web spam, social network spam, and review spam. Therefore, performance analysis was performed (separately) for the four spam types. That is, spam email detection techniques were compared to each other, web spam detection techniques were compared to each other, review spam detection techniques were compared to each other, and social network spam detection techniques were compared to each other.
ML algorithms
Different ML algorithms were considered in the literature, and they produced different results. In this analysis, for each spam platform, the performance of different ML algorithms was analyzed and compared to each other.
Feature selection
As shown in the study, different features and feature selection techniques were introduced in the literature. In the analysis, the performance of ML algorithms before and after feature selection are presented. This is to show the impact and significance of features and feature selection techniques on ML classifiers.
Performance metric
Most of the techniques reviewed in this study did not report the same performance metrics, hence it is difficult to present a perfect comparison of all the techniques. As shown in Table 5, most of the studies used prediction accuracy and F-measure as their performance metric. In view of this, for each platform, prediction accuracy and F-measure are the performance metrics featured in the analysis.
Prediction Accuracy: Prediction accuracy is the commonly used metric to measure the performance of ML classifiers. Accuracy can be defined as the ratio of correctly predicted sample to the total number of samples in the dataset. The formula for accuracy is shown in (1) below.
F-Measure: The F-measure (also called F1-score) can be defined as the weighted mean between Recall and Precision. The equation for F-measure is shown in (2).
The equation for recall and precision are shown in (3) and (4).
Performance analysis
This section presents a performance analysis of different techniques designed for spam email, web spam, social network spam, and review spam. This comparison should foster the understanding of the techniques that are best suited for building spam detection models.
Spam email
This sub-section presents the performance analysis of different ML-based spam email detection techniques. As shown in Table 5, some studies evaluated their techniques on the SpamAssassin dataset. Among these studies, Meli et al. [109] achieved the best result, followed by [18]. Meli et al. [109] designed a rule-based spam email detection technique using linear GP, and it achieved an accuracy of 99.83%. Although this technique produced a high prediction accuracy, it is not highly recommended as a standalone classifier. This is because, rule-based systems require regular updates, and they can be easily bypassed by sophisticated spam attacks. Furthermore, NB algorithm designed in [18] achieved the second-best result for SpamAssassin dataset. The authors in [18] performed a comparative analysis of six ML algorithms, and the analysis showed that NB algorithm outperforms the other algorithms, achieving an accuracy of 99.46%. Moreover, some studies evaluated their technique on the Spambase dataset. Among these techniques, Shuaib et al. [156] produced the best result, achieving a prediction accuracy of 99.99%, while other techniques achieved an accuracy between the range of 81.25% and 96.45%. This shows that Shuaib et al. [156] outperforms the other techniques by at least 3.45% and at most 18.74%, in terms of prediction accuracy. Shuaib et al. [156] designed a hybrid spam detection technique using RoT classifier and WOA. They used WOA to select relevant features for building effective models for spam email detection. This underscores the effectiveness of using NI algorithms for feature selection.
Generally, the performance of ML classifiers improves as feature selection techniques are applied. Different NI optimization algorithms have been used to select features from datasets, including ACO, WOA, CSA, BA, GA, and PSO. In [156], RoF achieved an accuracy and F-measure of 94.2% before feature selection, and 99.86% after feature selection. In [87] and [8], SVM achieved an accuracy of 79.5% and 96.66% before feature selection and 81.25% and 96.96% after feature selection. In [195], C4.5 achieved an accuracy of 91.76% before feature selection and 94.27% after feature selection. As shown in all these results, feature selection techniques improved the results of ML classifiers by over 5%. In some cases, these techniques reduced the feature size significantly. In [195], the number of features was reduced by more than 85%, from 57 features to 7 features. In [64], the number of features was reduced by more than 50%, from 6919 to 3400. In both cases, the reduced dataset produced better prediction accuracy than the original dataset.
The quality of features used to train a ML classifier also plays a major role in the performance of the ML classifiers. In [157], NB algorithm produced an accuracy of 88.5% when trained on features extracted from spambase dataset. The same classifier (I.e., NB) in [18] achieved an accuracy of 99.46% when trained on features extracted from SpamAssassin dataset. Moreover, in [87], SVM produced an accuracy of 79.5% when trained on features extracted from the spambase dataset, and it also achieved a higher accuracy of 96.96% in [18] when trained on features extracted from SpamAssassin dataset. This shows that the SpamAssassin dataset contains good features for identifying spam emails.
Different evaluation methods were adopted in the literature, including 10-fold cross validation, 20-fold cross validation, and percentage split. Among all these methods, 10-fold cross validation produced the best result. In [11], RF produced 95.7% accuracy when 10-fold cross validation was used, and its accuracy reduced to 95.3% when 66% percentage split method was used. Moreover, BN produced 88.13% accuracy when 10-fold cross validation was used, and its accuracy reduced to 87.38% when the percentage split evaluation method was used. Furthermore, MLP produced an accuracy of 86.05% when 10-fold cross validation was used, and it produced a reduced accuracy of 85% when the percentage split method was used. Although, the 10-fold cross validation technique does not always outperform the percentage split method, in most cases it does. However, it is highly recommended that different evaluation techniques are tested, and the technique that produces the best result should be used.
Some classifiers did not produce very good results for spam email detection. In [157], NB produced a prediction accuracy of 88.5% and a F-Measure of 88.5%. In [22], Radial Basis Function (RBF) produced an accuracy and F-measure of 82.61% and 0.828 respectively, while NB produced an accuracy and F-measure of 89.85% and 0.828 respectively. In [141], NB produced an accuracy and F-measure of 72.57% and 77% respectively. These poor results could be because of the evaluation technique, parameter selection technique, instance selection technique, feature selection technique designed in various studies. On the other hand, some classifiers performed averagely well, while some produced very good results.
Some classifiers produced good results when trained on fewer instances and attributes. In [141], NB classifier produced an accuracy and F-measure of 82.88% and 60% respectively when trained on a dataset with 4601 emails and 58 features. In [141], the same classifier (I.e. NB) produced a lower accuracy and F-measure of 72.57% and 77% respectively when trained on a larger dataset containing 9324 emails and 500 features. In addition, the diversity of the dataset plays a key role in the performance of ML classifiers. A classifier will produce improved generalization performance if it is trained on a large-scale dataset collected from multiple sources. For instance, in [49], NB classifier produced an accuracy of 97 when it was trained on a dataset containing emails collected from 150 users, and the same classifier in [141] produced a reduced accuracy of 77% when it was trained on a dataset that was obtained from one email account. This shows that the performance of classifiers changes with respect to dataset size, feature size, and dataset variability.
The quality of the performance of the classifiers is largely dependent on the dataset used. The analysis presented in this section is based on the spambase and spamassassin dataset. These two datasets are the most popular dataset used in the literature for spam email evaluation.
Social network
This section presents the performance analysis of social network spam detection techniques. Different ML models were designed for spam detection in social network sites. Some models achieved promising results. In [11], RF produced an accuracy and F-measure of 95.7%, while in [4] RF produced an accuracy and F-Measure of 93.2%. Moreover in [80], KNN produced an accuracy and F-Measure of 91.96% and 91.38%, NB produced an accuracy and F-measure of 92.07% and 91.74%, and SVM produced an accuracy and F-Measure of 93.14% and 92.97%. In [4], SVM produced an accuracy and F-Measure of 87.90% and 87.90%, while BN produced an accuracy and F-Measure of 84.2% and 84.3% [4]. The low performance produced by some of the techniques could be because of the quality of features and parameters used to build the models. Undoubtedly, these models require improvements if we want to design a secured webspace for social network users.
Feature selection also plays a major role in improving the results of ML-based social network spam detection classifiers. In some cases, feature selection techniques improved the accuracy of classifiers by over 33%. In [80], CNN produced an accuracy of 93.91% when evaluated on a Twitter dataset containing 14000 features. The same classifier (CNN) produced a better accuracy of 95.48% when evaluated on the same dataset with a reduced number of features – 8000 features. In [4], EA was used to reduce the feature size of a dataset from 69 to 18, and the reduced dataset produced better result compared to the original dataset. The accuracy of RF was improved from 92.8% to 93.2%, while the accuracy of SVM was improved from 87.8% to 88.3%. Moreover, the accuracy of BN was improved from 88.1% to 88.9%. The authors in [60] designed an RST-based feature selection technique and applied it to five social network datasets containing different number of features. The technique reduced the features in the first dataset from 63 features to 15 features. The feature selection technique also reduced the features in the second and third dataset from 60 features to 36 features, and from 41 features to 5 features, respectively. Moreover, the feature selection technique reduced the features in the fourth and fifth dataset from 60 features to 5 features, and from 23 features to 6 features, respectively. The five reduced datasets were evaluated on different ML classifiers and they produced improved results. For dataset 1, the accuracy and F-Measure of RandomTree was improved from 82.5% and 0.822 before feature selection to 85.80% and 0.862 after feature selection. Moreover, for dataset 2, the accuracy and F-Measure of NB classifier was improved from 50.4% and 0.536 before feature selection to 84.39% and 0.862 after feature selection. Moreover, for dataset 3, the accuracy and F-Measure of NB was improved from 90.1% and 0.908 to 99.51% and 0.995, while the accuracy and F-Measure of RBFNetwork was increased from 89.18% and 0.892 to 96.56% and 0.966. Furthermore, for dataset 4, the accuracy and F-measure of RBFNetwork was improved from 89.7% and 0.879 to 97.44% and 0974, while the accuracy and F-Measure of NB was increased from 75.23% and 0.742 to 99.62% and 0.9960. As shown in the results, feature selection techniques improved the results of spam classifiers by over 33%, as in the case of dataset 2.
Some studies used deep learning algorithms to build spam detection models. The authors in [11] designed a deep learning-based model using RNN and it produced an accuracy of 80%. The accuracy is not very high because of the size of dataset used to train the algorithm. Deep learning algorithms require large number of data instances for improved performance. Jain et al. [80] designed a hybrid deep-learning based technique using CNN and LSTM (called SSCL). The hybrid technique outperforms traditional ML classifiers, and it achieved an accuracy and F-measure of 95.48% and 97.13%, respectively. The hybrid technique also outperforms standard CNN and LSTM as they produced an accuracy of 92.73% and 92.93% respectively.
Review spam
This section presents the performance analysis of review spam detection techniques. Many ML models has been designed for review spam detection. The classification method introduced in [16] achieved the best accuracy of 98%, followed by SVM achieving an accuracy of 89.8% [125]. Moreover, the performance of classifiers varies with the type of review – positive or negative reviews. In [153], SVM produced 88.4% accuracy for positive reviews and 86.0% accuracy for negative reviews.
Some studies introduced different features, and these features contributes individually to the prediction accuracy of ML classifiers. Asghar et al. [16] trained different ML classifiers on features from different entities, including the reviewer, the review, and the product. Result shows that the classifiers produced the best result when trained on features from the three entities. The classifiers also produced good accuracy when trained on reviewer-based features. This shows that classifiers that are trained on features related to the reviewer will produce good results. Li et al. [97] evaluated different ML classifiers using unigram features, LIWC features and POS features, and the features had different impact on the performance of ML classifiers. SVM produced an accuracy and F-measure of 78.5% and 77.8% when trained on unigram features. It also produced an accuracy and F-measure of 74.5% and 75.9% when trained on LIWC features. Moreover, it produced a reduced accuracy and F-measure of 73.5% and 75.1% when trained on POS features. In [125], SVM produced 73% accuracy for POS features, 76.8% accuracy for LIWC feature, and 88.4% accuracy for unigram features. As shown in the results, unigram features produced the best result. Unigram features produces even better results when combined other features. Ott et al. [125] combined the unigram features and LIWC features. They trained SVM on the hybrid features, and it produced an improved accuracy of 89.8%, compared to 88.4% when was trained on only unigram features.
Like spam email and social network spam detection, feature selection also plays an important role in classifying spam reviews. Rajamohana and Umamaheswari [133] introduced a hybrid feature selection technique using a combination of iBPSO and SFLA. They applied the technique on a dataset containing over 2400 features, and it reduced the feature size to 642 features. Moreover, iBPSO and SFLA were separately applied to the same dataset and they reduced the feature size to 772 features and 723 features respectively. As shown, the hybrid feature selection technique achieved better feature reduction compared to the individual algorithms Furthermore, the reduced feature sets were trained on SVM, KNN, and NB and they produced better result than the original dataset. The feature set produced by the hybrid technique achieved an accuracy of 94.97% for NB, 92.12% for KNN, and 91.25% for SVM. It is noteworthy to mention that SFLA selected more features than the hybrid technique, however it produced lower accuracy compared to the hybrid technique. It produced an accuracy of 85.5% for NB, 87.07% for KNN, and 89.84% for SVM. Similarly, BPSO produced a lower accuracy compared to the hybrid technique. It produced an accuracy of 82.5% for NB, 85.05% for KNN and 88.38% for SVM. This shows that multiple NI algorithms can be combined to design effective feature selection techniques for review spam detection.
Some studies introduced different types of weighting schemes for review spam detection. These weighting schemes are designed to assign more weights to high-risk features. Results showed that ML classifiers produced improved results when trained on features with high weights. [113] introduced a weighting scheme for selecting features for spammer groups. They also introduced an unsupervised-based model called GSRank, and the model outperformed traditional ML algorithms. The model was evaluated on a dataset containing product reviews from Amazon, and it produced an AUC score of 0.93, while SVM and LR produced lower AUC of 0.83 and 0.79 respectively. Mukherjee et al. [16] introduced a classification method that uses a weighting scheme to calculate spam scores for product reviews. In the study, each review is classified as spam or non-spam based on the spam score. Results showed that the classification method outperformed traditional ML classifiers when evaluated on an Amazon dataset. It produced an accuracy and F-Measure of 98%. Moreover, SVM produced a lower accuracy and F-measure of 78% and 0.70 when evaluated on the same dataset. KNN also produced a lower accuracy and F-measure of 74% and 0.69, respectively.
Web spam
The section presents the performance analysis of web spam detection techniques. Combating webspam is becoming a great challenge for many search engines. In view of this, many anti-spam techniques have been developed and some of them produced promising performance. Based on some experiments performed by Sattar et al. [148], SVM produced the best result achieving an accuracy of 100%, followed by LR with an accuracy of 99.73%. In [104], C4.5 ensemble classifier produced an accuracy and F-measure of 94.66% and 0.35 respectively. NB produced the least performance for web spam detection with an accuracy of 76.1%. The results reported in some studies shows that some ML classifiers produced very good accuracy, but they have time complexity problems. For instance, Asdaghi and Soleimani [15] compared the performance of different ML algorithms, and the comparison shows that MLP and SVM produced the best result in terms of accuracy, but they produced the worst result in terms of time complexity. In the same study, KNN produced the best result in terms of time complexity, followed by Random Tree and NB classifier. The results reported in this paragraph are based on experiments performed on the WEBSPAM-UK2007 dataset.
Effectively identifying webspam requires ML classifiers to be trained on specific features, including graph-based features and content-based features. In [148], graph-based features produced the best performance, followed by content-based features. In [148], SVM and RF achieved an accuracy of 100% when trained on graph-based features. Moreover, in [15], NB produced an accuracy of 92% when trained on graph-based features. SVM and KNN also achieved an accuracy of 99.97% and 99.81% when trained on content-based features. On the other hand, in [148] and [15], NB classifier produced very poor result when trained on content-based features. In [148], NB produced an accuracy and F-measure of 36.009% and 0.471 respectfully. In [15], NB produced an accuracy of 14%. This show that the choice of features plays a key role in effectively identifying review spam.
Feature selection techniques also improves the performance of web spam classifiers. Sattar et al. [148] used information gain and gain ratio to reduce the feature set of WEBSPAM-UK2007 dataset from 274 features to 201 features, and the reduced dataset produced better result compared to the original dataset. Asdaghi and Soleimani [15] introduced a feature selection technique using the backward elimination approach. The technique was applied to the WEBSPAM-UK2007 dataset and it reduced the number of features by over 90%, from 275 to 27. The original and reduced dataset were evaluated on NB classifier and the results shows that the feature selection technique improved the accuracy of NB classifier from 70% to 73%. Makkar et al. [107] designed a framework for web spam detection consisting of feature selection and instance selection techniques. The framework used RFE for feature selection and PCA for instance selection. Experimental results show that the framework improved the accuracy of RNN classifier by over 24%.
Some studies introduced hybrid techniques for webspam detection. Sattar et al. [148] introduced a hybrid deep learning-based framework for webspam detection using RNN, RFE and PCA. Experimental results show that the hybrid framework achieved good results. Lu et al. [104] proposed a hybrid webspam detection using CSA for feature selection, under-sampling for dataset balancing, and ensemble decision tree classifier. Experiments performed on an imbalanced dataset – WEBSPAM-UK2006 dataset – shows that the hybrid technique outperform other ML classifiers. It achieved an improved accuracy and F-measure of 89.68% and 0.92, compared to NB classifier that produced a lower accuracy and F-measure of 69.37% and 0.47 respectively. Lu et al. [103] also proposed a hybrid approach using ICA for feature selection, under-sampling technique for dataset balancing, and C4.5 decision tree classifier. Experiments performed on the WEBSPAM-UK2006 dataset shows that the hybrid technique improved the accuracy of C4.5 classifier by over 16%, from 72.77% to 89.73%. It also improved the F-measure from 0.76 to 0.93. This shows hybrid technique can be used to improve the performance of webspam detection. It also noteworthy to mention that the improved result is also because of the under-sampling technique used to balance the WEBSPAM-UK2006 dataset in [103]. The WEBSPAM-UK2006 dataset used in [103] is very imbalanced; the ratio of normal web pages to spam web pages is 7:1. This shows that under-sampling techniques can be used to improve the quality of imbalanced dataset, and consequently the performance of ML algorithms.
Conclusion and future research direction
Spam detection is becoming an evolving problem that has affected millions of individuals across the globe. The world is currently living in an era of attractive internet technologies, such as email, social network, microblogging websites, and review websites. Although these technologies have made life easier and pleasurable for users, it has also exposed many individuals to cyber-attacks, such as spam. Many cybercriminals are now taking advantage of the vulnerability of users, and thus launching cyber-attacks on electronic platforms. Given this, researchers have developed effective techniques for securing these platforms. This paper presents a survey of ML-based and NI-based spam detection techniques for emails, social network platforms, review sites and search engines. More types of spam detection techniques exist; therefore, future surveys can expand the scope of this study by reviewing other types of spam detection techniques. This will further broaden the understanding of the research community and provide a platform for developing improved spam detection systems.
As shown in the survey, some studies designed feature engineering techniques and classification rules, while some introduced instance selection and parameter optimization techniques. Most of these techniques are typically used to select top-quality features, instances, and classification rules for improved spam detection. Also, as shown in Table 4, some techniques did not perform very well, therefore, there is room for more improvements. The next four subsections provide some recommendations for future research.
Review spam
Millions of data are regularly generated on different review sites, such as Booking.com, Yelp, and Trip Advisor. Insights from these big datasets can be used to improve the performance of spam filtering systems. Most studies built their models on small datasets [96,125], implying that these models cannot generalize well in a real-world scenario. Generally, classifiers that are trained on large datasets will produce better generalization performance. Conventional classifiers (such as deep learning algorithms) can handle big datasets. Deep learning algorithms have not been fully explored for spam detection. Future research can focus on designing robust models that are trained on large-scale datasets with top-quality features and instances. These algorithms can be trained on both textual and visual spam contents. Moreover, big data techniques and tools (such as Mahout and Spark) can be explored to effectively maximize the insights in big datasets.
Some of the datasets used to evaluate the review spam detection techniques are synthetic in nature, most likely due to the difficulty in labelling datasets and the scarcity of spam instances [96]. Training a classifier on synthetic datasets does not produce realistic results, because, synthetic datasets do not accurately represent real-world spam samples. Future studies can focus on collecting datasets that are representative of real-world samples. Obtaining large-scale labelled dataset can be challenging and cumbersome, however, it will provide a good platform for researchers and developers to design effective models for improved spam detection.
Most of the review spam detection techniques proposed in the literature were designed to identify fake reviews written by individual reviewers, however, these techniques cannot detect reviews written by a group of malicious reviewers, called spammer groups. Moreover, the supervised learning approach cannot effectively handle spammer groups [7]. Future work can focus on designing techniques that can effectively handle the detection of spammer groups. Moreover, future work can design generic techniques that can identify both individual spam reviews and spammer groups.
Priority-based ranking of spam features is one of the key challenges in review spam detection [16]. This is because of the difficulty in deciding the priority level of many spam-related features. Previous works on priority-based ranking focused on the graph-based model [119,175], however, such approaches are inefficient when a hybrid feature set is used [67]. Therefore, efficient priority-based ranking techniques are required. These techniques can be designed to rank features based on their importance in a particular domain. Asghar et al. [16] introduced a technique in this direction, but more work is still required.
Social network spam
The number of active Facebook, Twitter, Snapchat, and WhatsApp monthly users as of January 2020 is estimated to be 2.45 billion, 340 million, 382 million and 1.6 billion, respectively [47]. Although social networks have connected millions of users, it has also exposed millions of users to cyberattacks. These platforms require efficient techniques that can secure users from spam attacks. Very few spam detection techniques exist for social network spam detection, hence this is a potential research opportunity for interested researchers. Future research can train models on image-based features. Normally, social network platforms should have an indisputable visual appearance that makes them unique; they should have some sort of logo (or trademark) that is peculiar to them. This image-based feature (i.e. the logo) can be used to train deep learning algorithms for effective identification of fake websites.
The performance of current studies is limited due to the informal communication methods adopted by users in social networks. Users are not restricted to using the English vocabulary neither are they restricted to using a standard mode of communication style. Thus, analyzing the content of messages and its semantic (the meaning of the content) can be challenging [17]. This informal communication style also gives spammers more liberty to spread spam contents without been detected. More effective methods that can identify spam contents based on non-English vocabularies are required. A large dataset of prevalent vocabularies currently used by users in popular social networks, such as Twitter and Facebook, can be collected and used to build effective spam detection models.
Network-based metrics can also be used as good spam indicators in social networks [17]. Metrics such as groups and popularity can be used to identify spam profiles in social networks. These metrics can help in identifying the relationship between spammers and their in-degree and out-degree. It can also be used to identify the relationship between spammers and the possible groups that are formed in their network [40]. In-degree and out-degree metrics show how an individual is connected to others, it shows the popularity of a user. For example, individuals such as the president of a country will have high in-degree and out-degree value because of the large number of people that are connected to them in their network, and a large number of views they attract to themselves. These metrics can be analyzed and used to build effective social network spam systems. Furthermore, analysis of how spammers interact and engage with the target audience can also provide useful insights into the behaviour and techniques of spammers [17].
Most studies on social network spam are based on supervised learning approaches [194] that require annotated datasets which can be costly, complex, and laborious to create. Semi-supervised learning for big data social data analytics is one of the latest trends in social spam detection. Future work can focus on developing semi-supervised learning approaches for big datasets because experimental results achieved by some studies show that semi-supervised learning approaches can achieve comparable results to supervised learning approaches for real-time spam detection in a social network.
Camacho et al. [36] proposed four different dimensions (or performance measures) that can be used to evaluate the capacity of different techniques in performing social network analysis task, namely: pattern and knowledge discovery, information fusion and integration, scalability, and visualization. The first dimension helps us to determine the capacity of our algorithms in terms of how they can discover and gather knowledge from an online social network. The second dimension helps us to determine the capacity of algorithms in terms of how they can work with a large volume of data in adequate time. The third dimension checks the capacity of algorithms in fusing different sources of data, such as audio, video, image, and audio data. The fourth dimension helps us to determine the capacity of our techniques in visualizing, filtering and adequately representing the information stored in a network. Based on the four dimensions, Camacho et al. [36] defined a set of metrics that can allow researchers to determine the maturity level of their techniques in social network analysis. Future studies can use these metrics to evaluate the effectiveness and maturity level of their techniques or frameworks.
Webspam
The main webspam detection methods addressed in the literature include content-based spam, link-based spam, and cloaking-based spam [94]. Most of the current techniques addressed only one type of spam. Therefore, there is a need for effective techniques that can identity multi-class webspam. Also, most of the current techniques focused on link-based or content-based spam. There is also a need for novel techniques that can effectively handle cloaking-based spam.
Deep learning algorithms can be used to design effective webspam detection solutions. However, deep learning algorithms require large-scale datasets. As shown in Table 6, some of the popular datasets used in the literature (such as WEBSPAM-UK2006 and WEBSPAM-UK2007) are small and outdated. They were collected over 12 years ago. Also, these datasets have class imbalance problem. For example, in WEBSPAM-UK2007 dataset, the ratio of spam webpages to normal webpages is 1:17. Future studies can focus on collecting large-scale and balanced datasets suitable for building improved deep learning-based web spam detection systems. These improved systems can improve the satisfaction of users by effectively detecting spam resources and preventing the ranking of illegitimate web pages. Research in this domain is ongoing, and it requires more inputs from interested researchers.
Spam email
Substantial studies have been done to improve the performance of spam email classifiers. The most successful techniques are the content-based methods designed to classify emails based on their content [52]. Many studies did not explore deep learning algorithms, which could be considered as the future of spam email classifiers. Deep learning techniques allow spam filters to effectively learn from previous experience without explicit programming [190]. They also give spam filters the ability to extract valuable features from large-scale datasets. They are more effective than conventional ML algorithms (such as SVM, NB, KNN) because as the training size increases, the performance of deep learning algorithms also increases. They explore the computational power of modern-day CPUs and GPUs. Future research should consider designing deep learning-based techniques for spam email detection. These algorithms can also be trained on image-based features in emails for improved performance.
Concept drift is a problem that is currently faced by spam filtering techniques. As such, while researchers are trying to design improved spam email filters, spammers are also developing more techniques that can evade the efficiency of spam filters. Therefore, there is a need to design techniques that can effectively handle the evolving and unforeseen changes in spam features. Conventional ML algorithms cannot efficiently handle the concept drift problem; they require expert knowledge to work effectively. Future research can explore deep adversarial learning algorithms as a possible solution to handling the concept drift problem [52].
Some of the proposed spam detection techniques used static approaches, such as blacklist or whitelist. These techniques cannot effectively handle zero-day attacks and they are known to generate poor FPs [43]. Moreover, they require regular updates and intervention of humans. It is highly recommended that blacklists and whitelists should not be used as stand-alone techniques; they can be used as supplements to other techniques. This is because spam filtering rules are expected to be updated regularly [77]. Moreover, some studies designed rule-based systems for spam email detection. However, rule-based systems are not reliable, neither are they dynamic. They require regular updates and can be easily bypassed by sophisticated spam attacks, because, they are dependent on specific rules. Therefore, future studies can focus on designing dynamic rule-based systems for detecting new and evolving target concepts as they occur.
General recommendations
In recent years, some studies developed models for semi-supervised neural networks and deep generative learning. Semi-supervised learning approaches only performs better than supervised learning approaches in specific cases [101,173]. The performance degradation could be caused by the introduction of unlabeled data [173]. It could also be caused by the supervised learning algorithms used to evaluate the performance of semi-supervised learning techniques. These supervised learning algorithms are weak, causing a biased perspective on the benefits of incorporating unlabeled data [121]. The performance degradation in semi-supervised learning is more significant than the performance improvement [173]. This clearly shows that much work is still required in semi-supervised learning methods for machine learning problems, such as spam detection.
Some of the proposed techniques used neural network-based algorithms, such as CNN and RNN. Traditional neural network-based techniques require a massive volume of training samples to produce very good generalization performance. This is particularly true for CNN where the pooling operations wrongly remove positional information and do not consider the hierarchical relationships between features [143]. Capsule neural networks have the potential to better model the hierarchical relationships between consecutive layers. The success recorded by recent studies [176,192,196] shows that capsule neural network has good potentials in solving NLP problems, such as spam detection in social network and online review. This is a potential direction for future research.
As observed in this survey, most studies focused on designing spam detection techniques for one domain (and not multiple domains). For example, the features used to identify spam in emails are different from the features used to identify spam in social networks or online reviews. Also, the parameters used to build classification models for different domains differs. It will be interesting to explore the possibility of designing unified frameworks that can filter spam on multiple domains. These frameworks should have different components that can perform instance selection, feature selection, parameter optimization, and classification for multiple domains. Doing this should provide a platform for spam detection in multiple domains. It should improve the effectiveness, training speed, prediction accuracy, and computational complexity of spam detection in multiple domains.
Spam detection is not restricted to the English language; however, this survey shows that most studies focused on designing spam detection techniques for messages written in the English language. Studies in [165] and [2] are part of the few studies that designed spam detection techniques for non-English languages, including Chinese and Arabic language. Therefore, future research can explore designing spam detection techniques for other languages.
Footnotes
Acknowledgments
The author would like to acknowledge the remarkable contribution of my late supervisor, Professor Aderemi Adewumi. His support and contribution are highly appreciated.
