Abstract
In recent days, malicious users try to captivate the consumers using their fraudulent marketing URL post in social networking sites. Such malicious URL posted by fake users in Social Networking Services (SNS) is hard to identify. Therefore, there occurs a need to detect such fraudulent URLs in SNS. In order to detect such URLS, this paper proposes a SNS Fraudulent Detection (SFD) scheme. The proposed SFD scheme includes a Deterministic Finite Automata Tokenization (DFA-T) and Web Crawler (WC) based Neuro Fuzzy System (WC-NFS). DFA-T extracts the URL features and calculates a Penalty Score (PS) based on the malicious words in the extracted URL. The DFA extracted URL features with PS are fed into WC-NFS. Subsequently, the WC fetches the numeric WC-Index (WCI) value from the URLs which are added to the WC-NFS. The existing URL data set is used to identify the malicious web links and suitable machine learning techniques are used to identify the malicious URLs. From the experimental results, it is found that the proposed SFD provides 92.6 % accuracy in classifying the benign from malicious URLs when compared with the existing methods.
Introduction
Social Networking Services (SNS) like Instagram, Twitter, Facebook, YouTube connect people of similar interests [2]. Many users are active on social media and move towards SNS for purchase, which aid chances for scammers to attract consumers. Consumer Electronics (CE) are electronic products used in day-to-day life [17] and advertisers use the strategy of marketing their electronic products in SNS. Online advertising of CE purchases may be fraudulent to attract the consumers by posting a CE product advertisement in SNS [5]. The attackers lure the consumers by irresistible posts, genuine consumers may fall prey to such advertisement by clicking the malicious URL posted by the adversary. Such adversaries would either campaign to sell a fraudulent CE product or steal user information [19]. Some attackers may use stolen or leaked information of users to analyze their type of CE purchase in the case of targeted attack [16]. Reportedly, 46% of the consumers have experienced such fraudulent activity on social media platforms which leads to phished websites as shown in Fig. 1 [23].

Fraudulent activity experienced while making a purchase.
Many of such links lead to phished websites. Figure 2 shows the number of phishing sites detected in the 3rd quarter year of 2020 as reported by the public or consumers, in which more than 50% of phishing cases were fraudulent reports via SNS purchase. Online Social Networking (OSN) users are susceptible to multiple security risks [8].

Phishing sites reported on 2020.
Figure 3 shows the comparison of phishing attack cases of year 2020 vs. 2015–2019, in which 2020 has reported massive amount of attacks compared to previous five years. The malicious users might launch an attack to steal the user’s sensitive information such as username, password, credit-card details or to steal money by invoking a fraudulent purchase [3]. Some attackers impersonate an organization and steal information from consumers [14]. Identification of such attacks can be done by inspecting the posted web page feature and differentiating the scammer from genuine links [4]. Some purchase links posted in SNS are analyzed using lexical feature of URL but it is ineffective to use only lexical features [12]. The web page features like URL-based, Content-based and Host-based features can be used to precise prediction [11]. To ensure trustworthiness of such websites involved in CE purchase and to verify the identity of the users advertising in such vulnerable SNS, SFD is proposed.

Comparison of malicious activity 2020 vs. 2015–2019.
The paper is organized as follows Section II explains the prior research and novelty of SFD. Section III elaborates the proposed SFD in detail. The experimental results are discussed in Section IV and the paper is concluded in Section V.
Digital marketing, an integral part of social networking involves majority of consumers migrating towards electronic product purchase in SNS like instagram, twitter. As SNS based purchase of consumer products increase, the fraudulent posts also increase. Some fraudsters post URL with malicious domain name to sell counterfeit products [13]. The malicious domain name can be a already existing phished webpage or a webpage created for the consumers of SNS [15]. There is quantifiable amount of research which exists to detect malicious web pages. Figure 4 shows the taxonomy for malicious web page detection.

Taxonomy for malicious web page detection.
Blacklist and whitelist [18] are the most popular approaches used to find the malicious URLs. It will contain the list of blocked and genuine URLs but such verification will not be accurate in CE purchase fraudulent since the list is updated manually and does not contain recently compromised website list. Machine learning approaches like support vector machine, Random forest are used to overcome the disadvantage of manual list updating since they are detected based on the web parameters. Some heuristic based approaches are also used for URL classification but machine learning based approaches outperform heuristic based approaches in the factor of speed [21]. Cantina+is a heuristic based approach that uses the URL based features to differentiate the malicious and benign URLs [20].
Some existing classification approaches were based on ranking algorithms [3]. An approach used a link analysis algorithm to calculate the page rank [3] values based on the web traffic of a particular page, popular sites and frequently visited sites which was found to be inaccurate. Google Indexing [3] uses google crawlers and gives index of the webpage. Alexa Reputation [3] ranks the webpage based on its popularity.
A Learning Automata-based Malicious Social Bot Detection (LA-MSBD) [1] approach was developed which used URL features and trust models to detect malicious bot in twitter. LA-MSBD calculated the direct and indirect trust of twitter users based on Bayesian learning approach and belief value given by other users. It was found that the belief value was considerably low for newly created posts which lead to the misclassification of a genuine post as malicious. Moreover, an URL embedded (UE) algorithm [2] detected the malicious websites based on the correlation and coefficients of URLs. From the experimental analysis, it was inferred that it consumed huge space to store the embedded model since UE mapped the URL with their distributive representation. Another existing method named as MALicious Tweets in Parallel (MALTP) [4] used tweet graphs and metapaths to classify malicious from genuine tweets which resulted in increased misclassification rate. Subsequently, MALT failed to verify the trustworthiness of posts given by legitimate account holders which led to increased malicious posts. A classification problem with case-based reasoning or a knowledge extraction assignment with imbalanced classes can be used to model fraud detection. Models based on different artificial intelligence techniques, including neural networks, copula models, decision trees, and others, to identify fraudulent payment activity [22].
SpoofGuard prevents web spoofing and phishing using a heuristics of URL. Promotion based malicious account detection on online social network named as ProGaurd- [7] was developed, which collects the behavioral features of advertising accounts to classify it as malicious, which again is not very efficient when an attack is produced from a new account with no information. Li et al proposed a scoring based system to identify malicious user based on the privacy scores [9]. Some approaches detected malicious URLs by inspecting the keywords in URLs [10]. As a step forward towards malicious post detection, a Cognitive identification framework was designed to identify the malicious users in SNS for CE product advertisement [6] based on cognitive trials of users, such as psychological, activity trial and social interaction. However, this existing approach failed to resist the fraudster advertising. Therefore, this paper presents a novel SNS Fraud Detection approach named as SFD which overcomes all the existing disadvantages and classifies the SNS posted webpage as malicious or benign. SFD includes a DFA Tokenization which is a self-adaptive decision-making unit added after Consumer URL extraction which continuously iterates and finds a finite set of actions. The research focuses on designing a highly competent web crawler to extract a WCI feature, which plays a major role in accurately detecting newly launched malicious pages.
The WC-NFS takes input from DFA-T, WC and web traffic analyzer to precisely predict the posted URL as malicious or legitimate.
The proposed SFD detects the fraudulent post using the URL and web features. The SFD contains DFA Tokenization, Web traffic analyzer and Web crawler, with these phases six features namely Subdomain (SD), Pathdomain(PD), Penalty Score(PS), GoogleIndex(GI), Web Crawler Index(WCI) and WHOIS are extracted. Initially, Deterministic Finite Automata Tokenization (DFA-T) performs tokenization on the posted URL and the SD and PD feature are extracted, then based on the malicious words in URL the penalty score is calculated. Subsequently, Web Traffic Analyzer (WTA) extracts the GoogleIndex (GI) value. Finally, the WCI is computed by the web crawler. Based on these features the SFD classifies a CE posted URL as malicious or benign. Figure 5 shows the overall workflow of SFD. SFD begins with DFA-T, followed by WTA, finally WC-NFS as explained below:

Overall workflow SFD.
A Deterministic Finite Automata Tokenization (DFA-T) is performed after Consumer URL extraction. DFA-T produces the response on each state transaction. It is used to tokenize the contents in URLs. The structure of URL is partitioned as follows < protocol>://<SubDomain>.<Primary Domain>.<Top-Level-Domain> /<Path-Domain>. Table 1 explains the structure of URL with an example.
URL structure explanation for https://bestbuy.electronics.com/site
URL structure explanation for https://bestbuy.electronics.com/site
Here the DFA-T divides the input URL into tokens. The character sequence of the URL is read as lexeme. The tokenization is performed as shown in Fig. 6. The extracted URL is given for tokenization. The URL is split as a character sequence in the input buffer and a Pointer Value (PV) scans the buffer and increments itself after scanning. The state transition occurs based on the input buffer data. A state transition occurs when a malicious character is scanned or a delimiter is found.

Tokenization process, S1 is starting state, S2 and S3 are final state, SI Invalid states, Sm malicious state. SP is special character.
When the tokenizer encounters a delimiter the sequence is stored as a token in the symbol table. Tokenizer checks for the occurrence of ’@’,’-’ and dots and stores the information in the database. Legal sites do not frequently use ’@’ and ’– ’, when a ’@’ is encountered in the URL all the left side contents following it can be discarded. The malicious word transition takes place while encountering such words. After tokenization, DFA-T stores the identified valid tokens, lexemes and malicious characters present in the URL of CE purchase. The DFA-T stores the information about the sub domain, primary domain, top level domain and path domain separately as N tokens along with the ‘M’ malicious characters with its count. The DFA-T extracts the information from the symbol table. Algorithm 1 explains the functioning of DFA-T.
Tokenization is performed and the results are stored in the symbol table. DFA-T extracts the SD,PD and MS values and using the MS value the Penalty score is computed.
The WTA computes the Googleindex of the URL in SNS which is used in the proposed SFD. Google indexed pages are WebPages visited by Google Bots, while indexing the contents of the site is analyzed by the bots. The Google index feature is an important web feature, since a high probability of frequently used CE purchase sites in SNS exists.
WTA also uses the WHOIS feature. WHOIS provides the information about the web page registration and expiration. The information about the malicious sites is unstable compared with the legitimate sites. Some malicious sites have IP addresses as part of the URL and the WHOIS feature is taken to identify such web pages. The webpage registration date is normalized as numerical value and fed into WC-NFS as WHOIS feature.
Web crawlers
The Web Crawlers (WC) also referred as Spider Bots (SB) crawls to the Consumer URLs and its interconnected pages and links. WC crawls from one website to another going through all the links until all the web indexes are crawled. Many attackers with the intention to attract consumers on purchase do not index all the web links, they focus on advertising electronic products. WC gets their input from DFA tokenizer and computes a numerical value of all existing indexes with the sub domain name. The value of WC is passed to the Neuro-Fuzzy system along with the features extracted in DFA tokenizer. Figure 7 explains the working of web crawlers. First the URL is extracted and the interlinks to the URL are extracted and traversed.

Web crawler for web indexing.
The crawler computes a URL rank based on the successfully crawled links and the computed URL rank is stored as WCI value in the database.
WC-NFS is a combinational approach of neural networks and fuzzy logic based on the extracted features from web crawler which automatically achieves the knowledge by back propagation. The proposed WC-NFS uses six parameters for classifying benign and malicious URLs. They are sub domain, primary domain, Penalty Score, WC index values, GoogleIndex value and WHOIS information is given as input. The first layer of NFS is the input layer where all the 6 features are passed as vectors. The input vector xi extracted using DFA-T and WC. Two of six input features are text strings that need to be represented as real values for input nodes and remaining are numerical inputs. Some features will not have any crisp value, eg. GoogleIndex might not exist for some URLs, in such cases “null” values are given as input vectors. The second layer is fuzzification. The vector values are fuzzified using the sigmoid function. It produces Li and Mi where i varies from 1 to 6. The Legitimate node is computed using
A sigmoid function for output node Oo is used for activation
Table of notations
Table of notations
SFD is tested for several datasets and evaluated for different ratios of malicious and benign URLs as shown below.
Malicious and Legitimate URLs dataset are collected from three websites.
The SFD is tested with the datasets collected form. Benign and malicious URLs are trained and compared for the ratios of 80 : 20, 70 : 30, 60 : 40. About 11K datasets are collected from PhishTank and 22K datasets are used from Malicious_n_nonmalicious URLs, in which 80% of datasets are used for training and 20% of datasets are used for testing. The proposed SFD is programmed in python. First DFA tokenization occurs, URL is tokenized and the necessary features were extracted, Web crawlers are designed to crawl over web indexes, the total WCI is found, the WC is implemented using python selenium web driver. The features from DFA-T and WC are given to NFS. The features are inputted to NFS and resultant classification is done as benign or malicious.
The performance of the proposed SFD is evaluated based on accuracy, sensitivity and specificity. Accuracy of a URL classification calculated based as
Sensitivity is calculated based on correctly detected legitimate URLs
Specificity is the rate of correctly detected malicious URLs
The performance of SFD based on the evaluation metrics is given in Table 3. Figure 8 compares the sensitivity and specificity of proposed SFD with other models.

(a) Specificity (b) Sensitivity comparison of SFD with other models.
Performance of SFD
The proposed SFD is compared with existing classifiers such as Support Vector Machine (SVM), Decision Tree(DT), K-Nearest Neighbor(KNN), Random Forest(RF), Extreme Learning Machines(ELM). The comparison analysis is done using the collected datasets. The comparison analysis is shown in Table 4. The detection rate of SFD is better compared with other classifiers. The overall detection rate is shown in Fig. 9. The proposed SFD classifies the benign and malicious URL with the accuracy of 92.6%, which is higher than the other classifiers.

Comparative analysis of existing models with proposed SFD.
Comparison analysis
The web crawling index of collected URLS was analyzed. From the analysis the WCI of malicious URLs are below 0.5 and WCI of benign URLs is above 0.5. The benign URL has a high index rank as shown in the Fig. 10.

Web crawler indexing.
Table 5 shows the testing result for 20% of collected datasets. The testing results were recorded for multiple test numbers. The URLs were predicted to be benign and the misclassified count URLs is also listed in the table. SFD was tested for different combination of malicious and benign URLs as shown in Fig. 10. Figure 11(a) shows the accuracy computation for 80 : 20 benign vs. malicious ratio, 11(b) shows the accuracy for 70 : 30 and Fig. 11(c) shows for 60 : 40 ratio, in which SFD accuracy is higher than the other classifiers.

Accuracy evaluation of different ratios of benign and malicious URLS (a) 80 : 20,(b)70 : 30,(c)60 : 40.
Testing results for URL ratio 70 : 30 (benign:malicious)
ELM performed close to SFD, whereas SVM, KNN and decision tree detection rate were low. Random forest classification rate was high but ELM and the proposed SFD outperformed RF.
The SNS Fraudulent detection (SFD) detects fraudulent consumer product marketing in social networking services. The SFD is the combination of DFA-T, WTA and WC-NFS. SFD accurately detects the malicious URL post from fraudsters. SFD is tested with the URL collected from SNS sites. The Proposed SFD is compared with other classifiers like SVM, DT, KNN, RF and ELM. Compared to the other classifiers SFD classifies malicious URL post with more accuracy. The proposed SFD is also compared with other performance metrics like sensitivity and specificity. The comparison shows that SFD performs well in all the evaluation metrics. The proposed SFD classifies with an accuracy of 92.6%.
