Abstract
Online social networks (OSNs) are utilized by millions of people from the entire world to communicate with others through Facebook and Twitter. The removal of fake accounts will increase the efficiency of the protection in OSNs. The construction of the OSN model has the nodes and the links to identify the fake profiles on Twitter. This paper proposes a novel technique to detect spam profiles and the proposed classifier is to classify the profile images from the dataset. The malicious profile detection technique is used to identify the fake profiles with the concept of a Twitter crawler that implements the extraction of data from the profile. The feature set analysis has been implemented with the feature related analysis. The user behavior detection utilizes the adjacent matrix to measure the similarity values within the friend’s profiles. The multi-variant Support Vector Machine classifier is developed for efficient classification with the kernel function. The proposed technique is compared with the well-known techniques of ECRModel, ISMA and DeepLink that the detection rate is 2.5% higher than the related techniques, the computation time is 220 s lesser than the related techniques and the proposed technique has 3.1% higher accuracy.
Introduction
Online Social Networks (OSNs) have the normal transmission tool to generate the possibilities of the members to establish the actual relationships and activities. OSNs store a large amount of data that could be analyzed using various contexts [1]. The Social Network Analysis has the social related mining techniques to identify the similarities, expert detection and trust analysis [2]. The different kinds of OSN related analysis have been used to indicate the synthesizing. The concept of similarity has incorporated with the structural similarity according to the network topology. The semantic similarity for the shared contents on the social networks to identify the similarity values within the profiles [3].
Online social networks have been widely used by people around the world to communicate with friends and relatives. The basic components in the hierarchy of social networks are the individual users [4]. Online social networks have succeeded in constructing a network of trust [5]. A user is more likely to respond to a Twitter follower’s message than a message from a stranger [6].
Spamming is the method of sending unsolicited bulk messages especially advertisements, indiscriminately. Information sharing through URL (Uniform Resource Locator) shortening service is an important feature of OSN [7]. Twitter plays a very important role in today’s online social networking scenario. It has about a billion registered users. Over 500 million tweets are sent on an average every day. This study mainly concentrates on Twitter [8]. In this study, a statistical approach is provided in order to identify spam profiles on Twitter. This research identifies a set of 17 features that help in the detection of malicious profiles. The features thus extracted were fed into classification algorithms which provide detection rates and false positive rates. An experiment was conducted which involves feeding the entire feature set to train the classifiers and testing the accuracy of their classification using 10 folds cross-validation with the classification yielded a detection rate of 99.5% for Twitter [9].
The removal of fake accounts has been implemented after identifying the fake accounts by enhanced techniques in OSNs. The similarity related profile analysis has been put together with the IP address through scalable approaches to identify the group of fake accounts created by the attacker [10]. The common supervised machine learning methodologies are utilized to detect malicious account. The framework is constructed to discover the fake accounts according to the graph-based social networks with friends. The pattern recognition techniques to identify the fake accounts where the total amount of followers were utilized for every account [11]. The forwarding message tree combined with the efficient features that are implemented to identify the relationship within the real accounts and fake accounts. Sometimes, the similarity measured that have failed to analyze the fake accounts because some similarity will be matched with the fake users also [12]. The overfitting problem will occur while performing a high amount of data. Also in sometimes, the assumption is completely wrong where the similarity measure failure to analyze in the dataset.
The main contribution of the paper is The privacy-preserving model is used to identify the anomalies attack in dynamic way. The malicious profile detection in Twitter has the framework of the Twitter crawler functionality. The feature set analysis is used to segregate the features by several types of features. The user behavior detection has been utilized the adjacent matrix and trust value.
Related works
There are so many different approaches employed by existing systems to detect spam in online social networking sites. The aim of all the detection techniques is to increase the detection accuracy and to reduce the false positive rates. Yet another research presents a suspicious URL detection system in Twitter stream [13]. Another significant work on the identification of spam in OSNs, honey-profiles was created representing different age, nationality, etc. Based on the activities in Facebook, MySpace and Twitter, six features were developed to differentiate spam profiles from regular profiles. The researcher created a social honey pot to attract spam spreaders on Twitter, whose profiles are analyzed to recognize a group of features for classification purpose [14].
The Detection Rate by using new feature set enhanced and have analyzed tweets and the elements like user name, colour scheme, background, and profile picture/avatar. The social spam detection was a scalable and online spam detection system in social media that a Welsh-speaking Twitter and expressed through the non-symmetric “follow” relationship. In the research, a total of 6 user-based and content-based features were identified to separate spam and benign accounts [15]. The relevant method has been implemented to identify the features based on the categories Content and tweet related malicious behaviors [16].
Most of the earlier approaches have dealt with tweet level detection of spam in Twitter social network. Moreover, limited statistical features were considered for spam detection. This study does a profile level detection by using an enriched set of 17 features which help in identifying the malicious nature of a Twitter profile [17]. These features can be practically used in order to identify whether a Twitter profile is malicious and block the same if it is so and thus stop the spam propagation in Twitter network. Normally, the detection of anomalous accounts in OSNs, the techniques to identify the variation of activity based on the user’s performance [18]. The sudden variation of access pattern for the data and the user behavior permits the server to identify the suspicious account details. If this system has failed, the spam account will infect the system with existing fake details. The group of learning-based approaches are used to implement the analysis of the dataset. The learning method could train the features information within the period of computing the classification data according to the users [19].
The fake profile detection may use the dynamic information like learning technique and behavioural analysis. The community detection technique uses the identification of suspicious activities through the intruder detection strategy. The common social behavior model is used to explore the user profile to measure the detection related issues. The behavior has been analyzed in the OSNs that the generalized model has been classified the particular user. The techniques are analyzed to identify the suspicious profiles according to the horizontal based classification technique. An innovative sensing technique has been implemented to detect anomalies [20].
Several techniques have been utilized to implement the effective authentication procedure to solve the various issues such as the key agreement techniques to produce the security among the data [21]. The construction of OSN requires producing the framework to solve the malicious profile related problems by providing an authentication technique. For the user anonymity process, the procedure to consider the secure authentication [22], propagation technique [23], physical location identification [24], the interaction with combined community-related OSNs [25]. The traditional methods such as CAPTCHA are the authentication procedure whenever the authorized application in the system to detect and remove the fake accounts and avoid the fake profile creation process [26].
The pre-stage attack prevention technique is used to detect fake profiles using augmented social graphs. The enhanced model permits every node to append the connection with 2-hop neighbouring to the contribution-based heuristic approach. The challenging task to provide the OSN security, it requires producing the optimization with a significant enlargement approach of the system [27]. The anomaly detection according to the behavioral profile requires a promising framework that constructs the behavior information like the tweet time, identification of the contents, and the proximity within the user to construct the profile and analyzes the level of the violation through the behavior information. The OSN users have to develop the interaction among the other users to analyze the similarity content and compromised attacks [28].
Spam detection in OSNs is an emerging research area that the accuracy and minimize suspicious attacks, especially on Twitter. The feature set has been developed for analyzing the tweets while the following relationship in category-based malicious behavior detection techniques. The features can be utilized to know the malicious profile that will increase the system performances. Many of the existing techniques have failed to provide a high amount of accuracy and spam detection rate. This paper proposes the privacy-preserving model for detecting the anomalies attack in a real-time scenario. The categorization of Twitter data has been done with tweets, @mention, hash-tags. There are 17 features identified and segregated into 5 dissimilar groups to perform the spam detection process. The twitter crawler and feature set analysis are used to perform the classification, whenever calculating the trust value, every attribute has a dissimilar combination of the final trust value. The process of the trust value formation has been constructed that every collected behavior values are assigned to identify the feature of the user behavior for establishing the relationship within the adjacent matrix and weight values.
Proposed work
To overcome the problems occurred for identifying the fake accounts, this paper proposes a method to increase the efficiency of identification. In spite of identifying the relationship strength within the friend’s accounts, the similarity measurement is used. The over fitting problem has to be solved by the feature extraction technique. The resembling technique is used to implement the balance among the datasets. The graph-based adjacent matrix is used to identify the similarity between the profiles. The construction of the proposed framework is used to detect the identity categories of the OSN architecture. According to the existing techniques, the proposed anomaly detection methodology in OSN addresses the issues of the anomaly based user profiles and implementing the highest amount of feature extraction using the twitter dataset. The user profiles consist of several personal data like the user name and the phone number. The private security model requires identifying the attacker details that the fake profiles are used to steal the data and information computation. The trustworthiness and protection related issues have been solved using the efficient and dynamic OSN architecture. The privacy-preserving model has constructed to detect the anomalies attack in dynamic way and Fig. 1 demonstrates the OSN communication model.

OSN communication model.
The identification of fake profile in the twitter is the main problem of constructing the OSNs. The main characteristics of Twitter lead to some of the specific features that can help in branding a profile as malicious or benign. These features which help in characterizing a profile were identified in Twitter network and their details were crawled using the HTML parser. The Twitter features are listed and discussed and these values were extracted from all the Twitter profiles that were collected. These features thus extracted were used to identify whether an account is benign or spam. The statistical features have different kinds of values which typically indicate whether the profile is spam or not. The prediction of OSN attack is related on the classification of anomalies and it needs several OSN attributes for training the users to gather the user details which is connected to the server. Normally, OSN contains a group of users as nodes and the relationship as a links. The node consists of the group as Node ={ node1, node2, … . . , node n }. The group of link is demonstrated as Link ={ link1, link2, … . . , link n }, the interaction within the members to produce the social relation.
Twitter data and characteristics
A set of profiles were collected from Twitter social network which included manually classified benign and spam profiles. A total of 405 profiles were taken from Twitter of which 211 profiles were begun and 194 profiles were malicious. These accounts were the main source of our data. The maximum number of tweets that were considered for analysis from a single profile is 3200 even if the user has tweeted more than this count. Only public profiles were used for collecting the data and performing the analysis.
Figure 2 demonstrates the Twitter data characterization that the Twitter streaming The main characteristics of Twitter from which specific features can be identified and collected include: Tweets: These are messages using which information can be disseminated by sharing a link or writing a message not >140 words. @mentions: It is used to tackle somebody. Hash-tags: popular topics that are discussed in tweets are called as hash-tags. The topic is preceded by a hash (#) and hence, it gets the name. URLs: URLs can be shared in tweets.

Twitter data characterization.
The features are broadly classified into Interaction related features, Tweets related features, URLs related features, tags/@ mention related features and age related features. These features are listed in Table 1 where features set that were used and the remaining six features are novel features proposed in this study. Interaction related features introduced in this study are Follower of Following (FoFo) ratio (F2). FoFo ratio provides knowledge about the friendship details of the users with users as well as the popularity of the user with others in Twitter. A huge amount of following and a tiny amount of followers highlight the suspicious aspect of the account. Tweets related features in this study are API Count (F7) and Tweeting rate (F8). Since increased number of API implies that the account is more apprehensive, count of API is taken into consideration. Tweeting rate is the ratio of the total amount of tweets to the age of the account. As high tweeting rate symbolizes malicious user, tweeting rate is captured and used.
Interaction related features
Interaction related features
URL related features introduced which is API URL ratio (F12). Age related features in this work are age of account (F16) and following rate (F17). Age of the account is the amount of months that the user has been attached to Twitter. Most of the spammers create the twitter for short span of time. These newly introduced features and the previous 11 features are increasing the accuracy of the spam detection.
Classification of a profile as spam or not involves several steps that must be executed before one can confirm of its malicious nature. The architectural design of the work done in the twitter social network is given in Fig. 3. It gives an insight into the actual sequence of steps executed for the identification of malicious profiles. It starts with the identification of the features from the twitter network, which would help in the detection of malicious profiles. From the twitter public directory, various public profiles were randomly considered. This constituted the majority of benign profiles that were used for the purpose of producing the dataset that would train the classifier. A carefully examined and manually collected set of spam profiles were also used to produce the dataset. Subsequent to the identification of the profiles, a Twitter crawler was written which would crawl the data required from Twitter profiles.

Malicious profile detection.
A Twitter crawler method is used to parse the Twitter profile and to extract the data required from the profile. The crawler parses every character in the profile and checks for specific character or word or sequence of characters. Based on the presence or absence of the characters or word checked, certain actions are taken and decisions made. The pseudo-code of the Twitter crawler the features that are required can be extracted from the Twitter profile. This constitutes the next step. In the subsequent step, the extracted features are formatted to create twitter dataset which is used to train the classification algorithms for the detection of malicious profiles. These algorithms classify the profiles based on the feature set fed to them as malicious or benign.
Feature set analysis
Feature set forms the basis of classification and hence, this study deals with the analysis of various features that were mentioned in the previous sections. The specific values of these features were examined and for every feature that was analyzed, certain conclusions were reached based on its property which would aid in the classification of the profiles. A detailed analysis of every feature and the contribution of each for the identification of spam content are presented in Fig. 4.

Feature set analysis.
Number of followers/following (Fl): more number of followers indicates that more people trust this account. Hence this is an indication that the profile is a trustworthy one. Following number indicates how many people this account follows. Usually spammers follow a large number of profiles to gain popularity.
FoFo ratio (F2): This ratio provides knowledge about the friendship details of the users with others as well as the popularity of the user with the others in Twitter. A large number of following and a small number of followers highlights the suspicious aspect of the account. A high FoFo ratio is an indication that the profile is malicious. Compute FoFo ratio in Equation (1).
If FoFo ratio is less than 0.09 then set spam flag, count the total number of hash tags in the profile. If the count exceeds 785, then set the spam flag. Let n is the total number of unique hash tags in the profile.
Total Number of Hash tags (F3): The number of hash tags is indicative of the interaction of the profile with a large community. The total number of hash tags will be high for a malicious profile. A spammer will be able to spread spam more easily by spreading the malicious content using a hash tag so that it is visible to a larger community. Table 1 demonstrates the interaction related features and it contains the Number of followers, FoFo ratio and the total number of hash tags.
Numbers of unique hash tags (F4): Normal or benign users typically use a diversity of hash tags whereas spammers use the most popular hash tags as they are seen by most people and it is easier to spread malicious content over there. The amount of unique hash tags will be less in the case of malicious profiles as spammers tend to use popular hash tags frequently.
Maximum frequency of hash tags (F5): This value does not provide any information of its own but is useful with the hash tagging rate. A large maximum value along with a high average hash tagging rate indicates the malicious nature of a profile. Let freq (x i ) be the frequency of the occurrence of hash tag x i . Now compute the maximum number of hash tags in the profile using the Equation (2).
Average frequency of hash tags (F6): a high average value along with a small number of hash tags highlights the malicious nature of the profile. Average value of frequency histogram used in the profile (F6) is computed using Equation (3).
If the maximum number of hash tags generated by the profile >63 and F6 > 19 then set the spam flag.
API Count (F7): When more number of tweets is generated automatically and a large number of these automated tweets have a URL embedded in them, the profile is likely to be a malicious one. So, this feature along with the API URL ratio (F15) indicates the malicious nature of a profile.
Tweeting rate (F8): A high tweeting rate indicates that the profile is malicious in nature. That is, in a short span of time, a malicious profile posts a large number of tweets in order to spread spam at a great speed and the twitter related features are illustrated in Table 2.
Tweets related features
Total number of URLs shared by a user (F9): malicious profiles lean to share a huge amount of URLs. Most of these URLs are repetitive in nature as a tiny amount of unique URLs are shared, repetitively by these malicious profiles. A very high value for this feature indicates that the profile is malicious.
Total number of unique URLs (F10): the amount of unique URLs will be very small for a malicious profile. This feature along with the total amount of URLs distributed by a profile models the malicious behavior of the profile. A spammer will have a small number of URLs which spread spam content. These spam URLs will be circulated at large by the malicious profiles.
Average frequency of the URLs (F11): A high value for this feature indicates that a small number of distinct URLs are shared a large number of times which is indicative of the malicious behavior of a profile. Extract the total number of unique URLs in the profile (F10), If ((F9 > 2500) and (F10 < 30)) then set the spam flag. Compute average URL frequency (F11). This can be computed using Equation (4).
API URL ratio (F12): This ratio indicates as to how many of the automated tweets contain a URL. A high ratio indicates that the profile is malicious. Table 3 URL related features that a spammer would spread URLs using automation as he would have multiple accounts to manage and would resort to automation to make his job easier. If tweet rate >195, set the spam flag, find the total number of tweets with the tweet source of API (F7), compute API URL ratio (F12) and it can be computed using Equation (5).
URL related features
The @mention count (F13): A large number of @mentions illustrates that the user is interacting with a huge number of people many times. Benign users never do this as they communicate only with a small number of people and address only a small number of people. Thus, a large value illustrates that the profile is malicious. Calculate the total number of @mentions that is used by the account (F13). If F13 exceeds 2000 set the spam flag.
Unique @mentions (F14): This feature along with the @mention rate can indicate the malicious nature of a profile. Extract the number of unique @mentions in the profile (F14).
The @mention rate (F15): In twitter, a high @mention rate along with a huge amount of following count or followers count indicates that the profile is malicious. Also, another indicator is an extremely small value of the mention rate. Compute the number of @mentions per friend (F15) and it is demonstrated in Equation (6).
If ((folowers > 1000) OR (following > 1000)) , linebreakthen if ((F15 > 100) OR (F15 < 2.5)) set spam flag Extract the total number of URLs shared by the profile (F9). Table 4 demonstrates the @mention related features.
@mention related features
Age of the account (F16): The age of the account together with the number of tweets of the profile gives the tweeting rate. A high rate indicates that the profile is malicious. Similarly, the age of the account along with the number of following indicates the following rate of a profile.
Following rate (F17): A spammer would follow more accounts in a time to gain popularity. This is indicated by the following rate. Table 5 demonstrates the Age related features, this will indicate that the profile is malicious. Find the age of the account, F16, compute the following rate F17 using Equation (7).
Age related features
If F17 > 100, set the spam flag extract the total amount of tweets posted by the account. Compute the tweet rate in Equation (8).
The User Behavior detection is based on the graph model that the adjacent matrix for the social networks accounts graph has been constructed that the measurement of the social network friend’s similarity measures are generated within the nodes. The similarity matrix is constructed like Jaccard coefficient and Dice coefficient. The similarity measures are identified after completing the process. The adjacent matrix is computed in Equation (9).
where am ij is the matrix that indicates the evaluation factor to identify the similarity measures within the nodes. The evaluation set refers the group of several evaluation elements to represent as the vector value with the specific level of managing the attributes. The evaluation element vector EV ={ ev1, ev2, … . , ev n } is the component of the trust level of the evaluation set.
The weight vector is the identification of the particular attribute values within the whole activities. The highest value indicates the efficient overall behavior in Fig. 5. The features of producing the qualitative characteristics are Entropy (Entro n ) that the qualitative concept of the expected level of distribution, Expectation (Exp x ) that the measurement of the uncertainty for the Entropy, and the Hyper Entropy HyEntro e is the entropy for the uncertainty measurement with index value. The user behavior has reported as the complex task with uncertainty. The adjacent matrix provides the rating range is [rating mini , rating maxi ]. The Expectation value is computed using Equation (10).

Trust value formation.
The value of Entropy is computed in Equation (11).
The value Hyper Entropy is computed in Equation (12).
The factor group FG ={ Exp
x
i
, Entro
n
i
, linebreak HyEntro
e
i
} demonstrate the behavior of the user within the interval α
i
that the Adjacent matrix is computed using Equation (13).
Every activity impacts the dissimilar assessment result that the weight value of the user behavior in OSNs. The objectivity is computed in Equation (14).
The Entropy can be computed in Equation (15).
The value of
The weight value is computed using Equation (17).
Where rating
ij
is the most common component in the adjacent matrix formation that the entropy with the weight value is computed when there is no different in every behavior attribute in the rating level. The Evaluation vector EV is computed with γ multiplies with β. The interference value is eliminated through the similar factors in the OSNs; the interference factor η is used to alter the computed evaluation result. The final evaluation vector value is computed in Equation (18).
The computed trust score is related with the overall activity of the user, the related factors are eliminated to produce the better user behavior in OSNs to access the current situation. The value of β is computed using the user activation in Equation (19). The structure of the matrix is formed that the classification method is applied with the labels to the particular node. The class which contains the input image details are classified using the classifier function of individuals; the fusion decision is computed for multi-variant function.
The hyper parameters set is used for implementing the tuning technique that it directly maintains the proposed technique behavior with providing significant impact while evaluating the performance and it is used to implement large amount of experiments.
The Artificial Intelligence based Support Vector Machine called Multi-variant Support Vector Machine classifier is used for performing the classification of the OSN dataset. This classifier is performed to enhance the feature space margin with relevant to the non-linear boundaries. It is based on the requirement of the kernel function. The kernel function is computed in Equation (20).
The base level contains the multiple variant classifiers to implement the classification. The probability of the input image from the dataset is computed using Equation (21).
The weighted values are added for identifying the classification that the social networks have to classify the edges related constraints from the initial information for classifying the Twitter or Facebook data, the edges are included to analyze the link within the nodes as the Friends. Finally the class with input image is generated for performing the evaluation process and it is demonstrated in Fig. 6.

Multi-variant Support Vector Machine.
The Kaggle dataset [29] is used for performing the spam detection of profiles in OSNs that the cross validation based performance model has been provided for evaluation that the proposed technique is compared with the related techniques of ECRModel [30], ISMA [31], DeepLink [32]. The plot of RoC is used to compute the several threshold values without any threshold changes. Figure 7 demonstrates the ROC curve with AUC for the proposed technique based on the analysis done, an algorithm was developed which provides the set of steps involved in the detection of malicious profiles. If ((F7 > 50) and (F12 > 0.8)) then set the spam flag, the particular account is a spam profile. According to the setting of the spam flag, it can be decided whether the profile is spam or benign.

ROC curve.
For the sample profile, the classification result is spam profile because the total amount of @mentions > 2000. A real-time example for the Twitter profile classification is demonstrated in Table 6.
Twitter profile classification
The Jaccard Coefficient in Fig. 8 is used to measure the similarity value of the binary data within the overlapping area, it is computed in Equation (22).

Jaccard Coefficient.
Where μ demonstrates the binary value and ϑ is the truth value for the comparision.
The Dice coefficient in Fig. 9 is used to compute the efficient classification procedure that it illustrates the pixel percentage of the predicted spam profile from the dataset and it is computed in Equation (23).

Dice Coefficient.
Detection rate is computed that the percentage of the spam detection from the dataset and the performance proved that the proposed technique is having the highest detection rate compared with the related methods in Fig. 10. The detection is based on the statistical features of a profile as can be seen from this functionality.

Detection rate.
The computational time is demonstrated that the analysis of the quality of the classification by the classifiers, it is the time period to complete the specific computational processes. Fig. 11 illustrates that the proposed technique has the reduced amount of computational time compared with the related techniques.

Computational time.
The accuracy is computed how the method is correctly predicted the spam profiles and the comparison result are demonstrated in Fig. 12 that the experimental result proves that the proposed technique has achieved highest amount of accuracy compared with the relevant techniques.

Accuracy.
Figure 13 demonstrates the spam detection for the feature set that it has been divided into 5 different types as F1 to F3 is in Interaction related features, F4 to F8 in Tweets related features, F9 to F12 in URL related features, F13 to F15 in @mention related features and finally F16, F17 in age related features. The experimental result proves that the proposed technique has high spam detection than other techniques.

Spam detection for features.
For evaluating OSNs, the 4 types of resolutions are required for generating the performance evaluation. Initially, the population level, the entire social network measurements are involved to construct the level within the specified time period. Secondly, the community level measurements are used to provide the interactions within the community level. Thirdly, the user level measurements are involved for providing the interactions through the user activity. Finally, the content level measurements have been maintained the nodes based simulated network. Figure 14 demonstrates the overall performance of every model according to the average metric value; it is observed that the proposed technique has been performed well compared with other techniques on every resolution models.

Average metric value.
This paper proposed the novel spam detection technique for detecting and classifying the malicious profiles in OSNs. The malicious profile detection technique is implemented to detect the fake profiles with Twitter crawler function. The feature set analysis has been implemented to extract the features for classification and the multi-variant support vector machine is used for the classification accuracy. The future work requires an exploration based handling the malware hierarchical account link. The hierarchy provides the threat source and adaptive loss functions are used to generate the loss function according to the neural network concept. The innovative regulator requires producing the effective result for testing and training process.
