Abstract
Location-based social networks are becoming a unique platform for understanding user behaviors and providing pervasive services in intelligent environments. However, fake users or accounts can undermine user analytics and lower the value of the applications and services intended for real users. Mining a large Foursquare dataset and related Twitter accounts, we tested different user features with the goal of classifying fake users. Experiments demonstrate an accuracy over 95% in detecting fake users. Filtering out these fake users reduces the error rate of a location-based activity predictor by a 4.4% and avoids wasting 35% of coupons or promotion codes delivery if applied to a recommender system.
Keywords
Introduction
Location-based social networks (LBSNs) are playing an increasingly important role in the lives of millions of people as a part of current intelligent environments. Used with good intention, they are powerful platforms with remarkably rich information about users that can be used to provide invaluable services that are not available elsewhere. Some examples are social recommendations, social health challenges, understanding user interests and serving tailored advertisements and offers, predicting trends in a population and intentions/plans of individuals.
In addition to legitimate users, many LBSNs have fake users. These can include spammers trying to promote their business, disgruntled users trying to bully or attack others, users trying to game the system for some reward (e.g. become a user with privileges). In the Foursquare [16] sample used in this work, 6.5% of the users were fake; this is consistent with other social networks (e.g. NBC reports [27] that about 8% of Facebook users are fake).
Despite the relatively small number, the fake users cause significant damage to the social network data veracity and its operators. For example:
Fake users alienate legitimate users and make them less likely to use the network.
Fake users consume a disproportionate amount of resources. In our sample, for instance, fake users consumed 60% of resources even though only 6.5% were fake. Thus, even though the damage made by any individual fake user is negligible, they significantly increase the cost of managing and operating a social network if considered collectively.
Fake users can disrupt other LBSN operations, such as user modeling, prediction, and targeted advertising. For example, in our sample, including fake users would decrease the prediction accuracy by 4.4%. In addition, despite being only 6.5% of the population, fake users would receive 35% of the coupons or promotion codes in a targeted ad campaign.
Social network operators are well aware of these problems. For example, the 27% of Fake user accounts we identified in our dataset were closed within less than a year. Closing user accounts requires the involvement of a system administrator, and is potentially costly as it may cause a dispute or even open the company to a lawsuit. The fact that they chose to close those accounts anyway indicates that the fake users present a real problem for an LBSN operator. Given the significance of this growing problem, it is to our surprise to find that there has been little (published) work in the community that tackles it systematically. This paper attempts to bridge this important gap, with specific reference to classifying Fake users in a large Foursquare dataset, combined with related Twitter accounts. We believe that our approaches can be generalized to similar LBSNs, which integrate similar spatio-temporal features.
There are several challenges in detecting these Fake users. Firstly, we have found that reliable data samples of Fake users are key to developing accurate detection algorithms (via supervised learning). We came to this conclusion after attempting, but failing to solve the problem satisfactorily without any samples (i.e. using manually defined rule sets and unsupervised learning techniques such as clustering). Nevertheless, the acquisition of enough credible samples is challenging. We are not aware of any LBSN providers that release information on Fake users they may have correctly (or incorrectly) flagged. Meanwhile, manually labeling users is prohibitively expensive and unreliable. In this paper, we have adopted a novel crowd-sourcing approach to obtain reliable Fake user samples at low cost.
Secondly, a user’s Location-based Social Network (LBSN) data contains abundant information ranging from check-in histories, through comments, friends, to various profile data. Extracting the right features from this information for Fake user detection is critical to success and non-trivial. We have used our intuition, observations and experimentation to define a highly effective feature set.
Lastly, the Fake user detection problem is a binary classification problem where many supervised learning algorithms may apply. Without extensive comparison, it is difficult to determine which method is most suitable. We have conducted a comparative study that has yielded interesting insights in this regard.
Our research contributions can be summarized as follows:
An end-to-end supervised learning approach for Personal/Fake binary classification in LBSNs, including obtaining labeled user samples, feature extraction and selection.
A novel crowd-sourcing approach to obtain reliable Fake user samples at low cost.
A comprehensive evaluation of our end-to-end approach, including feature selection, and supervised and unsupervised algorithms.
Quantification of the value of Fake user detection using different real-world LBSN use cases and scenarios, including: early detection of Fake users, improvement of predictive analysis accuracy, lowering costs in recommender systems based on predictive analysis, and resource consumption reduction by avoiding high-activity Fake users (i.e. spammers).
The remainder of the paper is organized as follows: Section 2 reviews the related work. A description of the Foursquare dataset and the noise problem in LBSNs is presented in Section 3. Section 4 describes our model and methods to classify users in the dataset. The performance of these models is evaluated in Section 5, and its business impact is evaluated in Section 6. Finally, conclusions and future work are presented in Section 8.
Related work
Time-location information for user activity modeling was first obtained through experiments with GPS devices, cellular tower triangulation, WiFi logs, or the like. González et al. analyze in [18] a six-months location dataset of 100,000 cellphones characterizing a high degree of temporal and spatial regularity in human activities. Ashbrook and Starner presented in [4] a model of user location prediction based on meaningful locations extracted from a reduced number of users carrying a GPS device. Mobility prediction using GPS is evaluated in different works. For example, Krumm and Horvitz [21] infer driver destinations from a history of driver’s destinations, along with data about driving behaviors (types of destinations, driving efficiency and trip times).
Opposite to GPS time-location information, LBSNs offer a novel and dynamic dataset source of real user activities to the research community. Most LBSNs have gaming aspects that makes their use more common and attractive for people, thus generating a wide set of semantically described locations. Lindqvist et al. [25] analysed qualitatively and quantitatively how and why people use a LBSN like Foursquare. Interview results bring fun, self-presentation, exploration and coordination with friends as main reasons to use Foursqure. Cramer et al. study in [13] Foursquare check-ins from a sociological perspective, including motivations for sharing the location and the gamification component. Finally, Guo et al. propose in [19] the paradigm of Mobile Crowd Sensing and Computing, extending the vision of participatory sensing by crowdsourcing sensory data from mobile devices and user-contributed data from mobile social networking services.
A number of authors use LBSNs as core engines for recommendation systems. Noulas et al. explore in [28] the opportunities of Foursquare to be used as dataset to enable recommendation systems (including the concept of activity and place transitions). User localization and mobility in LBSNs is studied from different perspectives. Cheng et al. model human mobility patterns in [11] by analyzing the spatial, temporal, social and textual aspects associated with different LBSN datasets (Twitter, Foursquare, Gowalla, …). Scellato et al. provide in [31] LBSN socio-spatial analysis, including Brightkite, Gowalla and Foursquare data. Location information is obtained retrieving four million public tweets sharing the aforementioned location, from 250,000 different users and 1.5 million locations. Social information such as the friends network is also included in the analysis. Cranshaw et al. describe in [14] a methodology to understand the dynamics of a city using Foursquare. Quercia et al. tested in [30] whether established sociological theories of real-life networks hold in Twitter. More recently, Twitter and Foursquare have been used in [32] and [24] to detect whether the users tweeted from work or home based on semantic similarity and spatial autocorrelation, and predict user interest, respectively.
LBSNs allow an almost real-time analysis of a wide number of users’ locations and activities, which is far more easy and extensible than providing GPS devices to a few testing users. However, they might introduce undesired behaviors in the dataset due to fake users presence. For instance, Boshmaf et al. [7] and Aiello et al. [2] introduced bots in real LBSNs which actually socialized with real users, obtaining private data from their socialized real users. Next, we review recent works where LBSN data has been used to detect fake reviews (i.e. tip spamming) and fake accounts in different LBSN.
Summary of most relevant Fake user detection related work
Summary of most relevant Fake user detection related work
Temporal and spatio-temporal features are used in detecting fake reviews in online reviews sites. Temporal patterns discovery is used in [35] to analyze hotels reviews, discovering temporal patterns depending on the number of reviews per user. Vasconcelos et al. [34] crawled Foursquare to characterize the user behavior based on tip information. Using an expectation-maximization clustering algorithm, they classified users into four groups, one of which contained tip spammers. In [1] machine learning techniques to detect tip spammers are presented, correctly distinguish legitimate users from tip spammers with 89.8% accuracy. A tip spamming classification approach in a local Brazilian LBSN is presented in [12], using data labeled by the LBSN moderators. Zhang et al. analyzes in [39] different features from reviews LBSNs plus location entropy to detect fake or spam reviews achieving a 85% accuracy based on manual labeling. Spatio-temporal patterns are used in [23] to detect opinion spam based on the location of restaurants, and theoretic travel speed of users providing reviews, achieving 85% accuracy. Finally, Chen et al. measure and analyze in [10] fake tips in Foursquare based on sentiment analysis, although no prediction or classification is provided.
The presence of fake users or accounts in LBSNs has received attention recently. In [17] an information-theoretic approach is used to classify user activity on Twitter, based on the retweeting activity. From the different classified activities, the authors found automatic/robotic activity and parasitic advertising. However, these approaches rely in activities such as retweeting that are rare or absent from most LBSNs, where the emphasis is instead on location or co-location. In [8] one of the principal LBSNs in Spain is used to find fake accounts, correctly detecting ∼90% of these accounts using social-graphs properties. In [37], a honeypot-based solution to identify game cheaters in LBSNs is proposed, since the incentives for these behaviors can be either earning monetary rewards by unlocking special offers as well as virtual rewards (gamification). The strategy of cheating in LBSNs, including Foursquare, due to the user activity rewarding or gamification is introduced in [6]. Their approach reports the phenomenon, but does not propose a way of detecting it automatically. Xuan et al. proposed in [36] a method for malicious account detection based on machine learning, with 89% performance. Papalexakis et al. analyze in [29] anomalies in Foursquare check in behavior based on Tensor decomposition. However, no performance metrics for classification or prediction are provided.
Table 1 summarizes the most relevant methods, together with their goal, used database, performance and labeling approach. In this work, we propose an automatic supervised learning classification algorithm that achieves 95% accuracy in identifying fake users automatically from spatio-temporal information (Foursquare), and user profiles (Twitter), based on a novel crowd-sourced labeling approach.
Some users link their Foursquare check-ins to Twitter accounts making them publicly available. The corresponding public broadcasts in Twitter can be listened to build a database of check-in events over time. The dataset used in this work consists of more than 180 million public Foursquare check-ins with no geographic restrictions by about 6 million users from 2010 to 2011. We refer the readers interested in the dataset acquisition to [38]. The dataset was later updated with data from 2012 to 2014 following the same procedure as in [38]. Since August 2014 the check-in function has been removed from Foursquare and integrated to a new application so-called Swarm.
Each check-in event consists of a unique ID, an ID of the user who checked in and further information about the venue, including the full name, street address, city, state, postal code, country, latitude, longitudes, and a tier 1 Foursquare assigned category and a tier 2 category (e.g. “Food/Coffee Shop”).
This data is later combined with Twitter users’ account information. Hence, we can relate both LBSNs user information. Twitter accounts often provide profile details which can help to understand user behavior [30]. Retrieving user’s profile through Twitter application programming interface (API) can return four options:
User has made the account private.
User has closed or deleted this account.
User has been suspended (by Twitter).
Access to user profile including: account name, profile description, number of friends/followers, etc.
For options 1 and 2 no further information can be obtained. Option 3 implies that Twitter has closed this account because of a misuse of the service, hence, probably being Fake user [17]. The fourth option includes all public profile information from Twitter accounts. By analyzing the profile contents, a keyword frequency analysis can be performed, looking for patterns or trends. Furthermore, the fact of having zero friends or followers is also considered. This bias in the social-graph is also exploited in [8] to detect fake accounts in LBSNs. Note that we do not collect tweets not associated with Foursquare, nor any retweet information as in [8] and [17].

Fake user spatio-temporal behavior in the Foursquare dataset. Different color dots represent different tier-1 activities in the Foursquare dataset. (a) Temporal representation of a Foursquare Fake user check-ins. Horizontal axis represents time of day, vertical axis represents consecutive days. (b) Spatial representation of a Foursquare Fake user check-ins. Horizontal axis represents Longitude and vertical axis represents Latitude.
Features summary from Foursquare and Twitter dataset
Noise is generated in LBSNs in the form of fake users. We look for behaviors suspicious of being generated by automatic software agents. Our dataset for example, shows users reporting the exact same location at fixed times for several months, or users constantly moving between different continents.
The data retrieval system can also add noise to the dataset. Looking for Foursquare check-ins in the Twitter feed can result in wrong records if not properly handled. A Twitter user can retweet her Foursquare generated tweets, or reply other user’s Foursquare tweets. These practices generally create the illusion of different check-ins from a single place or activity.
Let us imagine a scenario where users hacking the Foursquare API are constantly reporting specific commercial locations for advertising purposes, or users retweeting multiple friends’ Foursquare information. Since the ultimate goal of retrieving LBSNs information is to create preference or activity user models, there is an obvious interest in classifying users into Personal, and Fake. Hence, Fake user data can be filtered out from dataset to improve the Personal user models, and thus, avoid wrong user models. In the remainder of this paper we refer to Personal and Fake users, being the latests intended to be filtered out from our dataset.
Figure 1 comprehends one Fake user activity temporal representation (a), and the geographic representation of the same activity (b). Notice the activity regular intervals within 100 days, plus the geographic distribution where Europe, America, and south-east Asia can be easily identified, meaning that this user has a priori been almost everywhere in the world in that short period.
Methodology
In this section we detail the proposed methods to classify users in either Personal or Fake. First, Foursquare users’ data is abstracted into four features identifying human behavior from Foursquare and Twitter datasets (summarized in Table 2). Second, users are labeled in Fake or Personal via crowd-sourcing. Finally, different supervised and unsupervised algorithms are proposed for the classification task.

Spatial transitions over 300 km/h are considered Fake behavior.
As detailed in Section 3, we combine Foursquare and Twitter data for all the users in the dataset. To homogenize the dataset with regard to user information (for further analysis and processing) we transform the check-in events per user to a set of unique features. These features combine user’s activity to single values, summarizing relevant information for the Personal/Fake user analysis. The goal is to represent all possible Fake users behaviors in the system, including different activity, spatio-temporal and keyword patterns:
L1: Humans sleep on average about 8 hours. During sleeping time the user is not reporting its activity in LBSNs. This is a basic feature of Personal users with respect to Fake users, which can be constantly reporting activity aided by software intended for this purpose. L1 represents the user’s averaged maximum lapse of time (hours) without check-in events per day. The more irregular users spend almost one week without activity. L2: Speed between check-ins (that is, the time a user spends to transition from one reported location to another) is limited by the transportation methods, being the plane the fastest one. L2 represents the rate of user’s transitions over average plane speed (300 km/h) (cf. Fig. 2) We use this feature to capture users hacking their locations, e.g. for advertising or becoming Mayor in a fraudulent way. A similar concept is used in [6] to report fraud in Foursquare, and in [23] to uncover fake users reviews. L3: Foursquare is a LBSN where users report their location (or activity) once [26]. Several check-ins reporting the same location or activity at consecutive (or almost) times is not considered a normal personal user activity. Instead, a Fake user can hack the LBSN API for their own purposes, or can retweet activity, that is, users taking an original Foursquare message and resending it to other users. L3 represents the number of check-ins reported from 0 to 1 seconds after previous check-ins. This feature does not consider spatial diversity, which is considered in L2. L4: Represents the frequency count for predefined keywords found in fake accounts: 4sq, foursquare, RT, jumper, broadcasting, buzz, zip code, postcode, followers count:0, and friends count:0 [8]. The rationale is that the more references to these words, the more probability to identify a Fake user.
Labeling users via crowd-sourcing for ground truth
There is no formal ground truth to label LBSN users in Personal/Fake. Hence, data extraction is necessary to evaluate our Personal/Fake user model and methods. To assign these labels we used Amazon Mechanical Turk (MT) [3]. This crowd-sourcing tool provides remote workers solving tasks like classifying pictures or checking web links. MT workers have been selected from those with more than 95% rating and more than 100 tasks completed. The methodology used to label a user in either Personal or Fake was as follows:
We provided detailed instructions on how to classify a Foursquare user on Personal, Fake, or unclassifiable (if the account was closed by the user) based on their timeline and Foursquare activity.
Each Twitter account is analyzed by three different MT workers resulting in a, b, c, d labeling options (see Table 3).
Each MT worker received 20 twitter account links to label. Two of these were previously labeled accounts by the authors of this work for correctness validation.
After MT labels were assigned, up to 60 random examples per option were selected and manually checked by the authors of this work for correctness validation.
Mechanical Turk labeling results on Foursquare data
Mechanical Turk labeling results on Foursquare data
Only labels provided by MT workers which correctly labeled the two control users were considered, resulting in 1,445 (89%) out of the total submitted accounts. Note that the probability of correctly classifying a user by chance with this method is under
We checked the degree of coincidence of MT labels with our manual labeling, achieving 96.6% coincidence. Hence, this crowd-sourcing labeling method is a good approximation of manual labeling.
Unlabeled accounts are not considered for the rest of this work. However, we tested on these accounts (taking the majority rule) the accuracy of a logistic regression classifier trained on 60 samples of each label, achieving 86.8% coincidence.
These results lower the uncertainty of the Personal/Fake classification (dependent variable), which nevertheless will always be based on human judgment. Accounts labeled in options a and d from Table 3 are used as ground truth for the remainder of this work.
Different algorithms are tested to train our system in the task of user classification based on the aforementioned features. On one hand, we test well-known supervised learning algorithms using Weka [20], based on a two-class classification into Personal and Fake users. Logistic Regression (due to its simplicity and computational inexpensiveness) and Support Vector Machines (due to is flexibility and parametrization capabilities) are used in this work. We have also evaluated supervised learning algorithms like Naïve Bayes, Random Forest, J48, PART, but their performance was inferior to the supervised learning algorithms we present.
Besides supervised learning algorithms, we also tested unsupervised algorithms like k-means clustering (also due to its simplicity), deterministic thresholds and baseline, with different numeric parameters, or a selection of them (cf. Table 2). Next, we summarize the algorithms tested in this work. We refer readers interested on further information on the used algorithms to [20] and [5].
Standard Logistic Regression (LR): is a widely used learning algorithm due to its computational inexpensiveness at running time. LR algorithms can be used in scenarios where binary states or classes are differentiated. After training, we used LR with the function
Two-class Classification-Support Vector Machine (SVM): The aim of SVM is to find the best classification function to distinguish between members of the two classes using hyperplane separation, and maximal margin between the two classes. Radial kernel is used with epsilon value of 0.1 in the loss function after grid search procedure [33]. Classification classes are weighted according to the distribution found in our dataset. In this work we used the LibSVM library [9]. For the remainder of this work we refer to two-class classification-Support Vector Machine as SVM.
k-means clustering (KM):
Heuristic Rule (HR): Thresholds for L1, L2, L3 and L4 features are defined by applying the central limit theorem [15] to a small subset of 131 users, including 61 Fake users and 70 Personal users. We extracted thresholds for each feature by finding the means of the sampling distributions with a 90% confidence.
Baseline (BL): Classifies users as Personal with 95% probability and Fake with 5% probability (as distributed in the dataset).

Logistic Regression shows better overall results. See Section 4.3 for algorithm descriptions and abbreviations.
We evaluate the accuracy of LBSN users’ classification into Personal and Fake, by using the features and methods detailed in Section 4. We present the main highlights of the classification evaluation, with supervised learning Logistic Regression classification obtaining the best evaluation metrics. We excluded from the present analysis the possibility that Fake users might adapt to classifiers over time. Solving this scenario, e.g. by classifiers incremental retraining or features incremental refinement, is proposed as future work at the end of this document.
Evaluation methodology
We divide our data into a training set and a test set using 10-fold cross validation. Correctly classifying a Fake user is considered a True Positive (TP), and doing so with a Personal user is considered a True Negative (TN). Classifying a Personal user as Fake is a False Positive (FP), while the inverse is a False Negative (FN). We use the following metrics to obtain the quality of Fake user classification: Precision = TP/(TP + FP), Recall = TP/(TP + FN), F-score = 2 × Precision × Recall/(Precision + Recall), Accuracy = TP + TN/All, and False Positive Rate = FP/(FP + TN).
The different supervised learning algorithms tested in our methodology (cf. Section 4) are evaluated using these metrics, and compared to advanced baselines including unsupervised learning algorithms and customized rule-based methods.
Overall performance results
After iterating the methodology defined in Section 4 the overall results are as follows:
The best performing method is Logistic Regression (LR).
Clustering performs 28% worse compared to LR.
The heuristic rule (HR) performs poorly, 32% below LR.
The best single-feature rule performs 18% worse compared to LR.
The most useful feature is L3 (time equal).
Figure 3 summarizes Precision, Recall, F-score, and False Positive Rate (FPR) evaluation metrics for the selected supervised learning methods (after feature selection and training) and unsupervised clustering (after feature selection), advanced baselines and basic baselines. As depicted in the figure, the best overall metrics (i.e., the best classification method for the analyzed problem) is Logistic Regression (LR) given all features for our proposed model.The heuristic rule (HR) performs 32% worse than LR. Hence, simple thresholds based on single or combination of features perform significantly worse, and thus, the utilization of supervised or unsupervised algorithms is justified to analyze the problem of Fake users detection in LBSNs.
Figure 4 summarizes the learning methods and feature selection overall results for the classification task. All methods show a classification accuracy over 95%, as depicted in Fig. 4(d). With a relatively small sample set our method correctly classifies Personal users, which is an important requirement from a LBSN managing perspective [8]. However, we shall look to other metrics like precision, recall and F-score to evaluate the Fake users classification.

All methods show a classification accuracy over 95%, being the Logistic Regression (using all features), and the SVM (without L1), the learning methods showing better overall results.

Despite the unbalanced number of positive and negative examples in the tested dataset, the bias in the outcomes of our experiments is not significant, based on our proposed model.
Precision, depicted in Fig. 4(a), defines the rate of TP examples within all examples detected as positive. Hence, good precision means few Personal users have been classified as Fake. As stated before, this is important since a LBSN company does not want to delete or close the account of Personal users with no reason to do so [8]. Our methods show a 100% precision. Recall, being depicted in Fig. 4(b), defines the rate of TP examples within all positive examples, being close to 80% for the supervised algorithms given a specific features set. F-score, depicted in Fig. 4(c), averages both precision and recall metrics. LR shows the best F-score metric for all tested algorithms (F-score = 87.6%), while a more complex method like SVM performs 2% worse in the best configuration (if L1 feature is ignored, and positive and negatives examples are weighted accordingly).
Focusing on the learning algorithms, a feature selection analysis gives meaningful insights about the features and model utilization. That is, what are the relevant parameters helping to classify Fake users in a LBSN. Specifically, we compare the algorithms using all defined features (L1–4), ignoring one feature at a time (L2–4, L1,3–4, L1–2,4, L1–3, and L1–4) to determine the effectiveness of each feature for the proposed method.
The effectiveness of each feature in the system is summarized in Fig. 4. Based on the results, excluding L3 feature worsens all metrics for supervised learning algorithms, followed by L2, and then L4 and L1 but to a lesser extent. Thus, L3 and L2, the features representing the spatio-temporal characteristics of the dataset (cf. Table 2) demonstrated to be the most relevant ones in the analysis. L2 and L3 features’ relevance can also be observed in the LR coefficients (cf. Section 4.3), where L2 has higher coefficient due to its range of values compared to L3. Nevertheless, best overall results are obtained considering all features for LR and SVM, while KM presents a better performance when L4 is not used for clustering. This difference is explained by the robustness provided for the supervised methods, which are able to learn from each of the provided features. However, k-means (the unsupervised approach) based on Euclidean distance commonly returns poor clustering performance when the number of features or dimensions increase.
Summing up, L3 is the most useful feature in the Fake users’ classification task, but recall that it is not enough to classify different Fake user patterns with good metrics by itself as shown in Figure 3. Hence, using features together with supervised learning algorithms brings better classification metrics. Based on this analysis, LR fits well in the Fake user classification problem. No significant issues like multi-collinearity or error terms of dependent variables being correlated, preventing our method to achieve good performance, have been detected in our dataset. Moreover, LR brings good scalability (since it has low computational complexity [38]), and thus, we consider unnecessary to explore more complex methods in the present analysis, although it is considered for future work.
Model sensitivity to dataset examples size
As introduced in Section 1, about 8% of users in ONS are Fake users [27], being consistent with our dataset where 6.5% of users have been identified as Fake. This fact represents a problem for populating a balanced Fake-Personal model, in order to avoid biased results when analyzing the dataset. Figure 5 depicts Precision, Recall, F-Score and FPR evaluation metrics using the selected classification algorithm (LR) for 77 positive examples and different number of negative examples (a), and different number of positive examples for 1,185 negative examples (b).
Our proposed model is quite stable regardless of the number of positive and negative examples in the dataset using Logistic Regression classification, as shown in Fig. 5. For instance, the FPR is highly related to the number of negative examples, but as shown in Fig. 5(a) the FPR variance is just 1% even considering that the number of negative examples increases by a 10× factor. Likewise, the F-score metric changes by less than 10% when the number of positive examples is increased as shown in Figure 5(b), and ∼1% from 40 positive examples on.
Hence, the unbalanced number of positive and negative examples is not adding any relevant bias in the outcomes of our experiments, based on our proposed model.
Application to industrial scenarios
There is an obvious interest into filtering out Fake accounts from a business perspective, since these accounts decrease the commercial value of the service, and annoys real users. Improving advertising click-through rates, Return-of-Investment (ROI) or other business driven metrics are some of the goals for LBSN-based companies. In this section we explore business impact evaluation metrics based on estimations on the users classification, dataset features, and Zhang et al. contextual prediction system [38].
Catching spammers early
Through a specific API provided by Twitter/Foursquare, we were able to verify that 27% of Fake users in our data set had been suspended (cf. Section 4.1). These Fake users were probably suspended due to (heavy) spamming as had been observed by [17]. We evaluated these accounts with our user model and they were 100% correctly flagged, also demonstrating the correctness of the crowd-sourcing labeling process (cf. Section 4.2). Not only is our approach effective in catching these spammers, we are even able to catch them early before they have posted many messages and caused considerable social disruption. This is because the features we use in our approach are not history-dependent by definition, and we can correctly label a user once we have a few check-in data points and some profile information.
Improving predictive analytics accuracy
Zhang et al. proposed in [38] a location-based personalized contextual prediction system using the Foursquare dataset. This technique reaches an accuracy of 76%, and 10× less false alarms compared to naïve methods. In the present work we use the same prediction system over the same Foursquare dataset, but with Fake users being filtered out.
Table 4 shows the performance metrics summary for the location-based personalized contextual prediction system using the Foursquare dataset, and the same dataset without Fake users. Equal Error Rate (EER) is the error rate in the place where the fraction of false positives is equal to the fraction of false negatives. The results show that by proper training, that is, filtering out Fake users, the EER improves the by 1.4%, which means an absolute improvement of 4.4%. Similarly the area under the ROC curve (AUC) is 2% higher after filtering out Fake users.
Filtering out Fake users increases AUC and decreases EER
Filtering out Fake users increases AUC and decreases EER

Rate of coupons delivered to Fake users for Shop and Food venues (Tier 1, Foursquare categorization), applied to the Foursquare dataset.
Companies using prediction systems to deliver promotion codes or coupons, may sort their predictions by confidence because these are most likely to be correct, i.e. the ones that are most likely to buy the product (in case companies have a limited number of coupons to deliver), or the users that may provide better feedback when using a beta (since a product on a beta stage may only allow a limited number of testers).
Using the same activity prediction system [38], for a given number of deliverable codes k, Fig. 6 shows the rate of promotion codes or coupons going to Fake users, thus, being lost. As it can be seen, predictions for Fake users are disproportionately more confident than Personal users. This is because Fake users tend to report check-ins focused on specific activities or locations, with a high degree of consistency. For example, for the top 1,000 predictions, about 35% of coupons would be delivered to Fake users, despite the fact that only 6.5% of users in the tested dataset are Fake.
Each LBSN user has a cost in terms of data storage, power or other computing resources. Fake users, not only worsen user models in the dataset, but have a relevant maintenance cost since they are usually active. In the Foursquare dataset used in this work, Fake users perform about 2.5× more check-ins than Personal users. This number results from our dataset of users with more than 500 check-ins, which itself contains fairly productive users. The productiveness of Fake users also imply a considerable increase in disk space to store their activity. In our dataset, Fake users take up to 60% of disk space. These check-ins are useless or even harmful to the applications within the LBSN, and require more CPU cycles to process and distribute and network bandwidth to receive (e.g. from a user’s Foursquare app) and distribute (to other users).
Besides simple resource wasting, this might cause other undesirable effects. For instance, people see that the most active users are bots and decide not to participate in the service. Or the system recommends you to follow people with similar interests, and of course it wants to pick the most active users, and a disproportionately large fraction of them are bots.
Discussion
In this work, we presented an empirical approach for fake user detection in location-based social networks. Our method relies in the following steps: data mining, crowd-sourced labeling, and supervised machine learning. Next, we discuss each of the proposed steps, also comparing them with related work.
The data mining procedure consisted in listening public broadcasts of Forsquare check-ins in Twitter (cf. Section 3), which we listened to build a database of check-in events over time. Besides the ease of retrieving data through Twitter’s API, combining two social networks allowed us to obtain a richer set of users data. That is, not only spatio-temporal features but also information linked to each users’ account helped us to improve the classification metrics (cf. Section 4.1). Out of the related work, only Aggrawal et al. [1] used Twitter for Foursquare data collection and feature extraction (although their goal is to detect different spam tipping behaviors). Zang et al. combined Yelp and the Chinese consumer reviews site Dianping [39]. Nevertheless, this method also presents drawbacks. Besides the complexity of combining different LBSNs, we may also face sampling problems since not all Foursquare users publicly broadcast their check-ins. However, the analyzed related work directly crawling Foursquare do not report actual metrics on fake user detection (cf. Table 1).
Large machine learning experiments face the labeling problem. That is, how to reliably label samples to train the system with a ground truth. On one hand, actual ground truth can only be provided by human judgment. On the other hand, in a real LBSNs it would require a large staff and resources for manual labeling. We adopted an aggregated human judgment crowd-sourced approach, where experienced remote workers of Amazon MT labeled the accounts. Redundancy and crosschecking measures were adopted to reduce the labeling uncertainty (cf. Section 4.2). In the related work, only Li et al. [23] and Xuan et al. [36] used crowd-sourced labeling to analyze Dianping.
Finally, we evaluated a number of supervised and non-supervised machine learning techniques, besides feature selection and model sensitivity to dataset size. Multivariate LR returned the best classification metrics (cf. Section 5). We also demonstrate that the unbalanced number of real and fake accounts does not add any relevant bias in our results. Although our results outperform similar related work ([1,8,36]), we also choose LR due to its simplicity and efficiency in binary classification, while classification in more than two groups may be improved using other methods like kNN or random forest as in Aggrawal et al. [1].
It is worth mentioning that classification metrics differences are not significant (less than 6%), specially given the fact that datasets are different. Moreover, fake users behavior may change in the future, making other methods more suitable. Nevertheless, compared to the related work, our proposal is the only end-to-end approach covering the three aforementioned steps (automatic mining, labeling and evaluation). We also contributed with the first quantification of the value of correctly detecting fake users or accounts, by providing metrics based on real world examples (cf. Section 6).
Conclusion
Location-based Social Networks (LBSNs) are currently the focus of fake users, that is, users or accounts not behaving in a human manner. The presence of Fake users is detrimental to LBSNs ecosystems and their value in intelligent environments. This paper presents a first approach for detecting fake users in LBSNs. We have shown that our approach is highly accurate in filtering out these users and leads to notable improvement to the quality of several LBSN services and/or use cases. We believe that our methods can be applied not only to Foursquare or Twitter, but to other similar LBSNs.
To solve the Fake users detection problem, we have defined a method to parametrize Foursquare users (combined with their Twitter accounts) in different features looking for different Fake users behaviors, and checking these features using a crowd-sourcing tool. Then, we have tested different supervised learning algorithms and feature combinations looking for the best classification metrics. A Logistic Regression algorithm has been selected based on evaluation metrics.
From our classification method, we have demonstrated the value of Fake users detection with business-related metrics. For instance, experiments with a prediction system using the clean dataset (after filtering out Fake users) show an improvement in the prediction accuracy. Further experiments prove the suitability of our method to classify spammer accounts, and also to filter out Fake users for cost/gain improvement and resource consumption. Our experience with this work has enlightened us to a number of ideas:
Feature extraction improvement: Current features reflect personal characteristics of the users, helping to detect non-human users. Improving the construction of these features would help to increase the accuracy of our method.
Incremental adaptation of classifiers: Fake users might adapt to classifiers, specially if those are bots or software agents trying to gain some benefit from LBSNs. Incremental adaptation or retraining of classifiers, features refinement and game theory techniques may improve the Fake user detection methods.
Scalability: Experiments in this work have been performed using a subset of our Foursquare dataset. As a future work we want to increase the dataset size of our experiments to test the scalability of our proposal.
Generalization to other Location-based Social Networks: Foursquare does not provide all possible GPS locations as in other LBSNs, but semantic locations linked to a location. Hence similar LBSNs would benefit from our proposed method. Moreover, due to the location features included in Social Networks (e.g. geolocation of pictures), it would be interesting to adapt our method to generic LBSNs. By doing so, we could add a much richer set of features, e.g. using Natural Language Processing to extract user prosodic markers.
Privacy: Accessing LBSN accounts for user modeling can turn on a privacy threat in scenarios where a few user’s data model all system’s activity. Sampling techniques can reduce the privacy threat by diluting users input to the system.
Footnotes
Acknowledgements
This work is partially supported by the Spanish Ministry of Economy and the FEDER regional development fund under the projects SINERGIA (TEC2015-71303-R), SMARTGLACIS (TIN2014-57364-C2-2-R), and Obra Social “la Caixa”-ACUP through project 2011ACUP00261.
