Abstract
To take advantage of the full range of services that online social networks (OSNs) offer, people commonly open several accounts on diverse OSNs where they leave lots of different types of profile information. The integration of these pieces of information from various sources can be achieved by identifying individuals across social networks. In this article, we address the problem of user identification by treating it as a classification task. Relying on common public attributes available through the official application programming interface (API) of social networks, we propose different methods for building negative instances that go beyond usual random selection so as to investigate the effectiveness of each method in training the classifier. Two test sets with different levels of discrimination are set up to evaluate the robustness of our different classifiers. The effectiveness of the approach is measured in real conditions by matching profiles gathered from Google+, Facebook and Twitter.
1. Introduction
Today, people spend part of their social life on the web, creating a virtual environment where they can find and meet friends, share and create information and be engaged in a variety of social activities. The ability to gather the public traces left during these online activities would lead to a deeper understanding of the user’s identity and behaviour. This would improve service provisioning, enable service customisation and cross-domain recommendation and give rise to other services in different domains.
Even though we are dealing with information that users have explicitly designated as ‘public’, chasing them throughout different social networks is still a challenging research task. In fact, not all online services force users to specify their real identity, nor do they adopt platform-specific user’s data (profile fields, friendships and tags), all which makes the field-matching difficult. In this scenario, the identification of the same user across multiple social platforms would represent a key step, for it would facilitate the data gathering process.
How might the identification of a user be successful in practice when considering different social platforms? The answer heavily depends on the available data on the target OSNs. Accessing this kind of public data is not a straightforward task and gives rise to privacy concerns. To overcome these privacy issues, our proposed approach for identifying users across Google+/Facebook (GF) and Google+/Twitter (GT) relies on public data made available by their application program interfaces (APIs). Among all fields returned by the APIs, we consider common fields only. This leads us to adopt a dynamic set of identification properties which flexibly adapts according to the systems under study, as opposed to the adoption of a predefined set like commonly assumed in literature [1].
In this article, we propose an approach aimed at finding a match among identities across online social networks (OSNs) based on minimum common profile fields available through the APIs. We select the most effective features based on the common fields and then use automated classifiers to match the input profiles. According to the literature, in carrying out the identification task, random construction is the only way to construct negative instances, that is, pairs of profiles corresponding to two different identities. The question which arises is as follows: are random negative instances representative of the population with respect to training the classifier? To answer this question, we construct negative instances in three different ways and evaluate our trained classifiers on two different test sets. The accuracy and robustness of the approach are evaluated on two novel datasets collected from Google+, Facebook and Twitter. They consist, respectively, of 8000 GF users and 2400 GT users. 1 Moreover, we analyse the applicability of the learnt classifiers in a real scenario built on the GF users’ neighbourhoods. By comparing the proposed method with the approaches in Vosecky et al. [2] and Carmagnola and Cena [1], we show that our method is able to identify users with a significantly higher degree of accuracy.
2. Related work
Several database research studies and information retrieval communities have focused efforts on matching entities across different data sources [3–7]. While addressing similar problems, they do not match accounts directly across social networks.
In this section, we summarise the studies related to the identification of individuals across different social media sites. We divide these studies into three different categories according to the information they rely on: username-based identification, profile-based identification and network- and content-based identification.
2.1. Username-based identification
Solutions based on username start from the assumption that the username is the minimum common factor available on several OSNs. Consequently, the different methods rely exclusively on features extracted from the strings composing usernames. Zafarani and Liu [8] proposed a methodology based on users’ behavioural patterns when selecting their usernames. They demonstrate that the environment, the personality and the human limitations result in choices of the username that are not random but have redundancy. Identification features are constructed on endogenous and exogenous factors and on patterns due to human limitations. Using logistic regression, they obtained 0.930 and 0.927 of accuracy in the identification of the same user, before and after a feature selection step. To the authors’ knowledge, Zafarani and Liu’s [8] work is one of the most complete research studies on this topic since their proposed features cover most aspects of username creation.
The same authors [9] introduced a simple but interesting approach for finding other possible usernames that a user may select when he or she is registering on a social media. They empirically provided evidence on the possibility of identifying corresponding identities across various communities using usernames and a search engine. The approach starts by searching for a given username on Google to find a set of keywords that might represent possible usernames in the target social networks. Then this set is extended by adding/removing common prefixes and suffixes to/from its members. Finally, in order to filter out usernames, the existence of each username in the set is checked through the URLs that reside in the target community domain by searching on Google. An accuracy of 0.66 is reported. Since the accuracy of the approach depends heavily on the set of candidate usernames, its construction presents several challenges of its own. Furthermore, the authors rely only on Google search results to determine the correctness of the identification.
Perito et al. [10] as well have explored the possibility of linking users’ profiles simply by looking at their usernames. To link profiles that correspond to the same identity, they estimate the uniqueness of a username by exploiting both a language model theory and Markov-chain techniques. For each username, the learnt binary classifier checks all possible usernames in a list for similarities. This makes the approach hard to use on a large scale.
Since the username is available in all social sites, username-based identification approaches do not cope with challenges associated with the public information of users (e.g. heterogeneousness, incompleteness and falseness). However, the exclusive usage of features extracted from usernames results in poor performances when two or more people have the same name or when users differentiate their usernames due to matters of privacy and so on.
2.2. Profile-based identification
Carmagnola and Cena [1] addressed the subject of user identification for cross-system personalisation among user-adaptive systems. They performed some analyses for defining a set of identification properties, the importance factors of each property and the relative thresholds. The proposed algorithm, which compares user profile attributes based on the assigned importance factors, is applied to users belonging to three user-adaptive systems developed in their research group. In the small test they used, 64 cases were positive matching (‘identified’) while 16 were negative (‘not identified’). The results of the algorithm were, respectively, ‘identified’ in 59 out of 64 cases and ‘not identified’ in 14 out of 16 cases.
Vosecky et al. [2] proposed a similar threshold-based approach for comparing profile fields. They defined a weightingvector to control the influence of each profile attribute on the overall similarity. To compare user profile attributes they used exact, partial and fuzzy string matching and achieved an accuracy of 0.83.
Motoyama and Varghese [11] discussed a method for searching and matching individuals on Facebook and MySpace using profiles attributes. Their proposed method considers attributes as bags of words and calculates the similarity between two accounts as the number of common words between profile attributes. It does not account for common entities that have slightly different names.
2.3. Network- and content-based identification
Iofciu et al. [12] proposed an approach for user identification based on usernames and tags that users assigned to images and bookmarks. They also suggested various strategies for comparing the profiles of two users and were able to achieve an accuracy of about 0.6. Since matching accuracy via tags mainly depends on the number of tags assigned by a user, the accuracy of their identification rose from about 0.6 to about 0.8 by aggregating users’ profiles from different sources.
Peled et al. [13] introduced an algorithm based on machine learning techniques to match two users’ profiles from two different OSNs. The classifiers utilised three types of features: name-based features, general user info based features and topological-based features including the number of mutual friends and mutual friends of friends of two users. Since the computation of the number of mutual friends strictly depends on the identification task, they just count the number of friends with identical names in both circles of friends. Their evaluation of the contribution of each feature in the classification process shows that the name-based features are the most important.
Goga et al. [14] conducted an investigation of the reliability of matching user’s profiles across real-world OSNs. They proposed a framework to understand how profile attributes used in the matching schemes affect the overall reliability. They used only public attributes, such as names, usernames, location, photos and friends. These public attributes are not necessarily available through the APIs of the OSNs.
Jain et al. [15] proposed two identity search algorithms based on content and network attributes. They improved the traditional identity search algorithm founded on the attributes of the user profile. An average precision of 0.83 for the identity resolution process using profile, network and content identity search methods and an image-based identity matching method is obtained.
Malhotra et al. [16] applied automated classifiers for identifying profiles belonging to the same user across social networks. They used profile attributes and connections of users on different social networks to generate the digital footprints of the users. Their study indicates that user identification yields a large number of false positives. Some other studies also leverage user connections to match accounts on social networks [17–20].
The approaches in this category rely on data not retrievable by the APIs, so making them difficult to apply. For instance, Google+ API does not expose any resource to get the in/out neighbours, while Facebook requires the user’s permission to access his or her friends.
Our approach belongs to the profile-based category since it exploits fields common to the social media, in addition to the features extracted from the usernames. In section ‘Evaluation’, we compare our method with two of the profile-based approaches. Moreover, in all the approaches based on classification techniques, the only way to construct negative instances is random selection; we show, however, that random construction in real scenarios yields a high rate of false positives. To overcome these limitations, in this article we propose different methods for constructing negative instances. We do so by acting on the level of similarity. Finally, we investigate the effectiveness of each approach for identifying a user across GF or GT.
3. Dataset
On most social networks, users complete the basic information associated with their profiles by inserting links to other external web resources. Most of the time, these links refer to the different accounts users adopt in other social network sites. This behaviour makes possible the collection of the accounts associated with an individual person.
Operationally, we began by gathering profile information from Google+; in particular we exploit the technique by Gonzalez et al. [21] to retrieve Google+ accounts. From the Google+ sitemap file, we randomly extracted more than 1 million user profiles. We analysed the links referred to on each profile page, although only 2% of those users had indicated useful links. Although we limited our attention to links pointing to Facebook and Twitter, the method can easily include other social platforms. Depending on the social site a link pointed to, we retrieved the profile information through the specific API, namely, Facebook Graph API for Facebook and Twitter API for Twitter. The entire gathering process is depicted in Figure 1, where, as a last step, we select profiles containing Latin1 characters only. Finally, the different profile information associated with the same Google+ user are stored in the ‘Profile DB’, so that each record contains a sequence of profiles Ps, where s denotes the name of the social network.

Data collection process. Once the links to other profiles have been obtained, the ‘URL Validator’ verifies whether the links are still valid and point to an account through the APIs. The ‘Profile Extractor’ gets the public information from each social site, while the ‘Latin1 Charset Filter’ detects the profiles containing Latin1 characters only. Each entry in the ‘Profile DB’ contains the profiles associated with each valid Google+ profile.
Unfortunately, profiles in ‘Profile DB’ cannot be directly analysed due to two main weaknesses: (1) the data returned by each API are different from one another, in terms of field name and semantics (structural and semantic heterogeneity); (2) some fields are empty or unavailable since users may not fill some common fields, may change a setting or may make some info private (data incompleteness). Both issues were analysed. Results are reported in Figure 2. Regarding structural and the semantic heterogeneity, in Figure 2 we report – on the y-axis – all the fields returned by each social network API. Although the number and the meaning of the fields are heterogeneous across the social sites, for each pair of social sites we are able to identify some common fields that mainly involve data about username. The available common fields between Google+ and Facebook include username, 2 first name, last name and gender. By contrast, in Google+ and Twitter profiles, common fields are username, 3 full name, location and description.

Percentage of missing values in the fields of each social site.
To deal with data incompleteness, in Figure 2, we report the percentage of empty or missing fields grouped by social site. We observe that fields related to the name of the user – such as last, first and full name – contain information that is valid for most of the profiles in each social site. Second, even though the Google+ API offers more public fields than other social sites, its users tend neither to fill these fields nor keep them private. By contrast, on Twitter, the same users are more prone to share information about where they live and what they are like.
We address the above issues by applying a data cleansing phase which removes the profiles with incomplete or missing data in the common fields. The process generates two datasets that we will use for building and testing the different user-identification solutions:
GF dataset. GF dataset contains pairs of Google+ and Facebook profiles with common fields properly set. Initially, the number of Google+ profiles linked to a valid Facebook account amounted to 14,000; however, after the cleansing phase, we discarded pairs (missing values included) and thus obtained 8000 valid ones.
GT dataset. As with the GF dataset, this collection contains valid pairs of profile from Google+ and Twitter. Initially, it consisted of 8600 GT profile pairs, but after discarding pairs (missing values included) we ended up with 2400 valid ones.
3.1. Username composition
We exploit the users’ profiles to investigate how people in Google+, Facebook and Twitter use their real names to compose usernames. Our analyses are based on the first, last and full names provided by each user in his or her profiles. These analyses are helpful for selecting the most efficient identification features, in particular they help find redundant fields.
We identify four elements which contribute to the composition of the username: the first name, the last name and the full name, denoted as FirstN, LastN and FullN, respectively, while the fourth element (FullNnoSpace) is obtained by eliminating spaces from the full name. Then, we verify whether one of the above elements or their concatenations is/are substrings of the original username. Results are shown in Table 1.
Username composition. Basic substrings and their main concatenations contained in the username
One of the most remarkable results is that 100% of Google+ usernames contain the first name. We also observe that more than 45% of Facebook usernames contain no part of the real name (i.e. first name and last name). As expected, the most common way to generate Google+ usernames is by adding a space between the first name and the last name (97.91%). Since the Twitter API just returns the full name of users, the analyses on Twitter usernames do not include first name and last name. In our dataset, 29% of Twitter usernames contain their full names. The results show that Google+ is considered as a more formal platform by users. After these empirical observations, in section ‘Feature extraction’, we select the most appropriate features.
4. Methodology
In this section, we present the methodology for solving the identification of the same identity across different social media. First, we formalise the identification problem as a classification task. Then, we illustrate how we build the feature set. We show that the construction of the negative instances in the generation of the training set plays an important role. In fact, we stress that the random construction of negative pairs results in a classifier which in a real application incorrectly assigns a lot of profiles to a single user. To reduce the number of false positives, we select negative instances by different methods. Finally, we propose different ways to construct training and test sets on top of which we evaluated different learning algorithms.
4.1. Problem definition
People leave plenty of various profile data and information on diverse OSNs. These fingerprints may be exploited to identify individuals across social networks and to integrate these sources to get an all-around vision of the users. However, the amount of publicly available information about users’ profiles is limited by the privacy setting and the APIs released by social media. Despite the limited availability of information, a common set of attributes exists for each pair of social sites. Usernames always represent the minimum common factor, but most of the time, gender, location and/or a short description of the person are made public by the APIs. Our methodology relies on these common attributes, which are available through social networks APIs, to identify users across OSNs.
In the user identification, we ask whether given two profiles Ps1 and Ps2 from two different social sites s1 and s2, we can specify whether or not they belong to the same individual. This corresponds to learn an identification function f(Ps1, Ps2) such that
where the function f (.,.) depends on the set of common attributes between Ps1 and Ps2. An identification function can be learned using a supervised learning technique that employs the selected features. As required by the supervised approach, the labelled data are based on the datasets in ‘Profile DB’ (see section ‘Dataset’), while the learning algorithms are the most adopted ones.
4.2. Feature extraction
As in most of the learning frameworks, the choice of the appropriate features impacts the performance of the identification. Features are used as measures to indicate whether two profiles on different OSNs are similar in terms of behavioural patterns in the construction of usernames, in the composition of the short description and in the definition of the location. According to the available common profile fields through the APIs, most of the fields are string, so we adopt the main measures for string fuzzy matching. Specifically, we use the following metrics for comparing profile fields:
Exact match (EM). We use a string comparison function to check the equality of data fields. We expect that this feature will be important for the identification; in fact, Zafarani and Liu [9] have shown that individuals tend to choose the same username.
Longest common substring (LCS). Since adding prefix or suffix to a username, name and so on is a common behaviour across social networks, it can be detected using LCS. We normalise this measure by dividing it by the average length of the two original strings to get a value in [0, 1].
Longest common sub-sequence (LCSS). We use the normalised LCSS for detecting abbreviations.
Levenshtein distance (LD). The Levenshtein algorithm (also called Edit-Distance) calculates the minimum number of edit operations that are necessary to modify one string to obtain another string [22]. It accounts for swapped letters, name shortening or the automatic composition of the username; for example, the username suggested by Google+ or Facebook is a concatenation of the first name, a dot and the last name.
Jaccard similarity (JS). To compute alphabet overlaps, we use JS, defined as the size of the intersection divided by the size of the union of the sample sets.
Cosine similarity with tf-idf weights. The cosine similarity between two documents measures their similarity in terms of the angle between their representations in the Vector Space Model, where for each term in the document set, we compute its tf-idf.
We employ EM, LCS, LCSS, LD and JS to compare name-based fields; otherwise, EM, LCS and JS are computed to evaluate the similarity of the ‘location’ field. The ‘description’ fields that users provide about themselves are first tokenised by removing punctuations and stop words; then we compute the cosine similarity between the two token sets. Finally, the EM is applied to compare the ‘gender’ field.
Figure 3 shows the distribution of some features for profiles pairs belonging to same identity. In the figure we also report the distribution of corresponding features for random pairs that do not belong to the same individual. These analyses are based on the GF dataset.

(a) LCSS distribution of first name pairs, (b) LCS distribution of last name pairs and (c) JS distribution of username pairs.
4.3. Training and test sets
Once the feature set is designed, the learning framework provides for the construction of training and test datasets to learn the identification function and then evaluate its performance in approximating the real one.
The datasets GF and GT only contain positive instances, that is, pairs of profiles corresponding to the same identity. Yet we need negative instances to train the different binary classifiers. In the literature, random construction is the means of choice for constructing negative instances. It includes creating negative instances by randomly producing pairs (
We applied multilayer perceptron and random forest learning algorithms to perform the classification task. The latter also assigns a probability to the instance to be evaluated, that is, the probability that a given pair of profiles in different social sites refers to the same identity. We compared the learning techniques through the standard measures: accuracy, precision, recall and F-measure. Finally, the positive instances were randomly split into training/test sets with a 70:30 ratio, and then we inserted the random negative instances. That results in training datasets TrainGF.1 and TrainGT.1 and in test datasets TestGF.1 and TestGT.1. The performances reported in Table 3 (TrainGF.1-TestGF.1 and TrainGT.1-TestGT.1 rows) show that both the multilayer perceptron and the random forest get very good performances for the identification task obtaining 0.95–0.96 as the F-measure.
Given the high precision and the low number of false positives, we tested our method in a real scenario. We applied the classifier trained on the training set obtained from the GF dataset to find the overlapping of the neighbourhoods of users in Facebook and Google+. As representative of a general trend, here we focus on a random user who has 199 friends in his or her Facebook network and 112 persons in his or her Google+ network including his or her followers and followings. For each Google+ neighbourhood of the random user, we performed the identification task with respect to all his or her neighbours in Facebook. Namely, we want to discover common friends in Facebook and Google+. In our crawled dataset, for the Facebook neighbourhoods of a random user only username and full name are available, while for his or her Google+ neighbourhoods just usernames are available. We eliminated some features in our training set to fit the current instances fields. Since in this real application we do not have a ground-truth about the profile matching, we used the number of matched profiles for each candidate user as the main criterion for comparing and evaluating our classifier. As shown in Figure 4, most of Google+ friends have been matched with more Facebook friends. Specifically, the average amount of false positives for the classifier trained on TrainGF.1 is 6. In general, we found analogous results for most of the accounts we retrieved.

False positive rate.
By analysing TrainGF.1, we observed that with the exception of fields like ‘gender’ and ‘location’, most of the random negative instances have completely different values for the corresponding fields. For example, in all 5600 randomly constructed negative pairs based on the GF dataset, there were only four pairs with the same first name and one pair with same last name. However, in a real application, the identification algorithm must identify a candidate user in a group of users who usually belong to the same language, culture or region. In general, the candidates show a high similarity/homophily resulting in a significant amount of similar fields. Therefore, applying a classifier trained on a dataset containing only random negative instances may result in a high rate of false positives. Thus, we propose two further methods to construct negative instances by taking into account these issues, as shown in Figure 5.

The architecture of the proposed approach.
In the first method (Train2 in Figure 5), we build negative instances to provide a medium level of difficulty for the discrimination. To do this, 50% of negative instances are constructed randomly. The remaining 50% are set up in a way so that each negative instance has similar values at least in one field. This ensures that the number of negative instances with similar values is the same for each field.
In the second method (Train3 in Figure 5), we construct negative instances to provide a more difficult setting. All negative instances are set up in a way so that each negative instance has similar values at least in one field. This ensures that the number of negative instances with similar values is the same for each field.
We select our required instances from all possible negative instances that can be constructed randomly from positive instances. We also create two different test sets to evaluate the accuracy and robustness of our different classifiers. We separate 30% of positive instances for testing. Our first test set includes these positive instances and the same number of random negative instances. The second test set includes positive and negative instances that are difficult to discriminate, that is, each positive instance has different values in at least one field and each negative instance has similar values in at least one field. Both test sets are balanced.
With respect to the identification between Google+ and Facebook profiles, we build the training datasets denoted as TrainGF.1, TrainGF.2 and TrainGF.3, respectively. All the training sets include 11,200 instances: 5600 negative instances and 5600 positive instances.
Whereas all the negative instances in TrainGF.1 are produced randomly, the negative instances in TrainGF.2 include 2800 (50%) randomly constructed pairs, 1400 (25%) pairs with same ‘first name’ and 1400 (25%) pairs with same ‘last name’. 50% of pairs with same ‘first name’ or ‘last name’ have a same ‘gender’: 1400 pairs. We do not consider similarity in fields like ‘username’, because we found no negative pairs with same ‘username’ in all the possible randomly constructed negative instances. Moreover, personal information (name, gender, etc.) has a main role in selecting usernames by individuals. Therefore, negative instances with similarity in some personal information fields may have a kind of similarity regarding ‘username’. In addition, negative instances in TrainGF.3 contain 2800 (50%) pairs with the same ‘last name’ and 2800 (50%) pairs with the same ‘first name’. Finally, 50% of negative pairs in TrainGF.3 have a same ‘gender’.
As for the construction of the test sets from the GF dataset, the first test set – TestGF.1 – includes 2400 positive instances and 2400 random negative instances. To create the second test set (TestGF.2), we select positive instances that are not equal in ‘first name’, ‘last name’ or ‘gender’. Negative instances include (50%) pairs with the same ‘last name’ and (50%) pairs with the same ‘first name’, while 50% of negative pairs in TestGF.2 have the same ‘gender’.
We apply the same procedure for the identification task between Google+ and Twitter. In this case, 50% of negative instances in TrainGT.2 are built randomly. Finding instances with exactly the same value for the field like ‘Full name’ is not possible since there is no match between the fields reporting first and last name. Therefore the remaining 50% negative instances are built in such a way so as to contain 33.3% pairs similar in the first part of the full name, 33.3% of instances similar in the last part of the full name and 33.3% of instances with same location. Also, 50% of these instances have similarity in the ‘description’ field bigger than the average cosine similarity of ‘description’ field value in all the positive instances. Finally, all negative instances in TrainGT.3 are constructed in a way so that each instance has similar values in at least one field.
To set up the test sets, the first testing set (TestGT.1) includes 50% positive instances and 50% random negative instances, while the second testing set (TestGT.2) includes positive instances that are different in at least one field. All negative instances are built so as to have comparable values in at least one field according to the same method adopted for constructing 50% of negatives instances in TrainGT.2.
5. Evaluation
Before evaluating the training sets on the different test sets, we validate our approach using 10-fold cross-validation. As in the test-bed application, we apply two different learning techniques: multilayer perceptron and random forest. Our aim is to verify whether different learning algorithms can further improve the learning performance. These techniques have different learning biases, and so we expect to observe different performances for the same task. As seen in Table 2, results are not significantly different among these methods. When sufficient information is available in features, the user-identification task is not sensitive to the choice of the learning algorithm.
Results of the evaluation on our datasets using 10-fold cross-validation
Using 10-fold cross-validation, we get reasonably accurate results for both classification techniques on different datasets. We evaluate the effectiveness of each method in training the classifier on two test sets with different levels of discrimination. Table 3 shows the detailed results of applying the different learning techniques on the datasets.
Results of different classification techniques on the datasets
As shown in Table 3, we confirm that the construction of negative instances in a random way is not a robust method. This is due to the fact that performances vary greatly from one test set to the other. The training on TrainGF.1 and TrainGT.1 results in the F-measure of 0.962 and 0.957 on TestGF.1 and TestGT.1 and in a worse F-measure of 0.551 and 0.842 on TestGF.2 and TestGT.2, respectively. TestGF.1 and TestGT.1 contain random negative instances, while TestGF.2 and TestGT.2 include pairs that are difficult to discriminate.
A different behaviour characterises the classifiers trained through the second method (Train2) for building the training set. We observe the same patterns in the evaluation results on both GF and GT datasets. In fact, classifiers trained on datasets built based on the second method, TrainGF.2 and TrainGT.2, get the best F-measure on the test sets and exhibit more robustness, while classifiers trained on TrainGF.1 and TrainGT.1 show a high variation on the different test sets.
We also consider tests on unbalanced datasets. Table 4 shows detailed results of (75/25) and (25/75) unbalanced test sets. Although results of unbalanced test sets are close to results of balanced train and test sets, our second method shows better results also in the unbalanced setting. In fact, the average results of unbalanced test sets including 25% positive instances and 75% negative instances (TestGF.12, TestGF.22, TestGT.12 and TestGT.22) are slightly better than results of balanced tests sets. The results on (90/10) and (10/90) unbalanced test sets are reported in Appendix 1. Again, they reveal the effectiveness of our second method versus the others.
Results of different classification techniques on unbalanced test sets ((75/25) and (25/75))
As shown in Table 3, results are not significantly different between the two learning algorithms. In our experiments, Random forest shows slightly better results in most of the cases; consequently, its results on balanced datasets will be the reference method in the following experiments. Specifically, we analyse (1) how the classifiers, trained with different training sets, perform in the test-bed application and (2) whether our proposed method for user identification outperforms other methods in the literature.
5.1. Finding candidate users
We repeated the test-bed experiment, namely, overlapping the neighbourhoods, adopting the classifiers trained on the different sets. The results reported in Figure 4 show that the classifier trained on TrainGF.2 exhibits the best result, with three matches per neighbour on average, while the classifier trained on TrainGF.3 obtains the worst result, with 21 matches per neighbour on average.
Since the Random forest algorithm gives us the probability of a candidate username belonging to an individual, we can rank the candidates so as to verify whether or not the first positions contain the right profile. Specifically, for each user u in the Google+ neighbourhood, let M be the set of neighbours in Facebook identified by the classifier, and for each node m ∈ M, let Pm be the probability that m is the same individual as u. We selected as a correct matching the m with the highest probability. Both the Google+ and Facebook neighbourhoods were checked manually for overlapping users. Methods for manual checking were as follows: checking for equal names, checking for similar profile pictures and other available information. In 87.75% of the cases, the right profile happened to be in first position using TrainGF.2. This percentage falls to 10% and 82% when using TrainGF.1 and TrainGF.3, respectively. All experiments in this section show that the classifier trained on the dataset constructed through our second method (Train2) exhibits better performances in a real scenario.
5.2. Comparison with existing algorithms
The average performance of the classifiers trained on datasets based on the second method (TrainGF.2 and TrainGT.2) is better than the others. Now we need to compare the performance of the classifiers trained on these datasets with some acceptable approaches in the literature. Thus, we consider Vosecky et al. [2] and Carmagnola and Cena [1] methods for comparison. For implementing these approaches, we have to calculate the required parameters and thresholds of TrainGF.2 and TrainGT.2 and then evaluate the methods on the test sets.
In Carmagnola and Cena [1], each identification property is characterised by three parameters to calculate the importance factor of each property in the identification process: (1) level of uniqueness (UL), which represents how much a property may assume the same value across different users. This property is directly related to the capability of identifying the user; (2) values per user (VpU), representing the possibility for a feature to be provided with different values for a unique user in the systems he or she interacts with and (3) misleading level (ML), expressing the probability, for a feature, to be provided with a false value. VpU and ML are inversely related to the ability to identify the user. We report the values assumed by these three factors in Table 5. After calculating the importance factor of each property using UL, VpU and ML, we use the following formula to combine the importance factors values of all the matching properties
where p, q, m and n represent the importance factor of each identification property whose values match. The IF must exceed the importance factor threshold for the user to be considered as identified user.
Calculated parameters for Carmagnola and Cena’s approach
UL: level of uniqueness; VpU: values per user; ML: misleading level.
Using the parameters and threshold of 0.82 calculated on TrainGF.2, the F-measure of 0.816 and 0.490 is obtained on TestGF.1 and TestGF.2, respectively. Moreover, by calculating the parameters and the threshold of 0.45 on TrainGT.2, the best F-measure of 0.659 and 0.602 is obtained on TestGT.1 and TestGT.2, respectively. The main weakness of the Carmagnola and Cena’s approach is that it uses only the EM as similarity measure. That also results in the lower accuracy on TestGT.1 than on TestGF.1. Moreover, since in our GT dataset almost half of the available fields are not name-based, including ‘location’ and ‘description’, the EM is not a good choice for comparing these fields.
Since TrainGT.2 and TestGT.2 contain negative instances with similarity in the first part or last part of full names, better results are observed on TestGT.2 than TestGF.2. It can be observed that our method outperforms the method of Carmagnola and Cena by 0.111, 0.304, 0.274 and 0.308 on TestGF.1, TestGF.2, TestGT.1 and TestGT.2, respectively.
In Vosecky et al. [2], a similarity vector V is defined as V (Ps1, Ps2) = < v1, v2, …, vn >, such that vi = compi(fi,Ps1, fi,Ps2), where fi,Ps1 is the ith field of profile Ps1, for each vi, 0 ≤ vi ≤ 1, |V| = |Ps1| = |Ps2|. Three categories of field matching are distinguished: exact matching, partial matching and fuzzy matching. A weight vector is defined to control the influence of each profile attribute on the overall similarity. In line with Vosecky’s method, we selected the weights and thresholds on our datasets, as reported in Table 6.
Vosecky’s weight vectors computed on our datasets
By calculating thresholds of 1.15 and 0.45 on TrainGF.2 and TrainGT.2, we observe that the experiments on TestGF.1, TestGF.2, TestGT.1 and TestGT.2 show the best F-measure of 0.852, 0.535, 0.816 and 0.781, respectively. Our proposed method outperforms Vosecky et al.’s on the TestGF.1, TestGF.2, TestGT.1 and TestGT.2 by 0.075, 0.259, 0.117 and 0.129, respectively. As shown in Table 7, Vosecky’s method exhibits better results if compared with Carmagnola and Cena’s approach since the former exploits different similarity scores through a name matching function (VMN).
Performance comparison of the profile-based approaches
These experiments reveal that not only our proposed approach is robust and achieves good results on the different test sets, but also that it outperforms both Vosecky et al.’s and Carmagnola and Cena’s approaches.
5.3. Features importance analysis
Until now, we have proposed different features measuring similarity between two fields. In this section, we analyse how the feature’s importance changes among datasets using Information Gain Ratio.
The most important features of TrainGF.1 are all based on first name. This result shows the flaw of the random selection approach as the number of negative instances with same first name is too small to properly perform the learning task. In TrainGF.2, we constructed enough negative instances with the same first and last names. In this case, features based on other attributes such as username ranked higher on the list, which is a more reasonable result. The most important features of TrainGF.3 are based on username. In TrainGT.1, TrainGT.2 and TrainGT.3, the most important features are all based on full name.
All these results show the importance of name-based features (username, first name, last name and full name) in the identification process. The five top features on the datasets are presented in Table 8.
Most important features using the random forest classifier
LCS: longest common substring; LCSS: longest common sub-sequence; JS: Jaccard similarity; LD: Levenshtein distance.
6. Conclusion
In this article, we have proposed an innovative methodology for connecting people across GF and GT. Our method adopts minimal common information available through the official APIs of Facebook, Twitter and Google+ to drive features that can be used by supervised learning to effectively connect users across different OSNs. Relying on information available through APIs, our approach reduces privacy concerns as well as difficulties in collecting user information. By focusing on all common properties that characterise the systems under study, rather than on a predefined set of properties, we show that a better identification process can be achieved. It is one which simply uses available information across different platforms.
We constructed negative instances in three different ways, going beyond the commonly adopted random selection to evaluate the robustness of our identification algorithm on different datasets. Results show that the approach can lead to a very effective identification method and methodology for building reliable datasets. Moreover, we analysed the success of our method in a real scenario built on GF neighbourhoods.
Experiments show the advantages of the proposed method in comparison to previous methods; they also indicate that constructed features contain adequate information for connecting corresponding users.
Future work includes defining a framework for user identification across multiple OSNs and analysing the benefit of connecting users in different domains.
Footnotes
Appendix 1
Results of different classification techniques on unbalanced test sets ((90/10) and (10/90))
| Train dataset | Test dataset | Distribution (%) | Random forest |
Multilayer perceptron |
||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy | Precision | Recall | F-measure | Accuracy | Precision | Recall | F-measure | |||
| TrainGF.1 | TestGF.11 | 90–10 | 93.86 | 0.962 | 0.939 | 0.945 | 94.26 | 0.961 | 0.943 | 0.948 |
| TestGF.12 | 10–90 | 98.78 | 0.988 | 0.988 | 0.988 | 97.97 | 0.981 | 0.980 | 0.980 | |
| TestGF.21 | 90–10 | 76.19 | 0.821 | 0.762 | 0.789 | 78.57 | 0.841 | 0.786 | 0.811 | |
| TestGF.22 | 10–90 | 54.09 | 0.865 | 0.541 | 0.628 | 55.45 | 0.875 | 0.555 | 0.639 | |
| TrainGF.2 | TestGF.11 | 90–10 | 87.39 | 0.944 | 0.874 | 0.893 | 87.05 | 0.942 | 0.871 | 0.891 |
| TestGF.12 | 10–90 | 98.58 | 0.986 | 0.986 | 0.986 | 98.45 | 0.984 | 0.985 | 0.984 | |
| TestGF.21 | 90–10 | 71.82 | 0.929 | 0.718 | 0.776 | 70.23 | 0.922 | 0.702 | 0.764 | |
| TestGF.22 | 10–90 | 94.09 | 0.940 | 0.941 | 0.940 | 94.09 | 0.940 | 0.941 | 0.940 | |
| TrainGF.3 | TestGF.11 | 90–10 | 85.97 | 0.938 | 0.860 | 0.882 | 85.63 | 0.939 | 0.856 | 0.880 |
| TestGF.12 | 10–90 | 96.69 | 0.968 | 0.967 | 0.967 | 98.58 | 0.986 | 0.986 | 0.986 | |
| TestGF.21 | 90–10 | 67.85 | 0.927 | 0.679 | 0.745 | 65.87 | 0.926 | 0.659 | 0.729 | |
| TestGF.22 | 10–90 | 93.18 | 0.930 | 0.932 | 0.931 | 95.00 | 0.947 | 0.950 | 0.947 | |
| TrainGT.1 | TestGT.11 | 90–10 | 93.99 | 0.963 | 0.940 | 0.946 | 94.82 | 0.966 | 0.948 | 0.953 |
| TestGT.12 | 10–90 | 97.30 | 0.975 | 0.973 | 0.974 | 96.48 | 0.970 | 0.965 | 0.966 | |
| TestGT.21 | 90–10 | 87.05 | 0.876 | 0.871 | 0.873 | 88.24 | 0.884 | 0.882 | 0.883 | |
| TestGT.22 | 10–90 | 61.05 | 0.883 | 0.611 | 0.688 | 62.10 | 0.893 | 0.621 | 0.696 | |
| TrainGT.2 | TestGT.11 | 90–10 | 91.09 | 0.953 | 0.911 | 0.922 | 89.85 | 0.947 | 0.899 | 0.912 |
| TestGT.12 | 10–90 | 97.51 | 0.976 | 0.975 | 0.975 | 95.85 | 0.963 | 0.959 | 0.960 | |
| TestGT.21 | 90–10 | 88.96 | 0.939 | 0.890 | 0.904 | 87.76 | 0.936 | 0.878 | 0.895 | |
| TestGT.22 | 10–90 | 94.21 | 0.944 | 0.942 | 0.943 | 92.63 | 0.930 | 0.926 | 0.928 | |
| TrainGT.3 | TestGT.11 | 90–10 | 90.68 | 0.945 | 0.907 | 0.918 | 89.64 | 0.942 | 0.896 | 0.910 |
| TestGT.12 | 10–90 | 90.89 | 0.944 | 0.909 | 0.919 | 88.19 | 0.932 | 0.882 | 0.898 | |
| TestGT.21 | 90–10 | 89.20 | 0.940 | 0.892 | 0.906 | 88.48 | 0.943 | 0.885 | 0.901 | |
| TestGT.22 | 10–90 | 94.21 | 0.947 | 0.942 | 0.944 | 95.26 | 0.950 | 0.953 | 0.951 | |
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.
Funding
This research project received grant from the University of Milan on local research funds.
