Abstract
User generated content on the microblogging social network Twitter continues to grow with significant amount of information. The semantic analysis offers the opportunity to discover and model latent interests’ in the users’ publications. This article focuses on the problem of uncertainty in the users’ publications that has not been previously treated. It proposes a new approach for users’ interest discovery from uncertain information that augments traditional methods using possibilistic logic. The possibility theory provides a solid theoretical base for the treatment of incomplete and imprecise information and inferring the reliable expressions from a knowledge base. More precisely, this approach used the product-based possibilistic network to model knowledge base and discovering possibilistic interests. DBpedia ontology is integrated into the interests’ discovery process for selecting the significant topics. The empirical analysis and the comparison with the most known methods proves the significance of this approach.
Introduction
Online social Medias have become essential tools for virtual interaction between individuals. Users interact in these virtual spaces for several reasons like: relationships creation, communities’ formation, knowledge sharing, etc. The analysis of these interactions allows discovering relevant knowledge for understanding user’s behaviour and ameliorates the social networking services. Users’ interest discovery from textual publications refers to the process of identifying relevant terms, entities or concepts that represent users’ profil.
User’s interest discovery is the fundamental task for many applications like personalized information [28] (e.g. find the relevant publications’ for each user in the social media), social recommendation [34], analyzing the users opinions and behaviours [27, 31]. However, the interests’ discovery is confronted to many difficulties like data heterogeneity, data noisy, imprecision and incompleteness of the information contained in user publications, etc. Indeed, the users’ publications in online social networks are ungrammatical and noisy and we, therefore, cannot guarantee parses for these texts [30]. Usually, user’s publication is characterized by its reduced size (e.g. twitter offers just 140 characters per message) and its ambiguity of some words and lack of context in which the message is posted [24]. Furthermore, Social networks bring together individuals from different backgrounds and different interests. The heterogeneity and opening of virtual social spaces are the main sources of their wealth. The analyzing of user’s publications must be able to take into account this information diversity. The classical approaches, like collocation words, words and terms co-occurrence,‘part-of-speech’ tagger and others statically computing approaches, are not sufficient for analyzing user publications [23] and characterizing latent terms [26]. The algorithms like entity recognition and concept extraction have been largely used to discover user interests. The main goal for these algorithms is to recognize the entities and concepts in the text and to determine those more representative for user interests. Other approaches have exploited the semantic resources (thesaurus or ontology) to disambiguate returned entities. The extraction of significant topics from users’ publications requires the use of an advanced tool for managed incompleteness and imprecision. In this work, we give importance to the feature of uncertainty in the user’s publication. We increase traditional approaches with possibilistic logic which provides a solid theoretical base for modelling uncertain information. This approach IDF-Tweet (Interest Discovery From users’ Tweet) differs from others by taking into account the uncertainty of information in online social networks. We use the DBpedia ontology that contains concepts identified from Wikipedia encyclopedia that covers several domains such as music, literature, technology, etc. With the possibilistic theory, DBpedia ontology will be transformed into a possibilistic semantic resource that allows deducing relevant topics for each user publication.
The remainder of this paper is organized as follows. Section 2 recapitulates related work. Section 3 presents the possibilistic logic and semantic resource that we have used in this work. Section 4 details our approach for users’ interest discovery. Then, Section 5 presents the experiments and results of evaluation. Finally, we conclude in Section 6 with an exposition of the obtained results.
Related work
Discovering users’ interest in online social networks is the challenge for researchers in the knowledge discovery field. Several approaches have been proposed in the literature for this purpose. They can be divided into three large families: based on social relationships [13, 14, 15, 16], based on user folksonomies [7, 8, 9, 10], and based on textual publications [17, 18, 19, 20].
Approaches using social relationships are based on the idea that the user behaviour is strongly influenced by the behaviour of its neighbours in his community. In the work [13], each hidden interest called sensitive interest has a set of possible values. The probability that user ‘u’ has an interest ‘i’ is given by the ratio of a number of users in its community and users with this interest. Mislove et al. [14] propose a method for inferring the user interests from its social links. Each interest has an affinity value in each community determined by the number of links sharing this interest. This affinity is used to determining the hidden user’s interests. Wen and Lin [15] uses the LDA model that allows building tow matrixes: the first one represents relationships between members and topics and the second represents the influence between users. The deduction of users’ interests is performed using the product matrix. The work [16] based on the idea that the interest of each user follows the interests of experts which this user is following. It discovers experts interests from the name list specified by users (each user can specify interest list for experts) and then infer the user’s interests. Approaches for interest’s discovery from social links do not take into account the specific interests of each user. In addition, they do not take into account the errors that can be generated by the used heuristics, as in the case where a user makes social relations with members away from his interests.
The social bookmarking is a functionality offered by social networks like Flickr, Twitter, del.icio.us, etc. It allows a user to describe their favourite web pages using a tags set. Classification proposed in [7] show the existence of seven main tags classes: specific, generic, context, synonyms, invented, organizational and subjective. Only the last class allows to discovery user’s interests. The work [10] proposes an approach to discover user’s interest from tags frequency and co-occurrence. The representative tags are grouped together to form unique interest. Each interest is represented by a set of frequents tags that appear together in many publications. Michlmayr and Cayzer [8] uses a co-occurrence graph between tags. The part of the graph representing the user interests includes only the arcs with a high co-occurrence. The suggested approach [11] discovers the users’ interests through matching tags to Wikipedia pages and WordNet thesaurus entries. It allows identifying representative categories of the user’s interests. The analysis of user’s folksonomies allows identifying knowledge about social network users including interest and preferences. However, these approaches encounter some difficulties such as lack of context when a tag appears and the language problem (the combination of the words, the used language, etc). Thus, folksonomy generated by users in the social networks does not contain enough information for discovering users’ interests.
Approaches for users’ interest discovery based on entities identification in the users’ publications are closer to this work. Indeed, the topic of concept extraction from online social networks is more studied in the recent works. Several approaches have been proposed for solving this problem. They can be divided into two groups: based on natural language processing tools (NLP) [17, 18, 19, 22] and based on semantic resources mainly DBpedia and Wikipedia [20, 25, 30, 32].
Some approaches use concept extraction for identifying entities in user publications. Entity is an element in the text that can be a person, a place, a topic, etc. In [18], the author uses the NLP tools, essentially POS tagging, and a topics model LabelLDA for identifying entities in users’ tweets. The approach [19] treats recognition of the players’ named entities and the interesting micro-events within a sports event. It uses textual publications with hash tags and annotations provided by users to compute the similarity between vectors characterizing players and tweets. In the [17] approach, the author uses part-of-speech tagging for the entity recognition in tweets, he trained a POS tagger with the help of a new labelling scheme and a feature set that captures the unique characteristics of tweets.
For entities extraction from users’ publication, several tools have been proposed by the community, such as SpotLight, AlchemyAPI, OpenCalais, Extractiv and Ze-manta. A comparative study of these approaches is available in [22]. These tools used with the SRs to extract concepts from the text. In the approach [30], Wikipedia is used to discover topics of interest on user’s publications. The matching words, between publication words and entities words, allow discovering and disambiguating the entities in this publication. The work [20] uses the news media such as BBC, CNN to discover user interests related to daily news. The entities are identified by the OpenCalais API that operates various semantic resources including DBpedia Ontology. The approach [35] considers both the textual publications and the users’ activity (retweet, mention) to identify personal interests and community interests. The TwiNER approaches [25], proposed for entity recognition from the Twitter social network is based on two stages: the first one is the tweets segmentation into fragments and the second one is the matching with Wikipedia hierarchy. The approach [32] uses free SpotLight tool to extract concepts from user’s pub-lications using DBpedia ontology. The Spreading Activation theory is used to activate important nodes (concept) in the DBpedia graph.
These approaches have implemented the NLP tools to extract topics from users’ publication. They represent concepts without considering the relations between them and other existing information in DBpedia as descriptions, related concepts, etc. In addition, these works treat separately the publications of each user and find only individual and subjective interests. This paper proposes a hybrid approach that combines textual publication, tags, and annotations for discovering common users’ interest. The imprecision and incompleteness in the users’ publications is taken into account by combining statistical computing, dictionary lookup, and possibilistic logic. The product possibilistic network model proposes two measures (Possibility and Necessity) that explain several aspects of concept relevance. The importance of terms and words in the possibilistic model is given by two complementary measures: the first one eliminates the unrepresentative words and terms, and the second reinforces the importance of meaningful terms and words. Important information in DBpedia nodes is used for computing words and terms importance and modelling the possibilistic network. Our objective is to ameliorate the precision and recall for the latest approaches by taking into account the imprecision and incompleteness in order to find the most important and common users interest. The next section presents different steps for possibilistic users’ interest discovery from incomplete and imprecise information.
Possibilistic theory and semantic ressource
Basics of possibilistic logic
Our approach IDF-Tweet finds its theoretical framework in the possibilistic logic and more precisely in the possibilistic network. It allows modeling different elements linked to a given problem with taking into account the uncertainty of the processed information. This section give a brief presentation of the basic concepts of this theory and its employment for users’ interest discovery.
Possibilistic theory
Possibilistic theory, proposed by [1] for incomplete and imprecise knowledge modeling, is an extension of fuzzy logic. It allows the weight possibilities distribution on the interpretation sets for a specific problem. In the application context, possibilistic logic allows a system to represent the received knowledge’s with possibility distributions reflecting the strength of belief in formulas. Thus, it allows reasoning and deducing other interpretations with possibilistic inference [3].
In formal terms, let
The possibilistic conditioning
The directed possibilistic network can be seen as the result of the fusion of one-formula knowledge bases. Each formula corresponds to a conditional possibility in the directed possibilistic graph [2]. Like the probabilistic network, a possibilistic network is a directed acyclic graph modeling the cause and effect between variables for a given problem
Semantic resources
DBpedia is one of the most important semantic resources available on the web and the core of the open data. The purpose of this project is to use the name of categories and pages from the famous encyclopedia Wikipedia and organizes them in the ontology by identifying the classes and relations between them. This SR is used in several approaches for extracting knowledge from the text. DBpedia is available in several languages and containing over 1.5 million concepts for English language divided into four main categories: people, travails, organizations, and places. For each concept in these categories, DBpedia ontology provides additional information like the related concepts and the link to the Wikipedia category or page. We can find a description of the concept with the property ‘dbo:abstract’, the semantically related concepts from ‘rdfs:seeAlso’ property, the same concept in another language from ‘rdfs:sameA’ and the terms describing the concept subject from ‘dct:subject’. Other information related to each topic is also available as the author, genre, external web links, etc.
In our work, each topic can be represented with a concept from the semantic resource, each concept is indexed by the terms given with the property ‘dct:subject’ and the property ‘rdfs:seeAlso’. The terms of each concept will represent the second possibilistic network layer; the third layer is built with the words extracted from terms. The three different sets used to construct the possibilistic network are the following:
A set of concepts: Exists in DBpedia. A set of terms: Exists in DBpedia. A set of words: A new set.
Possiblistic logic based-interest discovery
General architecture
User’s interest discovery.
Our approach for user interest discovery from imprecise and incomplete information has three main tasks (Fig. 1). The first one is the construction of semantic possibilistic resource that has two main steps. In the first step, the concepts of DBpedia ontology are modeled with a possibilistic network by identifying variables, relationships, and influences between them. For a given concept
The second task in the interest discovery process is the inference from possibilistic networks after the matching between users publications and concepts. The publication node represent the instance in the possibilistic network and the information received by the system. The possibilistic inference is used to propagate this information in the possibilistic networks. It allows to determine its relevance of the root node containing the concept
The last task in interests’ discovery process is the interest-graph construction from the relevant concepts. The construction is made as follow: the nodes in the graph represent the returned concepts and arcs represent the co-occurrence between them, two nodes are linked if they co-occurring in users’ publications. The graph containing relations between users and returned concepts is a bipartite graph G
where
The concept extraction process consists of selecting a relevant concepts for a given text out of a controlled vocabulary. The terminological resource simplify the selection process by grouping the concepts for a specific domain in a single source. However, other works like [4, 5] have shown the existence of irrelevant concepts for processed texts. Furthermore, the concepts extraction encounters other problems mainly the problem of under-generation related to the exact search (ignorance of the relevant concepts) and the problem of over-generation related to the approximate search. The possibilistic model approach takes into account these different points. Indeed, the relevance of a concept is determined by the necessity degree (certainty). More precisely, a concept is relevant for a user publication if it has the highest necessity degree. An algorithm for approximate inference in the possibilistic network allows us to compute the necessity degree. We control the relevance of concepts and the number of returned concepts using a threshold for the necessity value. We used four conditional possibility measures for each node in the possibilistic network related to each of its parent to measure the different possible cases and to take into account the imprecision of used heuristics;
Concept extraction algorithm aims to determine the similarity between concepts and text after matching the two objects (Concept, publication). Therefore, the users publications that interests us contains text, annotations (when user annotates an author user) and tags (Fig. 2). Each information given in the user publication can be an important indicator of his interests. The user gives his opinion or explain a topic that interests him with a short text in the publication. Words must be linked to this subject. Thus, each user can also mention other users using the annotation (term proceeded with @ containing user identifier). These annotations usually contain not only friend username but also username of interested public personality, a book title, song title, organization name, etc. In the collected database from twitter, 33% for users’ annotations is linked to concepts exists in DBpedia Ontology. Then, annotations in publications provide interesting information on user interests. The third form of information in user publications is the tags that contain term composed of one or more keywords (term preceded with
The three classes of words considered in user publication are the following: (
After this classification, each publication P is represented as a set of q pairs:
Our objective then is to discover the latent interests in the users’ publications. Each user
User publication from twitter social network.
The first layer in the possibilistic is the concepts layer (Fig. 3). It includes the concepts selected by the publication P (who shares at least a word with P). The second layer is the terms layer (Section 3.2), these two layers represent the DBpedia ontology. The third layer is the possibilistic layer that is built around the semantic resource.
Each root node is a binary random variable that can take a value in
Each term node can take a value in that describes the relevance of term Ti to its parent. The same way is used to model the variables
Possibility distribution
The nodes represent the concepts in the possibilistic network do not have a conditional possibility distribution since they are root nodes. The only available information is the number of times where this concept appears in the users publication. This frequency allows us to define a possibilistic distribution over the selected concepts as follow:
Where
Terms possibility distribution
Possibilistic model architecture.
The possibility distribution over the term nodes is determined using their syntactic feature such as size of terms (Table 1). For example, given a concept that contains a set of terms, the term that contains less words is contains specific information, most significant and more representative of its parent node.
Where
The measure
Where
For the variables
Where
Words possibility distribution
Possibilistic inference
PSN: Possibilistic Semantic resource P: user publication
S: Set of relevant concepts ranked with relevance
begin
:
for each concept
S do
W
// shared words between P and
:
All possible configurations of T
(
P)
1
max
0
for each configuration
do
for each term
do
for each word
W do
end
for
end
for
if
max then
max
end
if
end
for
max
for each word
W do
end
for
N (
end
for
Sort S according to the degree of Necessity.
Return the k first elements of the S // k is determined from the experimental study
end.
Possibilistic inference
PSN: Possibilistic Semantic resource P: user publication
S: Set of relevant concepts ranked with relevance
W
max
max
N (
Sort S according to the degree of Necessity.
Return the k first elements of the S // k is determined from the experimental study
The evaluation process compute the relevance of each candidat concept from two measures: Possibility and Necessity. We use the inference Eq. (11) proposed in [33], it take into account the words that are not shared between the two objects (Publication, Concept). The algorithm “Possibilistic Inference” presented below describe the gait for the selection of relevant concepts to a given publication.
Where
The relevance of each concept is determined using the Necessity degree given by the following equation:
Data collection
The evaluation of this approach for users’ interest discovery is performed on two data sets collected from two different social networks. The first one is a LiveJournal social network that allows recuperating user profiles (interests, friends, publications, etc.). The second social network is Twitter microblogging site.
LiveJournal users have the possibility to create social relationships and share messages with friends. Thus, each user specifies his/her interests using concepts predefined by the micro-blogging site or freely chosen text. LiveJournal provides the access to most user profiles for crawlers including publications, social relationships and the interests of each user. We collected publications and interests of 150 users (between 20 and 65 words in each publication) choosing small text and 80 publications (means) for each user (Table 3). The evaluation with Livejournal profiles allows us to discover the reliability and precision of our approach compared them with another existing method. We evaluate our results with collected users’ interest; an interest discovered with our approach is relevant for the user if it exists in his/her profile.
The second database used to evaluate this approach is collected from twitter microblogging site. We have developed a crawler to collect user profiles from twitter. Firstly, we seek user profiles using keywords like music, jazz, sport, tennis, etc. Next, we collect publications found in each user profile. The publications are collected in a chronological order until an average of 95 publications/user. We choose English publications since we use the English version of DBpedia. They contain text, annotations, tags and links to external resources; in this database, the external link is not considered. Using this method, we save more than 14250 publications from 150 profiles (Table 4). The second database does not include the interests list for the evaluation of our approach. For this, we use a simple tweets recommendation system in which each user is modelled by his/her interests and each tweet is modelled with concepts extracted from it; at least 6 concepts. Tweets used in recommendation system are collected randomly using our crawler from profiles other than the original users. Users and tweets are represented using the vector model: each vector contains the concepts (or interest for the users), the importance of each concept is the certainty value (Necessity) given by the possibilistic concept extraction approach. The cosine similarity determines the relevance of each candidate tweet for users. This recommendation system allows determining precision and recalling of our approach; it is efficient if the recommendation process returns the relevant tweets.
LiveJournal data set
LiveJournal data set
Twitter data set
Overall, the two used data sets in the evaluation process have two different characteristics. On the one hand, publications in Twitter microblogging site are characterized by their small size but they are rich in terms of tags and annotations. On the other hand, LiveJournal users use rarely tags and annotations but the publications size is more important than twitter publications.
We use the three most known metrics to evaluate our interest discovery approach from uncertain information: Precision, Recall, and F-measure. The precisions P@5, P@10, P@20 and P@30 representing respectively the mean precision values at the top 5, 10, 20 and 30 returned elements (interest or tweets). The ratio between the number of relevant elements found and the number of relevant elements that exist in the database defines the recall. The F-measure is the synthetic indicator, which combines precision and recall.
Element can be a tweet from twitter publications or interest from LiveJournal profiles. For a given user u, a returned interest is relevant if it exist in his list of interests (relevance(
Necessity and centrality: thresholds analysis.
Discovered Interests and Concepts per user.
Interest discovery results
The interest discovery process contains two main steps responsible for the filtering of the returned concepts. The first one is the concepts ranking using the Necessity degree. In the second step, we select the first relevant concepts to build the interest-graph for identifying the most common nodes. In the two steps, we chose two thresholds that maximize the precision Fig. 4: 0.47 and 1.483 (centrality/1000) for respectively the concept relevance and interest centrality. The number of used concepts varies according to the value of necessity and centrality. The Fig. 5 contains the distribution of returned concepts and interest for users. Observation of the concepts variation shows that the number of returned concepts in the first filtering step is a bit high (between 120 and 300 concepts). Increasing the threshold reduces considerably the number of concepts without giving importance to the centrality of these concepts. For this, we use the second filtering step using the interest-graph to select the more popular and representative interests (between 13 and 33 interests/user).
Necessity distribution for Dist-1 and Dist-2.
In the possibilistic model, we used the statistical computing, like words frequency and words importance in the documents corpus, and morphological characteristic for terms in the semantic resource (like the term size) to compute the possibility distributions. We evaluate the usefulness of the possibility distributions in our interests’ discovery model with the comparison between results found using two different distributions. The first one is the possibility distribution (Dist-1) of the terms (Table 1) used in our approach. The second one is a distribution (Dist-2) that ignores the features of the semantic resource terms (
The possibilistic theory used for concept extraction in the interest discovery approach contains two evaluation measures: Possibility and Necessity. The first measure informs us about the possibility that a candidate concept is relevant. However, the second measure confirms the relevance of the given concept. We compare the results returned by the two measures for two databases. Table 5 shows the importance of Necessity measure (certainty) compared with the possibility measure: More than
Comparison between possibility and necessity
Comparison between possibility and necessity
Comparison between two possibility distributions
Comparison with HIG approach
We compare our approach with the approach [32] IHG (Interest Hierarchy Generator) proposed by P. Kapanipathi for user interests discovery. It uses the entity recognition tool, e.g. spotlight for identifying the primitive interest, and the Wikipedia hierarchy. The spotlight method [21] is mainly based on a vector model proposed for information retrieval problem; it is used for determining the similarity between documents and query [9]. It is essentially based on the semantic DBpedia resource and used by many earlier approaches to extract concepts from the web content. In The IHG approach, the returned concepts from spotlight are used for the spreading activation algorithm in the Wikipedia hierarchy. It begins with the activation of primitive nodes in the hierarchy. In the second step, this activation propagates to similar nodes for inferring Hierarchical Interests for each user. In our approach, we use the possibilistic model recently proposed for the information retrieval problem. It proposes several measures to take into account the various factors that may affect the entity identification process. The interest-graph is constructed to select the most important interest. The comparison between the IHG approach and IDF-Tweet shows that our approach provides precisions and recall better than HIG with SpotLight approach.
The difference between the two values of F-measure (
Conclusion
This paper deals with a recent problem linked to users’ interest discovery from uncertain information. The interests’ discovery in the social networks is a primordial task for several applications like information retrieval, social recommendation, community finding, etc. The incompleteness and imprecision are the main characteristic of user publications on the social media. To solve this problem, we proposed an approach based on possibility theory and semantic resources. We use the structural indexes in the graph to select the important and common interests. Recently, possibilistic logic, as an extension of fuzzy logic, is used in several issues where the information to be treated is unreliable. It allows modelling the knowledge base and inferring new assertions with the arrival of new information. We use product-based possibilistic networks to model the topic identification task from user publications. The approximate inference with the maximum and product operators, which guarantees a polynomial response time, is used to determine the relevance of the concepts. In the last step, we construct an interest-graph based on co-occurrence between concepts. We identify the most important interest in the graph using the closeness centrality. It allows performing the user interest discovery adequately.
A wide experimental study allows showing the meaning of different parameters proposed by our method, in particular, the possibility distributions, the Possibility and the Necessity signification in possibilistic logic. The possibility distribution that we have used is based on the statistical computing like words frequency and terms features. Thus, the possibilistic network allows us to take into account different characteristics and factors that can be helpful in discovering interests in users’ publications. We used two data sets for the comparison with HIG approach. This last approach uses the spotlight tool that based on vector model and spreading activation algorithm in the Wikipedia hierarchy. In our method, the assignment of possibilities and necessities values for each word and term in the semantic resource allows quantifying the representativeness certainty degrees of these two elements. Thus, the use of interest graph to filter returned concepts allows selecting the popular and significant items for all users.
This possibilistic algorithm can be used in many applications and can help the scientific community interested in the knowledge discovery. On the other side, online social networks are promoter sources of information that can be used for discovering knowledge about users. Future works will concentrate on tweet stream analysis on the microblogging social networks. The first one is the dynamic user’s interest discovery from publications streams. Indeed, the millions of publications are shared on the virtual social space. The real time analysis of these publications allows organizing them and discovering knowledge change. In this case, we must be able to determine trends and changes in user behaviour over time. The identification of user interest’s change allows following the users’ opinions and preferences and recognizing the behaviour of change.
