Possibilistic interest discovery from uncertain information in social networks

Abstract

User generated content on the microblogging social network Twitter continues to grow with significant amount of information. The semantic analysis offers the opportunity to discover and model latent interests’ in the users’ publications. This article focuses on the problem of uncertainty in the users’ publications that has not been previously treated. It proposes a new approach for users’ interest discovery from uncertain information that augments traditional methods using possibilistic logic. The possibility theory provides a solid theoretical base for the treatment of incomplete and imprecise information and inferring the reliable expressions from a knowledge base. More precisely, this approach used the product-based possibilistic network to model knowledge base and discovering possibilistic interests. DBpedia ontology is integrated into the interests’ discovery process for selecting the significant topics. The empirical analysis and the comparison with the most known methods proves the significance of this approach.

Keywords

Interest discovery topic extraction social network DBpedia possibilistic network uncertainty

1. Introduction

Online social Medias have become essential tools for virtual interaction between individuals. Users interact in these virtual spaces for several reasons like: relationships creation, communities’ formation, knowledge sharing, etc. The analysis of these interactions allows discovering relevant knowledge for understanding user’s behaviour and ameliorates the social networking services. Users’ interest discovery from textual publications refers to the process of identifying relevant terms, entities or concepts that represent users’ profil.

User’s interest discovery is the fundamental task for many applications like personalized information [28] (e.g. find the relevant publications’ for each user in the social media), social recommendation [34], analyzing the users opinions and behaviours [27, 31]. However, the interests’ discovery is confronted to many difficulties like data heterogeneity, data noisy, imprecision and incompleteness of the information contained in user publications, etc. Indeed, the users’ publications in online social networks are ungrammatical and noisy and we, therefore, cannot guarantee parses for these texts [30]. Usually, user’s publication is characterized by its reduced size (e.g. twitter offers just 140 characters per message) and its ambiguity of some words and lack of context in which the message is posted [24]. Furthermore, Social networks bring together individuals from different backgrounds and different interests. The heterogeneity and opening of virtual social spaces are the main sources of their wealth. The analyzing of user’s publications must be able to take into account this information diversity. The classical approaches, like collocation words, words and terms co-occurrence,‘part-of-speech’ tagger and others statically computing approaches, are not sufficient for analyzing user publications [23] and characterizing latent terms [26]. The algorithms like entity recognition and concept extraction have been largely used to discover user interests. The main goal for these algorithms is to recognize the entities and concepts in the text and to determine those more representative for user interests. Other approaches have exploited the semantic resources (thesaurus or ontology) to disambiguate returned entities. The extraction of significant topics from users’ publications requires the use of an advanced tool for managed incompleteness and imprecision. In this work, we give importance to the feature of uncertainty in the user’s publication. We increase traditional approaches with possibilistic logic which provides a solid theoretical base for modelling uncertain information. This approach IDF-Tweet (Interest Discovery From users’ Tweet) differs from others by taking into account the uncertainty of information in online social networks. We use the DBpedia ontology that contains concepts identified from Wikipedia encyclopedia that covers several domains such as music, literature, technology, etc. With the possibilistic theory, DBpedia ontology will be transformed into a possibilistic semantic resource that allows deducing relevant topics for each user publication.

The remainder of this paper is organized as follows. Section 2 recapitulates related work. Section 3 presents the possibilistic logic and semantic resource that we have used in this work. Section 4 details our approach for users’ interest discovery. Then, Section 5 presents the experiments and results of evaluation. Finally, we conclude in Section 6 with an exposition of the obtained results.

2. Related work

Discovering users’ interest in online social networks is the challenge for researchers in the knowledge discovery field. Several approaches have been proposed in the literature for this purpose. They can be divided into three large families: based on social relationships [13, 14, 15, 16], based on user folksonomies [7, 8, 9, 10], and based on textual publications [17, 18, 19, 20].

Approaches using social relationships are based on the idea that the user behaviour is strongly influenced by the behaviour of its neighbours in his community. In the work [13], each hidden interest called sensitive interest has a set of possible values. The probability that user ‘u’ has an interest ‘i’ is given by the ratio of a number of users in its community and users with this interest. Mislove et al. [14] propose a method for inferring the user interests from its social links. Each interest has an affinity value in each community determined by the number of links sharing this interest. This affinity is used to determining the hidden user’s interests. Wen and Lin [15] uses the LDA model that allows building tow matrixes: the first one represents relationships between members and topics and the second represents the influence between users. The deduction of users’ interests is performed using the product matrix. The work [16] based on the idea that the interest of each user follows the interests of experts which this user is following. It discovers experts interests from the name list specified by users (each user can specify interest list for experts) and then infer the user’s interests. Approaches for interest’s discovery from social links do not take into account the specific interests of each user. In addition, they do not take into account the errors that can be generated by the used heuristics, as in the case where a user makes social relations with members away from his interests.

The social bookmarking is a functionality offered by social networks like Flickr, Twitter, del.icio.us, etc. It allows a user to describe their favourite web pages using a tags set. Classification proposed in [7] show the existence of seven main tags classes: specific, generic, context, synonyms, invented, organizational and subjective. Only the last class allows to discovery user’s interests. The work [10] proposes an approach to discover user’s interest from tags frequency and co-occurrence. The representative tags are grouped together to form unique interest. Each interest is represented by a set of frequents tags that appear together in many publications. Michlmayr and Cayzer [8] uses a co-occurrence graph between tags. The part of the graph representing the user interests includes only the arcs with a high co-occurrence. The suggested approach [11] discovers the users’ interests through matching tags to Wikipedia pages and WordNet thesaurus entries. It allows identifying representative categories of the user’s interests. The analysis of user’s folksonomies allows identifying knowledge about social network users including interest and preferences. However, these approaches encounter some difficulties such as lack of context when a tag appears and the language problem (the combination of the words, the used language, etc). Thus, folksonomy generated by users in the social networks does not contain enough information for discovering users’ interests.

Approaches for users’ interest discovery based on entities identification in the users’ publications are closer to this work. Indeed, the topic of concept extraction from online social networks is more studied in the recent works. Several approaches have been proposed for solving this problem. They can be divided into two groups: based on natural language processing tools (NLP) [17, 18, 19, 22] and based on semantic resources mainly DBpedia and Wikipedia [20, 25, 30, 32].

Some approaches use concept extraction for identifying entities in user publications. Entity is an element in the text that can be a person, a place, a topic, etc. In [18], the author uses the NLP tools, essentially POS tagging, and a topics model LabelLDA for identifying entities in users’ tweets. The approach [19] treats recognition of the players’ named entities and the interesting micro-events within a sports event. It uses textual publications with hash tags and annotations provided by users to compute the similarity between vectors characterizing players and tweets. In the [17] approach, the author uses part-of-speech tagging for the entity recognition in tweets, he trained a POS tagger with the help of a new labelling scheme and a feature set that captures the unique characteristics of tweets.

For entities extraction from users’ publication, several tools have been proposed by the community, such as SpotLight, AlchemyAPI, OpenCalais, Extractiv and Ze-manta. A comparative study of these approaches is available in [22]. These tools used with the SRs to extract concepts from the text. In the approach [30], Wikipedia is used to discover topics of interest on user’s publications. The matching words, between publication words and entities words, allow discovering and disambiguating the entities in this publication. The work [20] uses the news media such as BBC, CNN to discover user interests related to daily news. The entities are identified by the OpenCalais API that operates various semantic resources including DBpedia Ontology. The approach [35] considers both the textual publications and the users’ activity (retweet, mention) to identify personal interests and community interests. The TwiNER approaches [25], proposed for entity recognition from the Twitter social network is based on two stages: the first one is the tweets segmentation into fragments and the second one is the matching with Wikipedia hierarchy. The approach [32] uses free SpotLight tool to extract concepts from user’s pub-lications using DBpedia ontology. The Spreading Activation theory is used to activate important nodes (concept) in the DBpedia graph.

These approaches have implemented the NLP tools to extract topics from users’ publication. They represent concepts without considering the relations between them and other existing information in DBpedia as descriptions, related concepts, etc. In addition, these works treat separately the publications of each user and find only individual and subjective interests. This paper proposes a hybrid approach that combines textual publication, tags, and annotations for discovering common users’ interest. The imprecision and incompleteness in the users’ publications is taken into account by combining statistical computing, dictionary lookup, and possibilistic logic. The product possibilistic network model proposes two measures (Possibility and Necessity) that explain several aspects of concept relevance. The importance of terms and words in the possibilistic model is given by two complementary measures: the first one eliminates the unrepresentative words and terms, and the second reinforces the importance of meaningful terms and words. Important information in DBpedia nodes is used for computing words and terms importance and modelling the possibilistic network. Our objective is to ameliorate the precision and recall for the latest approaches by taking into account the imprecision and incompleteness in order to find the most important and common users interest. The next section presents different steps for possibilistic users’ interest discovery from incomplete and imprecise information.

3. Possibilistic theory and semantic ressource

3.1 Basics of possibilistic logic

Our approach IDF-Tweet finds its theoretical framework in the possibilistic logic and more precisely in the possibilistic network. It allows modeling different elements linked to a given problem with taking into account the uncertainty of the processed information. This section give a brief presentation of the basic concepts of this theory and its employment for users’ interest discovery.

3.1.1 Possibilistic theory

Possibilistic theory, proposed by [1] for incomplete and imprecise knowledge modeling, is an extension of fuzzy logic. It allows the weight possibilities distribution on the interpretation sets for a specific problem. In the application context, possibilistic logic allows a system to represent the received knowledge’s with possibility distributions reflecting the strength of belief in formulas. Thus, it allows reasoning and deducing other interpretations with possibilistic inference [3].

In formal terms, let $\Omega$ be the universe set of discourse containing all the possible interpretations, the knowledge about the variable X value is determined with function $\pi\mapsto$ [0, 1]. It represents the possibility distribution over this variable. Let $u$ ( $u\in\Omega$ ) be a possible X-value, the possibility $\pi$ ( $u$ ) $=$ 1 means that the proposition ( $X=u$ ) is compatible with the available information, where $\pi$ ( $\neg u$ ) $=$ 1 means that the proposition ( $X=u$ ) can be excluded from the knowledge set. Thus, the values $\pi(u)=\pi(\neg u)=$ 1 means total ignorance and lack of information about the X value. Subsequently, if a certain proposition is possible, does not infer that the converse is impossible. A second proposed measure to reflect on the certainty of a proposal is the necessity measure denoted by $N(u)$ . Intuitively, let $A\in\Omega$ be a set of values, given the knowledge available to a system, the possibility $\pi(A)$ reflects the plausibility degree of the proposition that the value of $X$ is from those in $A$ . While the necessity $N(A)$ reflects the degree to which knowledge available implies that, the value of $X$ must be necessary among those in $A$ :

$\displaystyle\Pi(A)=\text{max}_{u\in A}\pi(u)\>\text{and}\>N(A)=1-\Pi(\Omega|A)$ (1)

3.1.2 Possibilistic conditioning

The possibilistic conditioning $\Pi$ ( $.|b$ ), inspired by the Bayes theorem in probabilistic logic, allows a system to determine the possibility distributions on receipt new information based on the available knowledge. It consists of changing the original knowledge system by the arrival of some new certain information. There are two definitions for possibilistic conditioning [2]. In the qualitative context, the possibilistic conditioning allows organizing the interpretations in a finite level; minimum operator is used to determine the new distribution $\Pi$ ( $.|b$ ). In the numerical context [29], the possibilistic conditioning determines the plausibility of new knowledge by combining (aggregation) independent possibility distributions. In both cases, the certainty degree (Necessity) is determined with the same way:

$\displaystyle N(q|p)=1-\Pi(\neg q|p)$ (2)

3.1.3 Possibilistic network

The directed possibilistic network can be seen as the result of the fusion of one-formula knowledge bases. Each formula corresponds to a conditional possibility in the directed possibilistic graph [2]. Like the probabilistic network, a possibilistic network is a directed acyclic graph modeling the cause and effect between variables for a given problem $\{A_{1},\ldots,A_{n}\}$ . The domain associated with variables $A_{i}$ is $D=\{D_{1},\dots\,D_{n}\}$ , where these variables are binary, it calls the event or interpretations. The nodes in the possibilistic network represent the variables and the arcs represent the relationships between them. If there is an arc from node $A_{j}$ to the node $A_{i}$ , the first node is called the parent of the second node. Thus, for each node in the graph, a conditional possibility is associated for each of its instances and each configuration of his parents $\Pi(a_{i}|Par(A_{i}))$ (which $a_{i}$ is an instance of $A_{i}$ and Par ( $A_{i}$ ) a possible configuration of parent nodes). The inference in the possibilistic network determines the possible state with the possibility measure (and the certain state with Necessity measure) of each node in the graph with the arrival of new information. Algorithms for probabilistic networks are suitable for possibilistic inference, like the junction tree or the arithmetic circuit; most of them are exponential algorithms. Approximate inference that runs in polynomial time is used in this work to improve the response time of our algorithm.

3.2 Semantic resources

DBpedia is one of the most important semantic resources available on the web and the core of the open data. The purpose of this project is to use the name of categories and pages from the famous encyclopedia Wikipedia and organizes them in the ontology by identifying the classes and relations between them. This SR is used in several approaches for extracting knowledge from the text. DBpedia is available in several languages and containing over 1.5 million concepts for English language divided into four main categories: people, travails, organizations, and places. For each concept in these categories, DBpedia ontology provides additional information like the related concepts and the link to the Wikipedia category or page. We can find a description of the concept with the property ‘dbo:abstract’, the semantically related concepts from ‘rdfs:seeAlso’ property, the same concept in another language from ‘rdfs:sameA’ and the terms describing the concept subject from ‘dct:subject’. Other information related to each topic is also available as the author, genre, external web links, etc.

In our work, each topic can be represented with a concept from the semantic resource, each concept is indexed by the terms given with the property ‘dct:subject’ and the property ‘rdfs:seeAlso’. The terms of each concept will represent the second possibilistic network layer; the third layer is built with the words extracted from terms. The three different sets used to construct the possibilistic network are the following:

A set of concepts: Exists in DBpedia.

A set of terms: Exists in DBpedia.

A set of words: A new set.

4. Possiblistic logic based-interest discovery

4.1 General architecture

Figure 1.

User’s interest discovery.

Our approach for user interest discovery from imprecise and incomplete information has three main tasks (Fig. 1). The first one is the construction of semantic possibilistic resource that has two main steps. In the first step, the concepts of DBpedia ontology are modeled with a possibilistic network by identifying variables, relationships, and influences between them. For a given concept $C_{i}$ , its possibilistic network contains three layers represents its information; terms and words. This representation allows us to make the matching between users’ publications and selected concepts, and determine the relevance of each of them with the possibilistic inference. The second step in the first task is the calculation of conditional possibility distributions for each node in the possibilistic network. Possibility distributions are determined from the statistic computing such like words frequency, words importance in the documents corpus and morphological features of terms in the semantic resource. A document corpus collected from DBpedia ontology through the property ‘dbo:abstract’ is used for this task.

The second task in the interest discovery process is the inference from possibilistic networks after the matching between users publications and concepts. The publication node represent the instance in the possibilistic network and the information received by the system. The possibilistic inference is used to propagate this information in the possibilistic networks. It allows to determine its relevance of the root node containing the concept $C_{i}$ using the conditional possibility and necessity ( $\Pi(C_{i}|P)$ and $N(C_{i}|P)$ ).

The last task in interests’ discovery process is the interest-graph construction from the relevant concepts. The construction is made as follow: the nodes in the graph represent the returned concepts and arcs represent the co-occurrence between them, two nodes are linked if they co-occurring in users’ publications. The graph containing relations between users and returned concepts is a bipartite graph G $=$ (U, C, E) where U is the set of users, C is the set of returned concepts and E contain edges between nodes from U to C. Let M be the adjacency matrix for G, the matrix S $=\textit{MM}^{t}$ is the adjacency matrix for the interest-graph where each value represents the co-occurrence degree between two concepts (the diagonal elements contain occurrence frequency for each concept) [6]. The objective in this step is to find concepts that more representatives for users’ interests from the returned concepts. We suppose that it is more likely that a concept represent an ‘interest’ if it is repeated in several users’ publications and it’s important compared to the others returned concepts (i.e. coherent with a large number of concepts and can be representative for the most users). We identify important nodes in the interest-graph using the closeness centrality [12]. Each node is more central if the distance between all other nodes is minimal:

$\displaystyle H(n_{i})=\sum_{j\neq i}\frac{1}{d(n_{i},n_{j})}$ (3)

where $n_{j}$ and $n_{i}$ are the nodes in the graph and $d(n_{j},n_{i})$ is the shortest path between them.

4.2 Possibilistic network

The concept extraction process consists of selecting a relevant concepts for a given text out of a controlled vocabulary. The terminological resource simplify the selection process by grouping the concepts for a specific domain in a single source. However, other works like [4, 5] have shown the existence of irrelevant concepts for processed texts. Furthermore, the concepts extraction encounters other problems mainly the problem of under-generation related to the exact search (ignorance of the relevant concepts) and the problem of over-generation related to the approximate search. The possibilistic model approach takes into account these different points. Indeed, the relevance of a concept is determined by the necessity degree (certainty). More precisely, a concept is relevant for a user publication if it has the highest necessity degree. An algorithm for approximate inference in the possibilistic network allows us to compute the necessity degree. We control the relevance of concepts and the number of returned concepts using a threshold for the necessity value. We used four conditional possibility measures for each node in the possibilistic network related to each of its parent to measure the different possible cases and to take into account the imprecision of used heuristics; $\Pi(X|Y)$ , $\Pi$ ( $\neg X|Y$ ), $\Pi(X|\neg Y$ ), $\Pi(\neg X|\neg Y$ ) where X is a node and its parent Y.

Concept extraction algorithm aims to determine the similarity between concepts and text after matching the two objects (Concept, publication). Therefore, the users publications that interests us contains text, annotations (when user annotates an author user) and tags (Fig. 2). Each information given in the user publication can be an important indicator of his interests. The user gives his opinion or explain a topic that interests him with a short text in the publication. Words must be linked to this subject. Thus, each user can also mention other users using the annotation (term proceeded with @ containing user identifier). These annotations usually contain not only friend username but also username of interested public personality, a book title, song title, organization name, etc. In the collected database from twitter, 33% for users’ annotations is linked to concepts exists in DBpedia Ontology. Then, annotations in publications provide interesting information on user interests. The third form of information in user publications is the tags that contain term composed of one or more keywords (term preceded with $\#$ ). We check the existence of each tag in the WordNet thesaurus for using only the correct keywords.

The three classes of words considered in user publication are the following: ( $L_{1}$ ) word- annotation, ( $L_{2}$ ) word-tag and ( $L_{3}$ ) word-text. The first type characterizes the most important words in the publication and the last characterize the less important words. Each word $W_{i}$ have an importance value $\partial_{i}$ depends on the class to which it belongs:

$\displaystyle\partial_{i}=\left\{\begin{array}[]{l l}1\>\textit{if}\>W_{i}\in L% _{1},\\ \\ \frac{1}{2}\>\textit{if}\>W_{i}\in L_{2}\\ \\ \frac{1}{3}\>\textit{if}\>W_{i}\in L_{3}\end{array}\right.$ (4)

After this classification, each publication P is represented as a set of q pairs:

$\displaystyle P=\left\{(W_{1},\partial_{1}),(W_{2},\partial_{2})\dots,(W_{q},% \partial_{q})\right\}$ (5)

Our objective then is to discover the latent interests in the users’ publications. Each user $u_{i}$ has a set of $j$ publication $u_{i}=\left\{P_{i}^{1},P_{i}^{2},\dots,P_{i}^{j}\right\}$ . We extract all words that exist in the semantic resource entry terms to build a possibilistic layer around the concepts and terms. This layer allows to match publications and concepts and to compute the relevance of each selected concept. Each publication $P$ can select a large number of concepts. Therefore, we sort the returned concepts by the relevance degree determined using a possibilistic inference.

Figure 2.

User publication from twitter social network.

4.2.1 Architecture

The first layer in the possibilistic is the concepts layer (Fig. 3). It includes the concepts selected by the publication P (who shares at least a word with P). The second layer is the terms layer (Section 3.2), these two layers represent the DBpedia ontology. The third layer is the possibilistic layer that is built around the semantic resource.

Each root node is a binary random variable that can take a value in $\text{Dom}(C_{j})=\left\{c_{j},\neg c_{j}\right\}$ :

$C_{j}=c_{j}$ means that the concept $C_{j}$ is relevant to the publication,

$C_{j}=\neg c_{j}$ means that the concept $C_{j}$ is irrelevant to the publication.

Each term node can take a value in that describes the relevance of term Ti to its parent. The same way is used to model the variables $W_{k}$ . For the consistency in the possibilistic network, the user publication is also a binary variable that can take a value in $\text{Dom (P)}=\left\{p,\neg p\right\}$ . The instantiation of the node P propagates from the word layer and term layer towards the root nodes.

4.2.2 Possibility distribution

The nodes represent the concepts in the possibilistic network do not have a conditional possibility distribution since they are root nodes. The only available information is the number of times where this concept appears in the users publication. This frequency allows us to define a possibilistic distribution over the selected concepts as follow:

$\displaystyle\Pi(C_{i})=\left\{\begin{array}[]{ll}\frac{f(c_{i})}{\text{max}_{% j}(f(c_{j}))},&\text{if}\>C_{i}=c_{i}\\ \\ \frac{1+\text{max}_{j}(f(c_{j}))-f(c_{i}))}{\text{max}_{j}(f(c_{j}))}&\text{if% }\>C_{i}=\neg c_{i}\end{array}\right.$ (6)

Where $f$ ( $c_{i}$ ) is the frequency of the concept $c_{i}$ in the user publications.

Table 1

Terms possibility distribution

$\Pi(T_{i}\|C_{j})$	$c_{j}$	$\neg c_{j}$
$t_{i}$	$\alpha_{ji}$	$1-\beta_{ji}$
$\neg t_{i}$	1	1

Figure 3.

Possibilistic model architecture.

The possibility distribution over the term nodes is determined using their syntactic feature such as size of terms (Table 1). For example, given a concept that contains a set of terms, the term that contains less words is contains specific information, most significant and more representative of its parent node.

$\displaystyle\Pi(t_{i}|c_{j})=\frac{1+\left|T_{\textit{max}}\right|-\left|T_{i% }\right|}{\left|T_{\textit{max}}\right|}=\alpha_{ij}$ (7)

Where $\left|X\right|$ is the cardinality of $X$ and $T_{\max}$ is the term that have the maximum size among the terms in the concept $C_{j}$ .

The measure $\Pi(t_{i}|c_{j})$ is the possibility that translate the significance of the term $t_{i}$ for the concept $C_{j}$ . The term $t_{i}$ is certainly significant for the concept $c_{j}$ if the value of the necessity $N(t_{i}|c_{j})$ is higher than a given threshold:

$\displaystyle N(t_{i}|c_{j})=1-\Pi(t_{i}|\neg c_{j})=\frac{\left|\left\{t_{i}% \cap c_{j}\right\}\right|}{\left|T_{\textit{max}}\right|}=\beta_{ji}$ (8)

Where $\left\{t_{i}\cap c_{j}\right\}$ is the set containing the words shared between $t_{i}$ and $c_{j}$ . More formally, when the size of $\left\{t_{i}\cap c_{j}\right\}$ increases, the $t_{i}$ term becomes more representative of the concept $c_{j}$ (Table 1).

For the variables $(W_{k})_{1\leqslant k\leqslant c}$ , the value of $\Pi(w_{k}|t_{i})$ is determined from the size of term $t_{i}$ (Table 2), the weight class of the word $w_{k}$ (Section 4.2) and its importance in the document corpus (descriptions in the DBpedia ontology).

$\displaystyle\Pi(w_{k}|t_{i})=\frac{1}{\left|t_{i}\right|}*\partial_{k}=\delta% _{ki}$ (9) $\displaystyle N(w_{k}\rightarrow t_{i})=\frac{\textit{idf}_{k}}{\textit{log(D)% }}*\partial_{k}=\Phi_{ki}$ (10)

Where $\textit{idf}_{k}$ (Inverse document frequency) represent the importance of the word $w_{k}$ in the document corpus, D is the number of documents (Table 2).

Table 2

Words possibility distribution

$\Pi(W_{k}\|T_{i})$	$t_{i}$	$\neg t_{i}$
$w_{k}$	$\delta_{ki}$	$1-\Phi_{ki}$
$\neg w_{k}$	1	1

4.2.3 Concept relevance

Algorithm 1
Possibilistic inference
0:
PSN: Possibilistic Semantic resource P: user publication
0:
S: Set of relevant concepts ranked with relevance

begin

$\mathrm{S\leftarrow\{C_{i}}$ : $\mathrm{C_{i}\in PSR\>and\>\exists w_{k}\in C_{i},w_{k}\in P\}}$

for each concept $C_{i}$ $\in$ S do

W $\mathrm{\leftarrow\left\{P\cap C_{i}\right\}}$ // shared words between P and $C_{i}$

$\mathrm{T\leftarrow\{T_{j}}$ : $\mathrm{T_{j}\in C_{i}\>and\>\exists w_{k}\in T_{j},w_{k}\in W\}}$

$\theta^{T}$ $\leftarrow$ All possible configurations of T

$\mathrm{\Pi}$ ( $\mathrm{\neg c_{i}}$ $|$ P) $\leftarrow$ 1

max $\leftarrow$ 0

for each configuration $\theta^{t}$ $\in$ $\theta^{T}$ do

for each term $T_{k}\in\theta^{t}$ do

for each word $w_{j}\in$ W do

$\mathrm{\Pi(\neg c_{i}|P)\leftarrow\Pi(\neg c_{i}|P)\Pi(w_{j}|T_{k})}$

end for

$\mathrm{\Pi(\neg c_{i}|P)\leftarrow\Pi(\neg c_{i}|P)\Pi(T_{k}|\neg c_{i})}$

end for

if $\mathrm{\Pi}(\mathrm{\neg c_{i}}|P)\in$ max then

max $\mathrm{\leftarrow\Pi(\neg c_{i}|P)}$

end if

end for

$\mathrm{\Pi(\neg c_{i}|P)\leftarrow}$ max

for each word $w_{j}\in$ W do

$\mathrm{\Pi(\neg c_{i}|P)\leftarrow\Pi(\neg c_{i}|P)\Pi(w_{j}|\neg c_{i})}$

end for

$\mathrm{\Pi(\neg c_{i}|P)\leftarrow\Pi(\neg c_{i}|P)\lambda}$

N ( $\mathrm{c_{i}|P)\leftarrow 1-\Pi(\neg c_{i}|P)}$

end for

Sort S according to the degree of Necessity.

Return the k first elements of the S // k is determined from the experimental study

end.

The evaluation process compute the relevance of each candidat concept from two measures: Possibility and Necessity. We use the inference Eq. (11) proposed in [33], it take into account the words that are not shared between the two objects (Publication, Concept). The algorithm “Possibilistic Inference” presented below describe the gait for the selection of relevant concepts to a given publication.

$\displaystyle\Pi(c_{i}|p)=\textit{max}_{\theta^{t}\in\theta^{T}}(\Pi(P|\theta^% {t})\Pi(\theta^{t}|c_{i})\Pi(c_{i}))\lambda$ (11)

Where $\theta^{T}$ is the set of possible configurations of the terms nodes and $\theta^{t}$ a possible instantiation, the variable $\lambda$ is given by:

$\displaystyle\lambda=\left\{\begin{array}[]{l l}\lambda_{1}\>\textit{for}\ \Pi% (c_{i}|P),\\ \\ 1-\lambda_{1}\>\textit{for}\ \Pi(\neg c_{i}|P)\end{array}\right.$ (12)

$\lambda_{1}=\left(\frac{\left|P\cap C_{j}\right|}{\left|P\right|}\frac{\left|% P\cap C_{j}\right|}{\left|C_{j}\right|}\right)^{2}$

The relevance of each concept is determined using the Necessity degree given by the following equation:

$\displaystyle N(c_{i}|P)=1-\Pi(\neg c_{i}|P)$ (13)
5. Evaluation

5.1 Data collection

The evaluation of this approach for users’ interest discovery is performed on two data sets collected from two different social networks. The first one is a LiveJournal social network that allows recuperating user profiles (interests, friends, publications, etc.). The second social network is Twitter microblogging site.

LiveJournal users have the possibility to create social relationships and share messages with friends. Thus, each user specifies his/her interests using concepts predefined by the micro-blogging site or freely chosen text. LiveJournal provides the access to most user profiles for crawlers including publications, social relationships and the interests of each user. We collected publications and interests of 150 users (between 20 and 65 words in each publication) choosing small text and 80 publications (means) for each user (Table 3). The evaluation with Livejournal profiles allows us to discover the reliability and precision of our approach compared them with another existing method. We evaluate our results with collected users’ interest; an interest discovered with our approach is relevant for the user if it exists in his/her profile.

The second database used to evaluate this approach is collected from twitter microblogging site. We have developed a crawler to collect user profiles from twitter. Firstly, we seek user profiles using keywords like music, jazz, sport, tennis, etc. Next, we collect publications found in each user profile. The publications are collected in a chronological order until an average of 95 publications/user. We choose English publications since we use the English version of DBpedia. They contain text, annotations, tags and links to external resources; in this database, the external link is not considered. Using this method, we save more than 14250 publications from 150 profiles (Table 4). The second database does not include the interests list for the evaluation of our approach. For this, we use a simple tweets recommendation system in which each user is modelled by his/her interests and each tweet is modelled with concepts extracted from it; at least 6 concepts. Tweets used in recommendation system are collected randomly using our crawler from profiles other than the original users. Users and tweets are represented using the vector model: each vector contains the concepts (or interest for the users), the importance of each concept is the certainty value (Necessity) given by the possibilistic concept extraction approach. The cosine similarity determines the relevance of each candidate tweet for users. This recommendation system allows determining precision and recalling of our approach; it is efficient if the recommendation process returns the relevant tweets.

Table 3
LiveJournal data set

Number of publications	12000
Average number of publications/user	80
Average publication length	25
Number of user	150
Number of interest	3900
Average number of interest/user	26

Table 4

Twitter data set

Number of publications	14250
Average number of publications/user	95
Average publication length	8
Number of user	150
Publications for recommendation system	3000

Overall, the two used data sets in the evaluation process have two different characteristics. On the one hand, publications in Twitter microblogging site are characterized by their small size but they are rich in terms of tags and annotations. On the other hand, LiveJournal users use rarely tags and annotations but the publications size is more important than twitter publications.

5.2 Evaluation metric

We use the three most known metrics to evaluate our interest discovery approach from uncertain information: Precision, Recall, and F-measure. The precisions P@5, P@10, P@20 and P@30 representing respectively the mean precision values at the top 5, 10, 20 and 30 returned elements (interest or tweets). The ratio between the number of relevant elements found and the number of relevant elements that exist in the database defines the recall. The F-measure is the synthetic indicator, which combines precision and recall.

$\displaystyle\textit{P@X}=\frac{\sum_{i=1}^{n}P_{i}@X}{n},P_{i}@X=\frac{\sum_{% j=1}^{x}\ \text{relevance}(E_{j}^{i})}{X}$ (14)

$\displaystyle R=\frac{\sum_{i=1}^{n}R_{i}}{n},R_{i}=\frac{\text{number of % returned relevant elements}}{\text{number of relvant elements}}$ (15)

$\displaystyle F=\frac{2*(P*R)}{(P+R)},P=\frac{\sum P@X}{4}$ (16)

Element can be a tweet from twitter publications or interest from LiveJournal profiles. For a given user u, a returned interest is relevant if it exist in his list of interests (relevance( $i_{j}$ ) $=$ 1) and 0 otherwise.

Figure 4.

Necessity and centrality: thresholds analysis.

Figure 5.

Discovered Interests and Concepts per user.

5.3 Results and discussion

5.3.1 Interest discovery results

The interest discovery process contains two main steps responsible for the filtering of the returned concepts. The first one is the concepts ranking using the Necessity degree. In the second step, we select the first relevant concepts to build the interest-graph for identifying the most common nodes. In the two steps, we chose two thresholds that maximize the precision Fig. 4: 0.47 and 1.483 (centrality/1000) for respectively the concept relevance and interest centrality. The number of used concepts varies according to the value of necessity and centrality. The Fig. 5 contains the distribution of returned concepts and interest for users. Observation of the concepts variation shows that the number of returned concepts in the first filtering step is a bit high (between 120 and 300 concepts). Increasing the threshold reduces considerably the number of concepts without giving importance to the centrality of these concepts. For this, we use the second filtering step using the interest-graph to select the more popular and representative interests (between 13 and 33 interests/user).

Figure 6.

Necessity distribution for Dist-1 and Dist-2.

5.3.2 Possibility theory evaluation

In the possibilistic model, we used the statistical computing, like words frequency and words importance in the documents corpus, and morphological characteristic for terms in the semantic resource (like the term size) to compute the possibility distributions. We evaluate the usefulness of the possibility distributions in our interests’ discovery model with the comparison between results found using two different distributions. The first one is the possibility distribution (Dist-1) of the terms (Table 1) used in our approach. The second one is a distribution (Dist-2) that ignores the features of the semantic resource terms ( $\Pi(T_{i}|C_{j})=$ 1). Table 6 shows the large difference between the results returned by the two possibility distributions on both databases, e.g. the first distribution gives $+$ 0.6 for the precision using LiveJournal data set. Figure 6 shows the necessity distribution for the first 1000 returned concepts for the two of the possibility distributions. Using Dist-1, the necessity degree changes between 0.97 and 0.64. The necessity degree changes between 0.67 and 0.48 when we use the distribution Dist-2. This illustration shows the importance of distribution Dist-1 to distinguish between relevant and irrelevant concepts.

The possibilistic theory used for concept extraction in the interest discovery approach contains two evaluation measures: Possibility and Necessity. The first measure informs us about the possibility that a candidate concept is relevant. However, the second measure confirms the relevance of the given concept. We compare the results returned by the two measures for two databases. Table 5 shows the importance of Necessity measure (certainty) compared with the possibility measure: More than $+$ 0.5 differences between the two measurements to compute concept relevance. This result supports our choice of possibilistic logic for modelling the concept extraction problem. In the rest of the assessment, the Necessity is used as a primary measure for evaluation of the relevance of concepts.

Table 5
Comparison between possibility and necessity

Measures	Possibility		Necessity
	LiveJournal	Twitter	LiveJournal	Twitter
P@5	0.3821	0.3451	0.7025	0.6231
P@10	0.3314	0.3034	0.6349	0.5043
P@20	0.2485	0.2646	0.5974	0.4224
P@30	0.1294	0.0994	0.5142	0.3175
R	0.4041	0.3031	0.6325	0.6015
F-measure	0.3257	0.2758	0.6222	0.5256

Table 6

Comparison between two possibility distributions

Measures	Dist-2		Dist-1
	LiveJournal	Twitter	LiveJournal	Twitter
P@5	0.2615	0.2142	0.7025	0.6231
P@10	0.1638	0.1432	0.6349	0.5043
P@20	0.1146	0.1044	0.5974	0.4224
P@30	0.0968	0.0930	0.5142	0.3175
R	0.3221	0.3105	0.6325	0.6015
F-measure	0.2130	0.1917	0.6222	0.5256

Table 7

Comparison with HIG approach

Measures			IDF-tweet
	HIG		Possibility		Necessity
	LiveJournal	Twitter	LiveJournal	Twitter	LiveJournal	Twitter
P@5	0.6632	0.5235	0.3821	0.3451	0.7025	0.6231
P@10	0.5576	0.4468	0.3314	0.3034	0.6349	0.5043
P@20	0.4872	0.3799	0.2485	0.2646	0.5974	0.4224
P@30	0.4107	0.2987	0.1294	0.0994	0.5142	0.3175
R	0.5635	0.5268	0.4041	0.3031	0.6325	0.6015
F-measure	0.5460	0.4625	0.3257	0.2758	0.6222	0.5256

5.3.3 Comparison with other approaches

We compare our approach with the approach [32] IHG (Interest Hierarchy Generator) proposed by P. Kapanipathi for user interests discovery. It uses the entity recognition tool, e.g. spotlight for identifying the primitive interest, and the Wikipedia hierarchy. The spotlight method [21] is mainly based on a vector model proposed for information retrieval problem; it is used for determining the similarity between documents and query [9]. It is essentially based on the semantic DBpedia resource and used by many earlier approaches to extract concepts from the web content. In The IHG approach, the returned concepts from spotlight are used for the spreading activation algorithm in the Wikipedia hierarchy. It begins with the activation of primitive nodes in the hierarchy. In the second step, this activation propagates to similar nodes for inferring Hierarchical Interests for each user. In our approach, we use the possibilistic model recently proposed for the information retrieval problem. It proposes several measures to take into account the various factors that may affect the entity identification process. The interest-graph is constructed to select the most important interest. The comparison between the IHG approach and IDF-Tweet shows that our approach provides precisions and recall better than HIG with SpotLight approach.

The difference between the two values of F-measure ( $+$ 01) allows summarizing the advantage of Necessity measure compared to SpotLight method.

6. Conclusion

This paper deals with a recent problem linked to users’ interest discovery from uncertain information. The interests’ discovery in the social networks is a primordial task for several applications like information retrieval, social recommendation, community finding, etc. The incompleteness and imprecision are the main characteristic of user publications on the social media. To solve this problem, we proposed an approach based on possibility theory and semantic resources. We use the structural indexes in the graph to select the important and common interests. Recently, possibilistic logic, as an extension of fuzzy logic, is used in several issues where the information to be treated is unreliable. It allows modelling the knowledge base and inferring new assertions with the arrival of new information. We use product-based possibilistic networks to model the topic identification task from user publications. The approximate inference with the maximum and product operators, which guarantees a polynomial response time, is used to determine the relevance of the concepts. In the last step, we construct an interest-graph based on co-occurrence between concepts. We identify the most important interest in the graph using the closeness centrality. It allows performing the user interest discovery adequately.

A wide experimental study allows showing the meaning of different parameters proposed by our method, in particular, the possibility distributions, the Possibility and the Necessity signification in possibilistic logic. The possibility distribution that we have used is based on the statistical computing like words frequency and terms features. Thus, the possibilistic network allows us to take into account different characteristics and factors that can be helpful in discovering interests in users’ publications. We used two data sets for the comparison with HIG approach. This last approach uses the spotlight tool that based on vector model and spreading activation algorithm in the Wikipedia hierarchy. In our method, the assignment of possibilities and necessities values for each word and term in the semantic resource allows quantifying the representativeness certainty degrees of these two elements. Thus, the use of interest graph to filter returned concepts allows selecting the popular and significant items for all users.

This possibilistic algorithm can be used in many applications and can help the scientific community interested in the knowledge discovery. On the other side, online social networks are promoter sources of information that can be used for discovering knowledge about users. Future works will concentrate on tweet stream analysis on the microblogging social networks. The first one is the dynamic user’s interest discovery from publications streams. Indeed, the millions of publications are shared on the virtual social space. The real time analysis of these publications allows organizing them and discovering knowledge change. In this case, we must be able to determine trends and changes in user behaviour over time. The identification of user interest’s change allows following the users’ opinions and preferences and recognizing the behaviour of change.

References

Zadeh

L.A.

, Fuzzy sets as a basis for a theory of possibility, Fuzzy Sets and Systems 1(1) (1978), 3–28.

Benferhat

et al., Possibilistic logic bases and possibilistic graphs, In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, (1999), pp. 57–64.

Dubois

and Prade

, Possibilistic logic: a retrospective and prospective view, Fuzzy Sets and Systems 144(1) (2004), 3–23.

Omri

M.N.

, Pertinent knowledge extraction from a semantic network: application of fuzzy sets theory, International Journal on Artificial Intelligence Tools 13(3) (2004), 705–719.

Tuason

et al., Biological nomenclatures: a source of lexical knowledge and ambiguity, In Pacific Symposium on Biocomputing 9 (2004), 238–249.

Mika

, Ontologies are us: A unified model of social networks and semantics, In The Semantic Web-ISWC, 2005, pp. 522–536.

Carmagnola

Cena

and Gena

, User modeling in the social web, In Knowledge-Based Intelligent Information and Engineering Systems, 2007, pp. 745–752.

Michlmayr

and Cayzer

, Learning user profiles from tagging data and leveraging them for personal (ized) information access, 16th International World Wide Web Conference (WWW), (2007), pp. 1–7.

Van Canneyt

Schockaert

and Dhoedt

, Discovering and Characterizing Places of Interest Using Flickr and Twitter, International Journal on Semantic Web and Information Systems (IJSWIS) 9(3) (2013), 77–104.

10.

Guo

and Zhao

Y.E.

, Tag-based social interest discovery, In Proceedings of the 17th International Conference on World Wide Web, 2008, pp. 675–684.

11.

Szomszor

et al., Semantic modelling of user interests based on cross-folksonomy analysis, The Semantic Web-ISWC, 2008, 632–648.

12.

Okamoto

Chen

and Li

X.-Y.

,Ranking of closeness centrality for large-scale social networks, Lecture Notes in Computer Science 5059, 2008, 186–195.

13.

Zheleva

and Getoor

, To join or not to join: the illusion of privacy in social networks with mixed public and private user profiles, In Proceedings of the 18th International Conference on World Wide Web, 2009, pp. 531–540.

14.

Mislove

et al., You are who you know: inferring user profiles in online social networks, In Proceedings of the Third ACM International Conference on Web Search and Data Mining, 2010, pp. 251–260.

15.

Wen

and Lin

C.Y.

, On the quality of inferring interests from social neighbors, In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2010, pp. 373–382.

16.

Bhattacharya

et al., Inferring user interests in the twitter social network, In Proceedings of the 8th ACM Conference on Recommender Systems, 2014, pp. 357–360.

17.

Gimpel

et al., Part-of-speech tagging for twitter: Annotation, features, and experiments, In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies 2 (2011), 42–47.

18.

Ritter

Clark

and Etzioni

, Named entity recognition in tweets: an experimental study, In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2011, pp. 1524–1534.

19.

Choudhury

and Breslin

J.G.

, Extracting semantic entities and events from sports tweets, In Proceedings of the 1st Workshop on Making Sense of Microposts, 2011.

20.

Abel

et al., Analyzing user modeling on twitter for personalized news recommendations, In User Modeling, Adaption and Personalization, 2011, pp. 1–12.

21.

Mendes

P.N.

et al., DBpedia spotlight: shedding light on the web of documents, In Proceedings of the 7th International Conference on Semantic Systems, (2011), pp. 1–8.

22.

Saif

and Alani

, Semantic sentiment analysis of twitter, In The Semantic Web¨CISWC 2012, Springer Berlin Heidelberg 2012, pp. 508–524.

23.

Java

et al., Using a natural language understanding system to generate semantic web content, International Journal on Semantic Web and Information Systems (IJSWIS) 3(4) (2007), 50–74.

24.

Habib

M.B.

and Keulen

, Unsupervised improvement of named entity extraction in short informal context using disambiguation clue, 2012, pp. 1–10.

25.

et al., Twiner: named entity recognition in targeted twitter stream, In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2012, pp. 721–730.

26.

Murnane

E.L.

Haslhofer

and Lagoze

, RESLVE: leveraging user interest to improve entity disambiguation on short text, In Proceedings of the 22nd International Conference on World Wide Web Companion, 2013, pp. 1275–1284.

27.

Goh

K.Y.

Heng

C.S.

and Lin

, Social media brand community and consumer behavior: Quantifying the relative impact of user-and marketer-generated content, Information Systems Research 24(1) (2013), 88–107.

28.

Chen

and Zhang

, Extracting Concepts’ Relations and Users’ Preferences for Personalizing Query Disambiguation, International Journal on Semantic Web and Information Systems (IJSWIS) 5(1) (2009), 65–79.

29.

Omri

M.N.

, Semantic scales and fuzzy processing for sensorial evaluation studies, In International Conference IPMU, (96) (1996), pp. 1–5.

30.

Michelson

and Macskassy

S.A.

, Discovering users’ topics of interest on twitter: a first look, In Proceedings of the Fourth Workshop on Analytics for Noisy Unstructured Text Data, 2010, pp. 73–80.

31.

Ortigosa

Carro

R.M.

and Quiroga

J.I.

, Predicting user personality by mining social interactions in Facebook, Journal of Computer and System Sciences 80(1) (2014), 57–71.

32.

Kapanipathi

et al., User interests identification on twitter using a hierarchical knowledge base, In The Semantic Web: Trends and Challenges, 2014, pp. 99–113.

33.

Sendi

and Omri

M.N.

, Biomedical Concepts Extraction based Information Retrieval Model: application on the MeSH, In Intelligent Systems Design and Applications (ISDA), 2015, pp. 40–45.

34.

Likavec

Osborne

and Cena

, Property-based Semantic Similarity and Relatedness for Improving Recommendation Accuracy and Diversity, International Journal on Semantic Web and Information Systems (IJSWIS) 11(4) (2015), 1–40.

35.

Hoang

T.A.

, Modeling user interest and community interest in microbloggings: An integrated approach, In Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, Cham 2015, pp. 708–721.

Possibilistic interest discovery from uncertain information in social networks

Abstract

Keywords

1. Introduction

2. Related work

3. Possibilistic theory and semantic ressource

3.1 Basics of possibilistic logic

3.1.1 Possibilistic theory

3.2 Semantic resources

4. Possiblistic logic based-interest discovery

4.1 General architecture

4.2.2 Possibility distribution

5.1 Data collection

Table 3 LiveJournal data set

5.3.1 Interest discovery results

Table 5 Comparison between possibility and necessity

6. Conclusion

References

Table 3
LiveJournal data set

Table 5
Comparison between possibility and necessity