Abstract
Given the large volumes of information that are generated every day in the Web, Adaptive Information Filtering systems have the potential to become a very useful tool to handle such information overload. These systems allow users to focus on documents that actually meet their information needs, while the system discards the rest. Traditionally, these systems assume that terms of a document are not related to each other, and therefore their efficacy is limited. To overcome this limitation, we propose a method for extracting different relations between the terms of the documents that satisfy the information needs of the users, in order to update the system modeling of such needs, and thereby improve its discrimination capability. These relations are based on the co-occurrence of terms at different levels of granularity, such as document, sentences or noun phrases. The experiments conducted indicate the potential of our proposal, which is capable of improving system efficacy, from the beginning and in the long run.
Introduction
Every day a huge volume of information is generated on various topics. This phenomenon makes it possible to find a response on the Web to almost any information need. However, this volume of information is also a challenge for users, when it comes to discern between what is of interest and what is not [1, 11]. For this reason, users demand software solutions that turn easier to discard the information that is not relevant and to focus on what actually satisfies their interest.
Those systems that allow to discard irrelevant information from an information stream are known as Information Filtering Systems [14]. These systems are characterized by their focus on retrieving documents from a stream, that tentatively respond to the users information needs.
Information filtering systems have a major disadvantage, that is, once they are built, the information criteria extracted can not be modified. The problem with this disadvantage is that, as time goes by, the system becomes obsolete. To avoid this drawback, a new type of systems has emerged, that have the possibility to update the knowledge about the user’s interest. These systems are known as Adaptive Document Filtering. The key issue of these systems is that they depend on the users expressing their information needs. However: How good are we, humans, to express to the machine our interests? The truth is that this is a very complicated task, and we spend a lot of effort to convey to the system what we want to retrieve from the information stream.
The prevailing situation is that the user points out several documents to the system, that attempt to exemplify his information needs. The problem with this solution is that it assumes that the user has several documents that respond to his information needs, but this assumption is often false.
The use of a query is other solution commonly employed. This approach is suggested, mainly, when the user does not have previous information related to the topic, either because the topic itself is new, or because the user is trying to learn on the matter. The problem is that conveying complex ideas is hard by means of a few terms. For a system, creating good user profiles from just a few terms related to the user’s interest is a real challenge, especially when the information needs are very specific or when the boundary between relevant and non-relevant information is not clear.
Usually, in a document filtering system, terms are assumed to be independent of each other, i.e. the occurrence or use of a term does not depend on the use of other. This assumption would be particularly inadequate when we started with very little information to create the user profile, as is the case of a simple query.
In the literature, we find some proposals that do not assume independence between terms, such as Latent Semantic Analysis (LSA) [9], Latent Dirichlet Allocation (LDA) [6] and Word Embedding techniques as Word2Vec [18]. Nevertheless, in adaptive document filtering environments where new information is constantly becoming available and new terms can appear frequently, using those proposals is not adequate because adding new information to these models requires retraining, i.e. repeting an expensive process.
Other solutions involve the use of frequent term sets, like Pattern Taxonomy Mining (PTM) [16]. The use of frequent terms has the advantage that, in order to extract the existing relations between terms, does not require information out of the documents available for the construction of user profiles. Here, we call relations to the co-occurrences of terms in a document.
Existing methods for extracting relations between frequent terms do not take into account the context in which these relations are found. In order to overcome this limitation, in this paper we propose a method for considering the context during the extraction of relations among terms, as a way to refine them, and we apply it for enhancing the adaptive document filtering.
The remainder of this paper is organized as follows. In Section 2, we present the main concepts related to the Adaptive Filtering Task and discuss the state of the art on PTM extraction. Section 3 contains the main elements of our proposal. To evaluate the performance of our proposal, we conduct several experiments that are presented on Section 4. The empirical results and discussion are reported in Section 5, followed by conclusions in the last section.
Adaptive document filtering
Adaptive Document Filtering systems are information filtering systems capable of updating its knowledge related to the users’ needs by means of user feedback. This feedback provides the system with new samples, so that it can adjust the user profile.
In Fig. 1, we show a general scheme of an Adaptive Document Filtering system. This type of system can handle the needs of several users simultaneously. In the figure, a single user is shown for illustrative purpose. When a user has a new information need, he/she must describe his interest to the system. Before starting the document stream processing, the system has to create a new profile to begin paying attention to this request. As new documents arrive, the system analyzes each one of them and provides the user with some documents to decide whether its content is consistent with the information modeled in the profile. After reviewing the submitted documents, the user can explicitly send feedback to the system, identifying which of these retrieved documents really fit his information need and which ones do not. In a real application, the system can take advantage of certain user actions, such as deleting a document without reading it or saving it to disk, to exploit as implicit feedback.

General structure of an Adaptive Document Filtering system.
Some details are important to remark: The user only has access to documents selected (filtered) by the system. The system has to decide, on the fly, to separate the document for the user or not, without storing it for further analysis. This type of systems usually has to start operating with very limited initial information.
Filtering Track at the Text Retrieval Conference (TREC) is the best-known forum for evaluation of information filtering research [29]. In such track, adaptive filtering task, which was designed to model the filtering process from the moment of profile construction, is the most important task [37]. Many of the previous restrictions related to the task were originated on the Filtering Track.
On the way to find a solution to the adaptive filtering task, a variety of techniques and algorithms had been used on different research reports. The first approximations to Adaptive Document Filtering can be divided in two main groups: algorithms based on Information Retrieval and algorithms based on Text Categorization.
Among the algorithms based on Information Retrieval, probably the most employed method is the Rocchio algorithm, some examples can be found in the proposals [5, 33]. Other explored approaches are the Logistic Regression algorithm [32, 36], language models [27], and reinforcement methods [28].
The other group of algorithms employed text categorization methods. Here, we can find the use of Support Vector Machines [7, 27], Winnow classifier [31], and the k-NN family [4].
In more recent research, we can find the use of genetic algorithms [19], subspaces [23] and Active Learning and Dirichlet Process priors [12]. Other authors aboard the problem exploiting the use of metadata and facets, like in [34, 35].
Recent pattern-based models
Most existing text mining methods using term-based representation are subject to the problems of polysemy and synonymy. To avoid these problems, a line of research adopted the use of a pattern-based representation [21]. This has motivated that pattern-based models have gained interest among the Information Filtering research community.
In this line of research, we find works focused on discovering the best pattern for representing the users’ information needs as [2, 22].
One of the most recent and significant pattern-based models was presented by Zhong et al. in [38]. Pattern Taxonomy Model is one of the state-of-the-art pattern based models [10]. In this model, it is assumed that every document is divided into paragraphs, which are the units for discovering the sequential closed patterns, taking into account the relative support of the discovered patterns in the document set. The main problem with this proposal is that assumes that a document to be relevant for a user has to gently tackle the topic. But this does not always happen, sometimes there is only a brief mention of terms of interest for considering a given document as relevant.
Recently, Gao et al. in [10] proposed other information filtering model, named Maximum matched Pattern-based Topic Model (MPBTM) that, unlike other models, assumes that the user interest is formed by more than one topic. To achieve its results, they combine the extraction pattern-based models with LDA for document filtering. The idea of combining LDA and frequent pattern was also explored in EFITM algorithm (Enhanced Frequent Itemsets Based on Topic Modeling in Information Filtering) by Than and Sint [30]. Performance of these proposals is affected in environments where the available information for user profile construction is very reduced. Statistical information extracted with very few information is not reliable, and this situation limits the use of LDA in these environments. Another limitation of LDA is that, in order to add new information to the model, requires access to all training documents; and such training becomes expensive.
Term Relations extraction
Extracting useful information when we have few terms for building a user profile is a very difficult challenge. Nevertheless, initial profile construction is crucial, because if we build an erroneous initial profile, we can get a model that does not fit the user information needs at all.
While frequent sets capture certain co-occurrence relationships that occur among the terms found in documents, they by themselves cannot be enough to successfully specify the information needs of users. The problem with frequent sets is that, on one hand, if we consider frequent sets with several items we can obtain a very specific pattern, probably capable of reaching high precision but, depending on the topic, getting a poor recall. On the other hand, when sets with few items are considered, we probably obtain a high recall but with poor precision. Both situations are harmful, and it is necessary to reach a balance that allows selecting the most relevant documents for the user without being overwhelmed with a high number of false positives. This problem is particularly crucial when a new user profile is created.
The information needs can be quite different from one user to another. While some users can be interested in very general topics, such as education, football, or politics, other users can have very specific interest, such as traffic accidents involving urban buses and fatalities. In the first case, identifying relations of co-occurrences of frequent sets probably will be enough for obtaining acceptable results in the filtering process. The second interest is more specific and extracting straightforward co-occurrences of terms at the level of documents are not enough for expressing the user information needs. Here, we require to extract relations that involve more specific contexts. For instance, the relation of co-occurrence between urban and bus which has to be detected in the same sentence or even in the same noun phrase.
The problem is that a priori, a system does not know how specific or general is the information need provided by the user to decide which context should be considered for extracting co-occurrences. To ameliorate this situation, we do not constraint to handle terms that co-occurred in the same document. We extended this concept for extracting co-occurrence relations with different contexts, such as paragraphs or sentences.
Term relations extracted at document level are less specific, while reducing the context (for instance, at paragraph, sentence or phrase level) for extracting these relations, restrict them to a more specific meaning. This is the central idea that we explore hereon.
Returning to the previous example of urban bus accidents, we can represent, for instance, this information need as follows: varPhi (S, varPhi (NP, urban, bus), crash) varPhi (S, varPhi (NP, urban, bus), suffer, accident)
In this example, we are representing by varPhi (S, a, b) that a and b have to be related at (S)entence level. In a similar way, we can employ varPhi (D, a, b) and varPhi (NP, a, b) for representing relations at document (D) level and noun phrase (NP) level, respectively. Notice that we are considering relations of two or more elements, such as: single terms relations between other terms in a more specific context
We can interpret these relations, and use them to explain the information encoded in a profile in a simple way. For instance, the second example could be interpreted as: the user is interested in sentences, containing the terms suffer, accident, and a noun phrase that includes urban and bus. By convention, we are assuming that a term is always related (co-occurs) with itself, at any level.
To avoid the effect of building a very specific profile at the beginning of the filtering process, we extract these relations under the premise of obtaining a good level of recall without compromising the precision, over the available training set. In order to achieve this goal, the process of extracting relations explores first the more general context, and if with this context we can not differentiate the relevant documents from non-relevant, then we move to the next context level (more specific), and so on. We select the relationships with the least number of terms that distinguish between positive and negative documents. In the same way, we first consider the relations with the largest context, and then evaluate the more restrictive relations. In this way, the algorithm will extract the relations that best differentiate the users’ interest from the rest of the documents without abruptly affecting its recall from the beginning.
In the Fig. 3, we present an example showing how the set of term relations can evolve for adjusting to the user’s needs as new documents are received as feedback.

Example of Document Structure Representation.

Example of the evolution of the STR set according to the feedback received from the user.
Algorithm 1 describes the main steps of our proposal. The algorithm receives as input the available initial documents for the construction of the profile, the set T of terms that belong to at least one positive document and the relation types to be considered, sorted according to the generality of the context used (from the most general to the most specific).
Extraction of Term Relations.
In our algorithm, DF (R, D) denotes a function that receives a relation R and a set of documents D and returns the set of documents of D in which R appears, that is:
Q is a priority queue to maintain the candidate relations, considering as a priority the number of occurrences of the relation R in positive documents P, i.e. |DF (R, P) |.
NR stands for a function that, given a relation candidate, a type of relation Ω
i
, and a term t, returns the new relation to be explored. Let R denote a relation of the form R = Φ (Ω
R
, r1, r2, …, r
k
), we define two operators on such relation R:
The first operator σ is an expansion operator, and allows to add new items to the relation R. While the second operator ρ is the replacement operator that modifies the relation R by replacing its last item with a new one.
So Γ function is defined as follows:
When we find that a relation R adequately discriminates positive documents from negative, we do not need to explore another more specific type of relations with these terms.
The number of relations that can be extracted from a set of documents is too high, but many of them can be noisy or non representative. To reduce the number of relations considered, we eliminate all relations such that the set of positive documents in which a relationship appears is a subset of the occurrences of another relationship. This selection process is performed during the evaluation of function IsValid (Algorithm 2).
Is Valid function.
Note that as long as a relation increases its complexity, its generality decreases, as well as its probability of appearing in further documents. The relation complexity increases when more terms or specific contexts (other relations) are added.
In our proposal, we favor relationships as simple as possible, as long as they allow to differentiate between relevant and non-relevant documents. Selecting simple relationships allow us to adjust the user profile without compromising the recall during the filtering process. Conversely, if we opt for the most complex relationships, we can incur in an overfitting of the model, and consequently a quick degradation of the classifier efficacy.
For reducing noise, we keep only those relations that contain at least one term which better represents the user informations needs. Intuitively, we can select the most frequent terms as those that best describe a topic of interest for a particular user. However, this idea is valid when the profile documents are distributed homogeneously among the subtopics.
Representative_Topic_Terms function
Representative_Topic_Terms function
A topic usually consists of several subtopics, and these subtopics do not necessarily consist of the same number of documents. When we have some subtopics with many documents and others with few ones, and we extract frequent terms ignoring how the documents are distributed among the subtopics, we can extract some frequent terms to describe the topic that are frequent in some subtopics, but not the best for the global topic. When the subtopics that appear in the topic are imbalances, terms that frequently appear in the most represented subtopics are more likely to be selected as frequent, while those that appear in the under-represented subtopics, even when they are important for the general topic, their frequency will probably be lower than those in the larger subtopics. Also, we can omit some terms that appear in several of the documents but representing infrequent subtopics.
For example, let assume that we are interested in aircraft accidents and the regulations to prevent them, and our training documents consist of news related to a particular accident and two other subtopics considered in few documents related to aviation regulations and laws related to security at airports. If we select only frequent terms we will probably select some terms to describe the profile related to the accident, since this subtopic is the best represented, as the place of the accident, the name of the pilots and so on; also, we surely discard some relevant terms of the other two subtopics just because these subtopics are less represented.
To avoid the situation previously described, we first perform a clustering over the documents of interest and then apply the term extraction. A clustering algorithm structures the documents into groups so that documents in the same group are more similar to each other than to documents of the other groups. We assume here that each cluster is a subtopic of the general topic of interest, then this can be expressed as a set of subtopics S ={ S1, S2, ..., S c }.
When we have structured the documents of the topic in subtopics, we can then extract the terms of interest as the most frequent terms among the different subtopics. For that, we first select for each subtopic S
i
their frequent terms F
i
, as detailed in the following expression:
Then we select as representative topic terms RTT, those terms that are frequent in at least a ζ proportion of the subtopics.
The extraction of the representative topic terms is achieved in line 3 of the Algorithm 1 expressed as function 3. Notice that queue Q is first initialized with relations with the more general context considering only these frequent terms. This ensures that only relations that involve at least one representative topic term, are considered.
During the filtering process, we identify two main stages: user profile building and stream documents filtering. In our proposal, the user profile consists of the set of relations extracted as detailed in Algorithm 1. These extracted relations can be interpreted as rules of the form:
Thus, for each document d
i
in the document stream, d
i
is selected as relevant for the user if and only if:
For the sake of efficiency of the matching process between a relation Φ and a document D, we keep the hierarchy structure of the document. In Fig. 2, we present an example of a document representation considering three contexts or levels: document (D), sentence (S) and noun phrase (NP). In the figure, this hierarchy structure is represented by solid lines. We also maintain which terms belong to each context, and represented as dotted lines in Fig. 2). With this structure, the matching process consists of traversing the relation Φ and validating along the way if there exists a node in the hierarchy structure that contains all the terms of Φ.
During the actual filtering process, as documents are being analyzed, and some of them are presented to the user, we can receive from the user some feedback indicating a reliable classification for a given document. This feedback provides the system with a new training sample that can be employed for ensuring a better fitting to the real user information need.
We can obtain from the user two different type of feedback: positive or negative. When we receive positive feedback for a given document, we can interpret that this document, or at least some parts of it, contains information relevant to the user and, in the future, similar documents can be equally relevant. Otherwise, a negative feedback is an indicator that the system should avoid retrieving similar documents since these do not contain relevant information.
When feedback from the user is received, it is not necessary to start the term relations extraction process (Algorithm 1) from scratch. On the contrary, we can take advantage of the results of current relations, but taking into account if we are in presence of positive or a negative feedback. In positive feedback, we can keep current relations and we must only explore new relations that can appear. Also, we need to explore only those candidates that involve, at least, a term occurring in the document supplied as feedback. In case of a negative feedback, the process is slightly different. Firstly, we need to select the relations existing in the supplied document d. These relations are no longer sufficient to differentiate between relevant and non-relevant documents, and we have to place them in the queue Q, because they need to be more specific to differentiate again positive from negative documents. The other relations, that do not exist in d, are maintained in STR.
Experiments
For evaluating our proposal, we employed the Reuter text collection. A common measure for the task was used for comparing our proposal with other approaches.
Experimental dataset
One of the most popular datasets used nowadays is the Reuter Corpus Volume 1 (RCV1) [15], which includes more than 800 000 news articles collected from August 1996 until August 1997. In particular, we employed the same subset used in TREC11 Conference.
For that competition, 100 topics were created. This subset consists of 50 topics created by the NIST assessors, and the other 50 topics were created artificially from intersected topics of the RCV1 dataset. This collection is composed of news and press digest. Topics are quite different from each other in term of the number of relevant documents on the stream, as well as the distribution of these documents over time.
Another characteristic to emphasize about the collection is the fact that, on press digest documents, these are labeled as relevant for a particular topic if any of its short notes refers to the topic, but we do not know beforehand which one of them satisfies the user information need.
Evaluation metric
To evaluate the efficacy of the different proposals, we employ the T11SU measure [26], and consists of a normalization of the TREC linear utility measure (T11U). Linear utility gives two points for every relevant document retrieved (TP) and receives a penalty of one point for each non-relevant retrieved document (FP), and is expressed as:
Normalization is done based on its theoretical maximum, and scaled against the -0.5 value.
T11SU measure is computed as follows:
Being,
Notice that a T11SU value of 0.33 can be achieved by a system that does not retrieve anything. As TREC organizers noticed, a system that achieves this value, is not useful, but it does not waste user’s time either. This value was taken as baseline during the Filtering track in the TREC conferences.
Also, we report results with the traditional F1 measure. F1 measure is a harmonic mean of precision and recall and is computed as follow:
Being, precision the proportion of retrieved documents that are in fact relevant, and recall the proportion of relevant documents that indeed are retrieved.
In the experiments, the documents were previously preprocessed, where tags and stop words were removed and a lemmatization process was applied. Also Named Entity Recognition (NER) was performed and noun phrases were extracted. In particular, we employed the FreeLing 3.0 tools [20] for lemmatization, noun phrases extraction and named entity recognition. For NER, we also utilized JRC resources provided by European Commission’s science and knowledge service1. JRC-Names consists of large lists of names of persons and organizations and their spelling variants in several languages.
We considered three levels as context: Document, Sentence and Noun Phrase. During the experiments, we also limited the size of the relations to four terms.
For the experiments, we followed up the same experiments setting proposed on the TREC 11 conference [25].
We selected the Compact Clustering [24] for the representative topic terms extraction. We selected this clustering algorithm because it does not require to specify beforehand the number of clusters. This algorithm involves a parameter beta, and in our experiments we set this parameter as 0.1. This clustering algorithm is not mandatory in our proposal and another clustering algorithm can be employed.
Comparison with previous pattern-based approaches
Researchers are convinced that for obtaining high quality is mandatory the use of document representations that avoid the term independence assumption of Bag of Words model (BOW). Several approaches have been proposed and one of the most popular are those based on patterns. Pattern based representations are preferred because they are easy to interpret, and facilitate the possibility of explaining the decision of retrieving or not a document for the user.
Among those of the most recent proposals based on patterns are Pattern Taxonomy Model (PTM) and Pattern Deployment Model (PDM). The main difference between these models is that the first uses the whole pattern for representing the documents and the other combines the terms that appear in the patterns in a single centroid. These models do not take into consideration how the terms in the profile are distributed. To ameliorate this drawback, a model known as MPBTM was proposed. This model employs LDA to infer how terms describing the topic are related and the subtopics are presented.
Our proposal differs from these proposals in the fact that none of these models takes into consideration the non-relevant documents for extracting the patterns. Like PTM, we employ the patterns as a whole, but we use this patterns as rules and not combining them in a centroid.
Similar as MPBTM and EFITM, we take into account the subtopics appearing in the topic, but in these models the number of subtopics has to be specified. In contrast, we employ a previous clustering process that does not require to have the number of subtopics.
In Table 1, we show a comparison of the results obtained with three variants of our proposal and other recent pattern based models. All the models were implemented in Python 2.72. In this table, D denotes the use of relations only at the document level. S and NP represent that contexts are considered at sentence and noun phrases level, respectively. In the table, we can notice that our proposal obtains better results than our implementation of the three other models, even with the more general context (D). Global results tend to improve as we consider more specific contexts (D + S and D + S + NP).
Comparison with previous pattern-based methods
Comparison with previous pattern-based methods
Given that the first experiment was a comparison with other methods as the average of the efficacy on the set of topics, we wanted to set a second experiment oriented to analyze the contribution of the use of different type (levels) of contexts. In Fig. 4, we show how the quality obtained by the filtering algorithm evolves when we take into account the different depth levels of the contexts. On axis X, we show months of the document streams and in the Y axis the quality as expressed by the averaged T11SU measure over all the profiles, obtained at the end of the corresponding month. In the figure, again D denotes the use of relations only at document level. S and NP represent that contexts are considered at sentence and noun phrases level, respectively.

Evolution of T11SU measure over time.
From the figure, we can notice some interesting details. First, we can note a tendency to improve the quality of the classifier over time. This is an important element because if at the starting time we build an erroneous user profile, it is very unlikely that the classifier’s quality will improve later on.
Another interesting phenomenon is related to the fact that when we employ more specific context, like sentences or noun phrases, the quality, in comparison with the document context level, tends to improve. This behavior supports our hypothesis that when we include more specific contexts in the analysis, the quality shows an improvement.
Also interesting to remark is that from the beginning, when the information available is very poor, the use of noun phrases achieves better results in comparison to those obtained with documents or sentences contexts. But, as time goes by, the use of a sentence level context has a rapid increment and reaches an efficacy similar to that obtained with the use of noun phrases. This behavior can be motivated by the fact that noun phrases usually associate nouns with adjective modifiers and, this structural constructions are very specific and are likely a good key for differentiating positives from negatives documents. But, a noun can be modified by several adjectives that express the same idea. This situation causes that if the same idea is expressed with a different adjective, the relation will not be satisfied and the document will not be retrieved. For example, the noun phrases: secure plane and safe plane, basically express the same idea but the noun plane is accompanied by two different adjectives. Another behaviour to remark is that of employing only (D)ocument context that in the long run tends to improve but always below the other more specific contexts.
When we have more documents for the user profile construction, we can extract more relations between different nouns and verbs that allow to obtain an improved profile, with a better quality.
Terms occurring in a document are not independent of each other. These have been selected by the author to convey an idea or a topic, and are necessarily dependent on that. To overcome this limitation, there exist several proposals in the literature. Among these proposals we found the pattern based models. In this paper, we propose a new method that analyses in depth the relations that exist among terms for modeling the user profile in a Document Filtering task. In our approach for extracting the most representative terms, we take into account the structure of the topic in subtopics, but unlike models like MPBTM, we do not assume the existance of a predefined number of subtopics. The results obtained showed that the use of context at different levels of generality leads to an improvement of the classifier quality. Results also showed that the use of noun phrases contexts have a particular impact where the number of documents available for the user profile construction is low.
It remains to test whether an intermediate level such as paragraph (between document and sentence) shows a consistent behaviour with the other levels considered in the experiments. We could not take this level into account since the experimental collection does not have a clear separation in paragraphs.
Although the use of relationships allows to overcome the independence between terms, they do not handle the polysemy problem, where different terms could be used for expressing the same idea. In future works we plan to extend our model to handle this phenomenon, possibly by the use of a representation that incorporates latent information such as Random Indexing or Document Occurrence Representation. We foresee that such representations can complement our method.
