Time sensitive blog retrieval using temporal properties of queries

Abstract

Blogs are one of the main user-generated contents on the web and are growing in number rapidly. The characteristics of blogs require the development of specialized search methods which are tuned for the blogosphere. In this paper, we focus on blog retrieval, which aims at ranking blogs with respect to their recurrent relevance to a user’s topic. Although different blog retrieval algorithms have already been proposed, few of them have considered temporal properties of the input queries. Therefore, we propose an efficient approach to improving relevant blog retrieval using temporal property of queries. First, time sensitivity of each query is automatically computed for different time intervals based on an initially retrieved set of relevant posts. Then a temporal score is calculated for each blog and finally all blogs are ranked based on their temporal and content relevancy with regard to the input query. Experimental analysis and comparison of the proposed method are carried out using a standard dataset with 45 diverse queries. Our experimental results demonstrate that, using different measurement criteria, our proposed method outperforms other blog retrieval methods.

Keywords

Blog distillation blog feed search blog retrieval time sensitive blog retrieval

1. Introduction

After the prevalence of Web 2.0 technologies, different technologies have emerged that have enabled internet users to share their knowledge in a fast and easy manner. This has led to a growing mass of user-generated content on the Web. Weblogs are one of the simplest and most common content generation tools on the Web. A large number of people write about their experience and opinion in their blogs, which provides a tremendous amount of useful information on the blogosphere. Recently, the number of blogs has been growing extremely fast, and thus this phenomenon cannot be ignored [1]. The total number of blogs in the world is not known exactly; according to BlogPulse¹ (one of the major blog search engines), there existed more than 182 million blogs at January 2012. Based on the latest statistics presented by Tumblr,² the total number of blogs was over 248 million (with more than 116.9 billion posts) in August 2015. Also, according to WordPress,³ each month about 53.6 million new posts and 59.3 million new comments are written by WordPress blog users. Owing to the large number of blogs, various interrelated research activities have been carried out to answer the users’ information needs, such as opinion retrieval [2 –4], topic detection and tracking [5], top news stories identification [6, 7], blog post search [8, 9] and blog retrieval [10 –13].

The size of blogosphere and the special characteristics of blogs, such as vulnerability to spam, the informal and conversational language of blogs and their short lifespan, make the search for high-quality content on the blogosphere a challenging task [14]. Information retrieval in the context of the blogosphere usually includes one of the following two main tasks:

Blog post search– the goal is ranking individual blog posts with regard to the topic of the input query. Blog post search is very similar to ad hoc search in traditional information retrieval. This task can be stated as ‘ﬁnd me blog posts about x’, where x is the topic of the input query [9].

Blog feed search (also known as blog retrieval or blog distillation) – the goal of this task is to rank blogs according to their recurrent relevance to the topic of the query. Rather than blog posts, the retrieval units of this task are whole blogs. In blog feed search, it is supposed that users are searching for blogs to add them to their feed aggregator and follow them constantly. This task is similar to the filtering task in traditional information retrieval. It can be expressed as ‘find me blogs principally and recurrently devoted to concept x’, where x is the topic of the input query [15]. In this paper, we focus on this task.

After initializing a related task by the organizers of TREC conference in 2007 [16, 17], many other studies have been started on blog retrieval. Different solutions that have been used for similar problems, like ad-hoc search methods, expert search algorithms and resource selection in distributed information retrieval environments, have also been adopted for blog retrieval. These solutions are summarized in Section 2.

The temporal properties of blog posts have been used in different ways. Nunes et al. [18] defined two new measures called temporal span and temporal dispersion to evaluate how long and how frequently a blog has been written about a specific topic. Similarly, Macdonald and Ounis [19] used a heuristic measure to capture the recurring interests of blogs over time. Some other approaches considered a higher score for more recent posts [20, 21] and some others, like Keikha et al. [22], proposed a time-based query expansion method.

However, none of the previous methods according to our knowledge have considered temporal properties of different kinds of input queries. All blog retrieval methods act the same for all queries, while usually queries have different temporal properties that can be taken into account for blog retrieval tasks.

Suppose the input query is ‘FIFA World Cup 2010 in Africa’; if the writing time of a post about FIFA World Cup 2006 is ignored, such post can get a high score in a blog retrieval method. However, this post is not relevant as the user needs to find posts that are written about the World Cup of 2010. Also, for event-sensitive queries like ‘Oscar Film Awards’, ‘Fajr Film Festival’ or ‘Tehran International Book Fair’, considering the time of the posts is very important; posts that are written at the time of the festival or immediately after the awards are more relevant. Therefore, we propose to classify the user’s queries based on their temporal properties as follows:

periodic time-sensitive queries;

event-sensitive queries or queries that are sensitive to a certain time interval;

recency-sensitive queries;

time-insensitive queries.

We propose an approach that leverages the temporal property of queries to retrieve relevant blogs. The proposed approach starts by retrieving relevant posts by use of content-based relevance score. Then it builds a temporal profile for the query according to its importance in different time intervals, and computes a temporal score for each relevant post using this profile. Finally, the content and temporal scores are combined and are used in different retrieval methods to generate the final ranking. Consequently, for a query such as ‘FIFA World Cup 2010 in Africa’, a lower temporal score is assigned to a post written about World Cup 2006. More specifically, in this paper we will:

cross-validate the existing blog retrieval methods on a new collection;

discuss classification of user’s queries based on their temporal properties;

investigate the impact of temporal properties of queries on blog retrieval accuracy;

introduce a method for calculating a temporal score for blog posts with respect to the input query;

propose an efficient time-sensitive blog retrieval algorithm.

The rest of this paper is organized as follows: Section 2 discusses a general review and a classification of blog retrieval methods and our proposed method is presented in Section 3. Then our experimental setup is presented in Section 4, and the results and comparisons are discussed in Section 5. Finally the paper is concluded in Section 6.

2. Related works

In this section, different blog retrieval methods will be reviewed briefly. In the literature, the blog retrieval systems use two different types of representations for blogs:

Global Representation (GR)– builds a virtual document for a blog by concatenating all posts of the blog.

Local Representation (LR)– in this representation, blogs are seen as a collection of posts. The LR model treats each post within a blog as a separate document.

Considering the above representations, blog retrieval methods can be classified into two approaches; the first category includes the blog retrieval methods that rank blogs according to the relevancy score of the whole blog. These methods use GR model of blog representation. The approaches of the other category, rank blogs based on a combination of the relevancy scores of individual posts weighted by their post importance. These methods represent blogs using LR model. The next sub-section surveys the approaches that use GR and LR models of blogs and the second sub-section introduces the blog retrieval methods that use temporal properties of blogs.

2.1. Blog retrieval methods based on GR of blogs

Elsas et al. [23], propose a Large Document (LD) model for blog retrieval. The LD model regards the whole blog as a single document and calculates its relevancy score with regard to any input query.

P_{LD} (B | Q) = P (B) . P_{LD} (Q | B)

(1)

where P(b) is the blog prior, and the query likelihood P(Q|B) is estimated with Dirichlet smoothing. Weerkamp et al. [24] use a language modelling framework for blog retrieval. They adopt an expert search model for blog retrieval and nam it the blogger model. In their model, the probability of a query being generated by a given blog is estimated by representing the blog as a multinomial distribution of terms. Furthermore, Weerkamp et al. [21] extend the blogger model by using a number of blog-specific features in posts prior in the blogger model such as document structure, social structure and temporal structure.

Similar to the LD of Elsas et al. [23], Seo and Croft [25] create a virtual document for a blog by concatenating all posts in a blog. The virtual document is represented using a language model, and the query likelihood of the document for a query Q is used as a ranking function.

2.2. Blog retrieval methods based on LR of blogs

In this approach, blog retrieval methods use a ranking function F that is computed based on the individual post scores (PostScore) of each post P in a blog B.

Final score (B) = F (PostScore (P, Q))

(2)

Different solutions for similar problems like expert search or methods from resource selection have been adopted for defining the F function. Elsas et al. [23], propose the Small Document (SD) model, which considers each blog as a collection of posts; having the relevancy score of each post calculated, a blog B is scored based on the sum of the individual scores of its posts weighted by their post centrality.

Scor e_{SD} (B, Q) = \sum_{p \in F} PostScore (P, Q) . Centrality (P, B)

(3)

where P is a post in the blog B, and $PostScore (P, Q)$ is the relevancy score for each post which is computed based on query terms using Dirichlet smoothing and $Centrality (P, B)$ is the post centrality.

Seo and Croft [25] approach blog retrieval as a resource selection problem. They consider a Pseudo-Cluster Selection (PCS) model for blog representation. PCS is similar to the SD model of Elsas et al.; however, it utilizes a diﬀerent principle. In PCS, a blog is considered as a query-dependent cluster containing highly ranked blog posts for the input query.

Lee et al. [26], similar to SD and PCS models, proposed the Global Evidence Model (GEM) and Local Evidence Model (LEM) . The GEM calculates the final score of a blog B using average $PostScore$ of each post P of the blog. Formally, this can be expressed as:

Scor e_{GEM} (B, Q) = \frac{\sum_{p \in B} PostScore (P, Q)}{n}

(4)

Furthermore, the LEM is calculated in the same way. The difference is that GEM considers every post P in a given blog B, whereas LEM only considers the top K retrieved posts.

Scor e_{LEM} (B, Q) = \frac{\sum_{top k p \in B} PostScore (P, Q)}{K}

(5)

A linear combination of GEM and LEM provided the best efficiency among the participants of TREC 2008 Blog track. Keikha and Crestani [27] model each post as evidence of a blog’s relevancy to the input queries, and use aggregation methods like Ordered Weighted Averaging operators to combine the evidence. In particular, given a query Q, the score of blog B is estimated as:

Scor e_{OWA} (B, Q) = \sum_{i = 1}^{n_{b}} W_{i} . PostScore (P_{i}, Q)

(6)

where $P_{i}$ is the ith highest scored post in blog B among $n_{b}$ number of posts, and the weight vector W determines the behavior of the aggregation. Macdonald and Ounis [19] propose an approach that is similar to the expert finding task. They used different voting models that have been used in Macdonald and Ounis [28] for expert finding. Some aggregation methods that they applied to blog retrieval are used as our baselines and are discussed in more detail in Section 3. Weerkamp et al. [24] use a language modelling framework to propose a blog retrieval method that regards individual posts as indexing units. Rather than directly modelling the blog, individual posts are modelled and for every blog they sum up the relevance scores of individual posts weighted by their relative importance in the blog. Keikha et al. [29] introduce three types of on-topic diversity (Topical Diversity, Temporal Assortment and Hybrid Diversity) for the posts of blogs and investigate their aftereffect on achievement of blog retrieval methods. They use different existing aggregation methods such as CombSum [19], PCS [25] and SD [23] as baselines in their experiments. The impact of the content similarity between posts of a blog is investigated in Keikha et al. [30] and the relevancy score of a post is smoothed by the use of the association between the post and the other posts of the blog. Gerani et al. [31] used different features of posts such as content, in-links and anchor text in addition to the global features of each blog such as the total number of postings, the number of postings that are relevant to the topic and the cohesiveness of the blog. They utilize a rank learning [32] approach to combine the features into a single retrieval function for estimating the relevancy of blogs.

2.3. Blog retrieval methods based on temporal properties of blogs

Temporal information has been used in ad-hoc information retrieval in many different ways. Jones and Diaz use time-based query profiles for predicting query precision [33]. Dakka et al. [34] incorporate temporal distributions in different language modelling frameworks. They apply several standard normalizations to temporal distributions and use global temporal distributions as a prior. Li and Croft [35] and Efron and Golovchinsky [36] use time-based methods in order to rank information for queries for which recency is an important factor. Peetz et al. [37] propose a query modelling approach where terms are sampled from bursts in temporal distributions of initially retrieved set of documents. They deﬁne a burst to be a time period where an unusually large number of documents is published and propose different approximations for both continuous and discrete bursts. Amodeo et al. [38] select top-ranked documents in the highest peaks as pseudo-relevant, and consider documents outside peaks as non-relevant. They use Rocchio’s algorithm [39] for relevance feedback based on the top-10 documents.

Also time has been used in blog retrieval. Many researchers utilize the time stamp of posts in various ways; Keikha et al. [22] use temporal properties of posts in a query expansion method that selects terms for query expansion based on the relevant days of a given topic.

Nunes et al. use temporal properties of blogs to find relevant blogs [18]. They use temporal span and temporal dispersion as two measures of relevancy over time, and show that these features can improve blog retrieval. Keikha et al. [40] propose a framework, named TEMPER, that selects time-dependent terms for query expansion, generates one query for each point of time and calculates a distance measure based on temporal distributions.

Some models are designed to retrieve the blogs based on frequency of discussion about topics of interest. Such models show improvements over the baselines that solely use the content of the blogs [20, 21]. MacDonald and Ounis try to capture recurring interests of blogs over time [19]. Following the intuition that a relevant blog will continue to publish relevant posts throughout the timescale of the collection, they divide the collection into a series of equal time intervals. Then blogs are scored based on their number of relevant posts in different time intervals. Keikha et al. [41] aim to measure the stability of a blog relevance to a query over time. Their idea is that a blog that has many related posts during a short period of time is not highly relevant. Thus they define TRS (Temporal Relevance Stability), which scores a blog higher if it has more related posts in more time intervals, which is believed to be an indication of greater stability.

In this paper, a new approach is proposed to use temporal properties of queries for blog retrieval. Through different experimental results, we will show that our approach considerably outperforms already existing methods.

3. Temporal-based approach to blog retrieval

Various blog retrieval methods utilize temporal properties of blogs. However, none of them use the temporal properties of queries and treat all the queries in the same way. However, in this paper, we propose a temporal-based approach for blog retrieval that uses both temporal and content relevancy of blogs with regard to the input query to find an efficient ranking of blogs. Using the proposed approach, we can extend all the existing blog retrieval methods (using LR to blog representations such as voting model [19], blogger model [24], posting model [24], small document model [23] and local evidence model [26] to be time sensitive. Figure 1 shows an overview of the proposed temporal-based approach to blog retrieval. This approach contains the following steps:

Step 1 – retrieve top N relevant posts named R(Q) set.

Step 2 – create a temporal profile for each query using R(Q) set.

Step 3 – calculate a temporal score for each post in R(Q) set based on the temporal profile of query.

Step 4 – use combined temporal and content score of blog posts to make already existing methods sensitive to time.

Figure 1.

An overview of proposed temporal-based approach to blog retrieval.

3.1. Step 1: retrieve top N relevant posts

In the first step of the temporal-based approach, top N relevant posts are identified using the PL2 [42] retrieval method and we call them the R(Q) set. This set contains the candidate relevant posts for the input query. Our experiments on different values of N show that the optimum value is N = 1000. Previous studies show that using the most relevant posts improves the performance of blog retrieval methods [26]. Therefore, we use the top 1000 relevant posts to create the R(Q) set. Figure 2 shows the effect of N on the performance of voting model in the irBlogs dataset.

Figure 2.

Effect of N, on the performance of voting model over the irBlogs dataset.

3.2. Step 2: create temporal profile of queries

Retrieving candidate relevant posts, a temporal profile is created for each query according to the occurrence of relevant posts to the query in the time interval. This profile includes:

Sensitivity(Q,time) – calculate sensitivity of the input query Q to different time intervals time, that is, calculate the importance of each time interval time for query Q. Here, time is the publish time of the post and Q is a given query.

Temporal query type (time-sensitive or time-insensitive) – time-sensitive queries are those whom the majority of their relevant posts belong to a specific time interval, whereas time-insensitive queries are the ones whom their relevant posts are uniformly distributed over all time intervals, that is, the queries that have similar Sensitivity(Q,time) values for all time intervals are time insensitive queries and the queries that have greater Sensitivity(Q,time) values for some of the time intervals are time sensitive ones.

3.2.1. Calculating sensitivity of the input query to different time intervals

Calculating the sensitivity of the input queries to different time intervals, we introduce Sensitivity(Q, time) as the sensitivity of query Q to time interval time. We consider two different ways of calculating sensitivity of the input query Q to time time in equations (8) and (9).

In equation (8), the ratio of the number of posts being published in a certain time to the number of all posts defines the sensitivity of query Q to time time. In this equations, P denotes a given post, and $T (Q)$ is the set of the published times of all posts belonging to R(Q). The higher the presence of the relevant posts in a time interval time, the higher the $Sensitivity (Q, time)$ value will be for that time interval:

T (Q) = {da y_{1}, da y_{2}, \dots, da y_{n - 1}, da y_{n}}

(7)

Sensitivity (Q, time) = \frac{\sum_{P \in R (Q)} TimeRelevancy (P, time)}{\sum_{time' \in T (Q)} \sum_{P \in R (Q)} TimeRelevancy (P, time')}

(8)

Equation 9 differs from equation 8 because of inclusion of the relevancy score in the calculations as a weighting measure. In this equations, P denotes a given post and $T (Q)$ is the set of published times of all posts belonging to R(Q). Also, Score(P,Q) is the content score of the post P with regard to query Q that is calculated using the PL2 model.

Sensitivity (Q, time) = \frac{\sum_{P \in R (Q)} TimeRelevancy (P, time) . Score (P, Q)}{\sum_{time' \in T (Q)} \sum_{P \in R (Q)} TimeRelevancy (P, time') . Score (P, Q)}

(9)

In equations (8) and (9), $TimeRelevancy (P, time)$ is the relevancy of the published time of the post P to the time interval time which is shown in equation (10). As we see in this equation, if the post P is published at the time interval time, the published time of this post is relevant to time interval time.

TimeRelevancy (P, time) = {\begin{matrix} 1 & if tim e_{P} = time \\ 0 & other \end{matrix}

(10)

3.2.2. Temporal classification of queries

In order to find out whether a query is time-sensitive or not, we extract the posting time of the candidate relevant blog posts in R(Q) and draw a posting time distribution histogram as shown in Figure 3. The horizontal axis indicates the posting time of the blog posts and the vertical axis is the number of posts that are published in a specific time.

Figure 3.

Relevant post distribution histograms of (a) ‘FIFA 2010 World Cup in Africa’ event-sensitive query, (b) ‘gold and currencies prices and exchange’ time-recency query, (c) ‘Fajr Film Festival’ periodic time-sensitive query and (d) ‘Classical Music’ time-insensitive query.

The time line could be split based on different time intervals such as year, month, day, hour or minute. In fact, selection of the proper time interval to use depends on the social media type. For example, in a media such as Twitter with high update ratio, selecting small time intervals such as hour or even minute is reasonable. However for blogs, a larger time interval such as a day or a week is more appropriate. For this research, we used a day as the time interval for the blog posts.

Depending on how the relevant posts are distributed over different time intervals, the input queries are divided into two groups of time-sensitive and time-insensitive. The time-sensitive queries have a non-uniform distribution of the posts over the time intervals, indicating that the number of related posts may have a sudden increase in some time intervals.

Time-sensitive queries can be divided into several other subcategories such as:

Event-sensitive queries – these queries are sensitive to certain time intervals, such as ‘FIFA World CUP 2010’, a tournament which was held from 11 June to 11 July 2010 in Africa. Figure 3a shows distribution of the relevant posts for this query.

Recency-sensitive queries– for these queries newer posts have higher probability of relevancy. For example query ‘gold and Currency prices and exchange’ is recency-sensitive, since the price of the gold changes over time and users need to know the latest price. This fact can also be seen in Figure 3b in which most related posts are published in the most recent time intervals.

Periodic time-sensitive queries– these queries have spikes in their posting time distribution. That is periodically there are intervals with higher number of relevant posts. For example, a query about ‘Fajr Film Festival’ (Iran’s annual film festival), which is held every February in Tehran, is a periodic time sensitive query as every year around February there are a lot of posts about it. This fact can be seen in Figure 3c (in each year, the number of retrieved posts increases in January and February)

Similarly, the time-insensitive queries can be identified by looking at the distribution of their posts as well. Time-insensitive queries have nearly uniform distribution of posts over all time intervals. Figure 3d shows the distribution of the posts for time-insensitive query ‘Classical Music’.

3.3. Step 3: calculating temporal score of posts

A temporal score is calculated for each candidate post in R(Q), based on the exponential distribution in equation (11):

TemporalScore (P, Q, tim e_{P}) = λ e^{- λ (MA X_{time \in T (Q)} (Sensitivity (Q, time)) - Sensitivity (Q, tim e_{P}))}

(11)

In the equation (11), P is the intended post, Q is the input query, $tim e_{P}$ is the posting time of post P, $MA X_{time \in T (Q)} (Sensitivity (Q, time))$ indicates the sensitivity of query Q to a time interval in which the most relevant number of posts have been published, $Sensitivity (Q, tim e_{P})$ stands for the sensitivity of query Q to publish the time of the post P and $λ is a Decay Parameter .$

Higher sensitivity of the input query Q to the published time of the post P will result in higher TemporalScore value for the post P. In other words, the spiky points in the relevant post distribution histogram get higher TemporalScores than other points.

3.4. Step 4: proposed time-sensitive blog retrieval methods

In this section, we will illustrate how to combine the proposed temporal score with the existing blog retrieval methods that use individual post scores. In other words, we will make voting, blogger, posting, SD and local evidence models time sensitive. All the other blog retrieval models (that were mentioned in Section 2.2) which use individual post scores to calculate the final score of the blogs can also be modified in the same way to be time sensitive. Here, we introduce two approaches to make the blog retrieval methods time sensitive:

A linear combination of the temporal score and the content score of blog posts– as mentioned in Section 2.2, most blog retrieval methods utilize individual post scores PostScore to rank relevant blogs. Since previous studies use solely the content of the posts to calculate PostScore for them, they do not consider temporal properties of posts. In our first approach, in order to make blog retrieval methods time sensitive, instead of PostScore, we use Time sensitive PostScore. That is to say, the final score of each post is calculated using a linear combination of its temporal score and the content score, as shown in equation (12). This approach is used to make voting and local model time sensitive.

TimeSensitivePostScore (P, Q, tim e_{P}) = (1 - α) Content_Score (P, Q) + α TemporalScore (P, Q, tim e_{P})

(12)

In equation (12), α is a weighting coefficient, $TemporalScore (P, Q, tim e_{P}$ ) indicates the temporal score of post P (which is calculated using equation (11) as explained in the previous section) and $Content_Score (P, Q)$ is the content score which is calculated by the PL2 model. Note that content and temporal scores are normalized first.

The temporal score for post centrality– previous blog retrieval methods use some measures such as post length or similarity between a post and a blog to compute post centrality. In our second approach, in order to make blog retrieval methods time sensitive, we introduce $Temporal PostCentrality$ that calculates posts centrality using their temporal score. This approach is used to make the blogger/posting method and the small document model time sensitive.

Temporal PostCentrality (P, tim e_{P}) = TemporalScore (P, Q, tim e_{P})

(13)

3.4.1. Time- sensitive voting method

Macdonald and Ounis [19] adopt the voting method of Macdonald and Ounis [28] for blog retrieval. It works according to a list of initially retrieved posts for a query, named R(Q), which is supposed to be the set of probably relevant posts. Then each blog’s score is calculated based on the posts that exist in R(Q). In this method, bloggers are considered as experts in different topics. A blogger who is interested in a particular topic blogs regularly about that specific topic and it is highly probable that his/her posts are retrieved in response to a related query. Considering this approach, blog retrieval can be modelled as a voting process: a post that is retrieved in response to a query is considered as a weighted vote for the expertise of its blogger about the query. In Macdonald and Ounis [19], different fusion methods are used to aggregate the weighted votes and finally rank related blogs. ExpCombMNZ is the best fusion method in their experiments which is calculated as follows:

Scor e_{expCombMNZ} (B, Q) = | R (Q) \cap Post (B) | \sum_{P \in R (Q) \cap Post (B)} \exp (Content_Score (P, Q))

(14)

where Post(B) is the set of posts from the blog B, $| R (Q) \cap Post (B) |$ is the number of posts from blog B that also exist in R(Q) and $Content_Score (P, Q)$ is relevancy score of the content of post P with regard to the query Q which is calculated using a text retrieval model such as PL2.

In order to present a time-sensitive voting model, we calculate the final score of each post in R(Q) using Time sensitive PostScore in equation (12). Therefore, the final score of a blog is calculated using equation (15):

Scor e_{expCombMNZ} (B, Q) = | R (Q) \cap Post (B) | \sum_{p \in R (Q) \cap Post (B)} \exp (TimeSensitivePostScore (P, Q, tim e_{P}))

(15)

3.4.2. Time- sensitive posting and blogger method

Weerkamp et al. [24] adapt two expert search models based on language modelling for blog retrieval. The first model, known as the blogger model, estimates the probability of a query given a blog by representing the blog as a multinomial probability distribution over the vocabulary of terms as shown in equation (16):

P (Q | θ_{bloger} (blog)) = \underset{t \in Q}{Π} P {(t | θ_{bloger} (blog))}^{n (t, Q)}

(16)

In this equation, n(t, Q) represents the frequency of term t in query Q, θ_bloger (blog) is the blog’s language model, and P(t|θ_bloger (blog)) is the probability of term t in the blog’s language model, which is calculated using equation (17):

P (t | θ_{bloger} (blog)) = (1 - λ_{blog}) P (t | blog) + (λ_{blog}) P (t)

(17)

where P(t) is the probability of a term in the document repository and $P (t | blog)$ is likelihood of presence of term t in a blog that is calculated using equation (18):

P (t | blog) = \sum_{post \in blog} P (t | post, blog) P (post | blog)

(18)

Assuming that terms are conditionally independent from the blog (given a post), thus P(t|post,blog) = P(t|post) and P(t|post) is approximated using standard maximum likelihood estimate. Also, P(post|blog) or the importance of a post in a blog is considered the same for all posts of a blog.

In the second model presented in Seo and Croft [24], named the posting model, each blog post is modelled instead of the entire blog. Then the final score of a blog is the total relevance scores of individual posts $P (Q | θ_{posting} (post))$ weighted by their relative importance given the blog, that is, the P(post|blog). So, we have:

P (t | blog) = \sum_{post \in blog} P (Q | θ_{posting} (post)) P (post | blog)

(19)

In Weerkamp et al. [24], $P (post | blog)$ is considered as a constant for all blog posts but in this paper, to make blogger/posting model time sensitive, we use the $TemporalPostCentrality$ mentioned in equation (12) to calculate the importance of a post. As is evident in Figure 3, for time sensitive queries, there are time intervals (spiky points) that are far more important than the other ones. In such time intervals, there is more chance of finding relevant posts than the other ones. Therefore, we use $Temporal PostCentrality (P, tim e_{P})$ of the post in the blogs to calculate the $P (post | blog)$ in the blogger/posting model.

3.4.3. Time-sensitive small document model (TSSD)

The SD model considers each post within a blog as a separate document. After calculating the relevancy score of each post, a blog B is scored based on the sum of the individual scores of its constituent posts weighted by their post centrality:

Scor e_{SD} (B, Q) = \sum_{p \in F} PostScore (P, Q) . Centrality (P, B)

(20)

Considering post centrality, the model uses a measure of similarity between the post and the blog to calculate the centrality of the post. Generally, any measure of the similarity can be used, for example, K-L divergence or cosine similarity. In Elsas et al. [23] the post centrality score is computed based on the geometric mean of term generation probabilities, weighted by their likelihood in the blog language model. Also, in Keikha et al. [30] the authors considered a uniform prior for the blogs, since post centrality uses a uniform distribution over posts in each blog.

In order to present a time sensitive small document model, we compute the post centrality of equation (20) using the $TemporalPostCentrality (P, tim e_{P})$ of posts. Therefore, the final score of a blog is calculated using equation (21):

Scor e_{TSSD} (B, Q) = \sum_{p \in B} PostScore (P, Q) . TemporalPostCentrality (P, tim e_{P})

(21)

3.4.4. Time-sensitive local evidence model (TSLE )

Lee et al. [26] present global and local evidence of blog feeds to calculate blog scores, which corresponds to the document-level and passage-level evidence used in passage retrieval. They calculate the final score of a blog using a linear combination of its local evidence and global evidence based on equation (22):

Scor e_{LEM} (B, Q) = (1 - a) . (\frac{\sum_{p \in B} PostScore (P, Q)}{n}) + α . (\frac{\sum_{top k p \in B} PostScore (P, Q)}{k})

(22)

Instead of $PostScore$ in equation (22), we use $Time sensitivePostScore$ to propose time-sensitive local evidence model. Therefore, the final score of a blog is calculated using equation (23):

Scor e_{TLEM} (B, Q) = (1 - a) . (\frac{\sum_{p \in B} TimeSensitivePostScore (P, Q, tim e_{P})}{n}) + a . (\frac{\sum_{top k p \in B} Time S ensitivePostScore (P, Q, tim e_{P})}{k})

(23)

4. Experimental setup

4.1. Dataset

In order to evaluate and compare the proposed time-sensitive blog retrieval methods, we use the irBlogs dataset [43]. This collection is used as a basis for some other researches [44, 45]. This dataset is a standard dataset prepared for the evaluation of blog retrieval methods in Persian blogosphere. It includes: (a) a set of blogs with their posts and their published time; (b) a set of 45 topics about different subject categories which are prepared in TREC standard format, including a title, a description and a narrative; and (c) relevance judgements (ground truth) in a four-level scale of: Highly Relevant, Relevant, Irrelevant and Spam. Table 1 shows some general information about the irBlogs dataset.

Table 1.

Statistics of irBlogs collection

irBlogs’ properties	Count
Number of blogs	473,994
Number of posts	4,846,536
Average number of posts for a blog	10.22
Number of queries	45
Average highly relevant blogs per query	23.28
Average relevant blogs per query	39.40
Number of highly relevant blogs	1048
Number of relevant blogs	1773

The irBlogs dataset has some temporal properties that distinguish it from TREC datasets. One of them is the query set of rBlogs. Figure 4 indicates the percentage of the time-sensitive queries compared with the time insensitive queries of irBlogs dataset. About 80% of the queries are time-sensitive and 82% of posts have writing time. Another important property of the irBlogs is the distribution of the writing time of the posts that is depicted in Figure 5. As it is shown in Figure 5, there are sufficient posts in the dataset for all the time intervals. The queries of TREC datasets are time-insensitive queries that are not suitable for the evaluation of our approach, therefore we chose the irBlogs dataset instead. Also, recently it is proven that even widely accepted TREC collections are not as reliable as generally accepted collections [46], so irBlogs can also be useful for cross-validation of already existing blog retrieval methods.

Figure 4.

(a) The percentage of post with/without writing time.

Figure 5.

Distribution of writing time of irBlogs’s posts.

4.2. Training

We performed an exhaustive grid search to find the optimal parameters for the proposed methods. In the time-sensitive voting model and the time-sensitive local evidence model, we have two parameters to be trained, λ in equation (5) and α, which balances the weight of the temporal score of equation (7). Time-sensitive blogger/posting and time-sensitive small document models require training for one parameter λ in equation (13), which defines the exponential distribution coefficient.

4.3. Evaluation

In order to evaluate the performance of the proposed approaches, 10-fold cross-validation is performed. Partitioning process of the 45 queries of irBlogs dataset is done randomly. For one partition, the parameters are trained with all the other partitions and its performance is evaluated with the trained parameters. Thus, in each step we use 90% of the queries for the training procedure, and 10% for the test procedure.

Also, various common standard evaluation measures are used and the parameter trainings are also carried out for each measure. The measures are mean average precision (MAP) as well as a number of the precision-oriented measures such as precision at rank 10 (P@10), mean reciprocal rank (MRR) and normalized discounted cumulative gain (NDCG). MAP1 denotes the MAP of a run, when those blogs that are judged as highly relevant are considered to be relevant and MAP2 denotes the MAP of a run, when both highly relevant and relevant blogs are considered to be relevant. We compare the proposed method with some state-of-the-art blog retrieval methods that are listed in Table 2.

Table 2.

The best blog retrieval methods reported in the literature

Name	Blog retrieval method
Blogger	Blogger model (two-stage model) [24]
Posting	Posting model[24]
Voting	Voting model [19]
LE	Local evidence [26]
Large	Large document [23]
RS	Resource selection +diversity penalty [25]
SD	Small document [23]
Temp	Temporal evidence[18]
TEMPER	Temporal relevance feedback method [40]
TRS	Temporal relevance stability [41]

5. Experimental results

In this section, several blog retrieval methods are analysed based on their top 1000 retrieved blogs and their performances are compared with the proposed methods.

5.1. Cross-validate the existing blog retrieval methods

Table 3 provides a comparison of the performance of the blog retrieval methods on the irBlogs dataset. Based on these results, among the non-temporal blog retrieval methods, local evidence [26], resources selection [25] and large documents model [23] are the best. For temporal blog retrieval methods, TEMPER [40] performs considerably better than other temporal methods. Figure 6 shows the results of the comparisons of the blog retrieval methods based on MAP2 for the irBlogs and TREC 2007 datasets.

Table 3.

Evaluation results for the state-of-the-art blog retrieval methods based on all topics of the irBlogs dataset

Model	MAP1	MAP2	NDCG	MRR	P@10
Voting	0.1666	0.2211	0.4328	0.4242	0.3756
Blogger	0.1616	0.1785	0.3863	0.3911	0.3667
Posting	0.1445	0.1946	0.3922	0.4099	0.3533
Large	0.2551	0.3235	0.5264	0.6155	0.4889
RS	0.2104	0.2981	0.4677	0.3338	0.4689
SD	0.1721	0.2342	0.4478	0.4097	0.3768
LE	0.3822	0.3225	0.5415	0.7698	0.5044
Temp	0.0500	0.0816	0.2598	0.1412	0.1142
TEMPER	0.2448	0.2986	0.4833	0.7041	0.5012
TRS	0.1732	0.1997	0.4289	0.4065	0.3578

Figure 6.

Comparison of the state-of-the-art blog retrieval methods based on Map 2 on TREC2007 and irBlogs collections.

As it can be seen in Figure 6, local evidence, resources selection and the large documents model are ranked the same in both collections but the ranking of the other ones are slightly changed. It should be noted that most of the existing blog retrieval methods do not perform well on the Persian dataset. This fact is also reported for text retrieval methods [47], which means that new methods should be developed by considering the characteristics of the Persian blogosphere.

5.2. Evaluation of the proposed time-sensitive methods

Since the irBlogs dataset contains time-sensitive queries (80% of the queries are time-sensitive), it is used for evaluation of the proposed time-sensitive methods. Table 5 provides information about the results of comparisons between our proposed time-sensitive methods and the state-of-the art temporal blog retrieval methods (listed in table 4).

Table 4.

List of already existing temporal blog retrieval methods

Name	Description
Time Recency Posting (TRP)	Assume the most recent posts to be of more importance than much older posts. To incorporate this intuition, it assigns a recency score to each post in a blog and then uses the normalized recency score for the post importance term in posting model [21].
Time Recency blogger (TRB)	Similar to TRP, uses the normalized recency score for the post importance term in the blogger model [21].
Recurring Vote (RV)	Tries to capture recurring interests of blogs over time [19].
Temporal Evidence (Temp)	Uses temporal evidence as an extra feature of blogs in addition to their content [18].
TEMPER	Propose a framework, TEMPER, which selects different terms for different times and ranks blogs according to their relevancy to the query over time [42].
Relevance Stability (RS)	Propose a probabilistic framework to measure the stability of blogs relevance over time [43].
TSV	The proposed time-sensitive voting method
TSB	The proposed time-sensitive blogger method
TSP	The proposed time-sensitive posting method
TSSD	The proposed time-sensitive small document method
TSLE	The proposed time-sensitive local evidence method

In order to test the statistical significance of our results, Student’s paired t-test is computed for each of the queries at α=0.05 level. Statistically significant improvements of the proposed time-sensitive methods over the best existing temporal blog retrieval methods (TRP, RS and TEMPER) are shown by ↑, ╤, ▲ symbols respectively. As shown in Table 5, the proposed TSV and TSLE methods perform considerably better than all existing temporal blog retrieval methods. In all cases this improvement is statistically significant. Also the other proposed methods TSB, TSP and TSSD perform considerably better than all the other existing temporal blog retrieval methods except the TEMPER method.

Table 5.

Comparison of already existing temporal blog retrieval methods with proposed time sensitive methods

Model	MAP1	MAP2	NDCG	MRR	P@10
TRP	0.1612	0.2008	0.4204	0.4009	0.3498
TRB	0.0817	0.1113	0.3016	0.3643	0.2100
RV	0.1237	0.1929	0.3689	0.3978	0.2874
Temp	0.0524	0.0957	0.2832	0.2108	0.1539
TEMPER	0.2689	0.3012	0.5121	0.7178	0.5343
RS	0.1879	0.2187	0.4417	0.4327	0.3698
TSV	0.2975 ↑╤ ▲	0.3358 ↑╤ ▲	0.5237 ↑╤ ▲	0.7812 ↑╤ ▲	0.5400 ↑╤ ▲
TSP	0.2623 ↑╤	0.2934 ↑╤	0. 5102 ↑╤	0. 6998 ↑╤	0. 5267 ↑╤
TSB	0.2176 ↑╤	0.2631 ↑╤	0. 4749 ↑╤	0.6218 ↑╤	0.5087 ↑╤
TSSD	0.2373 ↑╤	0.2928 ↑╤	0.5097 ↑╤	0.6893 ↑╤	0.5145 ↑╤
TSLE	0.4689↑╤ ▲	0.4983↑╤ ▲	0.6230↑╤ ▲	0.8442↑╤ ▲	0.6898↑╤ ▲

Statistically significant improvements of time-sensitive blog retrieval methods over TRP, RS and TEMPER at the 0.05 level are shown by ↑, ╤, ▲ symbols respectively.

Table 6 compares the result of the proposed time-sensitive methods and their corresponding time-insensitive versions based on different measures. As it is evident, all of the proposed methods perform considerably better than their counterparts based on all of the criteria. The best results are achieved based on precision-oriented measures (P@10 and MRR). This means that the proposed methods provide better ranking of the relevant blogs by listing highly relevant blogs at the top of the results.

Table 6.

Evaluation results for the proposed time-sensitive methods and original blog retrieval methods using time-sensitive test topics of irBlogs

Model	MAP1	MAP2	NDCG	MRR	P@10
Voting	0.1229	0.1541	0.3826	0.4242	0.2353
TSV	0.2975 ▲	0.3358 ▲	0.5237 ▲	0.7812 ▲	0.5400 ▲
Blogger	0.1785	0.1437	0.4080	0.3911	0.1900
TSB	0.2176 ▲	0.2934 ▲	0.5059 ▲	0. 6218 ▲	0. 5087 ▲
Posting	0.1143	0.1673	0.4020	0.4099	0.2900
TSP	0.2623 ▲	0.2631 ▲	0.4749 ▲	0. 6998 ▲	0. 5267 ▲
SD	0.1421	0.1842	0.4178	0.3497	0.2978
TSSD	0.2373 ▲	0.2928 ▲	0. 5197 ▲	0. 6893 ▲	0. 5145 ▲
LE	0.2722	0.2425	0.4615	0.6031	0.4278
TSLE	0. 4689 ▲	0.4983 ▲	0.6230 ▲	0.8442 ▲	0.6898 ▲

Statistically significant improvements of time-sensitive blog retrieval method over the corresponding time-insensitive method, at the 0.05 level are indicated by ▲.

5.3. Temporal query type analysis

This section presents an analysis to show what types of the queries are more suitable in time-sensitive blog retrieval methods. First, we discuss the performance of the proposed methods for time-sensitive and time-insensitive queries and then we look into the performance of the proposed methods for different types of time-sensitive queries that were discussed in Section 3.2.2.

Figure 7 shows the MAP1 score of the proposed time-sensitive blog retrieval methods for time-sensitive and time-insensitive queries of the irBlogs dataset. As can be seen, all of the proposed methods perform better for time-sensitive queries. The advantage of the proposed methods is that their performance does not degrade for time-insensitive queries and match or outperform their time-insensitive counterparts.

Figure 7.

Evaluation results of proposed time-sensitive blog retrieval methods for time-sensitive and time-insensitive topics of irBlogs.

Figure 8 depicts a comparison of MAP1 scores of the proposed time-sensitive methods and the corresponding time-insensitive methods for different types of time-sensitive queries.

Figure 8.

Evaluation results of the proposed time-sensitive blog retrieval methods for different type of time-sensitive test topics of irBlogs.

The time-sensitive posting (TSP) method performs better on event-sensitive queries. For the other two types (i.e. periodic time-sensitive queries and recency time-sensitive queries) its performance is similar to time insensitive methods. The time-sensitive blogger method (TSB) performs much better for event-sensitive and periodic time-sensitive queries than recency-sensitive queries. The time-sensitive voting method (TSV) and time-sensitive local evidence (TSLE) perform better than other methods; they perform better on all three types of queries and in all cases this improvement is statistically significant, that is to say, TSV and TSLE methods are independent form temporal queries.

In general, the proposed time sensitive methods perform better for periodic time-sensitive and event-sensitive queries.

5.4. Discussion

5.4.1. Comparison of the proposed time sensitive approaches

In this paper, our aim was not to propose a new blog retrieval method but to propose an approach to make already existing blog retrieval methods time sensitive. It was shown that considering both content and time stamp of the posts results in finding more relevant blogs. Therefore, the accuracy of the proposed approaches depends on the structure of the already existing blog retrieval methods. For example, the proposed TSV method improved MAP1 of the original voting method by around 142%. The main reason behind this considerable improvement is that the voting method is highly dependent on the list of initially retrieved posts for the input query, named R(Q). Therefore, improving the quality of R(Q) can highly improve the voting methods. Our proposed method made use of this fact by introducing a time-sensitive post score for reordering the posts of the initial R(Q). Our proposed method assigns a higher score to the posts that are written in the period of time that the input query belongs to; in this way a better R(Q) is obtained for time-sensitive queries that impacts the precision of the voting method in a positive way. Also, the TSLE method uses the top k retrieved posts of R(Q), so our proposed method could improve this method in the same way.

We propose two approaches for incorporating time-sensitiveness in blog retrieval methods: (a) linear combination of our proposed temporal score with the content score computed by already existing methods; and (b) biasing the score of retrieved posts by use of the proposed temporal score. For time-sensitive queries, two proposed approaches improved the results of all the blog retrieval methods under investigation and in all cases this improvement was statistically significant. From our experimental results it can be seen that first one is a better choice; it could improve TSLE and TSV methods 35 and 37%, respectively, in comparison with original blog retrieval methods based on the NDCG criterion.

For time-insensitive queries as stated in Section 5.3, the first proposed approaches do not deteriorate the precision of the previous method. In the case of time-insensitive queries, blog posts are distributed uniformly (e.g. Figure 3d). Therefore, our proposed TemporalScore is approximately the same for all retrieved posts. This means that the TimeSensitivePostScore is computed merely based on the content score of the posts and the final results are roughly the same as the results of the original blog retrieval methods. For the second approach, the precision of TSP and TSSD is decreased for time-insensitive queries. In these methods, we compute post centrality using $TemporalPostCentrality (P, tim e_{P})$ of the posts. For time-insensitive queries, the $TemporalPostCentrality$ sets uniformly because the TemporalScore is the same for all the retrieved posts. Therefore the precision of TSP and TSSD is decreased the since original version of these methods uses some measure of similarity between the post and the blog to compute post centrality that is better than uniform post centrality over the blogs. This observation was also reported in the literature [23, 24].

5.4.2. Very high early precision

It is generally accepted that search engine users use the first page of the retrieve results. Therefore, we used MRR and P@10 criteria for evaluation of the top retrieved blogs of the proposed methods. The results of Table 5 state that the proposed methods show considerable improvement based on MRR and P@10. The TSLE and TSV methods performed better than other methods based on MRR and P@10. We inspected the first retrieved document of the two methods and it was shown that TSLE and TSV retrieved a relevant blog in the first rank of their results for 83 and 75% of the queries and the two methods retrieved the first relevant blog after fifth place of their rankings for only 5% of the queries. This implies that the proposed methods provide better ranking of the relevant blogs by populating top of the retrieved list with highly relevant blogs.

5.4.3. Per-topic analysis

In order to find out the main reason behind the better performance of the proposed methods, we looked at the retrieved lists of blogs for all queries. For query number 3 entitled ‘FIFA World Cup 2010 in Africa’, the original voting method ranks a non-relevant blog that published general news about football in the first place, while the proposed method placed it as 219th in the ranking. Instead, the time-sensitive voting method ranks a highly relevant blog in the first place which was specially devoted to ‘World Cup 2010 Africa’ (i.e. most of its posts were related to the 2010 FIFA World Cup in Africa), which is clearly due to the proposed temporal score. Another example is a blog that is ranked 55 by voting method while it is ranked as 3rd by our approach. The content of the blog was about the real-time Africa’s world Cup news. In this blog, there were few words such as ‘FIFA’, ‘World Cup’ and ‘Africa’, but there are a lot of analyses about a specific players or matches in the mentioned cup. As the voting method uses just the content score of the blogs to rank them, so it ranks this highly-relevant blog as 55th. Since time-sensitive voting method also uses the temporal score of the post to rank the blogs, so it performs more accurately than the voting method. This fact was also noted for many other blogs in the retrieved lists.

Also, for query ‘Iran’s ninth parliament elections’, the proposed method ranks a highly relevant blog with long posts about the candidates and their future plans about the ninth parliament election at the top of the retrieved list, while the original voting method places this blog in the 18th place.

The voting method ranks a blog as fourth only because it contains some posts about the eighth parliament election and keywords related to parliament election, although the period of the election was irrelevant. The TSV method ranks that blog as 67th by considering published time of the posts.

Another example is the query ‘Night of worship (Qadr)’ that happens periodically each year. Our proposed method ranks a highly relevant blog at ninth place in the ranking while the original voting method ranks it at 40th place in the ranking.

According to the analysis conducted, our proposed method can improve a blog’s score in the following situations:

A blog that is relevant to the query, but its content does not necessarily contain the query terms and there are some equivalent terms or some comments about the query. Therefore if only the content score is taken into consideration, such a blog cannot gain a good score. However, if the published time of the blog posts is considered, the blog can achieve a more temporal score. Thus, this will compensate for the lack of the content score.

The second case is a blog that is not relevant, but, owing to the existence of some query words in abundance, they have a good content score such as query ‘Iran’s ninth parliament elections’ discussed before.

The third situation is when queries are event-sensitive such as queries related to festivals, celebrations, etc. From the user’s perspective, it is crucial that the retrieved blog posts are within the duration of that event. For example, for the ‘Fajr film festival’, the user prefers blogs that have published news at the holding’s festival time. The proposed method will assign higher temporal score to such blogs and will push them to top of the retrieved list.

6. Conclusion and future works

Time plays a vital role in blog retrieval methods and its importance cannot be ignored. In this paper, we proposed a temporal-based approach to blog retrieval that makes use of temporal properties of the input queries. First, the input queries were divided into time-sensitive and time-insensitive categories based on their relevant posts’ distribution. Then, a time importance score was calculated for each post in the initially retrieved list of relevant posts. We applied the temporal scores to improve voting, posting and blogger method. Since the irBlogs collection contains enough time-sensitive queries to guarantee a reliable evaluation, the proposed method was evaluated and compared based on irBlogs. The evaluation results indicated significant improvement of the blog retrieval methods, especially in terms of P@5, P@10 and MRR. The proposed method may be improved even more by applying temporal query expansion method such as the TEMPER method.

In this paper, the input queries were manually categorized into two time-sensitive and time-insensitive categories to simplify the problem. Therefore, automatic categorization of the input queries remains as a future research topic. Also, we believe that the proposed method can be used for other related problems, for example, information retrieval in microblogs in which time is an even more important feature.

Footnotes

Funding

This research received no specific grant from any funding agency in the public, commercial or not-for-profit sectors.

Notes

References

Bar-Ilan

Information hub blogs. Journal of Information Science 2005; 31: 297–307.

Seki

Uehara

Opinionated document retrieval using subjective triggers. Journal of the American Society for Information Science and Technology 2011; 62: 861–876.

Wang

Luo

Utilizing term proximity for blog post retrieval. Journal of the American Society for Information Science and Technology 2013; 64: 2278–2298.

Guo

Wan

Exploiting syntactic and semantic relationships between terms for opinion retrieval. Journal of the American Society for Information Science and Technology 2012; 63: 2269–2282.

L-W

Liang

Y-T

Chen

H-H.

Opinion Extraction, Summarization and Tracking in News and Blog Corpora. In: AAAI spring symposium: Computational approaches to analyzing weblogs, 2006.

Lee

Jung

H-y

Song

Lee

J-H

. Mining the blogosphere for top news stories identification. In: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval. New York: ACM, 2010, pp. 395–402.

Lee

J-H.

Identifying top news stories based on their popularity in the blogosphere. Information Retrieval 2014; 17: 326–350.

Weerkamp

De Rijke

Credibility improves topical blog post retrieval. Stroudsberg, PA: Association for Computational Linguistics, 2008.

Weerkamp

de Rijke

Credibility-inspired ranking for blog post retrieval. Information Retrieval 2012; 15: 243–277.

10.

Chenlo

Parapar

Losada

Santos

Finding a needle in the blogosphere: An information fusion approach for blog distillation search. Information Fusion 2015; 23: 58–68.

11.

Yun

Lee

Pyun

Correlated blog-page retrieval with structural characteristics. Computer Science and its Applications. Berlin: Springer, 2015, pp. 191–196.

12.

Kim

Yun

The blog ranking algorithm using analysis of both blog influence and characteristics of blog posts. Mobile, ubiquitous, and intelligent computing. Berlin: Springer, 2014, pp. 13–17.

13.

Kim

Yun

A topic-oriented information retrieval algorithm in the blogosphere. Computer Science and its Applications. Berlin: Springer, 2015, pp. 197–202.

14.

Santos Rodrygo

. Information retrieval on the blogosphere. Foundations and Trends in Information Retrieval 2012; 6(1): 1–125.

15.

Macdonald

Santos

Ounis

Soboroff

Blog track research at TREC. ACM SIGIR Forum. New York: ACM, 2010, pp. 58–75.

16.

Macdonald

Ounis

Soboroff

Overview of the TREC 2007 Blog Track. TREC. Citeseer, 2007, pp. 31–43.

17.

Ounis

Macdonald

Soboroff

Overview of the TREC-2008 blog track. DTIC Document, 2008.

18.

Nunes

Ribeiro

David

Feup at trec 2008 blog track: Using temporal evidence for ranking and feed distillation. DTIC Document, 2008.

19.

Macdonald

Ounis

Key blog distillation: ranking aggregates. In: Proceedings of the 17th ACM conference on information and knowledge management. New York: ACM, 2008, pp. 1043–1052.

20.

Ernsting

Weerkamp

de Rijke

Language modeling approaches to blog post and feed finding. In: The sixteenth text retrieval conference (TREC 2007) proceedings, 2007.

21.

Weerkamp

Balog

De Rijke

. Finding key bloggers, one post at a time. In: ECAI, 2008, pp. 318–322.

22.

Keikha

Gerani

Crestani

Time-based relevance models. In: Proceedings of the 34th international ACM SIGIR conference on research and development in information retrieval. New York: ACM, 2011, pp. 1087–1088.

23.

Elsas

Arguello

Callan

Carbonell

. Retrieval and feedback models for blog feed search. In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval. New York: ACM, 2008, p. 347–354.

24.

Weerkamp

Balog

de Rijke

Blog feed search with a post index. Information Retrieval 2011; 14: 515–545.

25.

Seo

Croft

WB.

Blog site search using resource selection. In: Proceedings of the 17th ACM conference on information and knowledge management. New York: ACM, 2008, pp. 1053–1062.

26.

Lee

S-H

Lee

J-H.

Utilizing local evidence for blog feed search. Information Retrieval 2012; 15: 157–177.

27.

Keikha

Crestani

Linguistic aggregation methods in blog retrieval. Information Processing and Management 2012; 48: 467–475.

28.

Macdonald

Ounis

Voting for candidates: adapting data fusion techniques for an expert search task. Proceedings of the 15th ACM international conference on information and knowledge management. New York: ACM, 2006, p. 387–396.

29.

Keikha

Crestani

Croft

WB.

Diversity in blog feed retrieval. In: Proceedings of the 21st ACM international conference on information and knowledge management. New York: ACM, 2012, pp. 525–534.

30.

Keikha

Crestani

Carman

MJ.

Employing document dependency in blog search. Journal of the American Society for Information Science and Technology 2012; 63: 354–365.

31.

Gerani

Keikha

Carman

Gwadera

Taibi

Crestani

University of Lugano at Trec 2008 blog track. DTIC Document, 2008.

32.

Yue

Finley

Radlinski

Joachims

. A support vector method for optimizing average precision. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval. New York: ACM, 2007, p. 271–8.

33.

Jones

Diaz

Temporal profiles of queries. ACM Transactions on Information Systems 2007; 25: 14.

34.

Dakka

Gravano

Ipeirotis

PG.

Answering general time-sensitive queries. Knowledge and Data Engineering, IEEE Transactions on 2012; 24: 220–235.

35.

Croft

WB.

Time-based language models. In: Proceedings of the twelfth international conference on information and knowledge management. New York: ACM, 2003, pp. 469–475.

36.

Efron

Golovchinsky

Estimation methods for ranking recent information. Proceedings of the 34th international ACM SIGIR conference on research and development in Information Retrieval. New York: ACM, 2011, pp. 495–504.

37.

Peetz

M-H

Meij

de Rijke

Using temporal bursts for query modeling. Information Retrieval 2014; 1–35.

38.

Amodeo

Amati

Gambosi

. On relevance, time and query expansion. In: Proceedings of the 20th ACM international conference on information and knowledge management. New York: ACM, 2011, pp. 1973–1976.

39.

Rocchio

JJ.

Relevance feedback in information retrieval. In: The SMART Retrieval System Experiments in Automatic Document Processing. Prentice Hall, 1971, pp. 313–323.

40.

Keikha

Gerani

Crestani

Temper: A temporal relevance feedback method. Advances in Information Retrieval. Berlin: Springer, 2011, pp. 436–447.

41.

Keikha

Gerani

Crestani

Relevance stability in blog retrieval. In: Proceedings of the 2011 ACM symposium on applied computing. New York: ACM, 2011, pp. 1119–1123.

42.

Amati

Probability models for information retrieval based on divergence from randomness. PhD dissertation, University of Glasgow, 2003.

43.

AleAhmad

Zahedi

Rahgozar

Moshiri

: irBlogs: A standard collection for studying Persian bloggers. Computers in Human Behavior 2015; https://dx-doi-org.web.bisu.edu.cn/10.1016/j.chb.2015.11.038

44.

Zahedi

AleAhmad

Rahgozar

Oroumchian

: Blog feed search in Persian blogosphere. Information Systems and Telecommunication 2014; 2: 222–31.

45.

Sherkat

Rahgozar

Asadpour

Structural link prediction based on ant colony approach in social networks. Physica A: Statistical Mechanics and its Applications 2015; 419: 80–94.

46.

Urbano

Marrero

Martín

. On the measurement of test collection reliability. In: Proceedings of the 36th international ACM SIGIR conference on research and development in information retrieval. New York: ACM, 2013, pp. 393–402.

47.

AleAhmad

Amiri

Darrudi

Rahgozar

Oroumchian

Hamshahri: A standard Persian text collection. Knowledge-Based Systems 2009; 22: 382–387.