BM25-CTF: Improving TF and IDF factors in BM25 by using collection term frequencies

Abstract

In this paper, the use of collection term frequencies (i.e. the total number of occurrences of a term in a document collection) in the BM25 retrieval model is investigated by modifying its term frequency (TF) and inverse document frequency (IDF) components. Using selected examples extracted from TREC collections, it was observed that the informative nature, for retrieval purposes, of terms, either with the same TF (in a document) or IDF (in a collection) may be better revealed with the use of collection term frequencies (CTF). From three new heuristics based on those observations and deviations from a random Poisson model, collection term frequencies were integrated to TF and IDF factors. The novel formulations were tested by employing the TREC-1 to TREC-8 collections in the ad hoc task, for which BM25 was first developed and tested. Consistent and significant improvements were observed in mean average precision (MAP) reaching up to 17.67% for the TREC-8 dataset, and 7.16% averaged over all tested collections. These results were considerably better in comparison to other approaches surveyed aiming to improve BM25, proving in this way the effectiveness of the proposed heuristics and formulae. The proposed approach requires only additional offline pre-computations and does not entail extra computational complexity for retrieval while keeping the original spirit and parameter robustness of BM25.

Keywords

BM25 tf-idf collection term frequency information retrieval heuristics TREC collections deviation from randomness

1 Introduction

Okapi BM25 [26] is one of the most accepted retrieval models and has maintained its state-of-the-art status in information retrieval (IR) for nearly 20 years since its inception. From the perspective of information it uses, BM25 is similar to the even better-known cosine tf.idf model [27]. Both models rely on document frequencies of the terms of a query, as collected across the indexed document collection, and their frequencies within each candidate document.

Although several other successful retrieval models such as deviation from randomness [2] and language modeling [20] employ collection term frequencies (CTF), defined as the total number of occurrences of a term in a document collection, BM25 does not make use of these statistics. In the following, we investigate the usefulness of CTF for improving the retrieval performance of BM25. Rather than proposing a radically new retrieval model, we keep the original formulation structure of BM25 and integrate CTF to its term frequency (TF) and inverse document frequency (IDF) components.

The use of CTF in retrieval functions was introduced by Kwok (1990) and later used by others [13 , 19]. In our empirical evaluations, we compare CTF and document frequencies (DF) on TREC data.We observed that terms that occur in the same number of documents in a collection (i.e. same IDF) have a different level of informativeness for retrieval based on their CTFs. For instance, the terms “London” and “or” occur roughly in the same number of documents (∼213,000) on TREC’s disks 4 and 5, but “or” occurs almost three times more than “London” in the same collection (considerably different CTFs). In this case, CTFs distinguish the higher keywordness of “London” over “or”, which is indistinguishable by IDF. A similar pattern occur for term frequencies in a document w.r.t. CTFs. For instance, “holocaust” and “take” occur 12 times each in document CR92E-7563 (i.e. same TF), but the CTF of the latter is considerably higher than that of the former. These and other patterns occur systematically, which lead us to formulate three new IR heuristics that relates TF, IDF and CTF following the methodology of Fang et al. [6]. These heuristics provide additional support for the proposed retrieval model.

We strive to ground the proposed model on both theoretical and empirical foundations. From a theoretical standpoint, we start from the well-known observation that deviations from a Poisson generative model can be used as measures of keywordness of words [4]. We integrate these deviations in IDF to derive a new weighting scheme called Poison IDF (PIDF). Additionally, we integrate into this scheme the ICTF factor [12]. Given that both IDF and ICTF have plausible theoretical and empirical foundations [1 , 22], rather than selecting only one of them, our model combines both, along with the proposed PIDF, in a global weighting scheme, comparable to IDF, called Boosted-IDF (BIDF). Similarly, we integrate deviations from another model, with similar assumptions to Poisson, into the TF component, in a new model named Boosted-TF (BTF).

An experimental evaluation of the proposed modified BM25 formula is performed by using the ad hoc task in the documents and queries/topics from the TREC-1 to TREC-8 conferences [10]. In these experiments, the models are compared by using their optimal parameters (k₁ and b) for each TREC collection in an experimental setup similar to the used by Fang & Zhai [7]. Once determined the best combination, we carry out experiments for the second stage, in which we determine default values for the parameters of the proposed model. In the final experimental stage, BM25 and BM25-CTF are compared using their respective default parameter values on several retrieval evaluation measures. The results show that BM25-CTF provides significant improvements and less variance across collections in comparison with the original BM25 model. To put these results in the perspective, we survey related work that attempted to improve the BM25 model (as discussed in Section 3), and compare relative improvements in retrieval performance.

Finally, Section 6 provides a comprehensive summary of the model. We also discuss the additional computational cost of the proposed model. This arises because of the need of keeping track of the collection frequencies in addition to the document frequencies already required by BM25 and the calculation of the additional factors in the novel formula. However, these extra costs are rather negligible given the substantial improvements observed in retrieval performance.

2 The BM25 retrieval formula

The retrieval model Okapi BM25 [24, 25] has the following formula: $\begin{matrix} \sum_{w \in Q} idf (w) \times \frac{(k_{1} + 1) \times tf (w, d)}{k_{1} \times K (d) + tf (w, d)} \\ \times \frac{(k_{3} + 1) \times tf (w, q)}{k_{3} + tf (w, q)} \end{matrix}$ In the above, Q is the set of common terms between a document d and query q. This expression comprises into four parts: the inverse document frequency (IDF) component, the term frequency (TF) component, two saturation functions controlled by parameters k₁ and k₃, and the document length normalization factor K (d). The IDF expression used in BM25 is: $idf (w) = log (\frac{| D | - df (w) + 0.5}{df (w) + 0.5}) .$

Here, df (w) is the document frequency of the term w and |D| the number of documents in the collection. The TF component is represented by tf (w, d) and tf (w, q), which account for the number of occurrences of a term w in a document d or query q, respectively. The impact of TF values is controlled by saturation functions that prevent a linear growth. This control is typically stronger for document terms, by employing small values for k₁ (usually, between 1.2 and 2.0), and weaker for query term, by using large values for k₃ (typically, 500 or 1,000).

Finally, the document-length-normalization component is provided by the expression: $K (d) = (1 - b) + b \times \frac{len (d)}{avdl} .$

Here len (d) is the document length and avdl is the average document length for the collection. The parameter b ∈ [0, 1] provides a way to regulate the intensity with which the normalization applies. The current BM25 formulation is a practical simplification of a two-Poisson generative model [24], which has also proven to be robust with respect to the optimization of parameters k₁, k₃ and b. In addition, as observed by Fang et al. [6], BM25 mets to a better degree than other retrieval models an important desideratum in IR, namely obtaining good, consistent performance in large and heterogeneous test collections and the fulfillment of a set of formalized heuristics.

3 Related work

There have been several attempts to improve BM25 by making changes to its formula using the TREC collections and the ad hoc task for evaluation. In the remainder of this section, a not exhaustive, but representative review of previous work is presented. See [33] for an empirical comparison of various BM25 variations.

Attempts to improve BM25 that do no make use of relevance (or pseudo-relevance) information can be classified into several categories. First, there are approaches that only modify the BM25 formulation [Mod] and others that integrate additional information [Add]. Regarding the number of parameters, the modified formulations can either increase [Par+], decrease [Par-] or maintain [Par=] the original parameters of BM25. The modified formulae may be obtained by motivations either theoretical [Theo], empirical [Emp] or both. These motivations can guide the exploration through a subset in the space of possible retrieval functions [Gui] or demarcate restrictions [Res] in that space. In turn, if the search space is restricted, then it can be explored either manually [Man], randomly [Rnd] or exhaustively [Exh]. The approaches surveyed are presented in chronological order and labeled with the authors’ name, year and a set of the proposed taxonomy-tags (in square brackets).

Franz & McCarley (2000), [Mod/Par=/Emp/ Gui]: They [8] aimed to exploit the word repetitiveness in a factor called document length weighted word document density. This factor was integrated additively to IDF in BM25 and tested experimentally. In TREC-7, considerable improvements in MAP were observed versus BM25 using titles (13.00%) and descriptions (17.85%) as queries. Nevertheless, using titles in TREC-8, they observed a loss of -1.65% and an improvement of 7.45% using descriptions.

Rasolofo & Savoy (2003), [Add/Par=/Emp/ Gui]: They [21] included an additive factor to BM25 calculated from the term distances of the query terms in the document. The approach was tested using short queries in TREC-8, TREC-9, and TREC-10. The improvement was significant only in TREC-8 reaching 2.43% for average precision and 4.41% for P@10. The improvement overall collections were 0.84% for average precision and 4.98% for P@10.

Fang et al. (2004), [Mod/Par=/Emp/Gui]: They [6] proposed a modification of the BM25 IDF component, which was replaced by a simpler formulation inspired by the pivoted normalization retrieval formula [29]. This modification avoids negative values in the IDF component without needing to deal explicitly with high-frequency terms (stop-words). They reported extremely high average precision improvements, in part due to the abnormally low performance observed when long-verbose (lv) queries were used. For instance, in TREC-7, while BM25 reached only 0.08 average precision, the modified version obtained 0.26, achieving a relative improvement of 225%. However, these numbers are not directly comparable with the majority of other studies (including ours) that employ the full set of queries instead of the lv subset.

Fang & Zhai (2005), [Mod/Par-/Theo/Res/ Man]: They [7] proposed an axiomatic approach for restricting the space of possible retrieval functions. Through an inductive procedure, several variations of BM25 were proposed obtaining retrieval formulae that fulfilled the axiomatic restrictions. They determined the best formula (one named F2-EXP) by comparing their performance on TREC collections using optimal parameters. Then, they analyzed the parameter sensitivity of F2-EXP against BM25 setting k₁ = 1.2 and varying only b. They noticed, in the top 25-percentile, that F2-EXP improved BM25 4.19% in TREC-7 and 2.77% in TREC-8 measuring average precision using long-verbose queries. Although, these improvements seemed modest, the parameter robustness was remarkable, obtaining at the bottom 25-percentile improvements of 31.68% and 30.27% in TREC-7 and TREC-8 respectively. Nevertheless, in the AP collection, BM25 consistently outperformed F2-EXP.

Cummings & O’Riordan (2006), [Add/Par-/Gui/ Rnd]: They [5] used BM25 as a benchmark instead of a base formulation for modification. Their approach used GP to evolve a retrieval formula by using average precision as fitness function. In addition to the information used by BM25, the evolved functions used variables such as collection frequencies, vocabulary size, maximum collection frequency and maximum term frequency. The proposed formulas do not include adjustable parameters. Using CISI collection for training, the best-evolved formula for global term weighting (IDF-like) was obtained achieving improvements ranging from 7.47% to 15.72% in OSHUMED-88 to OSHUMED-91 collections. However, evolving a combined local-global weighting approach they observed losses ranging from -18.22% to -8.28%.

Büttcher & Clarke (2006), [Add/Par=/Emp/ Gui]: They [3] proposed an alternative for the BM25 modification of Rasolofo & Savoy also exploiting term proximity. Improvements in GOV2 collection reached 4.95% in P@20 and 4.65% in MAP.

Troy & Zhang (2007), [Add/Par=/Emp/Gui]: They [34] proposed to exploit the order in which the terms appear in the document with the intuition that terms occurring at the beginning are more informative than those at the end. They named this factor chronological term rank for which they explored empirically (guided by MAP) different modifications of the BM25 formula using the WSJ2 portion of the TREC collections. For four different test collections, they observed significant improvements (using title queries) of 10.71% in TREC12 (a merge of TREC-1 to TREC-3), 17.66% in TREC-8, 15.80% in TREC45 (a merge of TREC-7 and TREC-8) and 12.94% in Robust04. Similarly, significant improvements for P@10 ranged from 5.76% to 8.35% in all tested collections.

Tao & Zhai (2007), [Add/Par=/Theo-Emp/Res]: They [31] implemented a heuristic that exploited term proximity by integrating this factor to BM25 and experimentally testing its effectiveness on the TREC data. The best improvements reported were 2.91% in TREC-8 and 9.73% in Web2g using MAP.

He et al. (2011), [Add/Par=/Theo-Emp/Gui]: They [11] modified BM25 by combining countings of unigrams and bigrams to capture collocational proximity. They also combined evidence from more distant occurrences using statistical survival analysis, but this factor did not improve considerably nor consistently the unigram/bigram approach. Although, trigrams performed comparably to bigrams, the combination of uni/bi/trigrams was not significant. The model was tested using the ad hoc task in large collections obtaining consistent improvements of 6.46% in WT10G.

Lv & Zhai (2011), [Mod/Par+/Emp/Gui]: They [17] noticed that BM25 penalize excessively very long documents. Consequently, they introduced an adjustment factor to address this particular situation, while maintaining the overall properties of BM25 for regular-length documents. The adjustment factor consists in adding a constant value δ to the pivoted-normalized document length. Their approach achieved significant relative improvements on MAP, of 2.55% and 1.90% on the WT10g and WT2g collections, respectively. However, they reported on TREC-8 only a modest improvement of 0.54%, which was not significant.

Lv & Zahi (2011), [Mod/Par+/Theo/Res/Man]: In a subsequent study [16], they tested a heuristic for lower-bounding TF in BM25, obtaining a maximum relative improvement in MAP of 6.02% for the WT10g collection.

Lv & Zhai (2011), [Mod/Par-/Theo-Emp/Gui]: In other study [15], they managed to remove one of BM25's parameters (k₁), while maintaining and significantly improving its retrieval performance on three, out of four datasets (TREC-8 collection among them). The improvements (using MAP) were 5.53% against BM25 with default parameters and 1.49% against a parameter-tuned BM25.

Goswami (2014), [Mod/Par-/Emp/Res/Exh]: In recent work, Goswami et al. [9] exhaustively explored a heuristically restricted space of possible retrieval functions that used the same information as BM25. They compared BM25 against 69,184 retrieval functions generated by a grammar, of which 5,407 satisfied a set of restrictions. Comparing BM25 with the best function for each collection, the improvements in MAP for TREC-3 to TREC-8 ranged from 7.33% in TREC-6 to 0% in TREC-3. They also tested with large collections achieving 10.87% in WT10g and 6.2% in GOV2.

The approach proposed in this paper adds collection term frequencies to the BM25 formula and keeps the same parameters. The candidate formulations tested were obtained by measuring deviations from randomness. Therefore, our approach could be labeled [Add/Par=/Theo-Emp/Gui]. Table 1 show a summary of the approaches surveyed in this section reporting minimum and maximum improvements in average precision or MAP, and P@n measures.

Table 1
Summary-of-reviewed-improvSummary of improvements obtained by the reviewed approaches

Authors and reference Method’s name Max. AveP/MAP Max. Improv. P@n Min AveP/MAP Min. Improv. P@n n

Franz &McCarley pass1 desc.using dd 17.65% (TREC-7) 16.88% (TREC-7) 7.45% (TREC-8) 0.53% (TREC-8) 20

Rasolofo &Savoy OkapiTP 2.43% (TREC-8) 9.83% (TREC-10) -2.66& (TREC-10) -1.02% (TREC-9) 10

Fang &Zhai F2-EXP 4.19% (TREC-7) na -6.77% (FR) na na

Cummings &O’Riord. Evolv. global scheme 15.72% (OSHU90-91) na 5.92% (OSHU-89) na na

Büttcher &Clarke BM25TP 4.65% (GOV2) 4.95% (GOV2) 4.65% (GOV2) 4.95% (GOV2) 20

Troy &Zhang CTR 17.66% (TREC-8) 8.35% (TREC-8) 12.94% (Robust04) 5.76% (Robust04) 10

Tao &Zhai R2+MinDist 12.14% (DOE) 11.17% (AP) 1.49% (FR) 2.24% (TREC-8) 10

He et al. NC-unordered 10.38% (Blog06) na 1.38% (ClueWeb B) na na

Lv &Zhai-1 BM25L 2.55% (WT10g) 3.35% (WT2g) 0.54% (TREC-8) -0.44% (TREC-8) 10

Lv &Zhai-2 BM25+ 6.02% (WT10g) 3.38% (WT10g) 0.62% (Robust04) 0.46% (WT2g) 10

Lv &Zhai-3 BM25-adpt 5.53% (TREC-8) na 0.75% (WT2g) na na

Goswami et al. $f_{1}^{M - d}$ ; $f_{1}^{P - d}$ 7.33% (TREC-6) 3.35% (TREC-7) -1.19% (TREC-3) -3.38% (TREC-6) 10

Authors and reference	Method’s name	Max. AveP/MAP	Max. Improv. P@n	Min AveP/MAP	Min. Improv. P@n	n
Franz &McCarley	pass1 desc.using dd	17.65% (TREC-7)	16.88% (TREC-7)	7.45% (TREC-8)	0.53% (TREC-8)	20
Rasolofo &Savoy	OkapiTP	2.43% (TREC-8)	9.83% (TREC-10)	-2.66& (TREC-10)	-1.02% (TREC-9)	10
Fang &Zhai	F2-EXP	4.19% (TREC-7)	na	-6.77% (FR)	na	na
Cummings &O’Riord.	Evolv. global scheme	15.72% (OSHU90-91)	na	5.92% (OSHU-89)	na	na
Büttcher &Clarke	BM25TP	4.65% (GOV2)	4.95% (GOV2)	4.65% (GOV2)	4.95% (GOV2)	20
Troy &Zhang	CTR	17.66% (TREC-8)	8.35% (TREC-8)	12.94% (Robust04)	5.76% (Robust04)	10
Tao &Zhai	R2+MinDist	12.14% (DOE)	11.17% (AP)	1.49% (FR)	2.24% (TREC-8)	10
He et al.	NC-unordered	10.38% (Blog06)	na	1.38% (ClueWeb B)	na	na
Lv &Zhai-1	BM25L	2.55% (WT10g)	3.35% (WT2g)	0.54% (TREC-8)	-0.44% (TREC-8)	10
Lv &Zhai-2	BM25+	6.02% (WT10g)	3.38% (WT10g)	0.62% (Robust04)	0.46% (WT2g)	10
Lv &Zhai-3	BM25-adpt	5.53% (TREC-8)	na	0.75% (WT2g)	na	na
Goswami et al.	$f_{1}^{M - d}$ ; $f_{1}^{P - d}$	7.33% (TREC-6)	3.35% (TREC-7)	-1.19% (TREC-3)	-3.38% (TREC-6)	10

na: not available.

4 Improving IDF using CTF: Boosted-IDF

The intent of IDF is to capture the degree of information of a term for purposes of retrieval in a collection of documents, based on statistics gathered from the entire collection [30]. The effectiveness of IDF effectiveness has been demonstrated both empirically and theoretically [4 , 22], and its use is widely accepted. Although IDF uses only document frequencies, other term weighting approaches, which use CTF [2, 35], have also been shown as effective. In this section, we analyze various combinations between document frequencies df (w) and collection term frequencies ctf (w) in IDF-like factors that also satisfy some newly-proposed IR heuristics involving the two statistics.

One observation that motivates the integration between document and collection frequencies is the fact that most IDF scores in a collection encompass only very few values. This is a consequence of the Zipf’s law [36], which states that the vast majority of the words in the vocabulary of a document collection have low collection frequencies, and hence low document frequencies. For instance, on the disks 4 and 5 from the TREC collections, the vocabulary size is 1,159,440 terms (without using stemming). Of these terms, 1,129,637 have document frequencies less than or equal to 30. Thus, 97.43% of the vocabulary maps only to 30 possible values of IDF. An IDF factor formulated as a combination of document and collection frequencies could have a wider range and thus, provide a more fine-grained mapping.

4.1 Inverse collection term frequency (ICTF)

The disks 4 and 5 of TREC collections contain |D|=556, 077 documents. From that collection, Table 2 shows some illustrative examples of pairs of frequent words with similar document frequencies (hence, same IDFs), but with considerable differences in their CTFs (stemming was not performed and capitalization was ignored in these counts). In that table, the terms in the column labeled “w₁” seems to be less informative than their counterparts in the “w₂” column. Consider the following heuristic according to that data:

Table 2
Some frequent terms in TREC disks 4 and 5 with same df (w)

w ₁ w ₂ df ctf (w₁) ctf (w₂)

or financial ∼240 k 1,040,203 348,832

who london ∼212 k 580,625 298,358

if countries ∼184 k 469,752 267,949

years industries ∼152 k 320,219 173,030

under years ∼151 k 388,373 320,219

so words ∼145 k 314,128 163,066

could news ∼138 k 254,405 201,873

should column ∼135 k 312,355 171,760

do text ∼121 k 298,720 128,493

her private ∼45 k 167,453 87,111

w ₁	w ₂	df	ctf (w₁)	ctf (w₂)
or	financial	∼240 k	1,040,203	348,832
who	london	∼212 k	580,625	298,358
if	countries	∼184 k	469,752	267,949
years	industries	∼152 k	320,219	173,030
under	years	∼151 k	388,373	320,219
so	words	∼145 k	314,128	163,066
could	news	∼138 k	254,405	201,873
should	column	∼135 k	312,355	171,760
do	text	∼121 k	298,720	128,493
her	private	∼45 k	167,453	87,111

H₁: Assume two terms w₁ and w₂, which occur independently of each other in a collection of documents, have similar (and high) document frequencies df (w₁) ≅ df (w₂). If their collection frequencies are substantially different, ctf (w₁) ⪢ ctf (w₂), then the average number of occurrences in a document, which is ctf (w)/df (w), will be much higher for w₁ than for w₂. A term weighting scheme should account for this difference, and the weight applied to the term w₁ should be less than the weight for w₂ in order to compensate for the fact that w₁ tends to appear more times than w₂ in the documents that contain them.

Therefore, CTF could provide an additional clue for distinguishing the informative nature of terms. In our scenario, terms that occur with overall high frequencies in a document collection are per se low-informative terms, more likely to be stop-words. Unfortunately, some “good keywords” that appear in a similar number of documents are indistinguishable from the stop-words using the current IR heuristics [6].

Originally proposed by Kwok [12], the inverse-CTF factor (ICTF) makes use of collection term frequencies ctf (w) to weight the terms in a document collection. In this section, we incorporate ICTF as a complement for IDF implementing the proposed heuristic H₁.

Thus, ICTF could be integrated to IDF into a weighting scheme able to distinguish between terms having the same idf (w) score by employing the collection frequency ctf (w). The formula for ICTF is $ictf (w) = log (\frac{M}{ctf (w)}),$ where M is the total number of terms in the collection, which is analogous to the simplified version of IDF: log(D/df (w)). The two weighting factors can be combined in a simple way by multiplying them: $ictf . idf (w) = ictf (w) \times idf (w) .$

The co-domain of ictf (w) is the interval [0, log(M)], which when multiplied by that of idf (w) makes the co-domain of ictf. idf (w) become considerably larger than the co-domain of idf (w). The values returned by ictf. idf (w) range from 0 to log(M) × log(|D|+0.5/1.5). Therefore, ictf. idf (w) provides a finer-grained and wider discrimination of the informativeness of the terms in comparison with idf (w).

For example, let us compare the scores obtained by idf (w) and ictf. idf (w) for a pair of terms from Table 2: w₁= “who” and w₂= “london”; note that idf (w₁) =0.2097 ≅ idf (w₂) =0.2057, while ictf. idf (w₁) = 0.5686 < ictf. idf (w₂) = 0.6170 (M = 298, 349, 366). In this example, it is possible to say that ictf. idf (w) reveals the informativeness of “london” versus “who”.

Now, let us analyze the ictf. idf (w) function across the entire spectrum of document frequencies. Figure 1 shows a comparison between the scores obtained by idf (w) and ictf. idf (w) in the same document collection used before. In this figure, all curves are standardized to [0, 1] and the horizontal axis is a logarithmic scale of df (w). It can be appreciated that the scores obtained with ictf. idf (w) shows a similar tendency as rising idf (w) to a power. For comparison, idf (w) ^1.5, idf (w) ^2.0 and idf (w) ^2.5 are within the same figure as dotted lines. In addition, the ictf. idf (w) scores of a sample of 5% of the documents in TREC 8-6 are shown as small gray markers. The upper bound of these triangles draws a line that is mostly between the 1.5 and 2.0 powers of idf (w). Singhal [28] tested these powers obtaining the best improvement using idf (w) ^1.5 in TREC 2-4. These results improved the retrieval performance considerably especially at high-recall levels with marginal losses at low recall. It is clear that the upper bound of ictf. idf (w) behaves like idf (w) ^1.5 and most of the terms with df (w) >100 lie between idf (w) ^1.5 and idf (w) ^2.0.

Fig.1

ictf. idf (w) versus idf (w) and some of its powers).

Figure 1 also shows a dashed line in black with an estimate (proposed by Church and Gale [4]) for ictf. idf (w) using the Poisson model as follows: $idf (w) = - log (1 - e^{- \frac{ctf (w)}{| D |}}) .$ (1)

In Fig. 1, the upper bound of the marker coincides with this estimate showing that the ictf. idf (w) scores are consistent negative deviations from that model.

From this analysis, it can be observed that the ICTF factor produces two effects when multiplied by IDF. First, the resulting function behaves similar to raising the idf (w) function to a power, leading to an eventual improvement in retrieval performance according to the observations of Singhal. Second, the differences between ctf (w) and df (w) produce negative deviations in the resulting ictf. idf (w) function with respect to the case when df (w) ≅ ctf (w), i.e. no deviation. Although, this behavior implements the heuristic H₁, the effect shown in Fig. 1 is stronger for low document frequencies compared with the effect in the high document frequencies where the heuristic was motivated.

4.2 Improving IDF using Poisson IDF

Table 3 shows some examples of relatively low-frequent pairs of terms from TREC’s disks 4 and 5 that have the same document frequency but differ importantly in their collection term frequencies. Unlike Table 2, terms in the column labeled w₁ seem to be better keywords than those in the column w₂. Thus, assuming that the TF heuristic is valid (see TFCs in [6]), we propose our second heuristic:

Table 3
Some low-frequent terms in TREC disks 4 and 5 with same df (w)

w ₁ w ₂ df ctf (w₁) ctf (w₂)

farwest outspokenly 30 228 30

kissimmee unequalled 50 262 50

barracuda inwardly 70 303 70

suriname publicizing 80 230 80

mammography ostentatious 150 624 150

sediment deftly 300 820 304

europa astounding 400 752 423

pbs rhetorical 500 1170 537

dna temper 600 2473 642

curry overtaken 700 1347 718

w ₁	w ₂	df	ctf (w₁)	ctf (w₂)
farwest	outspokenly	30	228	30
kissimmee	unequalled	50	262	50
barracuda	inwardly	70	303	70
suriname	publicizing	80	230	80
mammography	ostentatious	150	624	150
sediment	deftly	300	820	304
europa	astounding	400	752	423
pbs	rhetorical	500	1170	537
dna	temper	600	2473	642
curry	overtaken	700	1347	718

H₂: Assume two terms w₁ and w₂ that occur independently in an equal and relatively small number of documents in a collection, for example, 100 out of 500,000. As before, df (w₁) = df (w₂). Now, if ctf (w₁) ⪢ ctf (w₂), then the IDF factor of w₁ should be greater than that of w₂, because it is likely that the term frequencies of w₁ within the documents where it occurs, are higher than those of w₂. Consequently, w₁ is a better keyword than w₂ for the documents in which they occur respectively.

As H₂ and the shown examples suggest, the integration between ctf (w) and idf (w) could provide an TF-like effect in the IDF factor. This subsection describes how H₂ can be implemented by measuring deviations of document and collection term frequencies from a random model.

Church and Gale [4] showed that, apart from the original motivation of IDF, it also captures deviations from a Poisson model. The Poisson model assumes that the occurrences of terms in a collection are evenly distributed across documents. Therefore, when the occurrences of a given term w unexpectedly accumulates in some documents, its document frequency df (w) is less than its expected value under Poisson, $\hat{df} (w)$ . They also noted that terms with considerable deviations from Poisson are better keywords than those with small deviations. As they showed, IDF intrinsically reflects deviations from Poisson because a large deviation means a low observed df (w) and hence, a large idf (w) score. However, since these deviations are measured only intrinsically by IDF, this fails to discriminate two terms w₁ and w₂ with the same observed document frequency, df (w₁) = df (w₂), but with different deviations from Poisson, $\hat{df} (w_{1}) \neq \hat{df} (w_{2})$ . These terms could be differentiated by making these deviations explicit in the formulation of IDF.

Deviations from the Poisson model can be measured explicitly only when both ctf (w) and df (w) are available. Knowing ctf (w), it is possible to obtain an estimator for df (w) and vice versa. The Poisson estimator $\hat{df} (w)$ can be extracted from the expression inside the logarithm in Equation 1: $\hat{df} (w) = | D | \times (1 - e^{- \frac{ctf (w)}{| D |}})$ (2)

Figure 2 shows the relationship between collection term frequencies, document frequencies for the Poisson model (Equation 2), and observed data points. While, at low frequencies both values are almost equal, as ctf (w) grows, df (w) is asymptotically limited by the number of documents in the collection |D|. Thus, the inverse function of this function provides an estimator for $\hat{ctf} (w)$ : $\hat{ctf} (w) = - | D | \times ln (1 - \frac{df (w)}{| D |})$

Fig.2

ctf (w) versus df (w), estimated df (w) and deviations of real data from the Poisson model.

As mentioned above, under the Poisson assumption, when the occurrences of w accumulate in some documents lead to the order relationship: $df (w) \leq \hat{df} (w)$ . Symmetrically, ctf (w) is consistently higher than its estimation, i.e. $\hat{ctf} (w) \leq ctf (w)$ . Another less obvious order relationship is $\hat{ctf} (w) \leq \hat{df} (w)$ . Experimentally, we found that more than 99.99% of the terms on disks 4 and 5 in the TREC collections satisfy these three relations. Exceptionally, the few terms that do not fulfill $\hat{ctf} (w) \leq \hat{df} (w)$ are stop-words such as ‘at’, ‘the’, ‘to’, ‘a’, ‘of’, ‘from’, etc. Therefore, based on these relationships, the proposed expression for measuring the deviation from Poisson in combination with idf (w) is: $pidf (w) = log (\frac{\hat{df} (w)}{\hat{ctf} (w)} + 1)$

The above expressions make explicit deviations from the Poisson model, the observed ctf (w), and df (w), their respective estimates. In this formula, a log-ratio was used to be consistent with the logarithmic formulation of IDF. Also, when the observed ctf (w) and df (w) deviate from the Poisson model the corresponding pidf (w) score diverges positively from idf (w) in a logarithmic rate.

Apparently, in the low-frequency range, if df (w) = ctf (w), then it means that w follows the Poisson model. Our hypothesis is that as df (w) and ctf (w) diverge, the informativeness of w increases, which diverge from Poisson,

4.3 Boosted-IDF

The reader may notice that the proposed heuristics H₁ and H₂, respectively implemented by ICTF and PIDF, oppose. Although, both seem to be plausible at the extremes of the frequency spectrum, in the mid-range none of them fully applies. Therefore, it seems reasonable that each one applies at the extremes, and a combination of both applies in the mid-range. The simplest combination method for ICTF, PIDF and IDF is multiplication. Thus, our proposed IDF factor, named Boosed-IDF, is: $bidf (w) = ictf (w) \times pidf (w) \times idf (w)$

5 Improving TF using CTF: Boosted-TF

Similar to the previous section, here the integration of factors CTF and TF was made by the observation that, in a document, the informativeness of terms with the same TF can be distinguished by their CTFs. Then, a new heuristic for information retrieval was formulated in accordance with that observation. Next, Boosted-TF, a new TF factor that implements that heuristic was proposed by using deviations from randomness.

Table 4 shows examples of pairs of terms that occur the same number of times in a particular document (i.e., same TFs), but differ importantly in their CTFs. This particular document has a length of 10,753 words. Unlike the terms in “w₁” column, “w₂” terms provide a good intuition about the topic of the document. In that example, CTF reveals an informative nature in the terms that TF alone is unable to distinguish. Consider the following heuristic:

Table 4
Pairs of terms in document CR93E-7563 (in TREC disks 4 and 5)

w ₁ w ₂ tf (w₁) tf (w₂) ctf (w₁) ctf (w₂) $\hat{tf} (w_{1})$ $\hat{tf} (w_{2})$ $\frac{tf (w_{1})}{\hat{tf} (w_{1})}$ $\frac{tf (w_{2})}{\hat{tf} (w_{2})}$

policy bosnian 28 28 158,524 15,893 8.41 1.74 3.329 16.092

issue un 19 19 128,829 44,728 7.02 3.09 2.707 6.149

committee yugoslavia 13 13 185,000 13,560 9.65 1.63 1.347 7.975

take holocaust 12 12 166,612 2,009 8.79 1.09 1.365 11.009

end atrocities 11 11 168,161 1,360 8.86 1.06 1.242 10.377

time herzegovina 10 10 379,877 14,079 18.76 1.66 0.533 6.024

times serbia 9 9 372,016 7,578 18.39 1.35 0.489 6.667

two sarajevo 8 8 391,394 11,808 19.30 1.55 0.415 5.161

news genocidal 6 6 201,873 173 10.44 1.01 0.575 5.941

years airstrikes 3 3 320,219 185 15.97 1.01 0.188 2.970

w ₁	w ₂	tf (w₁)	tf (w₂)	ctf (w₁)	ctf (w₂)	$\hat{tf} (w_{1})$	$\hat{tf} (w_{2})$	$\frac{tf (w_{1})}{\hat{tf} (w_{1})}$	$\frac{tf (w_{2})}{\hat{tf} (w_{2})}$
policy	bosnian	28	28	158,524	15,893	8.41	1.74	3.329	16.092
issue	un	19	19	128,829	44,728	7.02	3.09	2.707	6.149
committee	yugoslavia	13	13	185,000	13,560	9.65	1.63	1.347	7.975
take	holocaust	12	12	166,612	2,009	8.79	1.09	1.365	11.009
end	atrocities	11	11	168,161	1,360	8.86	1.06	1.242	10.377
time	herzegovina	10	10	379,877	14,079	18.76	1.66	0.533	6.024
times	serbia	9	9	372,016	7,578	18.39	1.35	0.489	6.667
two	sarajevo	8	8	391,394	11,808	19.30	1.55	0.415	5.161
news	genocidal	6	6	201,873	173	10.44	1.01	0.575	5.941
years	airstrikes	3	3	320,219	185	15.97	1.01	0.188	2.970

H₃: Assume two terms w₁ and w₂ that occur independently the same number of times in a particular document in a collection. Now, if ctf (w₁) ⪢ ctf (w₂), then w₁ is a more common term than w₂ across the document collection, and hence less informative in general. Consequently, w₂ is a better keyword for the document than w₁.

This heuristic can be analogously formulated using document frequencies instead of collection document frequencies, because, if, ctf (w₁) ⪢ ctf (w₂), then it is pretty likely that df (w₁) ⪢ df (w₂) too.

The methodology for obtaining a new TF factor integrated with CTF is similar to that used for the development of PIDF (Subsection 4.2). That is, by comparing the deviation of an observed tf (w, d) value with a plausible estimate $\hat{tf} (w, d)$ . This estimate cannot be obtained using df (w), since it only considers binary document occurrences, ignoring term frequencies in the documents. Unlike df (w), ctf (w) can be used to estimate the term frequencies by assuming, similarly to the Poisson model, that terms occur independently and uniformly across documents.

Consider the portion of the length of the document due to repetitions of terms. The length of this portion is len (d) - |d′|, where len (d) is the document length, d′ is the set of terms in document d, and |d′| is the vocabulary size of d. Note that len (d) and |d′| must be computed including stop-words. According to our assumption, this portion should be distributed among the terms in the document vocabulary in a weighted manner by CTF. Therefore, the more frequent the terms in the collection, the larger their share on that portion. The expression for such term frequency estimator is given by: $\begin{matrix} \hat{tf} (w, d) & = & 1 + (\frac{ctf (w)}{\sum_{w^{'} \in d^{'}} ctf (w^{'})}) \\ \times (len (d) - | d^{'} |) . \end{matrix}$

The initial 1 in this expression results from the fact that at least one occurrence of w in d is assumed. Note that, this estimate keeps the document length unchanged since it satisfies the following restriction: $\sum_{w \in d^{'}} tf (d, w) = \sum_{w \in d^{'}} \hat{tf} (d, w) = len (d) .$

The formulation of Boosted-TF can be proposed by the intuition that terms w, whose observed tf (w, d) values are larger than their estimates $\hat{tf} (w, d)$ should be useful keywords for the document d. This means that a term that occurs many times in the document, but it is rare at collection level is a good keyword for that document. Conversely, if a term in the document occurs fewer times than its estimate, it could be disregarded as a keyword for that document. Note that this argument is somewhat similar to the motivation of the popular tf.idf term weighting schema.

Table 4 also shows $\hat{tf} (w, d)$ estimates for the example terms. Besides, the ratio $\frac{tf (w, d)}{\hat{tf} (w, d)}$ , which measures the divergence from our model, is considerably higher for the terms in the column headed “w₂” (better keywords) in comparison with those in “w₁” column. Similar patterns were observed by performing such analyzes in the other six documents from the same collection, with more than 10,000 words.

Finally, the proposed expression for Boosted-TF uses the ratio between observed and estimated term frequencies adjusted by the factor C (d), as follows: $\begin{matrix} btf (w, d) & = & C (d) \times \frac{tf (w, d)}{\hat{tf} (w, d)}; \\ C (d) & = & \frac{len (d)}{\sum_{w^{'} \in d^{'}} \frac{tf (w^{'}, d)}{\hat{tf} (w^{'}, d)}} . \end{matrix}$

The purpose of C (d) is to maintain the values of btf (w, d) close to the original tf (w, d) by satisfying the constraint ∑_w∈d′btf (w, d) = len (d). Incidentally, this factor could prevent unexpected interactions with the pivoted-document-length-normalization component of BM25.

6 BM25-CTF model summary

In this section, we present a summary of the BM25-CTF retrieval model. Thus, the relevance score of a document d for a query q is: $\begin{matrix} \sum_{w \in q} bidf (w) \times \frac{(k_{1} + 1) \times btf (w, d)}{k_{1} \times K (d) + btf (w, d)} \\ \times \frac{(k_{3} + 1) \times btf (w, q)}{k_{3} + btf (w, q)} \end{matrix}$

Where w is an index term (i.e. a word), d is a document as a collection of index terms, and K (d) is the same document-length normalization factor of BM25. The parameters of the model and their proposed defaults values are k₁ = 1.25, b = 0.90, and k₃ =any value from 100 to 1000 (1000 was used in our experiments). Values for k₁ and b are determined experimentally in Section 7.

Unlike BM25, during the indexing stage, BM25-CTF requires keeping track of all collection-term frequencies ctf (w), which requires the same space as df (w). Next, computing and storing bidf (w) would require practically the same time and space than idf (w). Similarly, the same space used for tf (w, d) can be used now for btf (w, d) with the marginal cost of storing data on type real instead of integer. Note that btf (w, d) requires an additional off-line pass through the documents to calculate the scaling factor C (d). Finally, if the documents in the collection are relatively short, then ctf (w) ≈ df (w), and so BM25-CTF≈BM25. Consequently, the improvement of BM25-CTF over BM25 can be only observed when BM25-CTF is used in collections containing relatively large documents.

7 Experimental validation

7.1 Experimental setup

The aim of the experiments performed was to compare the original BM25 formula versus BM25-CTF. As a testbed, the ad hoc task was carried out in TREC data using the same configuration of collections of documents (disks), queries (topics) and relevance judgments (qrels) of TREC-1 to TREC-8 [10]. Also to compose the query terms, all fields of the topics were used, i.e. verbose queries. In addition, both documents and queries were pre-processed by tokenization (English tokenizer in the NLTK 2.0), stemming (Porter stemmer) and removal of 410 words from the InQuery stop-list. Next, tokens containing only numeric characters were removed with the exception of the years between 1900 and 2010. Finally, special characters such as underscore, punctuation marks, parenthesis and others were replaced with space characters.

For each combination of a collection of documents, set of queries and retrieval function, a list was retrieved containing 1,000 documents with the highest score. Next, the program trec_eval v8.1 produced evaluation measures (MAP, P@10 and others) for this sorted list by comparing it with its corresponding qrels. Finally, the statistical significance of differences between BM25 against the proposed function was obtained using the Wilcoxon’s signed-ranks test.

7.2 Parameter analysis

The robustness of the BM25 parameters has been widely accepted and demonstrated [32], which means that small changes in them also produce small changes in retrieval performance and default values are usually a good choice for different collections. On this basis, we compared the optimal k₁ and b throughout the tested collections between BM25 and BM25-CTF (see Table 5). The most important result of this comparison is the fact that all standard deviations for both parameters in both test settings (i.e. TREC-8 to 1 and 8 to 4) are lower for BM25-CTF. Thus, the robustness of BM25 is not only preserved but improved by BM25-CTF, making unnecessary usual cross-validation for adjusting model parameters.

Table 5
Optimal parameters for BM25 and BM25-CTF using MAP across collections

model param. TREC-8 TREC-7 TREC-6 TREC-5 TREC-4 TREC-3 TREC-2 TREC-1 Average Average

BM25 k ₁ 0.80 1.20 0.90 0.70 1.50 1.50 1.25 1.40 1.16 (0.30) 1.02 (0.33)

b 0.95 0.75 0.85 1.00 0.60 0.75 0.70 0.75 0.79 (0.12) 0.83 (0.16)

BM25-CTF k ₁ 1.30 1.10 0.90 1.25 1.50 1.50 1.50 1.50 1.32 (0.21) 1.21 (0.22)

b 0.85 0.95 0.95 0.95 0.70 0.80 0.75 0.80 0.84 (0.09) 0.88 (0.11)

model	param.	TREC-8	TREC-7	TREC-6	TREC-5	TREC-4	TREC-3	TREC-2	TREC-1	Average	Average
BM25	k ₁	0.80	1.20	0.90	0.70	1.50	1.50	1.25	1.40	1.16 (0.30)	1.02 (0.33)
	b	0.95	0.75	0.85	1.00	0.60	0.75	0.70	0.75	0.79 (0.12)	0.83 (0.16)
BM25-CTF	k ₁	1.30	1.10	0.90	1.25	1.50	1.50	1.50	1.50	1.32 (0.21)	1.21 (0.22)
	b	0.85	0.95	0.95	0.95	0.70	0.80	0.75	0.80	0.84 (0.09)	0.88 (0.11)

Figure 3 shows a comparison of the retrieval performance of BM25-CTF (left) and BM25 (right) in the search grid of parameters. The surface represents the average MAP over TREC-8 to TREC-1 using the corresponding parameters in the grid floor.

Fig.3

Average of MAP from TREC-8 to TREC-1 in a parameters search grid for BM25 and BM25-CTF.

Regarding robustness of parameters, one can see in Fig. 3, that the surface of BM25-CTF looks a bit “flatter” than that of BM25. This result shows that changes around the optimal parameter values produce smaller variations in performance for BM25-CTF than for BM25, which lead us to conclude that the robustness of the proposed model is comparable to (or even better) that of BM25. Table 6 shows the optimal parameters for both models using MAP and P@10. The resulting optimal parameters for BM25 are very close to the well-known default parameters suggested by Robertson [23], i.e. k₁ = 1.2 and b = 0.75. However, the values of k₁ differ between MAP and P@10 for BM25. Unlike BM25, the optimal parameters for BM25-CTF agreed for both measures.

Table 6

Optimal parameters for BM25 and BM25-CTF for TREC-1 to 8

Model	Measure	Value	k ₁	b
BM25	Average MAP	0.2611	1.20	0.80
BM25-CTF	Average MAP	0.2798	1.25	0.90
Relative improvement in average		7.16%	-	-
BM25	Average P@10	0.5210	1.40	0.80
BM25-CTF	Average P@10	0.5248	1.25	0.90
Relative improvement in average		0.73%	-	-

8 Results and discussion

Table 7 compares retrieval performance between BM25 and BM25-CTF for each document collection. Results, using MAP measure, show consistent improvements of the proposed BM25-CTF over BM25 in all collections except for TREC-2, in which a marginal loss was observed. It is important to note that in TREC-4 to TREC-8, BM25-CTF obtained significantly better results. Although, the improvements of BM25-CTF measured using P@10 were not as significant and consistent as those using MAP, no significant loss was observed.

Table 7
Comparison for each collection using its default parameters

MAP P@10

TREC BM25 BM25-CTF Improv. BM25 BM25-CTF Improv.

1 0.323 0.3305 2.32% 0.5158 0.5248 1.75%

2 0.3404 0.3401 -0.09% 0.4416 0.4576 3.62%

3 0.3287 0.3316 0.88% 0.68 0.674 -0.88%

4 0.2028 0.2204* 8.68% 0.582 0.582 0.00%

5 0.2141 0.2373* 10.84% 0.656 0.654 -0.30%

6 0.2282 0.2611* 14.42% 0.454 0.458 0.88%

7 0.2146 0.2427* 13.09% 0.414 0.416 0.48%

8 0.2332 0.2744* 17.67% 0.428 0.4520* 5.61%

MAP	P@10
1	0.323	0.3305	2.32%	0.5158	0.5248	1.75%
2	0.3404	0.3401	-0.09%	0.4416	0.4576	3.62%
3	0.3287	0.3316	0.88%	0.68	0.674	-0.88%
4	0.2028	0.2204*	8.68%	0.582	0.582	0.00%
5	0.2141	0.2373*	10.84%	0.656	0.654	-0.30%
6	0.2282	0.2611*	14.42%	0.454	0.458	0.88%
7	0.2146	0.2427*	13.09%	0.414	0.416	0.48%
8	0.2332	0.2744*	17.67%	0.428	0.4520*	5.61%

[*] significantly better using Wilcoxon’s test (p-value < 0.05).

Table 8 shows a comparison between BM25 and BM25-CTF when used with their corresponding default parameter values for various evaluation measures. BM25-CTF obtained improvements in all measures, particularly in MAP and GMAP. Regarding statistical significance, for each measure we counted the number of times that BM25-CTF were significantly better (labeled B*) and worse (labeled W-) in comparison to BM25. In the 7 performance measures considered, BM25-CTF was significantly better 20 times, and worse in only 2 out of the 56 comparisons (8 collections × 7 evaluation measures). In addition, the results and improvements averaged from TREC-8 to TREC-4 were much better than the average across all TRECs. This result shows that BM25-CTF perform better with shorter and less structured queries as those in the more recent TRECs [10], which are more common in practical applications.

Table 8

Results for other performance measures

			Average from TREC-8 to 1			Average from TREC-8 to 4
Measure	B*	W-	BM25	BM25-CTF	improv.	BM25	BM25-CTF	improv.
MAP	5	0	0.2606	0.2798	7.34%	0.2186	0.2472	13.08%
GMAP	na	na	0.1667	0.1791	7.40%	0.1183	0.1339	13.13%
Average Precision	na	na	0.2821	0.3005	6.53%	0.2423	0.2699	11.38%
R-precision	2	0	0.3088	0.3241	4.94%	0.2640	0.2876	8.96%
bpref	3	0	0.2921	0.3076	5.32%	0.2396	0.2629	9.72%
Reciprocal Rank	1	1	0.7248	0.7278	0.42%	0.6775	0.6990	3.17%
P@10	1	0	0.5158	0.5248	1.75%	0.4416	0.4576	3.62%
P@100	3	1	0.3081	0.3147	2.13%	0.2104	0.2255	7.17%
P@500	5	0	0.1494	0.1554	4.00%	0.0864	0.0948	9.74%

[B*] number of collections where BM25-CTF improved BM25 significatively (p-value < 0.05). [W-] ditto, but vice versa.

Although, our experimental setup differs from many of the works surveyed in Section 3, it is worth to highlight that the average relative improvement in MAP, from TREC-8 to TREC4 (13.08%), is larger than many of the improvements in these works. Considering magnitude and consistency, we believe that our results are comparable with those obtained by Troy and Zhang [34]. It is important to note that the improvements obtained by BM25-CTF are consistently better than any of the previously published improvements when direct comparison is possible.

9 Conclusion

We described an approach to improve the BM25 model by including collection term frequencies in its formulation. The proposed model was obtained by measuring deviations from randomness and implementing three newly proposed heuristics. Consistent and significant improvements of the proposed model against the classic BM25 proved that the three heuristics and the proposed formulation were convenient. Since collection term frequencies are readily available in most retrieval systems, the proposed approach does not increase the computational complexity of BM25. In our experiments on TREC data, the new model performs consistently better than BM25, and exhibits less variance across the collections. Furthermore, results indicate that the new model is robust with respect to the parameters k₁ and b for which new empirically-determined default values are recommended.

Footnotes

Acknowledgments

The fourth author acknowledges the support of the Mexican Government via SNI, CONACYT, and the Instituto Politécnico Nacional, SIP grants 20172008 and 20172044.

References

Aizawa

, An information-theoretic perspective of tf-idf measures, Inf Process Manag39(1) (2003), 45–65.

Amati

and Van Rijsbergen

C.J.

, Probabilistic models of information retrieval based on measuring the divergence from randomness, ACM Trans Inf Syst20(4) (2002), 357–389.

Büttcher

and Charles

L.A.

, Clarke, Efficiency vs, effectiveness in terabyte-scale information retrieval, In Proceedings of TREC 20052005.

Church

K.W.

and Gale

W.A.

, IDF: A measure of deviations from poisson, Natural language processing using very large corporaSpringer (1999), 283–295.

Cummins

and O’Riordan

, Evolving local and global weighting schemes in information retrieval, Information Retrieval9(3) (2006), 311–330.

Fang

, Tao

and Zhai

C.X.

, A formal study of information retrieval heuristics, In Proceedings of the 27th ACM SIGIR (2004), 49–56.

Fang

and Zhai

C.X.

, An exploration of axiomatic approaches to information retrieval, In Proceedings of the 28th ACM SIGIR (2005), 480–487.

Franz

and McCarley

J.S.

, Word document density and relevance scoring, In Proceedings of the 23rd ACM SIGIR (2000), 345–347.

Goswami

, Moura

, Gaussier

, Amini

and Maes

, Exploring the space of IR functions, Advances in Information RetrievalSpringer (2014), 372–384.

10.

Harman

, TREC: Experiments and Evaluation in Information Retrieval, Chapter the TREC Test CollectionsMIT Press (2005), 21–52.

11.

, Huang

J.X.

and Zhou

, Modeling term proximity for probabilistic information retrieval models, Information Sciences181(14) (2011), 3017–3031.

12.

Kwok

K.L.

, Experiments with a component theory of probabilistic information retrieval based on single terms as document components, ACM Trans Inf Syst8(4) (1990), 363–386.

13.

Kwok

K.L.

, A new method of weighting query terms for adhoc retrieval, In Proceedings of the 19th ACM SIGIR (1996), 187–195.

14.

Lee

, IDF revisited: A simple new derivation within the Robertson-Spärck Jones probabilistic model, In Proceedings of the 30th ACM SIGIR (2007), 751–752.

15.

and Zhai

C.X.

, Adaptive term frequency normalization for BM25, In Proceedings of the 20th ACM CIKM (2011), 1985–1988.

16.

and Zhai

C.X.

, Lower-bounding term frequency normalization, In Proceedings of the 20th ACM CIKM (2011), 7–16.

17.

and Zhai

C.X.

, When documents are very long, BM25 fails!, In Proceedings of the 34th ACM SIGIR (2011), 1103–1104.

18.

Pirkola

and Järvelin

, Employing the resolution power of search keys, Journal of the American Society for Information Science and Technology52(7) (2001), 575–583.

19.

Pirkola

, Leppänen

and Järvelin

, The RATF formula (Kwok’s formula): Exploiting average term frequency in cross-language retrieval, Information Research7(2) (2002).

20.

Ponte

J.M.

and Croft

W.B.

, A language modeling approach to information retrieval, Proceedings of the 21st ACM SIGIR (1998), 275–281.

21.

Rasolofo

and Savoy

, Term proximity scoring for keywordbased retrieval systems, In Proceedings of the 25th ECIR (2003), 207–218.

22.

Robertson

, Understanding inverse document frequency: On theoretical arguments for IDF, Journal of Documentation60(5) (2004), 503–520.

23.

Robertson

, TREC: Experiments and Evaluation in Information Retrieval, Chapter How Okapi Came to TREC (2005), 287–300MIT Press.

24.

Robertson

and Walker

, Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In , Proceedings of the 17th ACM SIGIR (1994), pp. 232–241.

25.

Robertson

, Walker

, Beaulieu

M.M.

, Gatford

and Payne

, Okapi at TREC-4, In Proceedings of the 4th TREC (1996), 73–96.

26.

Robertson

, Walker

, Jones

, Hancock-Beaulieu

M.M.

and Gatford

, Okapi at TREC-3, In Proceedings of the 3rd TREC (1994), 109–126.

27.

Salton

, Wong

and Yang

C.S.

, A vector space model for automatic indexing, Commun ACM18(11) (1975), 613–620.

28.

Singhal

, Term Weighting Revisited, PhD thesis, Cornell University, 1997.

29.

Singhal

, Modern information retrieval: A brief overview, IEEE Data Engineering Bulletin24(4) (2001), 35–43.

30.

Spärck

, Jones, IDF term weighting and IR research lessons, Journal of Documentation60(5) (2004), 521–523.

31.

Tao

and Zhai

C.X.

, An exploration of proximity measures in information retrieval, In Proceedings of the 30th ACM SIGIR (2007), 295–302.

32.

Taylor

, Zaragoza

, Craswell

, Robertson

and Burges

, Optimization methods for ranking functions with multiple parameters, In Proceedings of the 15th ACM CIKM, 2006.

33.

Trotman

, Puurula

and Burgess

, Improvements to BM25 and language models examined, In Proceedings of ACM ADCS ’14 (2014), 58–65.

34.

Troy

A.D.

and Zhang

G.Q.

, Enhancing relevance scoring with chronological term rank, In Proceedings of the 30th ACM SIGIR (2007), 599–606.

35.

Zhai

C.X.

and Lafferty

, A study of smoothing methods for language models applied to ad hoc information retrieval, In Proceedings of the 24th ACM SIGIR (2001), 334–342.

36.

Zipf

G.K.

, Human Behaviour and the Principle of Least-Effort, Addison-Wesley, 1949.

MAP			P@10
TREC	BM25	BM25-CTF	Improv.	BM25	BM25-CTF	Improv.
1	0.323	0.3305	2.32%	0.5158	0.5248	1.75%
2	0.3404	0.3401	-0.09%	0.4416	0.4576	3.62%
3	0.3287	0.3316	0.88%	0.68	0.674	-0.88%
4	0.2028	0.2204*	8.68%	0.582	0.582	0.00%
5	0.2141	0.2373*	10.84%	0.656	0.654	-0.30%
6	0.2282	0.2611*	14.42%	0.454	0.458	0.88%
7	0.2146	0.2427*	13.09%	0.414	0.416	0.48%
8	0.2332	0.2744*	17.67%	0.428	0.4520*	5.61%