Abstract
Novice users are unable to express their information needs properly, due to this it is difficult to retrieve all the desired relevant documents from the test collection. The problem of word mismatch is fundamental to information retrieval. Query expansion is a technique in which additional terms are added to retrieve relevant documents. In this article, we have expanded the query using pseudo-relevance feedback and Haar wavelet transform. The performance of the proposed technique is evaluated on FIRE 2011 ad hoc English test Collection and Robust dataset. The mean average precision of the proposed model on the FIRE dataset and Robust Track dataset is 0.3334 and 0.2724, respectively.
1. Introduction
In information retrieval (IR), the user poses a query to search engines and it retrieves documents that are relevant to the user query. The problem of word mismatch is fundamental in IR. The words of query might not appear in a document despite it being relevant to the query. In such cases, a non-retrieved document might contain relevant information. Query Expansion is a technique to overcome the problem of word mismatch. In this technique, after retrieving an initial set of documents the original query is expanded by appending additional words that are obtained by applying some or other technique on top k retrieved documents. It reformulates the original query to improve retrieval performance. The expansion words may be synonyms of words, semantically related words (like antonyms, hyponyms, etc.), or various morphological forms of words. Global and local methods are two major classes of query expansion. Pseudo-relevance feedback also known as blind relevance feedback is one of the local methods. Local methods adjust the query relative to the document that initially appears to match the query [1]. It retrieves an initial set of relevant documents, then assumes that top ‘k’ rank documents are relevant, and from these top ‘k’ documents expansion terms are automatically selected. In Global methods, expansion terms are selected from large corpora. Global methods use semantically similar terms to the original query terms. It uses thesaurus-based query expansion. Global methods are broadly classified into four categories: (a) linguistic-based, (b) corpus-based, (c) search log-based and (d) web-based.
Vector space [2,3] and probabilistic [4] methods are based on the hypothesis that the document is relevant to query if it contains more occurrences of the query terms in the document. Generally, these are similarity-based methods. In vector space, query and documents are represented as a vector in which each term represents a dimension in that space. Some IR methods are based on term proximity and are known as proximity-based methods. Clarke and Cormack [5] proposed a hypothesis that the document is more relevant to the query if query terms are found in smaller proximity to each other. In this positional information is used to achieve higher precision.
Spectral-based document retrieval is a mix of vector and proximity model. It finds relevant documents by considering the query term occurrence pattern in the documents. The documents that follow similar positional patterns are considered more relevant than those documents in which query term does not follow a similar pattern. Spectral-based retrieval compares query terms in spectral domain rather than the spatial domain. Wavelet transform is used for spectral-based document retrieval. Wavelet transform is a mathematical tool which allows us to break the given signal into a wavelet of different scale and position. It allows us to analyse the signal at different frequencies and its positions.
The proposed work is motivated by the work of Park et al. [6]. They proposed document retrieval using discrete wavelet transform. Alnofaie et al. [7] used wavelet transform for document ranking and then query is expanded using Kullback–Leibler (KL) divergence. KL divergence is applied on top ‘k’ retrieved documents.
Our proposed query expansion method uses pseudo-relevance feedback and Haar wavelet transform for query expansion. The spectral concept is applied on documents that are obtained after the initial round of retrieval for selecting expansion terms. It analyzes query term frequency and its position in the initial retrieved documents. The partition (component) of a document, in which query terms are close to each other is selected as expansion region and from it, all terms are selected for query expansion. The novelty in the proposed method is that we have applied Haar wavelet transform for query expansion instead of document retrieval and ranking [6]. The architecture of proposed model is shown in Figure 1. The proposed method is also compared with the existing query expansion methods. The article is organised as follows: Section 2 discusses related work. Sections 3 and 4 discuss multi-resolution and wavelet transform. Section 5 discusses query expansion and discrete Haar wavelet and the proposed algorithm is also presented in this section. Section 6 presents experimental results. Section 7 analyses the result. Section 8 presents the conclusion and future work.

Process of query expansion using Haar wavelet transform.
2. Related work
A number of document ranking and expansion models are proposed over the years to improve the performance of the retrieval system. Query Expansion is a technique to improve the performance of the retrieval system by appending few terms in the initial query [8]. Salton and Buckley [9] proposed relevance feedback using vector space model. Rocchio Algorithm is one of the oldest algorithms based on the vector space model [10]. Singhal et al. [11] showed that better results are obtained by using only documents that are close to the query. Koenemann and Belkin [12] did user study for the effectiveness of relevance feedback. Riezler et al. [13] used statistical Natural Language Processing (NLP) approach for Relevance Feedback. Qiu and Frei [14] and Schütze [15] used automatic thesaurus generation for query expansion. Miller [16] proposed thesaurus (WordNet)-based query expansion technique. It uses synonyms words of the query words for expansion. The approach faces a query drifting problem. To avoid query drifting, Leung et al. [17] proposed a context-aware personalised query suggestion clustering-based technique. In this technique, instead of suggesting a similar query to every user, they suggested clustered user click-through data to anticipate user intention. Session-based [18] query term suggestion is another important query expansion technique that asserts that every search query on the same session is related to one or another in a way or two. Ganguly et al. [19] proposed query expansion technique based on sentence similarity. They used sentences instead of adding terms, and they concluded that their method outperforms the previous state-of-the-art methods. Latiri-Cherif et al. [20] used fuzzy-based query expansion technique to improve the performance of the retrieval system. Bhogal et al. [21] proposed ontology heredity concept to find the expansion terms by calculating the similarity between ontology and query words. Previous studies [22–24] proposed latent semantic indexing, probabilistic latent semantic indexing and latent Dirichlet allocation method. They tried to reduce the query mismatch problem by finding semantically related terms for query expansion. ALMasri et al. [25] make a comparison among deep learning–based technique, pseudo-relevance feedback technique and mutual information–based technique. Xu and Croft [26] proposed query expansion using global and local document analysis and showed that both global and local analysis technique is more effective than simple local feedback. Zingla et al. [27] proposed hybrid query expansion technique for text and microblog IR. El Mahdaouy et al. [28] proposed word-embedding-based pseudo-relevance feedback-based query expansion technique for Arbic IR.
Wavelet transform plays a very important role in text analysis and text retrieval [29]. Miller et al. [30] used wavelet transform in text document at a different portion of the document, containing the main topic related to the query for ranking the document. The literature reveals that the performance of the IR system degrades due to word mismatch and query drift problems.
3. Term signal of document
A wavelet function is defined by
where
where
4. Multi-resolution and wavelet transform
Fourier transform [32] and family of unitary transform [33] provide information about the frequency component that exists in the signal. The transformation is useful for a stationary signal where frequency exists as a whole. In-text retrieval Fourier transform gives the frequency of query terms for the whole document. It does not provide frequency information at different positions of the document. Discrete Fourier transform of term signal
where
Short-time Fourier transform (STFT) [34] is another transform which transforms term signal into frequency domain. The STFT of term signal f(t) with frequency
where g[t−v] is the window function centred at v.
The resolution must be set in order to extract certain time-frequency information from a signal in STFT. By using the STFT, we could tell where in time the impulses occurred, but we would only be able to notice high-frequency changes. Any slow changes to the signal are not noticeable in the frequency analysis due to windowing.
Multi-resolution analysis (MRA) is required to examine a signal across different time-frequency resolution scales. To achieve this, wavelet transform [35] is applied.
A wavelet is a small wave of finite length. A wavelet is scaled and translated with parameters b and a, respectively
Scaling factor keeps the norm equal to one. The wavelet transform of f(t)
where
where set of

Enclosing scaling function
From Figure 2, the following holds
also
Wavelet transform is applied to the word signal, and it gives the position of the word in the desired resolution.
A wavelet must satisfy the following properties
(1) It has zero mean, that is
(2) Unit norm, that is
(3)It has vanishing moment, that is,
(4) The Fourier transform of
For a continuous wavelet, scaling function can be set at any scale while in case of discrete wavelet scaling is allowed only dyadic scale, that is, for a discrete wavelet if scaling function
5. Query expansion and discrete Haar wavelet transform
The family of N Haar functions
for any values of
Determining value of k, p and q for Haar transform.
Now Haar function can be defined as
when
when
where p specifies the magnitude and width (or scale) of the shape and q specifies the position of the shape (shift). The Haar transform is real and orthogonal, that is
The value of t is computed as
Haar transform for word signal gives the location of the word at different wavelet components. In the proposed method, initially, the user issues a query, and in response to it, the system returns an initial set of ‘k’ documents. The proposed method does an automatic local analysis using Haar wavelet on these top ‘k’ retrieved documents. The Haar wavelet helps in automatically searching the expansion region in the top k retrieved documents. The expansion terms are then selected from the spectral domain rather than the spatial domain. Finally, the expansion terms are appended in the initial query for the final round of retrieval. Figure 1 describes the Haar wavelet-based query expansion method. The proposed algorithm is shown in algorithm 1.
Wavelet Query Expansion HWQE(Q[],D[]).
For example, if the term signal of query term ‘cosmic’ in a document d is computed as [3,1,0,0,2,1,1,1] where 3, 1, 0, 0, 2, 1, 1, 1 is the occurrences of term ‘cosmic’ in the first, second, third, fourth, fifth, sixth, seventh and eighth component of document d, respectively, then spectral vector for query term ‘cosmic’ is computed as
6. Experimental result
The experiment is performed on the FIRE 2011 ad hoc English test collection and Robust Track dataset. The size of test collection is 1.1 GB and it contains 392,577 documents. The size of the Robust dataset is 2.1 GB and it contains 528,155 documents. The test collection is indexed using Terrier3.5 search engine [38]. We have used the InL2c1.0 model to index the test collection. The Stopwords are removed and PorterStemmer is used to stem the words. In this experiment, we have evaluated 50 queries, and query expansion is done using the top ‘k’ (k = 30) documents. The sample query of the ad hoc test collection is shown in Figure 3. The mean average precision (MAP) comparison of the proposed method with the other methods is shown in Table 2.

Example of topic file.
Comparison of mean average precision of proposed method with other methods.
MAP: mean average precision; CS: chi-square divergence; KLD: Kullback–Leibler divergence.
The performance of the original query (Title only), BM25 model, query expansion using word-embedding-based semantic model, query expansion using Bo1 model [39], query expansion using Bo2 model, query expansion using chi-square divergence (CS) model, query expansion using Kullback–Leibler divergence (KLD) model and query expansion with Wavelet are shown in Table 3. Comparative improvement of proposed model with other models is shown in Table 4. The performance of proposed model with other models for long query is shown in Table 5. Figure 4 shows the performance of the original query with respect to the Wavelet-based query expansion technique. The Wavelet technique performs well on some of the queries but could not perform well on other queries. Out of 50 queries, the Wavelet technique performs well on 24 queries and 19 queries with respect to the original query and Bo1 model, respectively. For this experiment, the MAP value is 0.3334. An improvement of 0.9691%, 4.97% and 11.91% is observed with respect to the Bo1 model, word-embedding-based query expansion model [40] and original query, respectively. A Z statistics is defined as
Comparison of proposed model with other models.
KL: Kullback–Leibler.
Comparative improvement of proposed model with other models.
MAP: mean average precision; KL: Kullback–Leibler.
Bold values indicate the best result.
Comparison of proposed model with other models for long query.
KL: Kullback–Leibler.

Performance of Bo1-based query expansion versus original query.
The provided sample mean is
Comparison of proposed model with other models on Robust Track dataset.
KL: Kullback–Leibler.
7. Discussion
It is clear from Figures 4–7 that the proposed model performs well. The MAP has an improvement of 0.9691% in comparison to the Bo1 model, 4.97% improvement in comparison to the word-embedding-based query expansion model and 11.91% improvement in comparison to the original query. Query-by-query analysis reveals that out of 50 queries proposed model performs well on 24 queries and 19 queries in comparison to the original query and Bo1 model, respectively. The proposed model performs well on some queries but not on all queries. It is observed that the Bo1 model also not performs well on all queries with respect to the original query. Bo1 model performs well on 37 queries with respect to the original query. It is observed that proposed method outperforms the Bo1 model with respect to MAP, R Precision, P@5, P@10, P@20 and P@50. The proposed model also retrieves 117, 33 and 15 more relevant documents with respect to the original query, word-embedding-based query expansion model and Bo1 model, respectively. The proposed model performs well on query number Q127, Q131, Q132, Q133, Q135, Q136, Q139, Q140, Q144, Q146, Q151, Q153, Q156, Q157, Q160, Q169, Q171, Q173 and Q174 with respect to Bo1 model because query terms of these queries are close to each other in the initially retrieved top 30 documents. On other queries, the Bo1 model performs well, because the Bo1 model selects expansion terms according to their term distribution in the document. Bo1 model uses Divergence from Randomness (DFR)-based term weighting model to extract the most informative terms. So if terms are properly distributed in the documents, there is a higher probability to select good terms. It is clear that whenever the query terms are close to each other in the document, then the wavelet-based query expansion method retrieves good expansion terms from the components; otherwise, expansion terms may drift the query. Overall performance of the proposed method is better than both original query and Bo1 model query expansion. The performance of the proposed model with respect to other models on the Robust Track dataset is presented in Table 6. The proposed model performs better on all parameters with respect to the original query and chi-square-based query expansion model. The proposed model performs better with respect to other models on P@5, P@10 and P@20 parameters. Sample query from topic file with expansion terms using Haar wavelet is shown in Table 7.

Performance of wavelet-based query expansion versus original query.

Performance of query expansion using wavelet versus Bo1 model.

Performance of QE using wavelet versus QE using Bo1 model versus original query.
Sample query from topic file and expansion terms using Haar wavelet.
8. Conclusion
In this article, a query expansion method using wavelet transform is proposed. Proximity and term frequency are considered for selecting expansion terms from the initially retrieved top 30 documents. The region where terms are close to each other are considered as expansion regions and terms belonging to this region are used as expansion terms. The proposed method performs better in comparison to the Bo1 model on FIRE 2011 ad hoc English test collection. The experimental result shows that there is an improvement of 0.9691%, 4.97% and 11.91% in comparison to the Bo1 model, word-embedding-based query expansion model and original query, respectively. The proposed model has also been compared with other models on the Robust dataset. The MAP value of the proposed model on the Robust Track dataset is 0.2724. The proposed model has an improvement in MAP, R precision, P@5, P@10, P@20 and P@50 parameters with respect to the original query and chi-square-based query expansion model. The proposed model also has an improvement on P@5, P@10, P@20 and P@50 parameters with respect to the Bo1 model and KL divergence-based query expansion model, respectively. The proposed model also performs better on P@5, P@10 and P@20 parameters with respect to the Bo2-based query expansion model. The proposed method also faces query drifting problem like other query expansion methods. In the future, we will try to further improve the proposed method using machine learning techniques.
Footnotes
Acknowledgements
One of the authors is pursuing doctoral programme from Department of Mathematics and Computer Applications, Maulana Azad National Institute of Technology, Bhopal, India, as full time research scholar. He expresses his sincere gratitude towards the institute for giving opportunity and required facility to pursue his doctoral programme from the institute. The authors have used Terrier search engine for performing their experiments and we are thankful for that. The authors are thankful to FIRE for providing the dataset. The authors are also thankful to TREC NIST for providing the Robust Track dataset for performing the experiment.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship and/or publication of this article.
