Improving personalised query reformulation with embeddings

Abstract

As a mechanism to guide users towards a better representation of their information needs, the query reformulation method generates new queries based on users’ historical queries. To preserve the original search intent, query reformulations should be context-aware and should attempt to meet users’ personal information needs. The mainstream method aims to generate candidate queries first, according to their past frequencies, and then score (re-rank) these candidates based on the semantic consistency of terms, dependency among latent semantic topics and user preferences. We exploit embeddings (i.e. term, user and topic embeddings) to use contextual information and individual preferences more effectively to improve personalised query reformulation. Our work involves two major tasks. In the first task, candidate queries are generated from an original query by substituting or adding one term, and the contextual similarities between the terms are calculated based on the term embeddings and augmented with user personalisation. In the second task, the candidate queries generated in the first task are evaluated and scored (re-ranked) according to the consistency of the semantic meaning of the candidate query and the user preferences based on a graphical model with the term, user and topic embeddings. Experiments show that our proposed model yields significant improvements compared with the current state-of-the-art methods.

Keywords

Embeddings personalisation query addition query reformulation query substitution

1. Introduction

Web search queries are typically short, ill-defined representations of more complex information needs [1]. Consequently, they can lead to poor retrieval performance. To alleviate this problem, modern search engines provide some services that assist users to find their desired content quickly by presenting a list of related queries. One such service is query auto-completion, which helps users to formulate their queries while typing in the search box [2]. Query auto-completion is restricted to using queries containing the same prefix with the original query entered by users to locate the possible user intentions. Another type of service is provided in the search result page after one or more queries have already been issued to the search engine. This is generated using two techniques. One is query recommendation (suggestion), which seeks to generate candidate queries by locating historical queries in a query log according to its relational proximity to the input query. The query generated is usually one already present in the query log. The other is query reformulation, which seeks to modify the part of the original query mainly by (1) adding, (2) deleting or (3) replacing terms in the query [3]. Unlike suggestion, reformulation offers candidates diverse expressions (new queries that were previously unseen in the query log) to describe a wider variety of users’ potential information needs. In addition, it can return relevant candidate queries.

Song et al. [3] and Bing et al. [4] observed that the terms addition and substitution are performed in most cases by users while reformulating queries, and such users generally only modify one of the terms in their queries. These findings explain why most previous work on query reformulation only modifies one term [3 –6]. Therefore, we address query reformulation and focus on one-term additions and substitutions in this study. Sullivan [7] showed that over 80% of users preferred to receive personalised search results and that different users with different backgrounds may have different personalised preferences, even if they issue the same query. Therefore, it may be more beneficial to provide candidate query lists in light of individual user’s interests rather than a generalised candidate query list. Given this, we further validate the benefits of providing users with candidate query reformulations based on their interests.

To date, popular solutions [5] for generating personalised query reformulations generally rely on query logs and typically involve two tasks: (1) generating candidate queries – it is assumed that two terms with a similar context (co-occurring with a similar set of terms in queries) can be substituted with each other and the terms that co-occur frequently with words in the query have a higher probability of being added to the query. A context-based translation model is employed to generate candidate queries that maintain semantic relations with the input query; (2) scoring (re-ranking) candidate queries – typically, the latent Dirichlet allocation (LDA) model estimates the latent semantic topic for each term, after which a graphical model is applied to evaluate the quality of the candidates generated. The graphical model is used to estimate consistency in the semantic meaning of a candidate query by considering both the semantic dependency of query terms and dependency among latent topics of terms. User personalisation is incorporated into the graphical model. These methods, though quite successful, suffer from three major shortcomings. First, the context-based candidate generation method involves a global analysis over the entire collection. It is expensive and time-consuming to use the translation model to capture the contextual similarity between two terms. Therefore, only the first encapsulating term on the left side (1st left) and the first one on the right side (1st right) of each given term were taken into account. Second, only the users’ individual preferences were considered while scoring the candidate queries. However, personalisation is also important for candidate generation as users usually formulate a distinct set of queries. Third, the graphical model is restricted to modelling dependency between latent topics and terms by capturing the term context in the historical queries, but without considering other evidence, such as the contexts of latent topics.

Word embedding, also known as word representation, has become increasingly popular in encoding words as vectors, and this is attributed to an algorithm called Word2vec proposed by Mikolov et al. [8] and to the availability of the related software package.¹ These word embeddings (embedded vector representations) learned through word embedding techniques have been proven to accurately capture meaningful semantic and contextual word information. The distance between vectors serves as a proxy for their semantic or contextual similarity. While word embeddings have been used to represent query terms and queries successfully [9,10], the straightforward method of using embeddings to personalise query reformulation has been rarely explored. Motivated by the fact that word embeddings are capable of providing a convenient means of identifying a set of words that have contexts similar to those of a given word, we use this property to resolve the shortcomings in the previous work [4,6]. Along the lines of past research [4,5], our work involves two subtasks: candidate query generation and scoring; the latter is completed using the graphical model proposed by Bing and Lam [5]. This article aims to investigate the utilisation of word embeddings while generating candidate queries according to user preferences and the contextual similarity between terms. We make use of word embeddings to improve the effectiveness of candidate query scoring by providing more evidence that captures the semantic consistency of terms, the dependency among latent topics and user preferences for the graphical model. Compared with previous works [4,6], the major contributions of our study are as follows:

While generating query substitution or addition candidates, context information for each term is derived using term embedding that applies the sliding window-based context around the terms. More flexible contexts are allowed; making it is easier to allow richer contexts, rather than relying solely on the 1st left and the 1st right contextual terms. This may lead to better modelling of contextual similarity between terms.

In addition, we investigate the personalisation capability by constructing user vectors based on term embeddings while generating candidate queries. This ensures that candidate queries can be generated based on each user’s individual preference.

We propose a new topic embedding approach to obtain the context word distribution for each latent topic. It is useful to consider richer contexts in modelling the semantic dependency of terms and the dependency among latent topics while scoring candidate queries.

We conduct extensive experiments on a real-world query log and perform a thorough analysis to verify the effectiveness of our proposed model.

The rest of this article is organised as follows. In the ‘Related work’ section, we review the related work on query log-based query suggestion and word embedding for information retrieval. The sections ‘Models for personalised query reformulation’ and ‘Embeddings’ present a detailed description of the proposed approaches, including candidate query generation and scoring, and embeddings, respectively. The experimental setup and results are introduced in detail in the ‘Experimental setup’ and ‘Experimental results’ sections, respectively. The ‘Conclusion’ section concludes the article and offers possible directions for future research.

2. Related work

In this section, we discuss work related to different aspects of our study, focusing on

Query log-based query suggestion;

Query log-based query reformulation;

Word embedding for information retrieval.

2.1. Query log-based query suggestion

As a result of having a considerable commercial value, extensive research on query suggestion has been performed over the past few years, and the search engine query log, known as ‘wisdom of crowds’, has proven to be a key asset in most of the literature. Most query log-based approaches estimate the relevance of candidate items by exploiting the observable evidence in a query log, including shared terms [11], common user sessions [11,12] and commonly clicked results [13]. Based on the evidence, some methods, such as the association rule [14] and random walk [13,15,16], have been adopted successfully to generate relevant candidate query suggestions. However, such strategies are often limited to generating relevance-oriented query suggestions. To overcome this problem, a few researchers have attempted other techniques, such as learning-to-rank, LDA and deep learning, to mine more unobservable features that are hidden in a query log to produce higher quality queries. For example, Ozertem et al. [17] applied a learning-to-rank approach and trained rankers on features derived from queries that share sessions and clicks to find useful and relevant queries. Saeedeh and Fabian [18] constructed virtual documents from a query log by combining all unique queries derived from all successful sessions that share the same final query and then used the LDA model to infer topic distributions for virtual documents. The topics inferred were subsequently employed to retrieve the virtual documents that were thematically similar to the original query, and the virtual document’s corresponding title is returned as a suggestion. Sordoni et al. [19] introduced a novel hierarchical recurrent encoder–decoder architecture to interpret sequences of previous queries of any length and produced context-aware suggestions. Dehghani et al. [20] attempted to employ the sequence-to-sequence model with a query-aware attention mechanism to encode the structure of the session while acquiring session-based query suggestions. Jiang and Wang [21] presented a novel approach called the Reformulation Inference Network (RIN) for context-aware query suggestions by learning to represent query reformulations and modelling the way in which users reformulate queries over search sessions.

A few researchers assumed that users’ interests are recorded in the query log and attempted to infer users’ preferences from the query log to provide personalised query suggestions. For instance, Mei et al. [13] proposed a query suggestion algorithm for ranking candidate queries using hitting time by random walking on a large bipartite query-click graph and adopted personalisation with back-off of the second and the third bytes of IP addresses to the proposed algorithm. Jiang et al. [22] designed a novel query suggestion framework known as the personalised query suggestion with diversity awareness (PQS-DA) to effectively integrate diversification and personalisation. They first proposed a multi-bipartite representation to use the query log to its full potential and developed a regularisation framework to determine the most relevant candidate suggestion. The cross-bipartite hitting time model was then leveraged to determine the remaining query candidates that were not only relevant to the initial query but also different from each other. Finally, the user profiling model constructed (beta distribution) based on previously used words and clicked URL was used to generate personalised candidate suggestions according to each user’s interests. Chen et al. [23] proposed a model for query suggestion with personalisation and diversification, and they exploited users’ short-term search context in their current session and their long-term search history to identify users’ search interests. Chen et al. [24] proposed an attention-based hierarchical neural query suggestion model to realise personalised query suggestions by combining a hierarchical structure with a session-level neural network and a user-level network to demonstrate the short- and long-term search behaviours of a user while the attention mechanism captures the user’s preference.

2.2. Query log-based query reformulation

Inspired by the intuition that a good query suggestion system should reproduce users’ natural reformulation patterns [20], a few researchers have examined query reformulation, which considers the special type of reformulations that users should perform to produce satisfactory queries. A more recent query reformulation technique primarily exploited query logs, and almost all of them only considered changing one term from the original queries to generate candidates. For example, Wang and Zhai [6] calculated the contextual similarity of terms in a search history to perform query addition and substitution, and a context-based translation model was exploited to generate and score candidate queries. Song et al. [3] built a term-level query graph based on data sourced from query logs and proposed two models to address topic- and term-level query suggestions (one of the three techniques, namely term deletion, expansion or modification, was leveraged). Their second (term-level) model followed the framework of supervised learning-to-rank, and term modifications were cast as documents to treat each query reformulation as a training instance. Features were extracted for each (query, document) pair from ODP, Wikipedia and so on.

By exploiting the method used by Wang and Zhai [6] to generate substitution candidate queries, Bing and Lam [5] introduced a graphical model that detects semantic dependencies between terms and topics by exploiting the latent topic space inferred from a query log. This work has been extended further by Bing et al., [4] who implemented both query addition and substitution and proposed social tagging data (i.e. Delicious bookmark) and query log-based candidate pattern generation. While generating additional candidates based on query logs, they also considered whether the query providing the term context attracted a click or was located at the end of a session, and then the smoothed probability method was used to obtain candidate terms. For query substitution, they leveraged a method similar to that by Wang and Zhai [6] to obtain substitution candidates. Then, they applied a scoring method proposed by Bing and Lam [5] but exploited different methods to set the parameters. They utilised the n-gram and skip-bigram approaches to capture the term context in the historical query for modelling the dependency among topics and terms, respectively. Although capable of achieving a reasonable performance, this study only considered individual preferences, when scoring candidate queries, and the term context but ignored other indicators, such as the context of latent topics while scoring candidate queries.

2.3. Word embedding for information retrieval

Word embedding techniques have been effective in learning low-dimensional vector space representations of an input word and capturing semantic relationships among words. Word embedding has been extended to the representations of sentences and paragraphs [24], relational entities [25,26], descriptive texts of images [27,28], entities [29,30] and so on. It is computationally more efficient than many other competitive techniques [8,31,32] and thus, many IR systems have recently attained enhanced performance with less human engineering using such techniques. For example, Zuccon et al. [33] and Ganguly et al. [34] focused on exploiting word embeddings to enhance the effectiveness of information retrieval, and the transformation probabilities between words were derived by calculating the distance between the vector embeddings for words. Zamani and Croft [35] explored the use of word embedding to improve the accuracy of query language models in the ad hoc retrieval task. Vu et al. [36] presented an embedding approach to learn user profiles, and each user profile was represented by two projection matrices and one user embedding. The two matrices were intended to detect the user interest-dependent aspects from the users’ input queries and the corresponding relevant documents while the user embeddings aimed to identify the relationship between the queries and documents. Subsequently, user profiles were used directly for search personalisation.

A few researchers have exploited word embedding (Word2vec) in query-related research. For example, Grbovic et al. [37] proposed a query rewriting method based on a new query embedding algorithm in which the vector embeddings of a query were learned from its content (content2vec), context (context2vec) or both context and content (context-content2vec) within a search session, and the final results showed that context-content2vec had achieved the best performance. Zamani and Croft [10] estimated embedding vectors of a query from pseudo-relevant documents, and the estimated query embeddings were applied to query expansion and classification. Roy et al. [38] presented a framework for automatic query expansion by leveraging Word2vec to learn the semantic and contextual relationships between terms, and the terms related to a query were acquired by exploiting the K-nearest neighbour approach to expand the original query. Kuzi et al. [39] exploited Word2vec term embedding trained from an entire document corpus to select terms that were semantically similar to the entire original query or the terms contained in the query. They leveraged the terms selected to expand the initial query using a unigram language model framework. Akbiket et al. [40] proposed a method for producing contextual string embeddings using the neural character language model. They captured syntactic–semantic word features and disambiguated words in a context, achieving the best performance in the classic natural language processing (NLP) sequence labelling tasks.

However, to the best of our knowledge, our work is the first to exploit word embedding in the task of personalised query reformulation. Distinct from previous work, our article pioneers the application of the term, user and topic embeddings to personalise query reformulation; the embeddings are used not only for better model-term similarity but also for personalisation, semantic dependency of terms and topic dependency.

3. Models for personalised query reformulation

The main framework of this article is depicted in Figure 1. Our work comprises two major sub-tasks: candidate query generation and scoring. These two sub-tasks are performed with the assistance of pre-learned embeddings (i.e. term, user and topic embeddings). In this section, we first detail our proposed models for the two sub-tasks and then specify the embeddings on which they rely. The main notations used in this article are listed in Table 1.

Figure 1.

The framework of our proposed approach.

Table 1.

Main notations used in this article.

Symbol	Description
$Q$	A query consists of a sequence of terms $w_{1}, \dots, w_{i - 1}, w_{i}, w_{i + 1}, \dots, w_{n}$ and is issued by a search engine user.
$Q$	A query collection that comprises $N$ unique queries $q_{1}, q_{2}, \dots, q_{N}$ .
$q'$	A candidate reformulation for the query $q$ .
$w_{i}$	The ith term in a query $q$ .
$L_{i} (k)$	Set of terms that occurs at the ith position away from $k$ on its left side in any $q \in Q$ .
$R_{i} (k)$	Set of terms that occurs at the ith position away from $k$ on its right side in any $q \in Q$ .
$c (a, C (k))$	The frequencies of that term $a$ occurs in a particular contextual set $(L_{i} (k) or R_{i} (k))$ of the word $k .$
$T$	The vocabulary contains the top 100,000 terms in the collection of pseudo-documents constructed in the training set.
$V$	The set of unique terms in the pseudo-documents constructed in the training set.
$u$	A particular user.
$Z$	The set of possible latent topics $(z^{1}, z^{2}, \dots z^{i} \dots)$ learned from the pseudo-documents constructed in the training set.

3.1. Candidate query generation

The outputs of this step are candidate queries formulated by substituting or adding one term in the original query. The model proposed by Wang and Zhai [6] is used with the difference that the contextual similarity between terms is calculated by exploiting embeddings and is augmented by user personalisation.

3.1.1. Query substitution

Given a query $q = w_{1}, \dots, w_{i - 1} w_{i} w_{i + 1}, \dots, w_{n}$ issued by a user $u$ , the task of query substitution is carried out to modify the query by replacing one term $w_{i}$ in $q$ with the term $k$ and thus, produce a new query $q' = w_{1}, \dots, w_{i - 1} k w_{i + 1}, \dots, w_{n}$ . The new term $k$ should have a meaning that is similar to that of the term $w_{i}$ and the query $q$ . It should meet with the user $u'$ s personalised requirement. Therefore, the probability of term $k$ substituting the ith position term $w_{i}$ given a context $w_{1}, \dots, w_{i - 1}_w_{i + 1} \dots, w_{n}$ for a particular $u$ can be defined as $P (w_{i} \to k | w_{1}, \dots, w_{i - 1}_w_{i + 1,} \dots, w_{n}, u)$ , and this probability is represented in short-hand $t (w_{i} \to k | q, u)$ . Then, the probability $t (w_{i} \to k | q, u)$ can be derived as

\begin{matrix} t (w_{i} \to k | q, u) = P (k | w_{i} : w_{1}, \dots, w_{i - 1}_w_{i + 1}, \dots, w_{n}, u) \\ \propto P (w_{i} : w_{1}, \dots, {w_{i - 1}}_{w_{i + 1}}, \dots, w_{n}, u | k) * P (k) \\ \propto t (w_{i} | k) P (k) P (w_{i} : w_{1}, \dots, w_{i - 1}_w_{i + 1}, \dots, w_{n}, u | k) \\ \propto t (k | w_{i}) P (w_{1}, \dots, w_{i - 1}_w_{i + 1}, \dots, w_{n}, u | k) \end{matrix}

(1)

We assume that $w_{1}, \dots, w_{i - 1}_w_{i + 1} \dots, w_{n}$ and $u$ are independent, and the following formulas are further obtained

\begin{matrix} t (w_{i} \to k | q, u) \propto t (k | w_{i}) P (w_{1}, \dots, {w_{i - 1}}_{w_{i + 1}}, \dots, w_{n}, u | k) \\ \propto t (k | w_{i}) P (w_{1}, \dots, w_{i - 1}_w_{i + 1}, \dots, w_{n} | k) \\ \propto t (k | w_{i}) P (w_{1}, \dots, w_{i - 1}_w_{i + 1}, \dots, w_{n} | k) P (k | u) / P (k) \end{matrix}

(2)

where $P (k)$ is the prior probability of the occurrence of the term $k$ in the vocabulary $T$ , which is the origin of the candidate terms . $t (k | w_{i})$ is the probability of ‘translating’ the term $w_{i}$ to the term k, and $P (k | u)$ represents the probability that the user $u$ prefers the term $k$ . We adopted term and user embeddings to represent terms and users as embedded vectors in our case, so $t (k | w_{i})$ and $P (k | u)$ are defined based on vector similarity as follows

t (k | w_{i}) = \frac{\exp (\cos (k, w_{i}))}{\sum_{w \in T} \exp (\cos (w, w_{i}))}

(3)

where $\cos (k, w_{i})$ is the similarity between the vector representation of the term $k$ and that of the term $w_{i}$ and $T$ is the vocabulary set. Analogously, $P (k | u)$ is denoted as

P (k | u) = \frac{\exp (\cos (k, u))}{\sum_{w \in T} \exp (\cos (w, u))}

(4)

Here, $\cos (k, u)$ represents the similarity between the vector of the user $u$ and that of the term $k$ and is also calculated using cosine similarity. $P (w_{1}, \dots, w_{i - 1}_w_{i + 1}, \dots, w_{n} | k)$ can be considered the probability of the new term $k$ fitting into the context of the term $w_{i}$ and is computed as

P (w_{1}, \dots, w_{i - 1}_w_{i + 1}, \dots, w_{n} | k) \propto {\tilde{P}}_{L_{2}} (w_{i - 2} | k) {\tilde{P}}_{L_{2}} (w_{i - 1} | k) {\tilde{P}}_{R_{1}} (w_{i + 1} | k) {\tilde{P}}_{R_{2}} (w_{i + 2} | k)

(5)

Owing to space limitations in this article, readers are required to refer to the work by Wang and Zhai [6] for details on the derivation. Where ${\tilde{P}}_{L_{i}} (a | k)$ or ${\tilde{P}}_{R_{i}} (a | k)$ is the probability of the term $a$ occurring in the context $L_{i} (k)$ (or $R_{i} (k)$ ) of the term $k \cdot L_{i} (k)$ and $R_{i} (k)$ is depicted in Table 1. ${\tilde{P}}_{L_{i}} (a | k)$ or ${\tilde{P}}_{L_{i}} (a | k)$ is calculated using Dirichlet prior smoothing

{\tilde{P}}_{C} (a | k) = \frac{c (a, C) + \hat{u} P (a | θ_{B})}{\sum_{i} c (i, C (w)) + \hat{u}}

(6)

where $C$ is the particular context of the term $k$ and can refer to $L_{i} (k)$ or $R_{i} (k)$ . $c (a, C (k))$ is demonstrated in Table 1. $P (a | θ_{B})$ is a predefined reference model representing the probability of term $a$ being observed in the entire training collection and $\hat{u}$ is a Dirichlet prior parameter (set to 3000 in our experiments). For the query $q = w_{1}, \dots, w_{i - 1} w_{i} w_{i + 1}, \dots, w_{n}$ , there are $n$ positions in which we can substitute one term. For each position, we first rank and obtain the candidate terms of position $i$ by calculating similarities between terms in the vocabulary $T$ and the term $w_{i}$ by applying equation (2). We then obtain the top 100 ranked candidate terms.

As most word embeddings have a common shortcoming wherein each word is represented by a single vector, they subsequently ignore polysemy [41]. If we only consider the query context generated by word embedding in the preliminary candidate term generation process, the noise problem may surface when addressing ambiguous terms. For example, the underlying possible subtopics of the query Flash may refer to Adobe Flash player, a 1990 American superhero series, or the Israeli politician Flash Gordon. Thus, the term ‘flash’ has widely diverse contexts in historical queries. When we try to explore the contexts for the term ‘flash’, which is restricted to terms related to flash player, considerable noise is involved in its candidate list, such as ‘boiler’ and ‘tank’. Furthermore, a few typing errors, such as ‘flahs’ and ‘falsh’ also cause the degradation of performance while handling typo-prone terms, such as ‘flash’. Similar to previous work [5,6], we rely on user sessions in our search log and the normalised mutual information (NMI) threshold [6] is employed to filter out the possible noise. Next, we iterate over all positions of this query, and queries for all positions are merged to generate the final list of candidate queries.

3.1.2. Query addition

Given a query $q = w_{1}, \dots, w_{i - 1}, w_{i,} w_{i + 1,} \dots, w_{n}$ issued by the user $u$ , the task of query addition in this article is to formulate the query by adding one new term $k$ to $q$ before a position $i$ in the form q’ $= w_{1}, \dots, w_{i - 1}, w_{i,} w_{i + 1,} \dots, w_{n}$ . The new term $k$ and the query $q$ should have similar meanings and also meet the user u’s personalised need. We formalise this as a probability problem, and $P (+ k | q, u)$ is used to denote the probability of such an addition pattern $P (+ k | w_{1}, \dots, w_{i - 1}, w_{i,} w_{i + 1,} \dots, w_{n}, u)$ for an individual user $u$ and can be derived as

\begin{matrix} P (+ k | q, u) = P (+ k | w_{1}, \dots, w_{i - 1}, w_{i,} w_{i + 1,} \dots, w_{n}, u) \\ \propto P (w_{1}, \dots, w_{i - 1}, w_{i,} w_{i + 1,} \dots, w_{n}, u | k) * P (k) \end{matrix}

(7)

where $w_{1}, \dots, w_{i - 1}_w_{i + 1} \dots, w_{n}$ and $u$ are assumed as independent, we can obtain the following formulas

\begin{matrix} P (+ k | q, u) \propto P (w_{1}, \dots, w_{i - 1}, w_{i,} w_{i + 1,} \dots, w_{n} | k) * P (u | k) * P (k) \\ \propto P (w_{1}, \dots, w_{i - 1}, w_{i,} w_{i + 1,} \dots, w_{n} | k) * P (k | u) \end{matrix}

(8)

$where P (w_{1}, \dots, w_{i - 1}, w_{i,} w_{i + 1,} \dots, w_{n} | k) and P (k | u)$ identically can be computed as the computation in equations (5) and (4), respectively. For query $q = w_{1} w_{2} \dots w_{i - 1} w_{i} w_{i + 1} \dots w_{n}$ , there are $n + 1$ positions into which we can add a term. For each position, we first rank and obtain candidate terms from the vocabulary $T$ . Then, the top 50 ranked candidate terms are maintained. We iterate over all positions of this query, and new queries for all positions are merged to form a final list of candidate queries.

3.2. Candidate query scoring

The rationale for our candidate-scoring approach is based on the assumption that an individual query shares the unique search intention of a user, and thus the topics in the same query should be prone to remaining consistent from the semantic perspective. Consequently, an ideal candidate query list is the one that formulates queries for which the semantic meanings consistently remain at the top of the ranking list. However, since candidate queries generated in the first sub-task purely depend on the contextual similarity between terms in the historical queries, it is almost impossible to capture and maintain consistency in the semantic meaning within a new query. For example, as shown in Table 2, the query ‘artisan chocolate flavor’ is an ideal output for the original query ‘artisan chocolate bread’ from the point of view of semantic consistency; however, it is ranked eighth. In contrast, the semantic meaning of ‘artisian casserole bread’ seems inconsistent, though it is ranked first.

Table 2.

Top 15 results given by substituting one term in the original query ‘artisan chocolate bread’.

Rank	Candidate query	Rank	Candidate query	Rank	Candidate query
1	artisian casserole bread	2	martineau chocolate bread	3	artisan chocolate cheese
4	artisian chocolate seasonings	5	shining chocolate bread	6	radison chocolate bread
7	residenza chocolate bread	8	artisan chocolate flavor	9	mercure chocolate bread
10	artisian chocolate biscuits	11	artisan chocolate cooked	12	artisian toppings bread
13	tarragon chocolate bread	14	artisian chocolate recipe	15	nickolodian chocolate bread

A score function that can model consistency in semantic meaning in the candidate query is necessary to deal with this problem. Subsequently, all candidate queries of the given original query $q$ can be fed into one scoring method to evaluate quality. Ultimately, the top-ranked candidates are identified as the final output in the original query $q$ based on their scores. As discussed earlier, the graphical model proposed by Bing and Lam [5] is used in our candidate scoring, but we exploit different methods to set the parameters for the model. Next, we discuss the candidate query scoring method and then examine how to learn latent topics from the query log and finally, model the parameter design.

3.2.1. Query scoring model

The design of the graphical model used in our scoring work is shown in Figure 2. While the filled nodes denote the observable query terms, the empty ones represent the unobservable latent topics. Let a candidate query $q^{'}$ comprise a sequence of terms represented as $q' : {w_{1}}^{'}, \dots, {w_{r - 1}}^{'}, {w_{r}}^{'}, {w_{r + 1}}^{'}, \dots, {w_{n}}^{'}$ ( $or w_{1 : n}^{'} for short$ ), where the subscript denotes the position of the term in query $q'$ . We write $w_{r}^{'}$ for a term in the vocabulary set $T$ . The latent topic of $w_{r}^{'}$ is denoted by $z_{r}$ , which represents an admissible topic in $Z$ . The corresponding latent topic sequence for the term sequence $w_{1 : n}^{'}$ is denoted as $z_{1}, \dots, z_{r - 1}, z_{r}, z_{r + 1}, \dots, z_{n}$ $(z_{1 : n}$ for short). The detection of latent topics will be explored later.

Figure 2.

The graphical model for scoring query.

The final candidate query score is arrived at by calculating the marginal joint probability of the term sequence $P (w_{1 : n}^{'})$ , and the latent topic sequence ( $z_{1 : n}$ ) of the query is used as hidden evidence to measure semantic consistency. Let $QS (q')$ denote the final score of candidate query $q' : w_{1 : n}^{'}$ ; all latent semantic topic sequences are marginalised to compute the overall score of the candidate query $q'$ . Particularly, $QS (q')$ is estimated as

Q S (q^{'}) = P (w_{1 : n}^{'}) = \sum_{z_{i} \in Z, 1 \leq i \leq n} P (w_{1 : n}^{'}, z_{1 : n})

(9)

where $w_{1 : n}^{'}$ and $z_{1 : n}$ represent the term and topic sequences in a candidate query, respectively. $P (w_{1 : n}^{'}, z_{1 : n})$ is the joint probability of the topic and term sequences and can be computed using the chain rule

P (w_{1 : n}^{'}, z_{1 : n}) = \prod_{r = 1}^{n} P (w_{r}^{'} | z_{r} = z^{i}) P (z_{1} = z^{m}) \prod_{r = 2}^{n} P (z_{r} = z^{i} | z_{r - 1} = z^{j})

(10)

where $P (w_{r}^{'} | z_{r} = z^{i})$ is the emission probability of a topic $z^{i}$ (unobservable) at position $r$ to a term $w_{r}^{'}$ (observable). $P (z_{1} = z^{m})$ is the probability of attaining a particular topic $z^{m}$ in the first position of the latent topic sequence. This is estimated as $P (z_{1} = z^{m}) = 1 / | Z |$ ( $| Z |$ is the total number of latent topics). $P (z_{r} = z^{i} | z_{r - 1} = z^{j})$ encapsulates the relationship between the two topics $z^{i}$ at position $r$ and $z^{j}$ at position $r - 1$ in the topic sequence. To investigate the potential contribution of personalisation, a topic-based user profile is incorporated in the scoring model. Let $π^{u} = {π_{1}}^{u}, {π_{2}}^{u}, \dots, {π_{| Z |}}^{u}$ demonstrate the profile of a particular user $u$ ?where ${π_{i}}^{u} = P (z^{i} | u)$ is the probability that user $u$ has a preference for topic $z^{i}$ . $\sum_{i} π_{i}^{u}$ is set to 1. To generate candidate queries based on a user’s preference, equation (10) can be updated as

P (w_{1 : n}^{'}, z_{1 : n}) = Π_{r = 1}^{n} P (w_{r}^{'} | z_{r} = z^{i}) π_{z_{1}}^{u} Π_{r = 2}^{n} P (z_{r} = z^{i} | z_{r - 1} = z^{j})

(11)

To compute the probability of $P (w_{1 : n}^{'}, z_{1 : n})$ in equation (11) efficiently, a dynamic programming technique adopted for hidden Markov models [42], known as the forward algorithm, is used, and the forward variable $α_{r} (i)$ is defined as

α_{r} (i) \overset{Δ}{=} P (w_{1 : n}^{'}, z_{r} = z^{j})

(12)

where $α_{r} (i)$ is an intermediate score of the partial query $w_{1}^{'}, \dots, w_{r}^{'}$ given the topic $z^{j}$ at position $r$ . When $r = 1$ , we set $α_{r} (i) = π_{i}^{u} P (w_{r}^{'} | z^{i})$ . The recursive calculation of $α$ is expressed as

α_{r} (i) = [\sum_{z^{j} \in Z} α_{r - 1} (j) P (z_{r} = z^{i} | z_{r - 1} = z^{j})] P (w_{r}^{'} | z_{r} = z^{i})

(13)

where $P (z_{r} = z^{i} | z_{r - 1} = z^{j})$ and $P (w_{r}^{'} | z_{r - 1} = z^{j})$ are independent of the position $r$ and are equal to $P (z^{i} | z^{j})$ and $P (w_{r}^{'} | z^{i})$ , respectively. Finally, the score of the query $q'$ , denoted by $QS (q')$ ?can be computed by adding all possible $z^{i}$ of $α_{n} (i)$

QS (q^{'}) = P (w_{1 : n}^{'}) = \sum_{z^{i} \in Z} α_{n} (i)

(14)

In certain circumstances, some candidate queries for an input query may be generated by substituting one term while others can be generated by adding a new term. Thus, the candidate queries returned for the same input query probably have varying lengths. To render the candidate queries with distinct lengths comparable, we consider the normalisation

NQS (q') = QS (q') / | q' |

(15)

where $| q' |$ represents the number of terms in the query $q'$ and $QS (q')$ is the final score of query $q'$ calculated in equation (14).

3.2.2. Latent topic detection

Latent topics are critical factors in the candidate query scoring function discussed above and in the topic embedding that will be discussed later. We first constructed a personal pseudo-document for individual users by aggregating all the queries issued by them. Then, we exploited the pseudo-document of each user to represent their interest based on the previous work [5,23], which have proved that users’ interests can be inferred from their historical queries. Next, we treated each user’s profile (pseudo-document) as a single document unit to detect latent topics. The widely used and standard LDA algorithm [43] was leveraged to implement the latent semantic topic analysis on the collection of pseudo-documents. The Gibbs LDA² package was used, and the collapsed Gibbs sampling [44] was performed to iteratively assign latent topics for each query term. In this manner, given a sequence of words T = {t₁, …, t_M}, after the LDA converged, each term $t_{i}$ was discriminated into a specific topic $z^{i}$ .

3.2.3. Model parameter design

To calculate the candidate query score, we require three types of model parameters, namely $P (z^{i} | u)$ , $P (z^{i} | z^{j})$ and $P (w_{i}^{'} | z^{j})$ . For modelling these parameters, only the contexts of terms were considered in the previous work [4,5]. We assumed that the contexts of latent topics can provide effective evidence for modelling the consistency of semantic meaning of a query, and not only the contexts of query terms but also those of latent topics were applied while modelling these three parameters. Contextual information on query terms and latent topics was represented in the form of embedded vectors. Term, user and topic vectors were involved in these parameters. They were obtained with term, user and topic embeddings, respectively. We discuss how to learn them later. Based on the embeddings, the three parameters $P (z^{i} | u)$ , $P (z^{i} | z^{j})$ and $P (w_{i}^{'} | z^{i})$ were calculated based on vector similarities

P (z^{i} | z^{j}) = \frac{\exp (\cos (z^{j}, z^{i}))}{\sum_{z \in Z} \exp (\cos (z, z^{j}))}

(16)

P (w_{r}^{'} | z^{i}) = \frac{\exp (\cos (w_{r}^{'}, z^{i}))}{\sum_{w \in T} \exp (\cos (w, z^{i}))}

(17)

P (z^{i} | u) = \frac{\exp (\cos (z^{i}, u))}{\sum_{z \in Z} \exp (\cos (z, u))}

(18)

where $\cos (z^{i}, z^{j})$ , $\cos (w_{r}^{'}, z^{i})$ and $\cos ({z^{i}}_{i}, u)$ represent the similarity between two vectors, and all of them are calculated based on cosine similarity.

4. Embeddings

Three types of embeddings (i.e. term, user and topic embeddings) are associated with our proposed model. The main structure used to derive them is demonstrated in Figure 1. In this section, we offer a detailed description of how to learn these embeddings.

4.1. Term embedding

This aims to learn low-dimensional vectors of query terms from search logs by employing the notation of a query as ‘sentence’ and terms within the query as ‘words’. It learns query term representations using a state-of-the-art word embedding model [8], called the skip-gram model, which aims to predict multiple surrounding (contextual) words for a given word in a sliding window. The objective of this embedding model is to maximise the log-likelihood over the entire set $Q$ of queries in the training data

L = \sum_{q \in Q} \sum_{w_{m} \in q} \sum_{- b < i < b, i \neq 0} l o g P (w_{m + i} | w_{m})

(19)

The probability $P (w_{m + i} | w_{m})$ of observing a neighbouring term $w_{m + i}$ given the current term $w_{m}$ is estimated using softmax

P (w_{m + i} | w_{m}) = \frac{\exp (v_{w_{m}}^{T} v_{w_{m + i}}^{'})}{\sum_{w = 1}^{V} \exp (v_{w_{m}}^{T} v_{w}^{'})}

(20)

where $v_{w}$ and $v_{w}^{'}$ are the input and output vector representations of term $w$ , respectively, $b$ is defined as the length of the context for query term sequences and $V$ is the number of unique terms in the pseudo-documents in the training set as depicted in Table 1. Term embeddings were applied in the computation of equations (3), (4) and (17) in our work.

4.2. User embedding

Assuming that users’ interests are revealed by their search histories, we obtain the vector of a user $u$ by aggregating overall term embeddings for each term they issued by, and a standard method is to take the mean of the individual term vectors. However, not all terms in each profile are equally important when representing a user’s preference. Some common terms do not contribute as much to the representation of a user’s interests as rarer terms do. Generally, Term frequency-Inverted document frequency $(tf - idf)$ is often used to implement this notion. As a result, the vector representation of each user’s interest is weighted by $tf - idf$ , and the vector of the user $u (v_{u})$ is defined as follows

v_{u} = \sum_{e \in M} \frac{count (e)}{\sum_{m \in M} count (m)} * \log \frac{| U |}{1 + {u \in U : e \in u}} * v_{e}

(21)

where $M$ is the set of terms issued by user $u$ , and $v_{e}$ is the vector representation of the term $e$ obtained by the term embeddings. $count (e)$ is the total number of times the term $e$ was issued by the user $u$ in the past and $| U |$ denotes the total number of unique users. ${u \in U : e \in u}$ represents the number of users issuing the term $e$ . The user embeddings were leveraged in the calculation of equations (4) and (18).

4.3. Topic embedding

This aims to learn the vector representation for each latent topic and is trained based on obtaining the specific topic for each term using topic analysis on users’ pseudo-documents. The basic idea of topic embedding is to consider each topic to be a pseudo term. Terms appearing in the context window $d$ of this topic have been assigned based on the original skip-gram of Word2vec. The objective of topic embedding is to maximise the following log probability of training data

L_{1} = \sum_{q \in Q} \sum_{w_{m \in q}} \sum_{- d \leq i \leq d, i \neq 0} \log P (w_{m + i} | w_{m} = z_{j})

(22)

Probability $P (w_{m + i} | w_{m} = z_{j})$ of observing a neighbouring term $w_{m + i}$ given the current topic $z_{j}$ is defined using softmax

P (w_{m + i} | w_{m} = z_{j}) = \frac{\exp (v_{z_{j}}^{T} v_{w_{m + i}}^{'})}{\sum_{w = 1}^{V} \exp (v_{z_{j}}^{T} v_{w}^{'})}

(23)

where $v_{z_{j}}^{T}$ is the input vector representation of the corresponding latent topic $z_{j}$ of the term $w_{m}$ , and $v_{w}^{'}$ is the output vector representation of the term $w$ . Therefore, the vector of a topic can denote the shared semantics of terms and the context term distribution under this topic. Crucially, this embedding is used to estimate equations (16) and (18).

5. Experimental setup

In this section, we discuss the experimental setup that supports the evaluation of our proposed frameworks. First, we introduce our datasets and then give a detailed description of our evaluation metrics and baselines. Finally, the additional experimental settings and parameters are presented. The research questions that our experiments aim to answer are as follows:

RQ1: Is the overall performance of our method better than those of the state-of-the-art baselines?

RQ2: Is our candidate generation method better than that of the baselines?

RQ3: Does our scoring method outperform that of the baselines?

RQ4: How does the query frequency impact the results?

RQ5: How does the query length impact the results?

RQ6: Does the number of latent topics affect the overall performance of our method?

5.1. Dataset

Our experiments were executed on the America Online (AOL) search log, which is the largest publicly available query log and, in recent studies, has been considered a credible experimental dataset [4,20,24]. This log dataset spans 3 months from 1 March 2006 to 31 May 2006. As shown in Table 3, each record contains the following fields: (1) UserID: an anonymous user ID; (2) Query: the actual query terms; (3) Item-rank: the rank of the clicked result and (4) Clicked URL: the result the user clicked. We divide the entire dataset into two parts based on time. Queries in the first 2/3 data (queries submitted before 1 May 2006) were used as our training set, while the last 1/3 data (the last month of the data) were used as our testing set. All non-alphanumeric characters were removed from the queries. To reduce the noise further, we removed a large number of navigational queries (containing URL substrings, such as .net, http, .org, .edu and .com, for which query suggestions may not be as useful) and queries starting with special characters (such as &, $ and #) from the dataset. We defined the end of a session using a 10-min idle-time window [45] and deleted the queries that were repeated in each session. Finally, sessions with less than two queries or no clicked URL were also removed because we intended to identify those in the subsequent queries if the first query submitted by a user could not meet his or her information needs in the current session. While constructing user profiles we only considered the users in both the training and testing sets only. We deleted user profiles with fewer than 50 unique queries. Finally, we obtained 35,368 user profiles, and the term and topic embeddings were trained on this profile dataset.

Table 3.

Examples of AOL query log datasets.

UserID	Query	Query time	Item rank	Clicked URL
2722	military rental benefits	2006-03-10 09:32:38	4	http://www.valoans.com
2722	military rental benefits	2006-03-10 09:33:37	4	http://benefits.military.com
217	military car rental benefits	2006-03-10 09:33:37	8	http://www.avis.com
217	deers id cards	2006-03-10 10:32:02	1
217	deers id cards army	2006-03-10 10:32:58	1	http://www.stewart.army.mil

5.2. Evaluation and baselines

It is well known that the effectiveness of query suggestion methods can be evaluated mainly through human evaluation [20,23] or session information [4,5,45]. Human evaluation is a resource-intensive process that requires labour, time, evaluator expertise and funding. Often, therefore, only a small number of query reformulation results can be evaluated, and the results of human evaluation are often influenced by the evaluators’ subjectivity. Conversely, session-based evaluation is implemented using session information from the query log that records the query reformulating behaviours performed by users to acquire satisfactory results. Automatic evaluations are conducted by investigating these behaviours. Compared with human evaluation, session-based evaluation can measure more query reformulation results automatically and its evaluation results are more objective. Consequently, session-based evaluation is applied in our work and the ground truth for a query reformulation is defined as follows. First, we utilised the same method as Bing et al. [4] to collect satisfactory and unsatisfactory queries. At the end of a search session, a query that led to at least one URL being clicked was selected as a satisfactory query. The query previous to the satisfactory one in the same search session was designated an unsatisfactory query. Second, we adopted unsatisfactory and satisfactory queries as inputs and the ground truth of our experiment, respectively. Given the unsatisfactory query as an input, the corresponding satisfactory query generated by a query reformulation method was accepted as the output. We selected sessions generated by users who produced sessions in both the training and testing set to evaluate our method, and 15,336 one-term query addition sessions and 57,556 one-term query substitution sessions were obtained in the testing set. From the statistical analysis, we found that approximately 70% of the satisfactory queries were not found in the training set because of the dataset limit, implying that no query suggestion method successfully recommended correct queries for these cases.

We found that only about 1.3% of satisfactory queries were generated by adding or substituting a stop word, and only approximately of 1.2% users added or substituted the stop words in original queries. These data suggest that stop words were not very useful for generating satisfactory queries, and they were removed. We also found that approximately 0.7% of the satisfactory queries were generated by substituting or adding words with their lengths denoted by characters, and this gave us a reason to rule out such words. We also found that the probability of new terms co-occurring with the terms of initial queries in the same query was 37%, and that the probability of a new term co-occurring with the terms in the initial queries in the same session was 79%, which proves once again that it is essential to exploit session information to filter noise while generating term substitution candidates. There are 2,848,149 unique queries in all and 633,372 distinct terms in the collection of pseudo-documents.

To evaluate the effectiveness of query reformulation, we followed the state-of-the-art query reformulation method [4] and used the metric Recall@K to measure the accuracy of each work. This metric signifies that if the true answer is returned among the top-K candidates, the reformulation of an input query is recognised as successful and is equal to the ratio of the successful cases of all adopted testing cases. A higher value of Recall@K indicates better performance. Actually, some metrics, such as MRR@K and nDCG@K, can also be used to evaluate the effectiveness of query reformulation. Like the work by Bing et al. [4], we simply intend to present the rate of successful cases and thus only Recall@K was leveraged in our work. Since previous studies on query reformulation were based on the assumption that users are generally provided with a list of top $10$ (at most) candidates by a Web search engine, it was common in many studies [23,45] to have set the K to 10 while using the evaluation metric $Recall @ K$ . However, to analyse the performance of the models more thoroughly, more or fewer candidates (for instance, $cutoff K = 5, 15, 20, 25 and 30$ ) are finally returned in our work.

To compare our approach with those of others, we reimplemented two comparative baselines, namely those taken from Wang and Zhai [6] and Bing et al. [4] with the same experimental setup. They are denoted as ‘CTA’ and ‘TALP’, respectively. Detailed descriptions of both baselines are provided in the work section. Bing et al. [4] used the query log- and social tag-based methods to generate candidate queries. Owing to the unavailability of the social tagging data resource and to ensure that the experimental setup (i.e. dataset) of the comparative method was consistent with that of our new method, we only implemented the query log-based method, and Model 3 scoring was reproduced with window size x = 3 for skip-gram because of its superior performance.

5.3. Settings and parameters

While generating query candidates by either term substitution or addition, we constructed the vocabulary set T by selecting the top 100,000 terms in the training set. We then set the filtering NMI threshold [6] to 0.001 to filter the noise while generating substitution candidates. We learned topic models by exploiting LDA in the training set, and the number of topics while training the LDA model was set to 80. We conducted additional experiments and present a detailed discussion of the results generated when the topic number changed.

There are several parameters in the Word2vec algorithm, such as vector dimensionality, window size, word-sampling threshold, whether to employ hierarchical softmax and the number of negative examples [8]. As the vector dimensionality and the sliding window size in most Word2vec algorithm-based work (irrespective of being related to the query), respectively, are set to 300 and 5 by default, we employed these parameter settings in both the term and topic embedding. Both embedding models were optimised using stochastic gradient ascent. Negative sampling was used to accelerate the training, and 10 random samples in each vector update were leveraged. Given that some very frequent words, such as ‘google’, ‘yahoo’, ‘facebook’, may cause hindrances, such as shrinking the embedding space, we deal with frequent terms using the subsampling approach [8]: each term $w_{i}$ in the training set is discarded with the probability computed by the equation $P (w_{i}) = 1 - \sqrt{τ / f (w_{i})}$ , where $f (w_{i})$ is the frequency of word $w_{i}$ in the training set and τ is a user-set parameter, with a typical value of approximately 10⁻⁵.

6. Experimental results

We first implement single and hybrid operations to generate final candidate queries for each of the aforementioned methods. We then provide an analysis of the effect of query frequency and length on the overall performance of query reformulation. Finally, we examine how the number of topics affects our model.

6.1. Performance comparison of single and hybrid operations

In this experiment, we aim to answer RQ1 and design two types of experiments according to these operations, namely single and hybrid operations, to generate candidate queries. The results are shown in Tables 4 –6. The best performer in each column is highlighted. The statistical significance of pairwise differences in our method versus CTA and our method versus TALP are detected using a two-tailed t-test (p < 0.05). We use the symbols ‘*’ and ‘Δ’ in the superscript in these tables to indicate statistically significant differences for CTA and TALP, respectively.

Table 4.

Comparing different methods for query substitution.

Approaches	Metrics
	Recall@5	Recall@10	Recall@15	Recall@20	Recall@25	Recall@30
CTA	0.050	0.091	0.115	0.130	0.150	0.165
TALP	0.071	0.120	0.135	0.156	0.183	0.188
Our method	0.089^Δ*	0.138^Δ*	0.157^Δ*	0.171^Δ*	0.192^*	0.200^*

‘CTA’ and ‘TALP’ represent the methods proposed by Wang & Zhai [6] and Bing et al. [4], respectively. The symbols “*” and “#” in the superscript in the table indicates statistical significance (two-tailed t-test, p < 0.05) with respect to the CTA and TALP, respectively.

Table 5.

Comparing different methods for query addition.

Approaches	Metrics
	Recall@5	Recall@10	Recall@15	Recall@20	Recall@25	Recall@30
CTA	0.030	0.054	0.070	0.089	0.090	0.109
TALP	0.050	0.068	0.082	0.114	0.124	0.131
Our method	0.072^Δ*	0.120^Δ*	0.143^Δ*	0.151^Δ*	0.163^Δ*	0.171^Δ*

Table 6.

Comparing different methods for hybrid operations.

Approaches	Metrics
	Recall@5	Recall@10	Recall@15	Recall@20	Recall@25	Recall@30
CTA	0.025	0.042	0.052	0.060	0.082	0.093
TALP	0.059	0.063	0.074	0.089	0.118	0.124
Our method	0.082^Δ*	0.096^Δ*	0.105^Δ*	0.123^Δ*	0.139^Δ*	0.148^Δ*

First, we compare the performance of the single operation of our method against that of state-of-the-art methods. It is assumed that we have already known the particular type of operation (i.e. term substation or addition) that should be performed by users to obtain satisfactory queries, and we discuss the overall performance of query substitution and addition on one term, separately. That is, all 57,556 one-term query substitution sessions are used as test cases to assess the performance of query substitution, and 15,336 one-term query addition sessions are employed for query addition. We compare the performance of our method with CTA and TALP, while the overall results from different methods in query substitution and addition are shown in Tables 4 and 5, respectively. There are some differences between the baseline scores obtained in this article and those published in others studies [4,6], and the scores calculated in this article are slightly lower than those found in previous work. This may be mainly because we chose different test queries as the test queries were not made public, and we ruled out a large volume of words comprising only one character. Besides, from the experimental results, we also found that our model achieved considerably more effective results than the baselines for single reformulation operations (i.e. query substitution and addition).

The overall performance increases when more query reformulation candidates are returned for re-ranking. A larger value of $K$ simply increases the probability of the gold-standard truth being included in the query reformulation list. These models report competitive scores when K = 30 in both query substitution and addition. The improvements in performance realised by our method over the baseline decrease further as K increases in the query substitution. For instance, at cutoff K = 20, our method shows 31.5% and 8.7% improvement over the baseline CTA and TALP, respectively, but 28% and 5% at the cutoff K = 25. The improvements achieved by our method against the TALP are not significant when K = 25 or 30. The main reason for this is that we use session information to filter out noise candidate queries while generating query substitution candidates. This may result in some testing queries to have fewer than 25 candidate queries. However, the superiority of our models over the baseline remains significant in query addition even when more candidates are returned.

Considering that it is likely for a user to perform both addition and substitution operations to reformulate a query, it is necessary to explore the effectiveness of our method while generating hybrid query reformulation candidates through addition or substitution. Thus, we relax the assumption that we know the operation type that should be carried to obtain a satisfactory query in this experiment. We also restrict the operations for one new term along the lines of the above experiments. Thus, the testing cases all involve substitution and additional testing sessions mentioned above, and we obtain 13,669 sessions in all. For a given testing query, the substitution and additional candidate queries are generated separately, and all candidate queries are re-ranked based on scores calculated by the scoring method to achieve the final candidate list. The results of different methods for hybrid operations are presented in Table 6. While allowing for hybrid operations, the absolute performance figures decrease when compared with single operations primarily because the number of possible operations increases. This, in turn, challenges the scoring methods. Despite this reduction, our method continues to significantly outperform CTA and TALP on all metrics, when $K$ varies from 5 to 30.

Tables 7 and 8 contain the top 10 final candidate queries generated by the hybrid operations of different methods for the input queries ‘enterprise rental car’ and ‘mexican recipes’, respectively. Table 7 demonstrates the candidates generated in a setting where the original query was issued by a user who used to submit queries related to ‘car’. Users’ historical queries can represent their preferences. Thus, we can infer that the car-related candidates in Table 7 can be expected to be satisfactory. The candidates produced in Table 8 for a user who searched for information on ‘recipes’ and recipe-related queries can be determined as ideal candidates. From these two tables, we find that more closely related queries can be returned by our method. The candidate queries generated by our method tend to correspond more accurately with users’ interests. Thus, we conclude that our method performs the best among the given methods.

Table 7.

The top 10 final outputs generated by hybrid operations of different competing methods for the query ‘enterprise rental car’ submitted by a user who often searches information related to ‘car’.

CTA	TALP	Our method
1. enterprise rental photos	1. enterprise de car	1. enterprise de car
2. enterprise rental free	2. enterprise washington	2. enterprise shortline rental car
3. enterprise rental virginia	3. enterprise ohio car	3. budget rental car
4. enterprise rental pictures	4. enterprise bank rental car	4. rebeccasgarden rental car
5. verbal rental car	5. enterprise rental park	5. enterprise rental restaurant
6. times rental car	6. enterprise nj car	6. blues-hollywood rental car
7. enterprise express rental car	7. enterprise rental college	7. verbal rental car
8. enterprise product rental car	8. enterprise hotel car	8. weekly enterprise rental car
9. enterprise product rental car	9. 2005 enterprise rental car	9. weekly enterprise rental car
10. timed rental car	10. enterprise school car	10. enterprise rental park

‘CTA’ and ‘TALP’ represent the methods proposed by Wang & Zhai [6] and Bing et al. [4], respectively.

Table 8.

The top 10 final outputs generated by hybrid operations of different competing methods for the query ‘mexican recipes’ submitted by a user who often searches information related to ‘recipes’.

CTA	TALP	Our method
1. mexican mccullough	1. mexican holiday	1. jamba recipes
2. mexican goat	2. mexican time	2. florida recipes
3. mexican restaurant	3. jamba recipes	3. mexican goat recipes
4. mexican holiday	4. California recipes	4. mexican eldered recipes
5. mexican food	5. disney mexican recipes	5. mexican restaurant
6. mexican city	6. mexican goat recipes	6. mexican holiday
7. mexican goat recipes	7. mexican restaurant	7. california recipes
8. jamba recipes	8. corkscrew recipes	8. green recipes
9. mexican time	9. nostra recipes	9. disney mexican recipes
10. disney mexican recipes	10. mexican eldered recipes	10. mexican time

‘CTA’ and ‘TALP’ represent the methods proposed by Wang & Zhai [6] and Bing et al. [4], respectively.

6.2. Comparing different methods for the generation of candidate and scoring queries

To answer RQ2 and RQ3, we designed six experiments to verify whether our candidate query generation or scoring methods were better than those of the current baselines by combining the proposed method with those of the baselines. The results are shown in Table 9, with Recall@K scores for different values of K. The symbol ‘+’, with candidate query generation method to its left and the candidate query ranking method to its right represents a combination of different candidate generation and scoring methods. For example, ‘CTA + our’ means that the CTA candidate query generation and our candidate query ranking methods were simultaneously used to produce candidate queries in an experiment. As mentioned in the related work section, the work of CTA only involved the task of candidate query generation, and the issue of re-ranking candidate queries was not addressed. Therefore, there are no experiments that combine our generation method with the CTA scoring method. The experiment involving CTA without any combination in this table implies that the final candidate queries are generated and scored only by the CTA generation method.

Table 9.

Comparison of different combinations of candidate generation and scoring methods in generating hybrid query reformulations.

Approaches	Metrics
	Recall@5	Recall@10	Recall@15	Recall@20	Recall@25	Recall@30
CTA	0.025	0.042	0.052	0.060	0.082	0.093
CTA + our	0.042	0.054	0.066	0.085	0.096	0.112
TALP + TALP	0.059	0.063	0.074	0.089	0.118	0.124
TALP + our	0.054	0.063	0.071	0.095	0.113	0.117
Our + TALP	0.050	0.056	0.067	0.090	0.098	0.116
Our + our	0.082	0.096	0.105	0.123	0.139	0.148

‘CTA’ and ‘TALP’ represent the methods proposed by Wang & Zhai [6] and Bing et al. [4], respectively.

We consistently benefit from using our scoring approach as opposed to the TALP approach, regardless of the generation approach used. The use of our approach for generation instead of the CTA or TALP produces consistent improvements. Both findings are consistent across all combinations and values of K. There are some possible reasons for this. First, the contextual information of each query term is derived by term embeddings that consider richer contextual information. Term embeddings provide better semantic similarity measures than the translation model based on co-occurrence for generating query reformulations. Second, taking the users’ personalisation into account in the generation of candidate queries can help filter out some unrelated terms. Our scoring model considers the context of both a term and its corresponding topic and can provide more valuable contextual evidence for calculating the consistency of the semantic meaning of a candidate query.

6.3. Performances of queries with different frequencies

Finally, we focus on the impact of query frequency on the overall query reformulation performance to answer RQ4. Queries submitted to Web search engines are known to follow a heavy-tailed Zipf distribution [46]. A large number of queries are issued infrequently. Thus, we distinguish among ‘long-tail queries’ (those that occur less than twice or those that are only issued once by a user), ‘highly frequent’ queries (those that are issued more than a hundred times) and ‘common queries’ (all others). The proportion of these three types of testing queries is presented in Table 10. The performances of query substitution and addition for each frequency group in terms of Recall@K are shown in Tables 11 and 12. Query frequency has an impact on performance, with better performance achieved with highly frequent queries both in the query substitution and addition. These results may arise from queries with lower frequencies having less context information thus rendering the completion of the query reformulation task more challenging. We found that the gap between highly frequent and common queries is larger than that between common and long-tail queries in most cases. For example, at cutoff K = 20 in query substitution, highly frequent queries show a 48.1% improvement over the performance of common queries, and common queries show 23% improvements over long-tail queries. At cutoff K = 20 in query addition, highly frequent queries reported a 60% improvement against the performance of common queries, and common queries showed an improvement of 6% over the long-tail queries. Therefore, we infer that queries with higher frequency are more likely to be advantageous in our method.

Table 10.

Proportion (%) of long-tail, common and highly frequent queries in our testing set.

Approaches	Long-tail queries	Common queries	Highly frequent queries
Proportion	47%	47%	6%

Table 11.

Recall@K performance for queries with different frequencies and substitution operation with one term.

Different types of queries	Metrics
	Recall@5	Recall@10	Recall@15	Recall@20	Recall@25	Recall@30
Long-tail queries	0.059	0.092	0.115	0.126	0.152	0.165
Common queries	0.086	0.124	0.132	0.156	0.174	0.178
Highly frequent queries	0.122	0.198	0.224	0.231	0.25	0.257

Table 12.

Recall@K performance for queries with different frequencies, addition operation with one term.

Different types of queries	Metrics
	Recall@5	Recall@10	Recall@15	Recall@20	Recall@25	Recall@30
Long-tail queries	0.042	0.08	0.102	0.12	0.125	0.133
Common queries	0.058	0.102	0.123	0.128	0.140	0.150
Highly frequent queries	0.116	0.178	0.204	0.205	0.224	0.230

6.4. Performance of queries of different length

To answer RQ5, we closely examined the results across different term lengths. The proportion (%) of testing queries with different lengths is shown in Table 13. Most queries contain two or three terms. We implemented query substitution and addition experiments to test the effect of query length on the performances, and the results are seen in Figures 3 and 4. The horizontal axis represents the specific value of $K$ , and the vertical axis indicates the corresponding $Recall @ K$ value. We found that when more terms were contained in the query, a higher Recall@K was achieved until the length was 5 for query substitution and 4 for query addition. This phenomenon may have taken place because of two reasons: first, there are more terms in a longer query to be substituted or added, and the likelihood of generating true answers becomes greater; second, longer queries may provide more contextual information for locating the correct candidate terms. There is a decrease when the query length is 6 for query substitution and 5 for query addition. The performance of queries containing (≥) seven terms is inferior to that of queries having two terms. It is thought that the main reason for these results is that longer input queries risk returning noisy candidates and generate them more abundantly. We also found, as seen in Figures 3 and 4, that the gap between the largest and smallest values of Recall@K was widening when K became larger.

Table 13.

Proportion (%) of queries with different lengths in our testing set.

Queries	Query length: 2	Query length: 3	Query length: 4	Query length: 5	Query length: 6	Query length: ≥7
Proportion	40%	27%	16%	8%	4%	5%

Figure 3.

Recall@K performance for queries with different lengths when substitution operation with one term being performed.

Figure 4.

Recall@K performance of queries with different lengths when addition operation with one term being performed.

6.5. Performance of topic number

The number of latent topics is a critical parameter in our framework and is exploited in latent topic detection. In this experiment, we aimed to answer RQ6 and changed the number of topics from 10 to 100 with step intervals of 10 to discuss the impact of the topic number on the overall performances. For each topic number, the substitution and additional experiments on one term were implemented, and the performance results were demonstrated in Figures 5 and 6, respectively. The X-axis shows the parameter of the topic number, while the Y-axis shows the value Recall@K. As the number of topics increases, the Recall@K values exhibit a dramatic rise until K = 80 for both query substitution and addition; a decline of Recall@K value followed when $K$ was greater than 80. This implies that the best performance was achieved when the topic number was set at 80 in our method. However, this result is different from the findings reported in a previous study [4], which demonstrated that their framework was not sensitive to the number of topics. A possible reason for this difference may be that we learn the contextual information for each latent topic. The impact of the topic number on the overall performance of our method lies in its effect on learning the useful context for each latent topic.

Figure 5.

Comparison of performance of query substitution with one term when the topic number changing.

Figure 6.

Comparison of performance of query addition with one term when the topic number changing.

7. Conclusion

In this article, we demonstrated how embeddings (i.e. term, user and topic embeddings) contribute to the two major tasks of personalised query reformulation, namely candidate query generation and scoring. To track the task of candidate query generation, term and user embeddings were used to generate candidate queries according to the contextual similarity among terms and user preferences. The candidate queries generated in the first task were evaluated and scored (re-ranked) for consistency in the second task. The factors considered were the semantic meaning of a candidate query and user preference based on a graphical model with the term, user and topic embeddings. Our experiments demonstrate that our model consistently and significantly improved over the state-of-the-art method results and that the performance of our method was affected by the query frequency, length and latent topic number. The findings in this study are useful for enhancing the search experience of users by helping them formulate accurate queries for their search tasks. We are also confident that the approaches explored in our study are general enough to be applied to other query-related studies, such as query expansion or rewriting.

Several ideas are worth investigating further to improve our solution. Single-word-level operation for query reformulation was discussed in our article. One future direction is to enhance our single-word level to more complex operations for query reformulation. The topic-based interest profile is derived from historical queries. However, some users may not have abundant historical queries. One possible approach for overcoming this problem may be an investigation of other types of personal information, such as web browsing history and browser bookmarks. In addition, we also plan to evaluate the effectiveness of reformulation method from the perspective of meeting the information need of the user by measuring the web search quality of candidate queries.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The author would like to thank the editor and reviewers for their useful comments and suggestions, which were very helpful in improving the quality of the paper. This research is supported by the Open Project of Xihua University under grant no.XSCG 2020-005 and Postdoctoral Science fund of China under grant no. 2017M620871.

ORCID iD

Xiaojuan Zhang

Notes

References

Jansen

Spink

Bateman

et al. Real-life information retrieval: a study of user queries on the web. ACM SIGIR Forum 1998; 32(1): 5–17.

Cai

Rijke

. A survey of query auto completion in information retrieval. Found Trend Inform Retriev 2016; 10(4): 273–363.

Song

Zhou

LW.

Query suggestion by constructing term-transition graphs. In: Proceedings of the fifth international conference on web search and web data, Seattle, WA, 8–12 February 2012, pp. 353–362. New York: ACM.

Bing

Lam

Wong

et al. Web query reformulation via joint modeling of latent topic dependency and term context. ACM Trans Inform Syst 2015; 33(2): 1–38.

Bing

Lam

. Investigation of web query refinement via topic analysis and learning with personalization, http://www.cs.cmu.edu/~lbing/pub/SIGIR-QRU11-bing.pdf (accessed 24 April 2019).

Wang

Zhai

. Mining term association patterns from search logs for effective query reformulation. In: Proceedings of the 17th ACM conference on information and knowledge management, Napa Valley, CA, 26–30 October 2008, pp. 479–488. New York: ACM.

Sullivan

. The older you are, the more you want personalized search, https://searchenginewatch.com/sew/news/2064258/the-older-you-are-the-more-you-want-personalized-search (2004, accessed 24 April 2019).

Mikolov

Sutskever

Chen

et al. Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th international conference on neural information processing systems, Lake Tahoe, NV, 5–10 December 2013, pp. 3111–31119. New York: ACM.

Kuzi

Shtok

Kurland

. Query expansion using word embeddings. In: Proceedings of the 25th ACM international on conference on information and knowledge management, Indianapolis, IN, pp. 1929–1932. New York: ACM.

10.

Zamani

Croft

. Estimating embedding vectors for queries. In: Proceedings of the 2016 ACM international conference on the theory of information retrieval, Newark, DE, 12–16 September 2016, pp. 123–132. New York: ACM.

11.

Jones

Rey

Madani

et al. Generating query substitutions. In: Proceedings of the 15th international conference on World Wide Web, Edinburgh, 22–26 May 2006, pp. 387–396. New York: ACM.

12.

Jiang

Liao

et al. Web query recommendation via sequential query prediction. In: Proceedings of the 25th international conference on data engineering, Shanghai, China, 29 March–2 April 2009, pp. 1443–1454. New York: IEEE.

13.

Mei

Zhou

Church

Query suggestion using hitting time. In: Proceedings of the 17th ACM conference on information and knowledge, Napa Valley, CA, 26–30 October 2008, pp. 469–478. New York: ACM.

14.

Fonseca

Golgher

Moura

et al. Using association rules to discover search engines related queries. In: Proceedings of the IEEE/LEOS 3rd international conference on numerical simulation of semiconductor optoelectronic devices, Santiago, Chile, 12 November 2003, pp. 66–71. New York: IEEE.

15.

Boldi

Bonchi

Castillo

et al. The query- flow graph: model and applications. In: Proceedings of the 17th ACM conference on information and knowledge management, Napa Valley, CA, 26–30 October 2008, pp. 609–618. New York: ACM.

16.

Bonchi

Perego

Silvestri

et al. Efficient query recommendations in the long tail via center-piece subgraphs. In: Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval, Portland, OR, 12–16 August 2012, pp. 345–354. New York: ACM.

17.

Ozertem

Chapelle

Donmez

et al. Learning to suggest: a machine learning framework for ranking query suggestions. In: Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval, Portland, OR, 12–16 August 2012, pp. 25–34. New York: ACM.

18.

Saeedeh

Fabian

. Generating query suggestions by exploiting latent semantics in query logs. J Inform Sci 2016; 42(4): 437–448.

19.

Sordoni

Bengio

Vahabi

et al. A hierarchical recurrent encoder-decoder for generative context-aware query suggestion. In: Proceedings of the 24th ACM international on conference on information and knowledge management, Melbourne, VIC, Australia, 19–23 October 2015, pp. 553–562. New York: ACM.

20.

Dehghani

Rothe

Alfonseca

et al. Learning to attend, copy, and generate for session-based query suggestion. In: Proceedings of the 26th ACM international conference on information and knowledge management, Singapore, 6–10 November 2017, pp. 1747–1756. New York: ACM.

21.

Jiang

Wang

. RIN: reformulation Inference Network for context-aware query suggestion. In: Proceedings of the 27th international on conference on information & knowledge management, Truin, 22–26 October 2018, pp. 197–206. New York: ACM.

22.

Jiang

Leung

Vosecky

et al. Personalized query suggestion with diversity awareness. In: Proceedings IEEE 30th international conference on data engineering, Chicago, IL, 31 March–4 April 2014, pp. 400–411. New York: IEEE.

23.

Chen

Cai

Chen

et al. Personalized query suggestion diversification in information retrieval. Front Comput Sci 2020; 14(3): 1–13.

24.

Chen

Cai

Chen

et al. Attention-based hierarchical neural query suggestion. In: Proceedings of the 41th international ACM SIGIR conference on research and development in information retrieval, Ann Arbor, MI, 8–12 July 2018, pp. 1093–1096. New York: ACM.

25.

Bordes

Usunier

Garcia-Duran

et al. Translating embeddings for modeling multi-relational data. In: Proceedings of international conference on neural information processing systems, Lake Tahoe, NV, 5–10 December 2013, pp. 2787–2795. New York: ACM.

26.

Walid

Wlodek

Hongxia

. Beyond word embeddings: learning entity and concept representations from large scale knowledge bases. Inform Retriev J 2018; 22: 525–542.

27.

Kiros

Salakhutdinov

Zemel

RS.

Multimodal neural language models. In: Proceedings of the 31st international conference on international conference on machine learning, Beijing, China, 21–26 June 2014, pp. 596–603. New York: ACM.

28.

Zhou

Sun

et al. Collaborative index embedding for image retrieval. IEEE T Pattern Anal Mach Intel 2018; 40(5): 1154–1166.

29.

Gysel

Rijke

Worring

Unsupervised, efficient and semantic expertise retrieval. In: Proceedings of 25th International World Wide Web Conference, Montreal, QC, Canada, 11–15 April 2016, pp. 1069–1079. New York: ACM.

30.

Casillas

Ezeiza

Goenaga

et al. Measuring the effect of different types of unsupervised word representations on medical named entity recognition. Int J Med Inform 2019; 129(2019): 100–106.

31.

Jeffrey

Richard

Christopher

. Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing, Doha, Qatar, 25–29 October 2014, pp. 1532–1543. Stroudsburg: ACL.

32.

Baroni

Dinu

Kruszewski

. Don’t count, predict! a systematic comparison of context-counting vs. context-predicting Semantic Vectors. In: Proceedings of 52nd annual meeting of the association for computational linguistics, Beijing, China, 26-31 July 2015, pp. 238–247. New York: ACM.

33.

Zuccon

Koopman

Bruza

et al. Integrating and evaluating neural word embeddings in information retrieval. In: Proceedings of the 20th Australasian document computing symposium, Caulfield, VIC, Australia, 5–7 December 2016, pp. 1–8. New York: ACM.

34.

Ganguly

Roy

Mitra

et al. Word embedding based generalized language model for information retrieval. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, Santiago, Chile, 9–13 August 2015, pp. 795–798. New York: ACM.

35.

Zamani

Croft

WB.

Embedding-based query language models. In: Proceedings of the 2016 ACM international conference on the theory of information retrieval, Newark, DE, 12–16 September 2016, pp. 147–156. New York: ACM.

36.

Nguyen

Johnson

et al. Search personalization with embeddings. In: Proceedings of the 39th European conference on information retrieval, Aberdeen, 8–13 April 2017, pp. 598–604. Berlin: Springer.

37.

Grbovic

Djuric

Radosavljevic

. Context-and content-aware embeddings for query rewriting in sponsored search. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, Santiago, Chile, 9–13 August 2015, pp. 383–392. New York: ACM.

38.

Roy

Paul

Mitra

et al. Using word embeddings for automatic query expansion. Neu-IR’16, http://pdfs.semanticscholar.org/686e/a4056bd23c2a05baa324722a3194dac8eef8.pdf (2016, accessed 13 April 2019).

39.

Kuzi

Shtok

Kurland

Query expansion using word embeddings. In: Proceedings of the 25th ACM international on conference on information and knowledge management, Indianapolis, IN, 24–28 October 2016, pp. 1929–1932. New York: ACM.

40.

Akbik

Blythe

Vollgraf

. Contextual string embeddings for sequence labeling. In: Proceedings of the 27th international conference on computational linguistics, Santa Fe, NM, 20–26 August 2018, pp. 1638–1649. New York: ACL.

41.

Zhang

Zheng

. Different contexts lead to different word embeddings. COLING 2016, http://www.nlpr.ia.ac.Cn/cip/ZhangPublications/hu-coling-2016.pdf(2016, accessed 24 April 2019).

42.

Rabiner

. A tutorial on hidden Markov models and selected applications in speech recognition. In: Proceedings of the IEEE, http://www.cs.cmu.edu/~cga/behavior/rabiner1.pdf (1989, accessed 27 April 2019).

43.

Blei

Jordan

. Latent Dirichlet allocation. J Mach Learn Res 2012; 3: 993–1022.

44.

Griffiths

Steyvers

. Finding scientific topics. In: Proceedings of the national academy of sciences, Irvine, CA, 9–11 May 2003, pp. 5228–5235. Irvin, CA: National Academies of Sciences and Engineering.

45.

Cai

Reinanda

Rjike

. Diversifying query auto-completion. ACM T Inform Syst 2016; 34(4): 1–33.

46.

Silverstein

Henzinger

Marais

et al. Analysis of a very large altavista query log. SRC technical note 1998-014, 26 October 1998, https://www.hpl.hp.com/techreports/Compaq-DEC/SRC-TN-1998-014.pdf