A method of query expansion based on topic models and user profile for search in folksonomy

Abstract

With the development of the web 2.0 communities, information retrieval has been widely applied based on the collaborative tagging system. However, a user issues a query that is often a brief query with only one or two keywords, which leads to a series of problems like inaccurate query words, information overload and information disorientation. The query expansion addresses this issue by reformulating each search query with additional words. By analyzing the limitation of existing query expansion methods in folksonomy, this paper proposes a novel query expansion method, based on user profile and topic model, for search in folksonomy. In detail, topic model is constructed by variational antoencoder with Word2Vec firstly. Then, query expansion is conducted by user profile and topic model. Finally, the proposed method is evaluated by a real dataset. Evaluation results show that the proposed method outperforms the baseline methods.

Keywords

Query expansion user profile topic model Word2Vec

1 Introduction

The current information is vast and keeps on increasing explosively. Information retrieval has to face the problem of how to obtain valuable information from the mountains of data effectively. With the maturity of related technology, information retrieval is a major area of research in various areas [1]. However, a user issues a query that is often a brief query with only one or two keywords, which leads to a series of problems like inaccurate query words, information overload and information disorientation [2, 3]. To solve the problems, query expansion approach is proposed. New query is constructed by query expansion (QE) that can provide effectively expansion words. Query expansion is one of the research in the field of information retrieval.

The automatic approaches of query expansion have played a major role in the improvement of efficiency for information retrieval system [4]. Automatic query expansion techniques focus on linguistic, corpus specific, query specific, search log-based and web data-based techniques [5]. In automatic QE, many papers are based on analysis of document corpora (referred as the statistical approach) or knowledge structures (referred as the semantic approach) or hybrid approach [4].

At the same time, more and more social resource sites support tagging mechanism (referred as folksonomy) that provides low interference source of information to construct user profile in recent years [6]. For example, Last.fm allows users to annotate music. Users can annotate and upload their favorite photographs and videos in Flicker. These collaborative tagging systems allow users to annotate resources freely according to their interests. In other words, collaborative tagging systems organize and share tags for all users, besides the tags are more accurate information for construct user profile [7].

However, there is not a hybrid QE approach with user profile for search in folksonomy. A hybrid QE approach combines two or more procedures for obtain an advantage greater than that from applying a single QE approach. And in folksonomy, user profile is constructed by tags accurately to reflect personalized preferences. As a result, it is urgent to propose a hybrid approach with user profile.

The main contributions of this paper are as follows:

This paper reveals and discusses the limitations of main current relevant works on query expansion. A method of query expansion based on topic models and user profile for search in folksonomy is proposed. Topic model is generated by variational autoencoder and Word2Vec. The reasoning and constructing processes of topic model are presented in details. User profile is composed of user tags. For a preliminary evaluation, the proposed method is compared with the state-of-the-art methods in experiments. The experiment results show that the proposed method is more effective.

The rest of the paper is organized as follows. The related research works are reviewed firstly. Then, the method of query expansion based on topic models is described in details. Next, experiments are conducted to compare our proposed method with baseline methods. Finally, this paper is concluded. In order to describe topic model clearly, the notations are summarized in Table 1.

Table 1
Notations description

Notation Description

K The number of topics in corpus

α The prior parameters of document-topic

θ The distribution of document-topic

Z The vector of words-topic

X Input documents

μ _t The mean parameters for inputting documents

σ _t The covariance parameters for inputting documents

$\tilde{X}$ Generate documents

The mean parameters for generating documents

The covariance parameters for generating documents

γ Variational parameter

Φ Variational parameter

Notation	Description
K	The number of topics in corpus
α	The prior parameters of document-topic
θ	The distribution of document-topic
Z	The vector of words-topic
X	Input documents
μ _t	The mean parameters for inputting documents
σ _t	The covariance parameters for inputting documents
$\tilde{X}$	Generate documents
	The mean parameters for generating documents
	The covariance parameters for generating documents
γ	Variational parameter
Φ	Variational parameter

2 Related works

In this section, some representative works on statistical approaches, semantic approaches and hybrid approaches of automatic query expansion are reviewed respectively. Furthermore, query expansion in folksonomy is reviewed.

2.1 Statistical approaches

Statistical query expansion approaches focus on the documents information. The basic idea is to apply statistical approaches to mine word-query association and word-word association. Jin et al. [8] propose a query expansion method that utilizes the tag co-occurrence information to select the most appropriate expansion words. However, Peat et al. [9] analyse the limitations of word co-occurrence data for query expansion in document retrieval systems. Meanwhile, topic model is used in query expansion. Sankhavara et al. [10] report the results on query expansion using topic model. Furthermore, in order to indicate that the statistical method has more performance than others, some different versions of QE have been developed and compared under the categories of lexical, statistical and content-based methods in [11]. Statistical methods and semantic methods have been compared in [12] by using a social media data repository. In addition, Rocchio algorithm [13] is classical method that gets the optimal solution by constructing the prototype vector and is widely used in query expansion.

Statistical query expansion approaches have highlighted mine word-query association, without considering the relevance of the user’s query or intent.

2.2 Semantic approaches

The main drawback of the statistical approaches is it may have more relevance to document collection rather than to the query of user. To address this issue, semantic approaches use semantic knowledge structures to extract meaningful word relationships according to the original search query. Nasir et al. [14] introduce a new method that it employs different types of external information to automate and improve the relevance feed-back process by selecting relevant words from the documents that best match the semantics of the query. Azad et al. [15] propose a novel technique named Wikipedia WordNet based QE technique for query expansion that combines Wikipedia and WordNet data sources to improve retrieval effectiveness. Huang et al. [16] propose a novel QE algorithm with the semantics of change sequences. It not only captures which changes occurred from GitHub commits but also understands why changes occurred with Deep Belief Net-work. Zaiem et al. [17] first built a query expansion training set by using sentence embedding based keyword extraction, and then assess the ability of the Sequence to Sequence neural networks to capture expanding relations in the space of words embeddings. Fang et al. [18] propose a query expansion method that provides an innovative combination of semantic information and the conventional sequential dependence model. Specifically, synonyms are obtained automatically through the word embeddings which are trained on the domain-specific corpus by selecting an appropriate language model. Word2Vec is proposed by Google in 2013, which improves the shortcomings of the previous word vector training models and has attracted extensive attention from industry and academia. In the training process of Word2Vec, the deep learning technique is used to map words into high-dimensional vectors for extracting the feature of words. The semantic similarity can be calculated by Word2Vec, e.g., cat and kitten are close together in the semantic space of word vectors. Zhou et al. [19] uses Word2Vec to train the label data. The labels are mapped to the multi-dimensional semantic space, and the optimal extended words are obtained by computing the vector similarity between query words and other words. Although the Word2Vec has been widely used in query expansion, there are few researches on the combination of topic model.

Semantic approaches have reported improved results in search. However, it can lead to the phenomena of topic drift, which may occur due to addition of some irrelevant concepts to the knowledge base during expansion of the query. When expansion words are more related to individual query words than to the whole query, resulting in irrelevant retrieval of results [20].

2.3 Hybrid approaches

A hybrid query expansion approach combines two or more models in order to obtain the further advantage over a single approach. Hybrid methods achieved the best results on the experimental benchmarks [12]. Esposito et al. [21] propose a hybrid query expansion approach based on lexical resources and word embeddings for question answering systems. Singh et al. [22] capture the limitation of pseudo relevance feedback based query expansion and propose a hybrid method to improve the performance of pseudo relevance feedback based query expansion by combining corpus based word co-occurrence information and semantic information of word. Singh [23] proposes a model for combining multiple expansion words selection methods by using a variety of ranks combining approaches. Then, semantic filtering used to filter out semantically irrelevant words obtained after combining multiple words selection methods. The genetic algorithm used to make an optimal combination of query words and candidate expansion words obtained by applying ranks combination and semantic filtering approach.

Although hybrid query expansion approaches obtain an advantage greater than that from applying a single approach, user profile is not introduced, especially in folksonomy.

2.4 Query expansion in folksonomy

At present, the collaborative tagging system has become popular. However, there are few researches on query expansion in folksonomy.

Kin et al. [24] explore the possibility of social annotation as a new resource for extracting useful expansion words. In the first phase, they propose a word-dependency method to choose the most likely expansion words. In the second phase, they develop a machine learning method for word ranking, which is learnt from the statistics of the candidate expansion words, using ListNet. Bouadjenek et al. [25] propose a new approach to query expansion through the combination of semantic and social information to reduce some of the limitations of the personalization approaches. The similarity between query words and candidate extension words is calculated by the jaccard similarity, and the weight of candidate extension words is calculated in the user personalized preference model.

Although there are methods to implement query word extensions in social tagging systems, there is not a hybrid approach with user profile to construct query expansion in folksonomy.

This paper uses a novel hybrid query expansion method, based on topic model. In detail, topic model is constructed by variational antoencoder with Word2Vec firstly. User profile is consisted of user tags from folksonomy. Then, query expansion is conducted by user profile and topic model.

3 Methods

In this section, a hybrid query expansion approach with user profile is described in detail. First topic model and user profile are constructed, respectively. In particular, topic model is generated by variational autoencoder and Word2Vec, and user profile is composed of user tags. Then, query expansion is implemented by user profile and topic model.

3.1 Topic model

Topic model is a research of natural language processing and information retrieval in recent years [26]. The most classic topic model is Latent Dirichlet Allocation (LDA) that is a generative probabilistic model of a text corpus. The idea of LDA is that documents are represented as random mixtures over latent topics, and each topic is characterized by a distribution over words [27]. A common special case is the symmetric Dirichlet distribution, where all of the elements making up the parameter vector α have the same value. The generative process of LDA is as follows:

For each topic k ∈ { 1, … K } :

Generate β_k = ∼ Dirichlet (η)

For each document (d):

Generate θ ∼ Dirichlet (α).

For each of the N words w_d,n:

Generate a topic z_n ∼ Multinomial (π (θ)).

Draw w_d,n ∼ Multinomial (π (β_{z
_n})).

where β is the topic-word distribution for a corpus, θ is topic distribution for documents, z is the vector of topic-words and N is the number of words in document.

According to the generation process of LDA [27], the marginal likelihood function of generate documents $\tilde{X}$ is variational target, which is shown in Equation (1). $logp (\tilde{X} | z, θ, α) = \log \int_{θ} p (\tilde{X} | z; θ) p (z, θ | α) d θ$ (1)

However, all variables are independent in the process of constructing topic model by variational autoencoder in this paper. Approximate distribution function q uses the another two variational parameters γ and Φ to approximate θ and z. The approximate function can be written as q (θ, z|Φ, γ), the derivation of the variational objective function is shown in Equation (2) according to the Jensen’s inequality.

$\begin{matrix} logp (\tilde{X} | z, θ, α) \\ = \log \int_{θ} p (\tilde{X} | z; θ) p (z, θ | α) d θ \\ = \log \int_{θ} q (θ, z | Φ, γ) \frac{p (\tilde{X} | z; θ) p (z, θ | α)}{q (θ, z | Φ, γ)} d θ \\ ⩾ \int_{θ} q (θ, z | Φ, γ) logp (\tilde{X} | z; θ) \\ + \int_{θ} q (θ, z | Φ, γ) \log \frac{p (z, θ | α)}{q (θ, z | Φ, γ)} d θ \\ = \int_{θ} q (θ, z | Φ, γ) logp (\tilde{X} | z; θ) \\ - \int_{θ} q (θ, z | Φ, γ) \log \frac{q (θ, z | Φ, γ)}{p (z, θ | α)} d θ \\ = E_{q (θ, z | Φ, γ)} [logp (\tilde{X} | z; θ)] \\ - KL [q (θ, z | Φ, γ) | | p (z, θ | α)] \\ ≜ L (q, θ, z; Φ, γ) \end{matrix}$ (2) where $L (q, θ, z; Φ, γ)$ is the variational lower bound of variational target. The first part is the expectation of reconstruction error and the second part is the Kullback–Leibler divergence (KL-divergence) of the approximate posterior distribution from the true posterior distribution. In order to find the best q (θ, z|Φ, γ) which should be able to approximate the true posterior distribution p (z, θ|α), the most common approximate calculation of variational target uses the KL-divergence between q (θ, z|Φ, γ) and p (z, θ|α). That is, the aim is to get Minimized KL-divergence between q (θ, z|Φ, γ) and p (z, θ|α) that is argminKL [q (θ, z|Φ, γ) ||p (z, θ|α)].

Topic model is constructed by using variational autoencoder that is composed of encoder and decoder. Encoder is to generate a variational approximate posterior distribution q (θ, z|Φ, γ) of the document-topic. Decoder estimates the optimal generation probability $p (\tilde{X} | z; θ)$ with q (θ, z|Φ, γ) known. The network structure of variational autoencoder of topic model is shown in Fig. 1. The model assumes q (θ, z|Φ, γ) obeys diagonalization of the covariance matrices of the Gaussians distribution. The encoder trains the potential mean and variance for the vector of topic-words z. In order to generate documents $\tilde{X}$ that are as close to the input documents X, decoder estimates the better probabilistic model $p (\tilde{X} | θ, z)$ according to the known q (θ|X). The generation process of θ needs to use backpropagation to calculate the gradient of the error function. Generally, an expected cost is minimized by using gradient descent. The generation process of θ adopts the re-parametrization trick that is used to calculate the gradient. Graphical model of the re-parametrization trick is shown in Fig. 2.

Fig. 1

Network structure of variational autoencoder of topic model.

Fig. 2

Graphical model of the re-parametrization trick.

In Fig. 2, z ∼ q (θ|ɛ, σ) can be reparameterized as a Gaussian function and a random variable $ɛ \sim N (0, I)$ so that ɛ is a constant, then θ = λ (μ + σ * ɛ), where λ (·) is the softmax activation function. Therefore, μ and σ only involve linear operation that can be optimized by stochastic gradient descent algorithm.

It is difficult to re-parameterize the Dirichlet distribution. This paper uses Gaussian distribution to instead of Dirichlet distribution, that is $p (Z, θ | α) = N (θ | μ^{'}, σ^{'})$ and $q (θ, Z | Φ, γ) = N (θ | μ, σ)$ . Therefore, for the second part of variational target, the KL-divergence between q (θ, z|Φ, γ) and p (z, θ|α) can be derived as follows:

$\begin{matrix} KL [q (θ, Z | Φ, γ) | | p (Z, θ | α)] \\ = \frac{1}{2} E_{q_{φ}} [- log | σ^{'} | - {(x - μ^{'})}^{T} σ^{' - 1} (x - μ^{'}) \\ + log | σ | + {(x - μ)}^{T} σ^{- 1} (x - μ)] \\ = \frac{1}{2} \log \frac{| σ |}{| σ^{'} |} + \frac{1}{2} E_{q_{φ}} [- {(x - μ^{'})}^{T} σ^{' - 1} (x - μ^{'}) \\ + {(x - μ)}^{T} σ^{- 1} (x - μ)] \\ = \frac{1}{2} \log \frac{| σ |}{| σ^{'} |} + \frac{1}{2} E_{q_{φ}} {- tr [σ^{' - 1} (x - μ^{'}) {(x - μ^{'})}^{T}] \\ + tr [σ^{- 1} (x - μ) {(x - μ)}^{T}]} \\ = \frac{1}{2} \log \frac{| σ |}{| σ^{'} |} + \frac{1}{2} E_{q_{φ}} {- tr [σ^{' - 1} (x - μ^{'}) {(x - μ^{'})}^{T}]} \\ + \frac{1}{2} E_{q_{φ}} {tr [σ^{- 1} (x - μ) {(x - μ)}^{T}]} \\ = \frac{1}{2} \log \frac{| σ |}{| σ^{'} |} - \frac{1}{2} tr {E_{q_{φ}} [σ^{' - 1} (x - μ^{'}) {(x - μ^{'})}^{T}]} \\ + \frac{1}{2} tr {E_{q_{φ}} [σ^{- 1} (x - μ) {(x - μ)}^{T}]} \\ = \frac{1}{2} \log \frac{| σ |}{| σ^{'} |} - \frac{1}{2} tr {σ^{' - 1} E_{q_{φ}} [(x - μ^{'}) {(x - μ^{'})}^{T}]} \\ + \frac{1}{2} tr {E_{q_{φ}} [σ^{- 1} ({xx}^{T} - μ x^{T} - x μ^{T} + μ μ^{T})]} \\ = \frac{1}{2} \log \frac{| σ |}{| σ^{'} |} - \frac{1}{2} tr {σ^{' - 1} σ^{'}} \\ + \frac{1}{2} tr {σ^{- 1} E_{q_{φ}} ({xx}^{T} - μ x^{T} - x μ^{T} + μ μ^{T})} \\ = \frac{1}{2} \log \frac{| σ |}{| σ^{'} |} - \frac{1}{2} K + \frac{1}{2} tr \\ {σ^{- 1} (σ^{'} + μ^{'} μ^{' T} - μ μ^{' T} - μ^{'} μ^{T} + μ μ^{T})} \\ = \frac{1}{2} {\log \frac{| σ |}{| σ^{'} |} - K + tr (σ^{- 1} σ^{'}) \\ + tr {σ^{- 1} (μ^{'} μ^{' T} - μ μ^{' T} - μ^{'} μ^{T} + μ μ^{T})}} \\ = \frac{1}{2} {\log \frac{| σ |}{| σ^{'} |} - K + tr (σ^{- 1} σ^{'}) \\ + tr {σ^{- 1} μ' μ'^{T} - σ^{- 1} μ μ'^{T} - σ^{- 1} μ' μ^{T} + σ^{- 1} μ μ^{T}}} \\ = \frac{1}{2} {\log \frac{| σ |}{| σ^{'} |} - K + tr (σ^{- 1} σ') \\ + tr {μ'^{T} σ^{- 1} μ^{'} - 2 μ'^{T} σ^{- 1} μ + μ^{T} σ^{- 1} μ}} \\ = \frac{1}{2} {\log \frac{| σ |}{| σ' |} - K + tr (σ^{- 1} σ') \\ + {(μ - μ')}^{T} σ^{- 1} (μ - μ')} \end{matrix}$ (3)

For p (Z, θ|α), it uses the Laplace approximation with the property that the covariance matrix becomes diagonal for large K (number of topics). This approximation to the Dirichlet prior p (Z, θ|α) is results in the distribution over the softmax variables as a multivariate normal with the mean μ′ and covariance matrix σ′ that are represented in Equations (4) and (5) respectively according to literatures [28] and [29]. $μ_{k}^{'} = \log α_{k} - \frac{1}{K} \sum_{i = 1}^{K} \log α_{i}$ (4) $σ_{kk}^{'} = \frac{1}{α_{k}} (1 - \frac{2}{K}) + \frac{1}{K^{2}} \sum_{i = 1}^{K} \frac{1}{α_{i}}$ (5)

As loss function, the final variational target is shown in Equation (6).

$\begin{matrix} L (Θ) & = E_{Q (θ | μ, σ)} [logp (\tilde{X} | Z; θ)] \\ - KL [q (θ, Z | Φ, γ) | | p (Z, θ | α)] \\ = E_{Q (θ | μ, σ)} [logp (\tilde{X} | Z) p (Z | θ) p (θ | α)] \\ - KL [q (θ, Z | Φ, γ) | | p (Z, θ | α)] \\ = E_{Q (θ | μ, σ)} [logp (\tilde{X} | Z) p (Z | θ) λ (μ + σ * ɛ)] \\ - \frac{1}{2} {\log \frac{| σ' |}{| σ |} - K + tr (σ'^{- 1} σ) \\ + {(μ^{'} - μ)}^{T} σ'^{- 1} (μ^{'} - μ)} \end{matrix}$ (6)

The topic model is considered from a statistical point of view and often ignores the semantic relationship between words. In this paper, Word2Vec is used to enhance the semantic of the topic model. This paper adopts Skip-gram model to predict the contexts of a given target-word. The specific generation method can be referred to the literatures [30]. A word similarity matrix is obtained by calculating the cosine similarity between word vectors.

Some semantically relevant words can be used to instead the words that are irrelevant to the topics. During the training process of the topic model, this paper uses non-parametric sampling method. The relevance of word w_j and topic k is denoted by Correlation (w_j, z = k), which is shown in equation (7).

$\begin{matrix} Correlation (w_{j}, z = k) \\ = \frac{p (z = k) p (w_{j} | z = k)}{\sum_{i = 1}^{K} p (z = i) p (w_{j} | z = i)} \\ * \frac{p (w_{j} | z = k)}{\sum_{j = 1}^{N} p (z = i) p (w_{j} | z = k)} \end{matrix}$ (7) where $\frac{p (z = k) p (w_{j} | z = k)}{\sum_{i = 1}^{K} p (z = i) p (w_{j} | z = i)}$ is the probability of word w_j in topic k, K is the number of topics, $\frac{p (w_{j} | z = k)}{\sum_{j = 1}^{N} p (z = i) p (w_{j} | z = k)}$ is the proportion of word w_j in topic k, N is the number of words in topic k. The higher value of Correlation (w_j, z = k) indicates the more relevance of word w_j and topic k.

This paper chooses the word w_j that is the largest proportion of the topic k, then the word w_i is sampled by equation (8).

$\begin{matrix} Semantic (w_{i}) = (Correlation (w_{j}, z = k) \\ - Correlation (w_{i}, z = k)) * sim (w_{i}, w_{j}) \end{matrix}$ (8) where sim (w_i, w_j) is from the word similarity matrix.

It should be noted that the Skip-gram model is not introduced for each iteration until topic model becomes stable. If the Skip-gram model is introduced at the beginning, topic model may not converge.

3.2 Query expansion

The optimal extension word is chosen by the topic similar of words in topic model. In the topic model, if the KL distance between the topic distributions of two words is smaller, the two words are considered to have a higher similarity. At the same time, query expansion combines with user profile that is considered as a technique to measure the degree a user is interested in a feature (tag). Based on the above illustrations, the following assumptions is proposed.

Assumption 1. For two words Topicword₁ and Topicword₂, in topic model, higher similarity indicates the more relevance.

Assumption 2. For a query vector $\vec{q}$ , in topic model, if the word Topicword_t has more relevance with most words of query vector $\vec{q}$ , then Topicword_t has more relevance with query vector $\vec{q}$ .

Assumption 3. For a user profile User_i, in topic model, if the word Topicword_t has more relevance with most words of user profile User_i, then Topicword_t has more relevance with user profile User_i.

Assumption 4. For a user profile User_i and a query vector $\vec{q}$ , in topic model, if the word Topicword_t has the more relevance with user profile User_i and query vector $\vec{q}$ at the same time, then Topicword_t is a candidate query extension word.

Based on assumptions 1-4, the process of query extension is described as follows.

A user issues a query that is pre-processed by performing tokenization, processing lemmatization and deleting stop words. The query vector $\vec{q}$ is generated as $\vec{q} = (q_{1} : w_{1}, \dots, q_{j} : w_{j}, \dots, q_{m} : w_{m})$ .

where m is the total number of words in query, q_j is a word and w_j is the weight of the word q_j, w_j is calculated by (9). $w_{j} = \frac{| q_{j} |}{| \vec{q} |}$ (9) where |q_j| is the number of word q_j, $| \vec{q} |$ is the total number of words in $\vec{q}$ .

According to the query vector $\vec{q}$ and topic model, the relevance between word Topic_t and query vector $\vec{q}$ is calculated by the distribution of topic similarity. $δ_{t} = \sum_{j = 1}^{m} (w_{j} * sim ({Topic}_{t}, q_{j}))$ (10) where Topic_t is the tth word of topic model, δ_t is the relevance between the tth word and query vector $\vec{q}$ . After descending order, the first P topic words are chosen as the candidate query extension words, denoted by Extension_P.

For each word in Extension, the relevance between the pth word and user profile User_i is denoted by φ_ip, which is calculated by equation (11). $φ_{ip} = \sum_{k = 1}^{n} (v_{i, k} * sim ({Extension}_{P}, t_{i, k}))$ (11) where t_i,k is the kth tag that is annotated by user i, v_i,k is the weight of tag t_i,k and n is the total number of tags that user ihas annotated. $v_{i, k} = \frac{N_{i, k}}{N_{i}}$ (12) where N_i is the total number of tags that is annotated by user i, N_i,k is the number of tag t_i,k that is annotated by user i. The interests of user can be reflected by the personalized weight [31].

Based on the two results of equation (11) and equation (12), the final value of the pth candidate extension word is calculated by equation (13). ${Rank}_{p} = γ * δ_{p} + (1 - γ) * φ_{ip}$ (13) where γ is parameter, γ ∈ [0, 1].

The candidate word with the highest value is selected as the final extension word. A query is issued for searching the extension word and query word together. This query extension method can alleviate the impact of query word mismatch and improve the search quality.

In addition, according to the content of research literatures [25, 26], the number of query extension words in this paper is decided as follows: If the query contains the number of non-repeating words m⩽3, then the number of extension words is m. If the query contains the number of non-repeating words m>3, then the number of extension words is 3.

4 Experimental results and analysis

This section conducts experiments to verify the effectiveness of the proposed method in a large real dataset. This section describes dataset and metrics used. Performance improvement of the method is compared with some query expansion methods.

4.1 Experiment environment

This section uses TensorFlow 1.4.0 with CPU support only and Python 3.6 for the proposed method implementation and the experiments are run on a single processor machine equipped with an AMD FX-Series CPU FX-8350 @4.0GHz, the processor having eight physical computing cores, 24 GB DDR3 RAM memory @1600MHz. The machine runs on Windows7 64 bits.

4.2 Dataset

Although some social resource sites support tagging mechanism, it is difficult to find more public and real datasets to evaluate the proposed model. The paper uses a benchmark dataset to evaluate the proposed method: The MovieLens-20M datasets (https://grouplens.org/datasets/movielens/) collected by the GroupLens Web site. The MovieLens-20M dataset has 465,564 tags, annotated by 138,493 users on 27,278 movies. Every movie is annotated by 17.06 tags averagely. In order to avoid the influence of tag data sparsity, more than 20 tags of movies and users are chosen for constructing topic model and user profile, respectively.

The MovieLens-20M dataset is randomly divided into training and test sets. For the dataset, 80% of the dataset is chosen in the training set, while the rest of the dataset is retained in the test set.

4.3 Evaluation metrics

In order to determine the optimal number of topics for the topic model, the Normalized Point-wise Mutual Information (NPMI) is used to evaluate the qualitative of topics. NPMI is close to the ordinary understanding of people [32]. NPMI (k) can express the coherence of topic k, and NPMI (k) is shown in Equation (14). $NPMI (k) = \sum_{i = 1}^{M - 1} \sum_{j = i + 1}^{M} \frac{\log \frac{p (w_{i}, w_{j})}{p (w_{i}) p (w_{j})}}{- logp (w_{i}, w_{j})}$ (14) where M is the number of topic words in topic k. w_i and w_j are the ith topic word and the j th topic word respectively. p (w_i), p (w_j) and p (w_i, w_j) are the probability of the ith topic word, the probability of the jth topic word and the co-occurrence probability of the ith and thejth topic word in topic k. The optimal number of topics depends on the average topic coherence. The higher value the average of NPMI is, the better the topic model is.

The evaluation of query expansion is a professional human judge that classifies the generated query expansion words into five types: excellent, good, fair, bad, and not sure [33]. This paper gets 500 query expansion words.

Two metrics are used for the evaluation of search. The first one is MRR (Mean reciprocal rank) that measure ranked query results. The reciprocal rank of a query result is computed while the first correct relevant resource is retrieved. The value of MRR is the average of the reciprocal ranks of results for a set of queries, as defined by Equation (15). $MRR = \frac{1}{Q} \sum_{i - 1}^{Q} \frac{1}{{Rank}_{i}}$ (15) where Q is the times of queries, Rank_i is the rank position of the first correct relevant resource for the i-th query. MRR emphasizes the significance of the first correct relevant resource at the top of the result list. Users get the relevant resource faster when MRR is higher.

Imp is a widely adopted metric to measure how the proposed method is better than baseline methods, as defined by Equation (16). $Imp (q_{i}) = \frac{1}{r_{p}} - \frac{1}{r_{b}}$ (16)

A user issues a query (q_i), r_p is the ranking of the relevant resource in the result list by the proposed method, while r_b is the same by a baseline method. Thus the average of imp shows the overall improvement in the test data, as defined by Equation (17). $Imp = \frac{\sum_{i = 1}^{Q} Imp (q_{i})}{Q}$ (17) where Q is the times of queries. The higher value the Imp is, the greater improvement of the ranking result by the proposed method there appears.

4.4 Baseline Methods

To evaluate the efficiency of the proposed approach, some baseline methods are chosen to compare with the proposed method. The first baseline method uses Word2Vec to train the label data directly [19], denoted by Baseline1. The labels are mapped to the multi-dimensional semantic space, and the optimal extended words are obtained by computing the vector similarity between query words and other words. The second one constructs the topic model by the variational auto-encoder, and the optimal extended words are obtained by computing the topic similarity between query words and other words, denoted by Baseline2. The next one uses user profile to expand the query based on Baseline2, denoted by Baseline3. This paper constructs topic model that is based on Word2Vec, denoted by Baseline4. A query expansion approach is proposed based on topic models, Word2Vec and user profile, denoted by PQE. In addition, in order to show that query extension can improve search quality, a personalized search method is compared with query extension methods, denoted by Original-Q [31].

4.5 Comparison with Baseline Methods

First, the key of topic model is to determine the optimal number of topics. The NPMI is used to calculate the average topic relevance under different number of topics from 40 to 100. The result is shown in Fig. 3.

Fig. 3

Average topic relevance under different number of topics.

In Fig. 3, the number of topics begins to increase from 40, the average topic relevance gradually increases to the highest value 0.149 when the number of topics is 70. The average topic relevance decreases slowly when the number of topics is from 70 to 90. And then, the average topic relevance decreases rapidly to 0.1037, The average topic relevance is high and stable when the number of topics is from 70 to 90 that is appropriate for the experimental dataset. In order to obtain the optimal topic model, 70 is the optimal number of topics in this paper.

To determine the optimal parameter γ in Equation (13), 500 extension words are evaluated by the method of professional human judge. The larger number of extension words has less error of subjective judgment. The range of γ is from 0.1 to 0.9, and the step size is set to 0.2. The number of extension words in five criteria of excellent, good, fair, bad, and not sure under different parameter values is shown in Table 2.

Table 2

Five criteria under different parameter values

	γ=0.1	γ=0.3	γ=0.5	γ=0.7	γ=0.9
Excellent	146	153	158	163	154
Good	194	192	193	196	191
Fair	45	44	43	39	42
Bad	53	49	45	41	40
Not Sure	62	62	63	61	71

The more excellent and good extension words show that the parameter γ is better for query extension. In Table 2, as the parameter γ increases, the number of excellent and good extension words grows. When γ =0.7, the numbers of excellent and good extension words reach the highest values at 163 and 196 respectively. The result indicates that the similarity between extension words and query vector is more important, and the role of user profile cannot be ignored.

At the same time, the less fair and bad extension words show that the parameter γ is better for query extension. In Table 2, the number of fair extension words reaches the least value when γ =0.7. The number of bad extension words reaches the least value at 41 and 40 when γ =0.7 and γ =0.9. The number of extension words in not sure is a general reference that has the least influence on the parameter selection.

According to the experimental results of the training set, this paper chooses the optimal parameter value γ=0.7.

Then, a series of experiments are conducted on test dataset. All methods on MRR metric are shown in Fig. 4.

Fig. 4

Comparison of the proposed method and baseline methods on MRR.

According to Fig. 4, PQE has the highest MRR value at 0.2344, while the other methods are at 0.1844, 0.2112, 0.1869, 0.2233 and 0.2211, respectively. PQE outperforms the baseline methods. Furthermore, on imp metric, PQE outperforms the method of Original-Q by 5%, the method of baseline1 by 2.32%, the method of baseline2 by 4.75%, the method of baseline3 by 1.11% and the method of baseline4 by 1.33%.

All query extension methods outperform Original-Q, proving that query extension can improve search quality. Query extension with Word2Vec outperforms topic model by comparing baseline1 and baseline2. The result of query extension using topic model with Word2Vec (baseline4) is little difference from using topic model with user profile (baseline3), and they outperform the methods of query extension that only uses Word2Vec (baseline1) or topic model (baseline2). On the whole, the best one is PQE that is constructed by topic model, Word2Vec and user profile.

5 Conclusions

This paper proposes a query expansion approach based on topic model and user profile for search in details. Topic model is constructed by variational antoencoder with Word2Vec, and query expansion is conducted by user profile and topic model. Finally, the PQE approach is evaluated by a real dataset. Experimental results demonstrate that the proposed method outperforms the baseline methods. The accuracy of search using the proposed method is over 1% higher than baseline methods. The largest advantage of this work is in solving the problem that is not a hybrid approach with user profile to construct query expansion in folksonomy.

The main limitation of the proposed method is the process of constructing user profile without the rating score. The future research will focus on the following three directions. In order to improve efficiency, the research will explore the parallel variational autoencoder to construct topic model. This research will construct the more accurate user profile with the rating tags and scores to improve the efficiency of query expansion. This research will also apply PQE approach in recommender systems, text classification, and other fields.

Footnotes

Acknowledgments

This research was funded by the Scientific Research Project of Hebei Education Department of China (Grant No. QN2020198).

References

Mabrouk

, Rady;

, Badr

and Khalifa

M.E.

, A survey on information retrieval systems’ modeling using word dependencies and word weighting, In Proc ICICIS Cairo, Egypt, 2017, pp. 321–328.

Teevan

, Ramage

and Morris

M.R.

, #TwitterSearch: a comparison of microblog search and web search, in Proc. WSDM, Hong Kong, China, 2011, pp. 35–44.

Spink

, Wolfram

, Jansen

M.B.

and Saracevic

, Searching the web: The public and their queries, J Assoc Inf Sci Technol 52(3) (2001), 226–234.

Raza

M.A.

, Mokhtar

and Ahmad

, A survey of statistical approaches for query expansion, Knowl Inf Syst 61(1) (2019), 1–25.

Carpineto

and Romano

, A survey of automatic query expansion in information retrieval, ACM Comput Surv 44(1) (2012), 1–50.

Gou

, Han

, Zhu

, Yang

and Duan

, Personalized search by a multi-type and multi-level user profile in folksonomy, Arab J Sci Eng 43 (2018), 7563–7572.

Vallet

, Cantador

and Jose

J.M.

, Personalizing web search with folksonomy-based user and document profiles, in Proc. ECIR 2010, Milton Keynes, UK, 2010, pp. 420–431.

Jin

, Lin

and Su

, Query expansion based on folksonomy tag co-occurrence analysis, in Proc GrC, Nanchang, China, 2009, pp. 17–19.

Peat

H.J.

and Willett

, The limitations of word co-occurrence data for query expansion in document retrieval systems,pp, J Am Soc Inf Sci 42(5) (1991), 378–383.

10.

Sankhavara

and Majumder

, Team DA IICT at clinical decision support track in TREC 2016: topic modeling for query expansion, in Proc. TREC, Gaithersburg, Maryland, USA, 2016.

11.

Natsev

A.P.

, Haubold

, Teši'c

, Xie

and Yan

, Semantic concept-based query expansion and reranking for multimedia retrieval, In Proc. MM. New York, NY, USA, 2007, pp. 991–1000.

12.

Mahdabi

and Popescu-Belis

, Comparing two strategies for query expansion in a news monitoring system, In Proc. NLDB, Salford, UK, 2016, pp. 267–275.

13.

Rocchio

J.J.

, Relevance feedback in information retrieval, In the SMART retrieval system: Experiments in automatic document processing, 1971, pp. 313–323.

14.

Nasir

J.A.

, Varlamis

and Ishfaq

, A knowledge-based semantic framework for query expansion, Inf Process Manag 56(5) (2019), 1605–1617.

15.

Azad

H.K.

and Deepak

, A new approach for query expansion using Wikipedia and WordNet, Inf Sci 492 (2019), 147–163.

16.

Huang

, Yang

and Cheng

, Deep learning the semantics of change sequences for query expansion, Softw Pract Exp 49(11) (2019), 1600–1617.

17.

Zaiem

and Sadat

, Sequence to sequence learning for query expansion, In Proc AAAI, Honolulu, Hawaii, USA, 2019, pp. 10075–10076.

18.

Fang

, Zhang

and Yin

, Semantic sequential query expansion for biomedical article search, IEEE Access 6 (2018), 45448–45457.

19.

Zhou

, Wu

, Zhao

, Lawless

and Liu

, Query expansion with enriched user profiles for personalized search utilizing folksonomy data, IEEE Trans Knowl Data Eng 29(7) (2017), 1536–1548.

20.

Mahler

, Holistic query expansion using graphical models, New Dir Quest Answ, (2004), pp. 203–227.

21.

Esposito

, Damiano

, Minutolo

, Pietro

G.D.

and Fujita

, Hybrid query expansion using lexical resources and word embeddings for sentence retrieval in question answering, Inf Sci 514 (2020), 88–105.

22.

Singh

and Sharan

, Co-occurrence and semantic similarity based hybrid approach for improving automatic query expansion in information retrieval, In Proc. ICDCIT, Bhubaneswar, India, 2015, pp. 415–418.

23.

Singh

, Ranks aggregation and semantic genetic approach based hybrid model for query expansion, Int J Comput Intell Syst 10(1) (2017), pp. 34–55.

24.

Lin

, Lin

, Jin

and Ye

, Social annotation in query expansion: a machine learning approach, In Proc SIGIR, Beijing, China, 2011, pp. 405–414.

25.

Bouadjenek

M.R.

, Hacid

, Bouzeghoub

and Daigremont

, Personalized social query expansion using social bookmarking systems, In Proc SIGIR, Beijing, China, 2011, pp. 1113–1114.

26.

Figuerola

C.G.

, Marco

F.J.G.

and Pinto

, Mapping the evolution of library and information science -using topic modeling on LISA, Scientometrics 112(3) (2017), 1507–1535.

27.

Blei

D.M.

, Ng

A.Y.

and Jordan

M.I.

, Latent Dirchlet Allocation, J Mach Learn Res 3 (2003), 993–1022.

28.

Srivastava

and Sutton

, Autoencoding variational inference for topic models, In Proc ICLR, Toulon, France, 2017.

29.

Hennig

, Stern

, Herbrich

and Graepel

, Kernel topic models, In Proc AISTATS, La Palma, Canary Islands, Spain, 2012, pp. 511–519.

30.

Fernández-Reyes

F.C.

, Valadez

J.H.

and Montes-y-Gómez

, A prospect-guided global query expansion strategy using word embeddings, Inf Process Manag 54(1) (2018), 1–13.

31.

Cai

and Li

, Personalized search by tag-based user profile and resource profile in collaborative tagging systems, in Proc CIKM, Toronto, Canada, 2010, pp. 969–978.

32.

Lau

J.H.

, Newman

and Baldwin

, Machine reading tea leaves: automatically evaluating topic coherence and topic model quality, in Proc EACL, Gothenburg, Sweden, 2014, pp. 530–539.

33.

Lee

, Gao

and Zhang

, Rare query expansion through generative adversarial networks in search advertising, In Proc KDD, London, UK, (2018), pp. 500–508.