Abstract
Query intents describe user information needs for searching on the web. How to capture the query intents is a crucial research topic in information retrieval. Search engine users always employ insufficient or unclear words as queries, thus making query intents ambiguous and uncertain to be interpreted by search engines. Query intent classification can deal with the problem by clarifying user queries and interpreting information needs for improving user satisfaction. Two main challenges have been addressed to classify query intents: one is how to effectively represent short and ambiguous queries; the other is how to generate a set of appropriate categories for matching diverse queries. In the paper, we propose a hybrid deep neural network model for query intent classification to meet the challenges. Our model adopts two state-of-the-art neural network models to comprehensively represent queries as feature vectors. We then employ query logs to automatically generate intermediate categories for fine-grained query intent clarification. Experimental results show that our method can outperform other baseline models, and effectively improve the performance in query intent classification.
Keywords
Introduction
With the increasing amount of information on the Internet, search engines become indispensable tools in our daily life. Information retrieval (IR), as the core technologies of search engines, seeks to comprehend user information needs and provide relevant documents for given queries. However, submitted queries for search engines may ambiguously describe the real user intents, thus hurting retrieval performance. For example, the intents for searching the query ‘apple’ may be associated with the fruit apple or associated with electronic products produced by the Apple Company. To better interpret queries, it becomes necessary to distinguish different query intents before retrieving and ranking candidate websites or documents.
To tackle the problem, query intent classification has become a crucial research issue. Query intent classification aims to map queries with predefined query categories so as to eliminate the ambiguity and uncertainty of original queries [1]. Although query intent classification is essentially a text classification problem, it has certain unique characteristics, which makes this task more challenging. Specifically, the average length of queries, compared to documents, tends to be very short. Therefore, existing methods for document classification can hardly be directly used for query intent classification. Moreover, there is a lack of labeled data for training the classifiers, because it is difficult to annotate the real query intents without enough contexts. To this end, it remains two research issues on query intent classification: effective representations of queries and automatic label strategy on query intents.
Existing research on query intent classification has addressed these two issues and carried out by directly using supervised machine learning methods [2–7]. Most studies have transformed the problem to a document classification problem through pseudo-relevance feedback. Pseudo-relevance feedback assumes that top-ranked documents from initial retrieval are highly correlated with the original query. The top-ranked documents can therefore be treated as feedback to interpret user queries for mining the real query intents. The shortcoming for these studies lies in their incapability to cope with constantly changing categorizations of query intents due to the diverse information needs of users.
To capture the constantly changing intents, recent studies have proposed to adopt two-stage classification methods to improve the performance [8, 9]. In these studies, supervised methods were used to train classifiers for mapping queries to intermediate categories, and feedback documents were used to map queries in the intermediate categories with the target categories. These two-stage methods can deal with constantly changing categories of query intents by introducing intermediate categories, thus improving the classification accuracy. These studies motivate us to capture query intents in two stages for accurately understanding information needs of users. However, current studies still suffer from the difficulty in automatically generating the intermediate categories to reduce manual labor that is time and cost expensive.
To this end, we propose a hybrid deep neural network model for query intent classification. Our model leverages the state-of-the-art recursive and recurrent neural network models to implicitly construct a two-stage query intent classifier for query representations. To reduce manual costs of labeling, we introduce implicit relevance feedback based on query logs to automatically capture real query intents of users, which can also deal with the constantly changing categorizations of query intents. Experimental results show that our model outperforms baseline models in query intent classifications.
We summarize the contributions of our work as follows.
1) We propose a novel two-stage framework for query intent classification based on hybrid deep neural network models and query representations. In the first stage, we represent queries using word embedding-based features and syntactic features, and then encode the query representations using the recurrent and recursive neural networks, respectively, for mapping queries to intermediate categories. In the second stage, we concatenate the learned query representations and intermediate classification results as inputs, and use a multi-layer perceptron for classification in target categories.
2) We propose to generate intermediate categories of query intents based on implicit relevance feedback from query logs, which largely reduces manual labor in annotating query intents for intermediate classification. Our method adopts Open Directory Project (ODP) to automatically match the queries to their corresponding clicked web pages from query logs, and treats the categories of web pages from ODP as the intermediate categories of queries for effective classification of query intents.
3) We conduct experiments on a retrieval dataset, and experimental results show that our method outperforms the baseline methods in query intent classification. In the experiments, we use AOL query logs and ODP to generate intermediate categories, and classify queries based on two proposed hybrid neural networks. We then use TREC queries for the final stage classification based on Wikipedia. Experimental results demonstrate the effectiveness of the proposed method.
The rest of the paper is organized as follows. Section 2 introduces the related research work. Section 3 details the general framework of the proposed method. Section 4 introduces the query term feature extraction, and Section 5 provides details on the classification models. Section 6 reports the experimental results with in-depth analysis and discussions. Section 7 concludes this paper and provides future directions.
Related work
Query classification, as one of the most crucial issues in information retrieval, has attracted much attention in recent years [10–13]. Early research has attempted to enrich the feature representations of queries [8, 15]. For example, Shen et al. [8] won the 2005 KDD Cup for query enrichment by training document classifiers based on the search results. They extracted document features based on the titles, snippets and pages of documents for document classification, and then mapped the document taxonomy to the target taxonomy of queries. The quality of extracted features affects classification performance. Since queries usually contain only a few key words, how to generate effective features remains a great challenge.
Some studies have exploited external resources, such as the query logs, for generating effective query features. For example, Ashkan et al. [16] used clickthrough logs, query specific information and search results to capture commercial query intents. Radlinski et al. [17] identified the popular meanings of web queries using search logs and user click behaviors for more complete and user-oriented intents. Law et al. [18] designed a new online game to collect data on user query intents for reducing manual labor in annotating the intents for supervised training. Jiang et al. [19] mined query intents from query logs from a multi-dimensional perspective involving the URL, session and terms of queries. Results-oriented and topic-oriented frameworks were also proposed and investigated in their study. Sakar et al. [12] accurately predicted the online shopping intents of visitors using multilayer perceptron and LSTM recurrent neural networks. Yao et al. [20] developed a convolutional neural network-based ranker by integrating query-specific preferences with learning-to-rank methods. Singh et al. [11] presented a novel query expansion method based on fuzzy logic to select effective expansion terms from feedback documents. Ding et al. [21] propose an efficient query processing optimization approach based on extreme learning machine in ComMapReduce framework. Puigcerver et al. [10] tackled the problem of out-of-vocabulary query keywords using word similarity. These studies have implied that external resources are valuable in identifying query intents of users. Meanwhile, limited labeled data in query intent mining have posed another big challenge in improving the performance. Various methods have been proposed to meet the challenge. For instance, Li et al. [22] proposed two semi-supervised methods to augment labeled queries for query classification, in which clickthrough data were treated as graphs to enrich query features. Ayadi et al. [23] defined a list of generic and specific medical query features and exploited them in an association rule mining technique to discover correlations between query features and image retrieval models. Kim et al. [24] applied the matrix sketching algorithm to improve the scalability of query intent classification with limited labeled data. Abid et al. [13] provided a comprehensive review on search results diversification techniques, and pointed out user query intents are important for detecting subtopics of resultant documents. These studies motivate us to reduce manual labor in intent annotation using feedback from query logs.
In addition, query intent classification has also been addressed in different tasks. For example, Barua et al. [25] proposed to use suggested queries from search engines as external knowledge sources for named entity classification. Jiang et al. [26] addressed the session-based query auto-completion in terms of intent classification. They adopted learning to rank models to predict user intents and an ensemble model to combine the proposed features and conventional approaches. Khudabukhsh et al. [27] designed query based triggers to learn query classification models to detect self-harm intents in modern search systems. Ali et al. [28] highlighted the retrieval effectiveness of search engines in terms of the precision and recall based on case studies. Shi et al. [29] proposed using a deep long-short-term-memory based feature mapping to learn feature representations in query classification. Fariha et al. [30] developed a system for semantic similarity-aware query intent discovery. Cai et al. [31] proposed a CNN-LSTM model to predict user intents from online health communities. Glater et al. [32] proposed a novel framework for learning semantic query annotations suitable to the target intents of individual queries, and demonstrated the effectiveness of their approach for various target intents in different lengths and difficulty levels.
In this study, we propose a two-stage neural network model to further enhance the performance of query intent classification. Our model represents queries with abundant features and automatically matches query intents to target categories based on query logs through implicit feedback.
General framework
In this section, we introduce the general framework of the proposed model. Our framework classifies query intents in two stages for accurately mining user intents. The goal of the first stage is to classify the original queries into intermediate categories. The intermediate categorization is automatically generated to augment latent useful information in user queries. We exploit query logs as external resources in this stage for more accurate classification. We then generate comprehensive representations of queries, and integrate the representations with the intermediate categories in the second stage. The goal of the second stage is to map the queries from intermediate categories to target categories. We illustrate the general framework of our method in Fig. 1.
General framework of the proposed method.
In the figure, we first transform the classification task in the first stage as a sequence classification problem. We take a given query as a sequence of query terms. Given the entire sequence of all the query terms, we compute the probabilistic distribution over intermediate categories for the first-stage classification. We then integrate the intermediate categories with query representations in the second stage to obtain the probabilistic distribution over target categories. In the next sections, we will introduce query term features, classification models and the labeling strategy on query intents, respectively.
We transform each original query into a feature representation as the inputs of our framework. Since queries are submitted by users with consecutive terms, we treat query terms as a sequence, and extract features at query term level instead of the entire query. We extract three types of features, including word representation-based features, term weighting-based features and syntax-based features.
Word representation-based features are generated based on the neural network language models. We use the skip-gram model by word2vec [33] in our experiments, and train the model using the retrieval dataset. These features capture the information of terms based on their distribution in large corpus, encoding abundant semantics of terms in the context. Based on the word representations, each query term is represented as a feature vector. Distances between two vectors can reflect the similarity of two terms.
Term weighting-based features consider the importance of different terms in query context. We adopt the term re-weighting method [8] to learn the term weights. We then normalize the term weights as term features.
Syntax-based features are based on the part-of-speech (POS) information and the syntactic dependency trees of queries. For the POS based features, we represent the POS of query terms as integers, and treat them as features. For the syntactic dependency tree based features, we use the depth of each query term node in the tree as one feature. The tree-based feature can model the dependency relationship of different query terms for discriminating the usefulness of terms. We use the NLTK tool to obtain the POS and dependency trees of queries in our implementation.
We list all the defined features on query terms in Table 1 for easily understanding our query representations. We represent each query as a feature vector based on the defined term features for further processing. We believe that information on query intents have been comprehensively encoded in the feature vectors for further classification.
Feature Definitions on Query Terms
Feature Definitions on Query Terms
In this section, we introduce our hybrid deep neural network model in detail. The basic structure of our model has been illustrated in Fig. 1. In our framework, we adopt recurrent neural network and recursive neural network as the classification models in the first stage, respectively. We model the sequence of query terms as probabilistic distributions over intermediate categories. We generate the intermediate categories based on implicit relevance feedback from query logs. In the second stage, we employ the multiple layer perceptron to obtain the query intents in the target categories based on Wikipedia.
Intermediate categorization oriented mapping
The first stage of our model is designed to map the original queries to intermediate categories. The inputs of this stage are representations of query terms, and the outputs are the intermediate categories of queries. We adopt neural network models to map the inputs to the outputs for accurate classification. Next, we introduce our query intent labeling strategy and neural network models, respectively.
Query intent labeling based on query logs
We generate intermediate categories of query intents based on implicit relevance feedback from query logs. Query logs have been widely used to optimize the search results without active feedback from users. Query logs contain abundant search information, such as user queries and corresponding clicks, which can reflect the query intents.
To capture query intents from query logs, we regard clicked URLs of given queries as implicit relevance feedback. Implicit relevance feedback assumes the clicked URLs are relevant to queries, and can be used to interpret the queries. Based on this idea, we use the clicked URLs to build a query-click graph. A toy example of the query-click graph is illustrated in Fig. 2. The clicked URLs are assumed to be high correlated with the query intents. Therefore, we can use the categories of URLs to indicate the categories of query intents in our study.

Construction of Query-click Graphs.
Like query intent categories, websites with a certain URL can be divided into different categories based on their contents, topics and other features. These categories for websites map a given URL to a category indicating characteristics of the URL. Therefore, we can use the categories of the clicked URLs to implicitly determine the category of query intents.
Primary and counts of secondary categorizations in ODP
The matching results in query logs
Formally, for all the queries in query logs, we obtain a set of clicked URLs together with the corresponding clicks for each URL. Based on the ODP-based categorization, we match all queries with intermediate categories, and count the number of times of each matched secondary category. Finally, we use maximum likelihood estimation to convert the vectors of matching categories into probabilistic distribution over query intents as follows.
To classify query representations into the intermediate categories, we adopt the recurrent neural network model and the recursive neural network model. These two networks have been proved effective in various text mining tasks. We introduce the structures of them below.
The recurrent neural network is designed to deal with sequence classification problems [2–7, 34]. We use the network to traverse all the query terms based on their sequence in the query, and makes predictions based on historical information from query terms along the sequence. Formally, the model is defined as follows.
We provide an example of the recurrent neural network in Fig. 3. In the figure, we remove the superscript ’IH’ for simplicity. We input the query ’where buy an apple’ in the example. Our model computes the outputs of hidden layer as Example of the recurrent neural network.
The recursive neural network can also be used to tackle the problem of sequence classification [34]. The main difference between recursive neural network and recurrent neural network lies in that recursive neural network traverses all the query terms based on any given sequence, in which all the query terms are taken as leaf nodes of a recursive tree. Therefore, recursive neural network is more flexible in dealing with the input terms, but may ignore sequential information of the original query compared to recurrent neural network. We learn the recursive neural network based on syntactic dependence binary tree of each query, which captures the syntactic relationships among query terms. Long queries with complex syntactic structures may particularly be addressed by the recursive neural network to capture the dependence relationship among terms. We define the structure of the used recursive neural network as follows.
We provide an example of the recursive neural network in Fig. 4. In the figure, we also remove the superscript ‘IH’ for simplicity. We input the query ‘where buy an apple’ in the example. Our model computes the outputs of ‘where buy’ and ’an apple’ as Example of the Recursive neural network.
We use a Softmax layer as the outputs of both models, which outputs probabilistic distributions on intermediate categories. The probability of a query Q to i
th
class can be computed as follows.
In the second stage of our model, we map the intermediate categories of query intents to the target categories. The target categories are determined based on Wikipedia [35], which includes 14 primary categories. To make the our model more robust, we take both the outputs of the first stage and the query representations as inputs of the second stage. We employ a multi-layer perceptron as the classifier in this stage, which contains one hidden layer and one Softmax layer. In our preliminary experiments, we have examined the effectiveness of the proposed network with more hidden layers. However, it brings more time cost on model training, but yields no significant improvement on the classification performance. Therefore, we use only one hidden layer in the perceptron. The inputs of the perceptron are the concatenation of the original query representation and the distribution on intermediate categories. The dimensionality of the inputs equals to the sum of the dimensionality of the query representation and the number of intermediate categories. We illustrate the mapping from intermediate categories to the target categories in Fig. 5. The hidden layer and the Softmax layer can be formalized as follows.
Softmax model for classification.
Overall, our framework adopts the recurrent neural network and the recursive neural network, respectively, for query intent mining. We name these two models as HDNN-Recurrent and HDNN-Recursive, respectively.
Experimental settings
In this section, we conduct sufficient experiments to evaluate our method. We first introduce the dataset used in two stages, respectively. The first stage uses AOL query logs and ODP to generate intermediate categories on query intents. We use 4 million queries as the training set, 0.1 million queries as the testing set, and remaining queries as the validation set. The dimensionality of outputs for this stage is set to be 219, which equals to the number of secondary categories of ODP.
For the second stage, we use the target categories based on Wikipedia [35], which includes 14 primary categories. We use 500 queries from ad-hoc track of Text REtrieval Conference (TREC), and label the queries manually by three experts. The final category of each query is determined by voting. The dimensionality of target categories is 14. We perform five-fold cross validations to achieve the average performance of all the queries. These amount of data used in these two stages are sufficient to evaluate the proposed method in this work. Since the annotations on query intents cost much manual labor, we will explore effective ways for automatic intent annotation in future, which would facilitate the model comparisons on larger datasets.
We use the skip-gram model [36] to obtain word representations based on Wikipedia datasets in 100 dimension. We use Keras framework with Theano as backend to implement the proposed deep learning models. At the model training, we adopt cross entropy as the cost function, and AdaDelta as the optimizer with early stop for dynamic adjustment of the learning rate. We set the initial learning rate as 1.0. The regularization parameter λ is empirically set to be 0.001. The size of batch is 10000, and the number of epoches is 500. Based on the difference of our two models in the first stage, we denote our two models as HDNN-Recurrent and HDNN-Recursive, respectively.
Baseline models and evaluation metrics
We choose four state-of-the-art models as baselines in our experiments, and adopt Precision, Recall, F1 and Accuracy as evaluation metrics to evaluate the performance of classification. We introduce the baseline models as follows.
(1) BridgeQC is a classic two-stage classification method for query intent classification [9]. In its first stage, we use ODP to map query to an intermediate category, and in the second stage, it adopts exact matching and SVM classifier to obtain the target query intents. A variation of the method has won the query intent classification task of KDDCUP [37].
(2) Vec-SVM takes the average of all the representations of query terms as the query representation, and use support vector machines (SVM) to train a classifier mapping queries to target categories directly. We take this method as a baseline to examine the effectiveness of two stage classification for query intent classification. (3) WVec-SVM is the weighted version of Vec-SVM, which takes weighted average of all the representations of query terms as the query representation. In our experiments, term weights are obtained based on the methods proposed in [8].
(4) HDNN-SVM is a modified version of our method using SVM as the classifier in the second stage. We use this method to examine the effectiveness of neural network based model in our method.
The baseline methods are implemented under the same framework as the proposed method for fair comparison. The support vector machines are trained with a linear kernel using l2 penalty and a square hinge loss with C = 1, because relatively good performance can be achieved under this settings on the validation set. Since the proposed framework is easy to implement, other state-of-the-art models can also be compared with the proposed models in similar experimental settings. We will also explore other optimization strategies in our framework to further improve the classification performance.
Results and analysis
We report the experimental results in Table 4. As mentioned above, we categorize query intents based on Wikipedia. We perform five-fold cross validations for the Vec-SVM and WVec-SVM methods, and perform two-stage classifications for other methods using ODP-based intermediate categories with five-fold cross validations in the second stage.
Classification Performance Comparisons
Classification Performance Comparisons
From the table, we observe that WVec-SVM outperforms Vec-SVM as one-stage classification methods, which indicates the usefulness of term weighting in query intent classification. Two-stage classification methods achieve better performance than one-stage methods. This finding implies that query representations purely based on summations of representations of query terms gain limited capability to interpret the query intents. The second stage contributes much to overcoming the lack in training instances, thus improving classification performance. We estimate the computational cost based on the number of floating point operations (FLOPs) used to train a model by multiplying the training time, the number of GPUs used, and an estimate of the sustained single-precision floating-point capacity of each GPU, which follows the way used in [38]. The comparison shows that the computational cost of our model is comparable to other baseline models.
Among all the two-stage methods, HDNN-Recurrent achieves the best performance, and gains 13.4% and 9.0% improvements over BridgeQC. HDNN-Recurrent achieves better performance than HDNN-Recursive. This indicates that recurrent neural network in the first stage is more useful to represent the queries. Sequential modeling on query terms yields more improvement in query representations than using the syntactic dependence trees.
To further analyze the effectiveness of our method, we plot the ROC curves of the compared models in Fig. 6, and report the classification performance in the first stage classification in Table 5. Figure 6 plots true positives versus false positives retained by setting different cutoffs. The curves show that the proposed method can yield better performance compared with other baseline models. HDNN-Recurrent outperforms other models under different false positive rates. The trends of performance of these models are consistent with that reported in Table 4. The results indicate that the proposed models can improve the classification accuracy of query intents, and the recurrent network based model is more effective.
ROC curves for the compared models.
Classification Performance for the First Stage
Table 5 shows that the proposed methods outperform BridgeQC in terms of all the evaluation metrics. This eventually affects the overall classification performance. The results also imply that intent labeling based implicit relevance feedback is effective in building intermediate categories on query intents.
In the paper, we propose a hybrid deep neural network model for query intent classifications. The model contains two stages: the first stage adopts ODP to generate intermediate category, and employs recurrent and recursive neural networks for query representations; the second stage adopts Wikipedia-based category as target category, and employs a multilayer perceptron to train the classifiers. Experimental results show that our method outperforms state-of-the-art baseline models and improves the performance of query intent classification. Experimental results demonstrate that ODP-based query intent labels based on query logs improve the performance of classification, and hybrid deep neural networks are effective in representing queries. We will carry out our future work to further optimize our method using relevance feedback to improve the classification effectiveness. We will also explore effective ways for automatic query intent annotation, which would facilitate further model comparisons on larger datasets and model optimization on other domain-specific tasks.
Copyright
Authors submitting a manuscript do so in the understanding that they have read and agreed to the terms of the IOS Press Author Copyright Agreement posted in the ‘Authors Corner’ on www.iospress.nl.
Footnotes
Acknowledgments
This work is partially supported by grant from the Ministry of Education Humanities and Social Science Project (No.19YJCZH199), the Natural Science Foundation of China (No. 61632011, 61572102, 61602078, 61572098), China Postdoctoral Science Foundation (No. 2018M641691), the Fundamental Research Funds for the Central Universities (No. DUT18ZD102), the National Key Research Development Program of China (No. 2016YFB1001103).
