Abstract
Keyword query on XML data has attracted many researchers’ attention. The existing keyword query methods on XML data are mainly based on the LCA (lowest common ancestor) semantics and its variants (SLCA, ELCA, et al.). These semantics are mainly focused on finding the results of “AND” semantics among keywords which makes the query results incomplete. The structure query language can return more meaningful and comprehensive answers, but it is difficult for a user without the knowledge of the structure and schema of an XML document to propose a structure query statement. In the reality, there exists plenty of uncertainty and ambiguity, and how to search the useful information on fuzzy XML data becomes an important research issue. In this paper, we introduce the structure query language into the keyword query in fuzzy XML data to get more comprehensive query results. First, we propose the concepts of object tree, the minimum object tree and the nearest object tree and propose a semantics of matching object trees for keyword query to capture the user’s query intention. Then, we give our query method AO-Twig to combine the structure query language with keyword query to obtain the Top-K query results with the highest scores. Finally, experimental results on both real datasets and synthetic datasets show that the proposed method AO-Twig performs well for finding Top-K results of keyword queries over fuzzy XML data.
Introduction
In the real world application, there exists a plenty of uncertain and vague information, and how to manage the uncertainty and vague information becomes a research hotspot. XML (Extensible Markup Language) is rapidly emerging and has been the de facto standard for representing and exchanging data on the Web, and many researchers devote their efforts on the management of uncertainty XML data. And the fuzzy XML data management has attracted many researchers’ attention. Keyword query is a user-friendly query method. Users can directly obtain the corresponding query results only by proposing one keyword or several keywords, without understanding or mastering the complex structure query languages (such as XQuery, Xpath) and the document’s schema. And how to search the useful and meaningful information on fuzzy XML data by the way of keyword query is an issue worthy of study.
Recently, the existing keyword query methods on Crisp XML are mainly based on the LCA semantics or its variants (SLCA [25], ELCA [26], VLCA [10]). Among these semantics, An LCA of a set of keywords is a lowest node in the tree that is the common ancestor of nodes with these keywords. An SLCA of a set keywords is a lowest node whose subtree is the “smallest tree” containing all keywords. An ELCA is a lowest common ancestor of nodes containing keywords with no child node be LCA node or ancestor node of LCA on the path from ELCA node to the nodes containing keywords. An VLCA is a lowest common ancestor of nodes containing keywords if its subtree contains at least one match to every keyword, and it does not have a descendant whose subtree contains at least one match to every keyword. These semantics are focus on finding the results of “AND” semantics among keywords.
There are some papers about the study of taking the “structure information” of XML into account in the keyword query to return the fragments of XML document as query results [13, 20, 21]. In [13], Li et al. first construct structure queries for the keyword queries based on the source schemas and evaluate the generated structured queries in a sequence, and proposed an XML keyword search approach, “XBridge”. Cohen et al. [20] designed and implemented a search engine “XSEarch” to return the semantically related document fragments for keyword queries, and they develop a syntax which is suitable for a naive user and facilitates a fine-granularity search.
Many researchers devote their efforts on the study of the uncertainty XML data, and mainly on the aspects of models, representations and query methods of uncertainty XML data. The main research works are on the studying of probabilistic XML data, incomplete XML data and fuzzy XML data. On the research achievements of models and representations of uncertainty XML. Nierman and Jagadish [2] proposed a probabilistic XML model, ProTDB, with the probabilistic types IND and MUX and this model allows convenient intermixing of probabilistic and non-probabilistic data. Hung et al. [8] introduced a probabilistic cyclic that could support arbitrary distributions over the relationship between an object with its children and arbitrary distributions over the object’s value. They develop an extension of the relational algebra and propose efficient algorithms for answering queries by using this algebra. In [19], a model and representation system for the incomplete information is proposed. In [27], Ma and Yan proposed the fuzzy XML model for the representation of fuzzy information in XML and identified multiple granularity of data fuzziness. Gaurav and Alhajj [1] defined a mechanism to represent fuzzy data along with crisp data in XML by introducing fuzziness using both possibility theory and similarity relation, and explored the fuzzy relational databases and algorithms to map the classical relational database into classical XML document.
Based on these uncertainty models, many algorithms have been proposed to find the information from the uncertainty XML [3, 4, 5, 11, 12, 14, 18, 23, 24, 28]. On the study of keyword query, there are some published works about keyword querying on the probabilistic XML data. Zhou et al. [18] studied the ELCA semantics on probabilistic XML and proposed an algorithm PrELCA to compute ELCA probabilities without generating possible worlds. In [5], Zhang et al. proposed a probabilistic XML model PrXML
As we know, the structure query language can get more accurate and meaningful query answers (including results of AND semantics and results of OR semantics), but it needs users to master the structure query language and knows the documents’ schemas. For non-professor users, it is hard to build structure query statement for a keyword query. So, in this paper, we focus on the problem of keyword querying over fuzzy XML data by introducing the structure query language (XPath) into the keyword query and the users can get the more relevant results when proposing the keyword queries. Firstly, we introduce the concepts of “Object tree”, “minimum object tree” and “nearest object tree”, then we propose a semantics of matching object tree for the keyword query, we find the matching object trees
We summarize the contributions of this paper as follows:
We introduce the concepts of “object tree”, “the minimum object tree” and “the nearest object tree”, and define a semantics of matching object tree for keyword querying on fuzzy XML data. We give our method AO-Twig for keyword querying on fuzzy XML data by combining the structure query language with keyword query. We propose a score mechanism by taking the tf*idf document relevance and the keyword- possibilities of the query results into consideration. We conduct experiments to evaluate and compare the performances of our method AO-Twig with other keyword query approaches.
The rest of paper is organized as follows: we first introduce the model and representation of the fuzzy data in XML in Section 2. In Section 3, we introduce the concepts of “object tree”, the “minimum object tree” and the “nearest object tree”, propose the semantics of matching object trees for keyword query and then propose our AO-Twig method for keyword queries on fuzzy XML data. In Section 4, we propose a scoring mechanism to score the answers based on the tf*idf and the keyword-possibilities of the result subtrees, give the indexes structures and rank them with the threshold algorithm. The experimental results are reported in Section 5. In Section 6, we provide a conclusion of the paper.
A XML document can be represented as a directed and ordered tree
A fuzzy XML tree structure example.
There are two kinds of fuzziness in fuzzy XML: one is the fuzziness in elements and we use membership degrees associated with such elements, the other is the fuzziness in the values of attributes and we use possibility distribution to represent such values. There are two kinds of possibility distribution, which are disjunctive possibility distribution or conjunctive possibility distribution [27]. And a fuzzy XML document can also be represented as an ordered and directed tree, there are two types of nodes in the fuzzy XML tree, the crisp nodes
Figure 2 is a fragment of fuzzy XML document. Considering line 2, <Val Poss
A fragment of fuzzy XML document.
In order to facilitate our study and help readers better understand. We simplify the fuzzy XML tree into a simplified structure [32], which can be shown in Fig. 3. In this simplified structure, we use nodes of elliptical shape to denote the ordinary nodes and the nodes of rectangular shape to denote the fuzzy nodes. And the nodes with name FV
In this section, we give our structure-based approach of the keyword query by combining keyword query processing and structure queries on fuzzy XML data. Firstly, we introduce the concept of “object tree”, the “minimum object tree”, and the “nearest object tree”. Then we give the semantics of matching object trees for the keyword queries, and propose our method AO-Twig to combine the structure query language with keyword query to obtain the query results.
A simplified structure of a fuzzy XML tree.
In reality, objects are applied to model real-world entities or abstract concepts. The objects have properties that may be attributes of the object itself or relationships known as associations between the object and one or more other objects [29]. The object has two characteristics, which are (1) an object has attributes, and the values of attributes, and (2) an object has a correlation with other objects. There exists two kinds of objects, crisp object and fuzzy object. The object is fuzzy because of a lack of information, and an object is a fuzzy object when there exists at least one attribute whose value is a fuzzy set. A fuzzy XML document is represented as a tree structure
Here, for an XML tree
The minimum object trees.
Figure 4 shows the examples of minimum object trees in the fuzzy XML tree. In the tree structures of Fig. 4,
If the node If the node
We give some interpretations for case (2), for a set of keywords
When keyword When keyword When keyword When keyword
We give some explanation about the smallest object tree. For two object trees
For a fuzzy XML document, we apply the method in [30] to pre-process the fuzzy XML documents with the object identification. And before the users input their keywords, all the fuzzy XML documents have been pre-processed with the object identification operations.
For a fuzzy XML document, after the operation of object identification on it, we classify the nodes in fuzzy XML into six categories:
The object (or object tree) node The element node The attribute node The connect node The value node The fuzzy node
[h] AO-Twig[1] A keyword query the Top-K query results with highest scores
In this section, we propose our method for combining the keyword query processing and structure queries on fuzzy XML data. Given a fuzzy XML tree
Given a fuzzy XML tree, there are three kinds of fuzzy structures:
Fuzzy nodes only appear in a single path (that is, a root to leaf path). Fuzzy node appear among the branches. Fuzzy nodes appear in the complex structure.
As shown in Fig. 5, the tree structure of A is the example of type-1, B is the example of type-2, and C is the example of type-3. Combine the above three types, we give the definition of the whole possibility of a fuzzy XML subtree as follows.
Whole possibilities in different fuzzy structures.
As the structure query language can convey complex and precise semantic meanings sufficiently to obtain a more meaningful query results, and combine the keyword query with the structure query language becomes a worth-study problem. But for the common users, it is difficult to propose the structure query statement because of lacking the professional knowledge about the structure query language and document’s schema, and we propose a semantics of matching object tree for keyword query to capture the query intention of the users. When inputting the keyword query, we find matching object trees
We give some details about our method AO-Twig for the keyword query inputted. When there is only one keyword
[h] Function fuzzy tree distance (
For the Algorithm 2, we give some interpretations. And we start to introduce it by giving the following definitions:
if <
if < <
Here, path(
From the above definition, if <
[h] Function assignment[1] The distance metrics A[
<1> Assign
<2>
<3>
<4> Draw straight lines at the rows and columns which are not be assigned
The
Then, we give the more details about the Algorithm 2, we analyze the two comparable object trees by considering the children nodes in the object trees, and recursively compute a fuzzy tree distance for each couple of subtrees rooted at children node with the same label. After all distances have been calculated. The algorithm assign each node to another node with the same label to minimize the overall cost. The assignment problem can be solved using a variation of the Hungarian Algorithm [9] (seen in Algorithm 3). And this task is performed by a call to function Assignment. Given a matrix of distances, the function Assignment returns a set of assignments containing couples of indices of assigned nodes. A set of all children of node
In this section, we give the concrete implementation of our method on the aspects of scoring mechanism, index construction and rank mechanism. Next, we start from introducing the scoring mechanism.
The scoring mechanism
In order to obtain the results which are the most relevant to the keyword queries, we propose a scoring mechanism to distinguish and rank the query results. And for the query results obtained by LTwig, we propose a scoring mechanism by considering the tf*idf statistical method and the keyword-possibilities of the answers.
We model each result subtree obtained from LTwig as a document that includes keywords in the subtree. And for each keyword, we use the tf*idf to score the relevance of the subtree to a keyword. Given a fuzzy XML document
We give some explanation about the above formula, s is a constant (usually set to 0.2), and the relevance degree of the result subtree Tr to the set of keywords
As the fuzzy XML contains fuzzy information including the membership degrees of the elements and the possibility distributions of the values of the attributes. For a result subtree, it may contains fuzzy information in the tree. Considering the impacts of keyword nodes which are contained in the result subtrees, we give the definition of the keyword-possibility of the result subtree.
Taking the keyword-possibility of the result subtrees into consideration, we use the following formula to score the result trees:
For a fuzzy XML document
The index We give some explanations about the The object tree node index After pre-processing the fuzzy XML documents with the object identification operation. The codes (begin( The minimum object tree index The The index For a result subtree The index The score index Given a set of keywords
There are many threshold-based techniques for ranking the Top-K answers, such as Fagin’s Algorithm [15] and Threshold Algorithm [16, 17]. In this paper, we adopt the TA algorithm to obtain the Top-K answers with the highest scores. The algorithm first retrieve each keyword list
[h] TA algorithm[1] Do sorted access in parallel to each keyword list
Experiments
In this section, we first present the experimental results on the performance of our proposed method in comparison with the existing traditional keyword query methods. We analyze the query results on the metrics of precision, recall, ROC curve and
Experimental setup and dataset
All the experiments are performed on a laptop with 2.13 GHz Intel core i3 with 2 GB memory on Windows 7 system. And we use a real dataset DBLP [7] and a synthetic dataset XMark [22] for testing our method.
The keyword query examples on different datasets
The keyword query examples on different datasets
The Top-K precision on different datasets.
The Top-K recall on different datasets.
We generate five datasets
We evaluate our method AO-Twig with the traditional XML keyword methods XRank and SLCA on three standard metrics precision, recall and
First we evaluate the precision of our method with the traditional keyword methods XRank and SLCA. We use the tf* idf document relevance score method to score the query results obtained by SLCA, and rank the results. We run keyword queries
We run keyword queries
We run keyword queries
The ROC curve.
The
The
The Top-K precision and Top-K recall when varying the values of threshold U.
Next, we evaluate the performances of AO-Twig with ROstack [33] on the metrics of precision and recall when setting different values of threshold U. As the ROstack algorithm does not have a ranking mechanism, we use the TA algorithm (Algorithm 4 in Section 4.3) to rank the results of keyword queries obtained with ROstack and get the Top-K results with their possibilities (scores). We random choose 10 groups of keyword queries which consisted of 1
In this paper, we study the problem of keyword querying on fuzzy XML data. We combine the structure query language into the keyword query over fuzzy XML to get more meaningful and comprehensive results. We introduce the concept of “object tree”, “the minimum object tree”, “the nearest object tree”, and propose a semantics of matching object tree for keyword query. We propose the method AO-Twig, and when we get the matching object trees of the keyword query, we find the approximate matching object trees of matching object trees by setting a threshold U, and input the structure query statements which are built with the approximate matching object trees into LTwig to obtain more query results. The users can make the value of threshold U higher to obtain more approximate matching object trees and more query statements can be putted into LTwig to get more meaningful query results. We propose a score mechanism based on the tf*idf and the keyword-possibilities of the result subtrees, and rank the results with the threshold algorithm for the Top-K results with highest scores. Experimental results show that our method achieves high search result quality on both synthetic and real datasets for keyword queries over fuzzy XML data, and outperforms the traditional XML keyword query approaches significantly.
As the large-scale data appears in all trades and professions commonly, considering the characteristics of large-scale data, how to find the useful and accurate information existed in the large-scale data with a fast way becomes much important. So in the future, we will devote our efforts on finding the effective method for getting the results of keyword queries over large-scale fuzzy XML data speedily.
Footnotes
Acknowledgment
The work is supported by the National Natural Science Foundation of China (grant no. 61772269).
