A structure-based approach of keyword querying for fuzzy XML data

Abstract

Keyword query on XML data has attracted many researchers’ attention. The existing keyword query methods on XML data are mainly based on the LCA (lowest common ancestor) semantics and its variants (SLCA, ELCA, et al.). These semantics are mainly focused on finding the results of “AND” semantics among keywords which makes the query results incomplete. The structure query language can return more meaningful and comprehensive answers, but it is difficult for a user without the knowledge of the structure and schema of an XML document to propose a structure query statement. In the reality, there exists plenty of uncertainty and ambiguity, and how to search the useful information on fuzzy XML data becomes an important research issue. In this paper, we introduce the structure query language into the keyword query in fuzzy XML data to get more comprehensive query results. First, we propose the concepts of object tree, the minimum object tree and the nearest object tree and propose a semantics of matching object trees for keyword query to capture the user’s query intention. Then, we give our query method AO-Twig to combine the structure query language with keyword query to obtain the Top-K query results with the highest scores. Finally, experimental results on both real datasets and synthetic datasets show that the proposed method AO-Twig performs well for finding Top-K results of keyword queries over fuzzy XML data.

Keywords

Fuzzy XML keyword query object-oriented structure query Top-K approximate

1. Introduction

In the real world application, there exists a plenty of uncertain and vague information, and how to manage the uncertainty and vague information becomes a research hotspot. XML (Extensible Markup Language) is rapidly emerging and has been the de facto standard for representing and exchanging data on the Web, and many researchers devote their efforts on the management of uncertainty XML data. And the fuzzy XML data management has attracted many researchers’ attention. Keyword query is a user-friendly query method. Users can directly obtain the corresponding query results only by proposing one keyword or several keywords, without understanding or mastering the complex structure query languages (such as XQuery, Xpath) and the document’s schema. And how to search the useful and meaningful information on fuzzy XML data by the way of keyword query is an issue worthy of study.

Recently, the existing keyword query methods on Crisp XML are mainly based on the LCA semantics or its variants (SLCA [25], ELCA [26], VLCA [10]). Among these semantics, An LCA of a set of keywords is a lowest node in the tree that is the common ancestor of nodes with these keywords. An SLCA of a set keywords is a lowest node whose subtree is the “smallest tree” containing all keywords. An ELCA is a lowest common ancestor of nodes containing keywords with no child node be LCA node or ancestor node of LCA on the path from ELCA node to the nodes containing keywords. An VLCA is a lowest common ancestor of nodes containing keywords if its subtree contains at least one match to every keyword, and it does not have a descendant whose subtree contains at least one match to every keyword. These semantics are focus on finding the results of “AND” semantics among keywords.

There are some papers about the study of taking the “structure information” of XML into account in the keyword query to return the fragments of XML document as query results [13, 20, 21]. In [13], Li et al. first construct structure queries for the keyword queries based on the source schemas and evaluate the generated structured queries in a sequence, and proposed an XML keyword search approach, “XBridge”. Cohen et al. [20] designed and implemented a search engine “XSEarch” to return the semantically related document fragments for keyword queries, and they develop a syntax which is suitable for a naive user and facilitates a fine-granularity search.

Many researchers devote their efforts on the study of the uncertainty XML data, and mainly on the aspects of models, representations and query methods of uncertainty XML data. The main research works are on the studying of probabilistic XML data, incomplete XML data and fuzzy XML data. On the research achievements of models and representations of uncertainty XML. Nierman and Jagadish [2] proposed a probabilistic XML model, ProTDB, with the probabilistic types IND and MUX and this model allows convenient intermixing of probabilistic and non-probabilistic data. Hung et al. [8] introduced a probabilistic cyclic that could support arbitrary distributions over the relationship between an object with its children and arbitrary distributions over the object’s value. They develop an extension of the relational algebra and propose efficient algorithms for answering queries by using this algebra. In [19], a model and representation system for the incomplete information is proposed. In [27], Ma and Yan proposed the fuzzy XML model for the representation of fuzzy information in XML and identified multiple granularity of data fuzziness. Gaurav and Alhajj [1] defined a mechanism to represent fuzzy data along with crisp data in XML by introducing fuzziness using both possibility theory and similarity relation, and explored the fuzzy relational databases and algorithms to map the classical relational database into classical XML document.

Based on these uncertainty models, many algorithms have been proposed to find the information from the uncertainty XML [3, 4, 5, 11, 12, 14, 18, 23, 24, 28]. On the study of keyword query, there are some published works about keyword querying on the probabilistic XML data. Zhou et al. [18] studied the ELCA semantics on probabilistic XML and proposed an algorithm PrELCA to compute ELCA probabilities without generating possible worlds. In [5], Zhang et al. proposed a probabilistic XML model PrXML ${}^{\{\text{exp},\text{ind},\text{mux}\}}$ to represent the relationship among XML sibling nodes, and propose an algorithm using kdptabs which were keyword distribution probability tables of subtrees, to find SLCA nodes and computed their probabilities. Li et al. [14] investigated probabilistic threshold keyword queries of XML, and proposed two efficient algorithms: Baseline algorithm and PI index-based algorithm. There are also some published works about keyword querying on the fuzzy XML data [31, 32, 33]. The previous efforts on the studies of keyword querying over fuzzy XML data mainly focus on developing the query methods based on the SLCA semantics and the object-oriented keyword query methods. In [32], the methods of keyword querying for getting the results of AND semantics are developed, the fuzzy SLCA semantics is defined and two efficient algorithms Fstack and Mindex scan for finding $k$ SLCA results with highest possibilities are proposed. In [33], the method of object-oriented keyword query is developed, the semantics of object-oriented keyword queries is proposed and the algorithm ROstack is proposed to obtain the root nodes of the matching result object trees. The ROstack algorithm proposed can return the results of AND semantics and results of OR semantics, which are the “minimum result object trees” containing all keywords and the “result object trees” containing partial keywords. For a keyword query over the fuzzy XML document, the object-oriented query method [33] previously proposed can return the query results at the object-level (e.g. matching result object tree) with high precision and high recall, but it dose not take the query relaxation into consideration. For the same keyword query and the same fuzzy XML document, the number of the matching result object trees obtained with the object-oriented keyword query semantics [33] is a fixed number, and the query users can only find search information with their possibilities in these object trees. Given a structure query statement, the structure query algorithms can return much more accurate and meaningful answers, these answers can include the results of AND-logic (AND semantics), results of OR-logic (OR semantics) and so on. Query relaxation is common in the structure query, which can obtain more query results by relaxing the query conditional value or modifying the query statement. Considering the characteristic of fuzzy XML data, we will aim to find more possible results with their possibilities of keyword queries over fuzzy XML and return the more relevant results with high scores to users. So, in this paper, we will introduce the concept of object-oriented, the thought of query relaxations and the structure query into processing the keyword queries of fuzzy XML data, and study the method for processing the keyword queries on fuzzy XML by combining the ways of keyword query and structure query together to get more relevant and more possible results for users.

As we know, the structure query language can get more accurate and meaningful query answers (including results of AND semantics and results of OR semantics), but it needs users to master the structure query language and knows the documents’ schemas. For non-professor users, it is hard to build structure query statement for a keyword query. So, in this paper, we focus on the problem of keyword querying over fuzzy XML data by introducing the structure query language (XPath) into the keyword query and the users can get the more relevant results when proposing the keyword queries. Firstly, we introduce the concepts of “Object tree”, “minimum object tree” and “nearest object tree”, then we propose a semantics of matching object tree for the keyword query, we find the matching object trees $O_{M}$ of keyword query, and set a threshold U to find the approximate matching object trees $O_{AM}$ in the fuzzy XML tree that satisfy fuzzy tree distance ( $O_{M}$ , $O_{AM}$ ) $<$ U. We construct the structure query statements according to the approximate matching object trees and put the query statements into the structure query algorithm to obtain the keyword query results.

We summarize the contributions of this paper as follows:

•
We introduce the concepts of “object tree”, “the minimum object tree” and “the nearest object tree”, and define a semantics of matching object tree for keyword querying on fuzzy XML data. We give our method AO-Twig for keyword querying on fuzzy XML data by combining the structure query language with keyword query.
•
We propose a score mechanism by taking the tf**idf* document relevance and the keyword- possibilities of the query results into consideration.
•
We conduct experiments to evaluate and compare the performances of our method AO-Twig with other keyword query approaches.

The rest of paper is organized as follows: we first introduce the model and representation of the fuzzy data in XML in Section 2. In Section 3, we introduce the concepts of “object tree”, the “minimum object tree” and the “nearest object tree”, propose the semantics of matching object trees for keyword query and then propose our AO-Twig method for keyword queries on fuzzy XML data. In Section 4, we propose a scoring mechanism to score the answers based on the tf**idf* and the keyword-possibilities of the result subtrees, give the indexes structures and rank them with the threshold algorithm. The experimental results are reported in Section 5. In Section 6, we provide a conclusion of the paper.
2. Fuzzy XML data model

A XML document can be represented as a directed and ordered tree $G=(V,E)$ , where $V$ are sets of nodes and $E$ are sets of edges. Given two nodes $v_{i}$ and $v_{j}$ , we use $E(v_{i},v_{j})$ to represent a directed edge from node $v_{i}$ to node $v_{j}$ . For each $v\in V$ , there has a unique label, denoted as label ( $v$ ), For a given tree $t$ , $V(t)$ denotes the set of all the nodes in $t$ , and $E(t)$ denotes the set of all the edges in $t$ , and $E(t)\subseteq V(t)\times V(t)$ . If there are two nodes $v_{1}$ and $v_{2}$ , where $v_{1}$ , $v_{2}\in V(t)$ , $E(v_{1},v_{2})$ is the edge between $v_{1}$ and $v_{2}$ , and $E(v_{1},v_{2})\in E(t)$ . Then we can conclude that $v_{1}$ is the parent of node $v_{2}$ . If there is a path track from node $v_{1}$ to node $v_{2}$ , and $v_{1}$ is not parent of $v_{2}$ , then node $v_{1}$ is an ancestor node of $v_{2}$ . If the tree $t_{s}$ and $t$ satisfy that $V(t_{s})\subseteq V(t)$ , and $E(t_{s})\subseteq E(t)$ , then $t_{s}$ is a subtree of tree $t$ .

Figure 1.

A fuzzy XML tree structure example.

There are two kinds of fuzziness in fuzzy XML: one is the fuzziness in elements and we use membership degrees associated with such elements, the other is the fuzziness in the values of attributes and we use possibility distribution to represent such values. There are two kinds of possibility distribution, which are disjunctive possibility distribution or conjunctive possibility distribution [27]. And a fuzzy XML document can also be represented as an ordered and directed tree, there are two types of nodes in the fuzzy XML tree, the crisp nodes $N_{C}$ and fuzzy nods $N_{F}$ . The crisp nodes are the ordinary nodes (with a tag and a value) and fuzzy nodes (“Dist” or “Val” nodes) are the description of the fuzzy information over the subsets of their children. In the fuzzy XML tree structure, we introduce a possibility attribute, denoted “Poss”, which takes a value between 0 and 1 and is applied together with the fuzzy construct “Val” to specify the possibility of a given element. The node “Val” associates a possibility with an element node, which means the possibility that the element “exists” in a collection. And the “Dist” node is used to denote the types of possibility distributions, which are disjunctive or conjunctive. And for edge set $E(t)$ , and there exists four types of edges, $E_{CC}$ , $E_{CF}$ , $E_{FC}$ , $E_{FF}$ , where $E_{CC}$ is a set of edges from nodes in $V_{C}(t)$ to nodes in $V_{C}(t)$ , $E_{CF}$ is a set of edges from nodes in $V_{C}(t)$ to nodes in $V_{F}(t)$ , $E_{FC}$ is a set of edges from nodes in $V_{F}(t)$ to nodes in $V_{C}(t)$ , and $E_{FF}$ is set of edges from nodes in $V_{F}(t)$ to nodes in $V_{F}(t)$ . The differences between these edges are the different node types at the ends of the edges. A simple example of a fuzzy XML tree structure can be seen in Fig. 1.

Figure 2 is a fragment of fuzzy XML document. Considering line 2, <Val Poss $=$ “0.9”> denotes that the possibility of department’s name being “Computer Science and Technology” is equal to 0.9. For a crisp element, we omit its membership expression: <Val Poss $=$ “1.0”> and </Val>. In order to express the possibility distributions of values of the attributes, a fuzzy construct “Dist” is introduced into the model. A “Dist” element has multiple “Val” elements as children, and each “Val” element is associated with a possibility for the value of the attribute. The Dist element indicates the possibility distribution of values, which is disjunctive possibility distribution or conjunctive possibility distribution. Lines 5–18 in Fig. 2 describe a Dist construct which makes the expression of two possible information of William Smith. One expresses that the possibility of the information that William Smith is an associate professor, and the salary is 6000 is equal to 0.8, the other expresses that the possibility of the information that William Smith is a professor, and the salary is 8000 is equal to 0.6. Although the possibility distribution in lines 5–18 is for leaf nodes in the ancestor-descendant chain, we can also have the possibility distributions over non-leaf nodes.

Figure 2.

A fragment of fuzzy XML document.

In order to facilitate our study and help readers better understand. We simplify the fuzzy XML tree into a simplified structure [32], which can be shown in Fig. 3. In this simplified structure, we use nodes of elliptical shape to denote the ordinary nodes and the nodes of rectangular shape to denote the fuzzy nodes. And the nodes with name FV ${}_{i}$ represent the fuzzy nodes. And for two connected fuzzy nodes Dist-Val in Dist construct, we combine them with one fuzzy node in the simplicity structure. The real number from (0, 1] attached with the edge between two nodes indicates the possibility that the child node will appear under the parent node given the existence of the parent node. And the unweighted edges have 1 as the default value of the membership degree between two connected nodes, which means the possibility of the two connected node is 1.

3. The structure-based approach of keyword querying for fuzzy XML data

In this section, we give our structure-based approach of the keyword query by combining keyword query processing and structure queries on fuzzy XML data. Firstly, we introduce the concept of “object tree”, the “minimum object tree”, and the “nearest object tree”. Then we give the semantics of matching object trees for the keyword queries, and propose our method AO-Twig to combine the structure query language with keyword query to obtain the query results.

Figure 3.

A simplified structure of a fuzzy XML tree.

3.1 The object tree

In reality, objects are applied to model real-world entities or abstract concepts. The objects have properties that may be attributes of the object itself or relationships known as associations between the object and one or more other objects [29]. The object has two characteristics, which are (1) an object has attributes, and the values of attributes, and (2) an object has a correlation with other objects. There exists two kinds of objects, crisp object and fuzzy object. The object is fuzzy because of a lack of information, and an object is a fuzzy object when there exists at least one attribute whose value is a fuzzy set. A fuzzy XML document is represented as a tree structure $T$ , and $T$ can be regarded as a fuzzy object $O_{T}$ , and in $T$ , there exists many objects, including the fuzzy objects and crisp objects, denoted by ${\{}O_{1},O_{2},...,O_{n}{\}}$ respectively. Obviously, ${\{}O_{1},O_{2},...,O_{n}{\}}\subseteq O_{T}$ , and ${\{}O_{1},O_{2},...,O_{n}{\}}$ exist in the forms of subtrees ${\{}t_{1},t_{2},...,t_{n}{\}}$ . As $O_{T}$ can be represented by $G=(V,E)$ , the object $O_{i}$ can also be represented by $G_{i}=(V_{i},E_{i})$ , where $V_{i}$ is the set of nodes in $O_{i}$ , and $E_{i}$ is set of edges in $O_{i}$ which connect the nodes in $V_{i}$ . Next, we give the definition of “Object tree” [33], which can be regarded as an Object in the fuzzy XML tree. And in the following parts of this section, we will introduce the definitions of “the object tree” and the minimum object tree [33] firstly, and then propose and show the definition of “the nearest object tree”.

Definition 1 (the object tree) Given an XML tree $t$ , suppose that its root node is $r$ . Then, if the children nodes of $r$ have at least one attribute node, $t$ is regarded as an “object tree”, denoted as $O_{r}$ , and its root node $r$ is called the “object node”of $O_{r}$ .

Here, for an XML tree $t$ , and its root node is $r$ , there exists at least an attribute node $A_{i}$ , and the relationship between $A_{i}$ and $r$ is parent-child relationship, then $O_{r}$ is regarded as an object tree. And for an object tree $O_{r}$ , when the values of its attributes are all crisp values, then the object tree $O_{r}$ is regarded as a crisp object tree. And, when in the object tree $O_{r}$ , there exists at least one attribute whose value is a fuzzy set, then the object tree $O_{r}$ is regarded as a fuzzy object tree.

Definition 2 (the minimum object tree) For an object tree $O_{i}$ , if in its tree structure, the attribute nodes only exist in the children nodes of the root node $r(O_{i})$ , then the object tree $O_{i}$ is regarded as a“minimum object tree”, denoted as $O_{\min}$ .

Figure 4.

The minimum object trees.

Figure 4 shows the examples of minimum object trees in the fuzzy XML tree. In the tree structures of Fig. 4, $Z_{i}$ represents the attribute nodes, (A) represents a minimum crisp object tree $O_{a}$ and (B) represents a minimum fuzzy object tree $O_{b}$ . We notice that in the minimum object trees, the attribute nodes are only exist in the children nodes of the root node of the object tree (the relationship between the attribute nodes and the root node is parent-child relationship), and for the descendant nodes of the root node (except child node), there has no attribute nodes in the minimum object trees.

Definition 3 (the nearest object tree)

Case 1: Given a set of keywords ${\{}k_{1},k_{2},...,k_{n}{\}}$ , the nearest object tree $O_{\textit{near}}$ which containing keywords ${\{}k_{1},k_{2},...,k_{n}{\}}$ in the nodes of the object tree can be separated into two cases:

(1)

If the node $s=$ SLCA $(k_{1},k_{2},...,k_{n})$ is an object node, then the object tree $O_{s}$ which is a subtree rooted at $s$ containing keywords ${\{}k_{1},k_{2},...,k_{n}{\}}$ is the nearest object tree.

(2)

If the node $s=$ SLCA $(k_{1},k_{2},...,k_{n})$ is not an object node, then find the node $s^{/}=\textit{parent}(s)$ (or $\textit{ancestor}(s)$ ), and the object tree $O_{s^{/}}$ containing keywords ${\{}k_{1},k_{2},...,k_{n}{\}}$ is the nearest object tree.

We give some interpretations for case (2), for a set of keywords ${\{}k_{1},k_{2},...,k_{n}{\}}$ , if the SLCA $(k_{1},k_{2},...,k_{n})$ is not an object node, then we find its parent node $\textit{parent}(s)$ , and the object tree $O_{\textit{parent}(s)}$ which is rooted at the node $\textit{parent}(s)$ is the nearest object tree, and if $\textit{parent}(s)$ is also not an object node, then we find the nearest ancestor node $\textit{ancestor}(s)$ which is an object node, and the object tree $O_{\textit{ancestor}(s)}$ containing keywords ${\{}k_{1},k_{2},...,k_{n}{\}}$ is the nearest object tree.

Case 2: When there is only one keyword $k$ inputted, the nearest object tree which contain keyword $k$ can be separated into the following cases:

(1)

When keyword $k$ exists in the attribute nodes, then the nearest object tree is the smallest object tree which contain the attribute node with keyword $k$ .

(2)

When keyword $k$ exists in the object nodes, then the nearest object tree is the smallest object tree of the object trees whose root nodes contain keyword $k$ .

(3)

When keyword $k$ exists in the element nodes, then the nearest object tree is the smallest object tree which contain the element node with keyword $k$ .

(4)

When keyword $k$ exists in the connect nodes, then the nearest object tree is the smallest object tree whose root node is the descendant of the connect node and has the nearest distance with the connect node

We give some explanation about the smallest object tree. For two object trees $O_{1}$ and $O_{2}$ , and $O_{1}$ contains $m_{1}$ nodes in its object tree and $O_{2}$ contains $m_{2}$ nodes in its object tree. If $m_{1}<m_{2}$ , then we regard that the object tree $O_{1}$ is smaller than $O_{2}$ .

For a fuzzy XML document, we apply the method in [30] to pre-process the fuzzy XML documents with the object identification. And before the users input their keywords, all the fuzzy XML documents have been pre-processed with the object identification operations.

For a fuzzy XML document, after the operation of object identification on it, we classify the nodes in fuzzy XML into six categories:

(1)

The object (or object tree) node $N_{O}$ , and we use its root node $r(O)$ to denote the object tree $O$ .

(2)

The element node $N_{E}$ , which can be the child node of the attribute node or the child node of the object node.

(3)

The attribute node $N_{A}$ , which always has value nodes as its children nodes.

(4)

The connect node $N_{C}$ , which is a node for connecting the element nodes of the same category.

(5)

The value node $N_{V}$ , and it always the leaf node which contains text or values.

(6)

The fuzzy node $N_{F}$ , which refers to the “Dist” node or the “Val” node.

[h] AO-Twig[1] A keyword query the Top-K query results with highest scores $(T_{r_{1}},\lambda_{1}),(T_{r_{2}},\lambda_{2}),...,(T_{r_{k}},\lambda_{k})$ when there is only one keyword $k$ inputted, if $k\in N_{V}$ , then find the minimum object trees $O_{\min}$ which contain keyword $k$ as the matching object tree $O_{M}$ ; else find the nearest object trees $O_{\textit{near}}$ of keyword $k$ as the matching object tree $O_{M}$ ; when there are more than one keyword ${\{}k_{1},k_{2},...,k_{n}{\}}$ inputted, then find the nearest object tree $O_{\textit{near}}$ which contain keywords ${\{}k_{1},k_{2},...,k_{n}{\}}$ as the matching object tree $O_{M}$ ; return matching object trees $O_{M}$ ; Do function fuzzy tree distance ( $O_{1}$ , $O_{2}$ ), set a threshold U, and obtain the approximate matching object trees $O_{AM}$ ; Construct structure query statements $q$ according to $O_{AM}$ , and put $q$ into the LTwig to get the query results; Score the query results and rank; return Top-K results with their scores $(T_{r_{1}},\lambda_{1}),(T_{r_{2}},\lambda_{2}),...,(T_{r_{k}},\lambda_{k})$ .

3.2 AO-Twig

In this section, we propose our method for combining the keyword query processing and structure queries on fuzzy XML data. Given a fuzzy XML tree $T$ , we first give the computation method of the whole possibility of the fuzzy XML tree $T$ , then we give our semantics of matching object trees for keyword query and propose our query methods AO-Twig to combine the structure query language with keyword query. And first we give the method for the computation of the whole possibility of a fuzzy XML tree.

Given a fuzzy XML tree, there are three kinds of fuzzy structures:

(type-1)
Fuzzy nodes only appear in a single path (that is, a root to leaf path).
(type-2)
Fuzzy node appear among the branches.
(type-3)
Fuzzy nodes appear in the complex structure.

As shown in Fig. 5, the tree structure of A is the example of type-1, B is the example of type-2, and C is the example of type-3. Combine the above three types, we give the definition of the whole possibility of a fuzzy XML subtree as follows.

Figure 5.
Whole possibilities in different fuzzy structures.

Definition 4 (the whole possibility of a fuzzy XML subtree) Given a fuzzy subtree $T_{S}$ , and its root node is $r(T_{S})$ , and the membership degrees on the path from node $r(T_{S})$ to the leaf nodes which in the $T_{S}$ are $\omega_{1},\omega_{2},...,\omega_{n}$ , then the whole possibility of the tree $T_{S}$ , $P_{w}(T_{S})=\omega_{1}\times\omega_{2}\times...\times\omega_{n}$ .

As the structure query language can convey complex and precise semantic meanings sufficiently to obtain a more meaningful query results, and combine the keyword query with the structure query language becomes a worth-study problem. But for the common users, it is difficult to propose the structure query statement because of lacking the professional knowledge about the structure query language and document’s schema, and we propose a semantics of matching object tree for keyword query to capture the query intention of the users. When inputting the keyword query, we find matching object trees $O_{M}$ of keyword query, and by finding the approximate matching object trees $O_{AM}$ in the fuzzy XML document which satisfy that fuzzy tree distance ( $O_{M}$ , $O_{AM}$ ) $<$ U, and U is a threshold given by users. After obtaining the $O_{AM}$ and the structure query statements can be constructed with these approximate matching object trees (a query statement can be built with the root node of the object tree, nodes containing keywords in the object tree, the attribute nodes whose value contains keyword in the object tree and the path information among these nodes in the object tree.) We put these query statements into the structure query algorithm (LTwig [11]) to match the twig queries and obtain the query results. After scoring the query results, we rank the query results with the threshold algorithm and obtain the Top-K query results with the highest scores. The basic framework of our method are shown in the follows, and we call this method as AO-Twig.

We give some details about our method AO-Twig for the keyword query inputted. When there is only one keyword $k$ inputted. If it exists in the value node, then we find the minimum object trees which contain keyword $k$ in the value node of the object tree as the matching object trees. Then if keyword $k$ exists in one of the attribute node, element node, connect node, or object node, according to the definition 3, and return the nearest object trees as the matching object trees. When there are more than one keyword ${\{}k_{1},k_{2},...,k_{n}{\}}$ inputted, compute the SLCA ${\{}k_{1},k_{2},...,k_{n}{\}}$ [25] and return the nearest object tree as the matching object tree according to the definition 3. After we obtain the matching object trees $O_{M}$ of keywords, and do function fuzzy tree distance ( $O_{1}$ , $O_{2}$ ) to get the approximate matching object trees $O_{AM}$ by setting a threshold U, which means for two object trees $O_{1}\in O_{M}$ , and $O_{2}\in O_{AM}$ . The distance between object trees $O_{1}$ and $O_{2}$ , Dist ( $O_{1}$ , $O_{2}$ ) $<$ U. The function fuzzy tree distance ( $O_{1}$ , $O_{2}$ ) can be seen in Algorithm 2, we improve and extend the structure aware XML distance algorithm in [6] to compute and determine the distance between two object trees and propose a string comparison function by considering the possibility of nodes. After we obtain the approximate matching object trees $O_{AM}$ , we put the query statements constructed with $O_{AM}$ into the structure query algorithm LTwig [11] to get the query result subtrees. LTwig is a holistic algorithm which can efficiently evaluate twig queries over fuzzy XML and can return the query answer subtrees whose whole possibilities are no less than the threshold $u_{T}$ (in AO-Twig, $u_{T}$ is set to 0). When users do not satisfy the query results and want to get more meaningful results, they can make the values of U higher to get more approximate matching object trees and build more structure query statements into the LTwig to obtain more query results.

[h] Function fuzzy tree distance ( $O_{1}$ , $O_{2}$ )[1] If there are two object trees $O_{1}$ , $O_{2}$ with root nodes $r_{1}$ and $r_{2}$ , Initialize xmldist : $=$ $\infty$ ; for all $l$ in labels(children( $r_{1}$ ) $\cup$ children( $r_{2}$ )) do for all $v_{i}\in$ children ${l}(r_{1})$ do for all $w_{i}\in$ children ${l}(r_{2})$ do $D_{l}$ [ $i$ , $j$ ] : $=$ dist( $O(v_{i})$ , $O(w_{i})$ ); end for end for Assignment $t_{l}$ : $=$ Assignment( $D_{l}$ []); for all < $h$ , $k$ > $\in$ assignment ${l}$ do if xmldist $=$ $\infty$ then xmldist : $=$ 0, end if xmldist : $=$ xmldist $+$ $D_{l}$ [ $h$ , $k$ ]; end for end for end if return xmldist.

For the Algorithm 2, we give some interpretations. And we start to introduce it by giving the following definitions:

Definition 5 (overlay) For two Object trees $O_{1}$ and $O_{2}$ , an $\textit{overlay}(O_{1},O_{2})$ is a non-empty set of pairs of nodes from $O_{1}$ and $O_{2}$ : $\forall v_{i}$ , $v^{\prime}_{i}\in O_{i}$ , $\forall n_{i}\in O_{i}$ – leaves( $O_{i}$ ), $i=$ 1, 2, …, $n$ , then the overlay ( $O_{1}$ , $O_{2}$ ) has the following properties:

(1)
if < $v_{1}$ , $v_{2}$ >, < $v^{\prime}_{1}$ , $v^{\prime}_{2}$ > $\in$ overlay ( $O_{1}$ , $O_{2}$ ), then $v_{1}=v^{\prime}_{1}$ if $v_{2}=v^{\prime}_{2}$
(2)
if < $v_{1}$ , $v_{2}$ > $\in$ overlay ( $O_{1}$ , $O_{2}$ ), then path( $v_{1}$ ) $=$ path( $v_{2}$ )
(3)
< $n_{1}$ , $n_{2}$ > $\in$ overlay ( $O_{1}$ , $O_{2}$ ) $\textit{if}\;\exists v_{1}$ , $v_{2}$ $s . t .$ $n_{1}=$ parent( $v_{1}$ ) $\wedge n_{2}=$ parent( $v_{2}$ ) $\wedge$ ( $v_{1}$ , $v_{2}$ ) $\in$ overlay ( $O_{1}$ , $O_{2}$ ).

Here, path( $v_{i}$ ) denotes the sequence of node labels label(root)…label( $v_{i}$ ) encountered when traversing $O_{i}$ from the root $r_{i}$ to node $v_{i}$ .

From the above definition, if < $v_{1}$ , $v_{2}$ > $\in$ overlay ( $O_{1}$ , $O_{2}$ ), then we regard $v_{1}$ and $v_{2}$ are matched. An overlay matches nodes from $O_{1}$ to nodes from $O_{2}$ one by one, and nodes or leaves are matched only if they have the same path from the root. Two non-leaf nodes can be matched if they are ancestors of two leaves which are matched. When a node can not be matched with any other node, then it is a deleted node. Then, if a node is a deleted node, all its descendants are also deleted nodes. It means that an overlay of two trees exists only if there exists two leaves $l_{1}\in O_{1}$ and $l_{2}\in O_{2}$ with the same path from the root. We say that two trees are comparable if they have at least one overlay.

Definition 6 (the existence possibility of a node) Given a fuzzy XML tree $T$ , suppose its root node is $r$ , for a node $v$ in the tree $T$ , if the membership degrees on the path from node $r$ to node $v$ are $\alpha_{1}$ , $\alpha_{2}$ , … $\alpha_{n}$ , then the existence possibility of node $v$ , $P_{E}(v)=\alpha_{1}\times\alpha_{2}\times...\times\alpha_{n}$ .

Definition 7 (cost of a match) Let sdist ( $s_{1}$ , $s_{2}$ ) be a string comparison function. The cost of match for two nodes $v$ , $w$ is:

$\displaystyle u_{v,w}=\left\{{\begin{array}[]{ll}\textit{sdist}(\textit{text}(% v),&\textit{if }v,u\textit{ are leaves}\\ \textit{text}(w))&\\ 0&\textit{otherwise}\\ \end{array}}\right.$ (1)

[h] Function assignment[1] The distance metrics A[ $i$ , $j$ ] of two subtrees $t_{1}$ , $t_{2}$ which are rooted at the nodes with the same label The assignment plan A ${}^{\prime}$ [ $i$ , $j$ ] Find the minimum element $a_{i}(0<i\leqslant n)$ in each row, and the minimum element $b_{j}(0<j\leqslant n)$ in each column; For the element $a$ in each row, do $a-a_{i}(0<i\leqslant n)$ to obtain a metric, For the element $b$ in each column of the new metric, do $b-b_{j}(0<j\leqslant n)$ , and obtain another metric A ${}^{-}$ [ $i$ , $j$ ], If for each row and each column in A ${}^{-}$ [ $i$ , $j$ ] there exists element of 0, then count the numbers of the elements of 0, If there exists $n$ elements of 0, then set the values of $a$ [ $i$ , $j$ ] ( $a$ [ $i$ , $j$ ] $=$ 0) to 1, and set the values of other elements to 0, and obtain the assignment plan A ${}^{-}$ [ $i$ , $j$ ], and A ${}^{\prime}$ [ $i$ , $j$ ] $=$ A ${}^{-}$ [ $i$ , $j$ ], and return A ${}^{\prime}$ [ $i$ , $j$ ]. else do a minimum line collection to overlay all the elements of 0, do

<1> Assign $\times$ to the row $a_{m}$ which dose not has elements of 0,

<2> For each row $a_{m}$ assigned $\times$ , assign $\times$ to the column $b_{n}$ which has element of 0,

<3> For each column $b_{n}$ assigned $\times$ , assign $\times$ to the row $a_{k}$ which has element of 0, Repeat process <2>, <3>, until there are no row and column can be assigned $\times$ ,

<4> Draw straight lines at the rows and columns which are not be assigned $\times$ , and obtain the set of minimum straight lines to overlay all elements of 0; Finding the minimum element $a^{\prime}$ in the elements $a_{i}^{\prime}$ which are not be overlayed by the straight lines, do $a_{i}^{\prime}-a^{\prime}$ ; For the elements $c_{i}$ of the intersection of straight lines, do $c_{i}+a^{\prime}$ , and return the new metrics $A_{2}$ [ $i$ , $j$ ]; goto 4 continue to execute; return the assignment plan A ${}^{\prime}$ [ $i$ , $j$ ].

The $\textit{sdist}(s_{1},s_{2})$ is a function of string comparison function, $P_{E}(v)$ , $P_{E}(u)$ are the existence possibilities of two compared nodes. For a node $v$ , where there are no fuzzy nodes on the path from the root node of the document to it, then $P_{E}(v)=$ 1. The cost of an overlay is defined as Cost $(\Gamma_{\textit{overlay}})=P_{E}(v)\times P_{E}(w)\times\sum\limits_{v,w\in% \textit{overlay}}u_{v,w}$ . For two object trees $O_{1}$ and $O_{2}$ , the overlay ( $O_{1}$ , $O_{2}$ ) is optimal if it is complete and there is no other complete overlay $(O_{1}$ , $O_{2})^{\prime}$ such that $\Gamma_{overlay(O_{1},O_{2})^{\prime}}<\Gamma_{\textit{overlay}(O_{1},O_{2})}$ . And the Fuzzy tree distance of object trees $O_{1}$ and $O_{2}$ is defined as the cost of an optimal overlay of $O_{1}$ and $O_{2}$ .

Then, we give the more details about the Algorithm 2, we analyze the two comparable object trees by considering the children nodes in the object trees, and recursively compute a fuzzy tree distance for each couple of subtrees rooted at children node with the same label. After all distances have been calculated. The algorithm assign each node to another node with the same label to minimize the overall cost. The assignment problem can be solved using a variation of the Hungarian Algorithm [9] (seen in Algorithm 3). And this task is performed by a call to function Assignment. Given a matrix of distances, the function Assignment returns a set of assignments containing couples of indices of assigned nodes. A set of all children of node $v$ having label $l$ is denoted with children ${l}(v)$ . The array $D_{l}$ store the results of distance calculations for a set of children having label $l$ . The distance is initially set to $\infty$ , and reset to 0 only in the case that there is at least one assignment of children nodes of the root.
4. The realization of AO-Twig for keyword querying on fuzzy XML data

In this section, we give the concrete implementation of our method on the aspects of scoring mechanism, index construction and rank mechanism. Next, we start from introducing the scoring mechanism.

4.1 The scoring mechanism

In order to obtain the results which are the most relevant to the keyword queries, we propose a scoring mechanism to distinguish and rank the query results. And for the query results obtained by LTwig, we propose a scoring mechanism by considering the tf*idf statistical method and the keyword-possibilities of the answers.

We model each result subtree obtained from LTwig as a document that includes keywords in the subtree. And for each keyword, we use the tf*idf to score the relevance of the subtree to a keyword. Given a fuzzy XML document $D$ , assuming that there are $n$ nodes and $m$ keywords in the $D$ . For a result subtree $T_{r}\in D$ , if the keyword $k_{i}$ (1 $\leqslant i\leqslant n$ ) contained in the result subtree $T_{r}$ . Then, the tf ( $k_{i}$ , $T_{r}$ ) is the term frequency of keyword $k_{i}$ in $T_{r}$ , which is the number of occurrences of $k_{i}$ in $T_{r}$ . The idf ( $k_{i}$ ) is the inverse document frequency of $k_{i}$ , and idf $(k_{i})=\frac{n+1}{N_{k_{i}}+1}$ , where $N_{k_{i}}$ is the number of nodes that include keyword $k_{i}$ . And ntl ( $T_{r}$ ) is the normalized term length of $T_{r}$ , where ntl $(T_{r})=\frac{{|}T_{r}{|}}{{|}T_{r(\max)}|}$ , $|T_{r}|$ denotes the number of terms in $T_{r}$ , and $|T_{r(\max)}|$ denotes the node with the maximal number of terms. Given a set of keywords ${\{}k_{1},k_{2},...,k_{n}{\}}$ , the relevance degree of the result tree $T_{r}$ to keyword $k_{i}$ can be calculated by the following formula:

$\displaystyle\textit{Score}(T_{r},k_{i})$ $\displaystyle\quad=\frac{\ln(1+tf(k_{i},T_{r})\times\ln(\textit{idf}(k_{i})))}% {(1-s)+s\times\textit{ntl}(T_{r})}$ (2)

We give some explanation about the above formula, s is a constant (usually set to 0.2), and the relevance degree of the result subtree Tr to the set of keywords $\{k_{1},k_{2},...,k_{n}\}$ is defined as following:

$\displaystyle\textit{Score}(T_{r},k)=\sum\limits_{i=1}^{n}\textit{Score}(T_{r}% ,k_{i})$ (3)

As the fuzzy XML contains fuzzy information including the membership degrees of the elements and the possibility distributions of the values of the attributes. For a result subtree, it may contains fuzzy information in the tree. Considering the impacts of keyword nodes which are contained in the result subtrees, we give the definition of the keyword-possibility of the result subtree.

Definition 8 (the keyword-possibility of the result subtree) Given a fuzzy XML tree $T$ , and a query result subtree $T_{r}$ , and their root nodes are $r(T)$ and $r(T_{r})$ , respectively. Assuming the membership degrees on the path from the $r(T)$ to $r(T_{r})$ are $\varphi_{1}$ , $\varphi_{2}$ , …, $\varphi_{i}$ , and the membership degrees on the path from $r(T_{r})$ to keyword nodes are $\beta_{1}$ , $\beta_{2}$ , …, $\beta_{j}$ , then the keyword-possibility of the result tree $P_{k}(T_{r})$ can be calculated with the following formula:

$\displaystyle P_{k}(T_{r})=\varphi_{1}\times\varphi_{2}\times...{}\times% \varphi_{i}\times\beta_{1}\times\beta_{2}\times...\times\beta_{j}$ (4)

Taking the keyword-possibility of the result subtrees into consideration, we use the following formula to score the result trees:

$\displaystyle\textit{Score}_{p}(T_{r},k)=P_{k}(T_{r})\times\textit{Score}(T_{r% },k)=P_{k}(T_{r})\times\sum\limits_{i=1}^{n}\textit{Score}(T_{r},k_{i})$ (5)

4.2 Index construction

For a fuzzy XML document $D$ , we build five indexes to serve our query methods, shown as follows:

(1)
The index $L_{P}$ (Docid, begin( $v$ ): end( $v$ ), LevelNum, Fuzzy, FuzzySequence)

We give some explanations about the $L_{P}$ , $L_{P}$ is the index for the position of the nodes in fuzzy XML, where (a) “Docid” is the identifier of a fuzzy XML document. And the different document can be identified with the “Docid”. (b) We use preorder traversal to visit the nodes in the fuzzy XML tree, “begin( $v$ )” is the order number generated before visiting all descendant nodes of $v$ and “end(v)” is the order number generated of the next visit of $v$ after visiting all descendant nodes of $v$ . if begin( $v_{1}$ ) $<$ begin( $v_{2}$ ) and end( $v_{2}$ ) $>$ end( $v_{1}$ ), then $v_{1}$ is an ancestor node of $v_{2}$ . (c) “LevelNum” is the nesting depth of the element in the fuzzy XML document. The “LevelNum”of root node is set to 1, and the “LevelNum” of each following level adds 1. (f) “Fuzzy” expresses whether a node is a fuzzy node or not, and its value is a Boolean one. If “Fuzzy” is equal to 1, it means the node is a fuzzy one. And when “Fuzzy” is equal to 0, it means the node is a crisp one. (g) “Fuzzy Sequence” is an ordered set which stores the information of a set of fuzzy nodes (including fuzzy nodes’ names and memberships) from root to the current node. If the path from root to current node contains no fuzzy nodes, it can be null.
(2)
The object tree node index $L_{O}$

After pre-processing the fuzzy XML documents with the object identification operation. The codes (begin( $r$ ), end( $r$ )) of the root nodes of the object trees are recorded into the index $L_{O}$ .
(3)
The minimum object tree index $L_{O_{\min}}$

The $L_{O_{\min}}$ is used to store the entry of the minimum object tree $O_{\min}$ including its root node and crisp nodes which are in $O_{\min}$ .
(4)
The index $L_{\textit{keyword}}{\{}r(T_{r}),\theta_{i}{\}}$

For a result subtree $T_{r}$ , and its root node is $r(T_{r})$ , for the nodes containing keywords stored in the subtree. $\theta_{i}$ is the membership degrees on the path from the $r(T_{r})$ to the nodes containing keywords in the subtree from left node to right node. For example, assuming there are three nodes $v_{1}$ , $v_{2}$ and $v_{3}$ containing keywords from left to right in the result subtree $T_{r}$ , and the membership degrees on the path from $r(T_{r})$ to $v_{1}$ , $v_{2}$ and $v_{3}$ are (0.8, 0.7, 0.6), (0.5), (0.6, 0.7, 0.9), then the entry stored in the $L_{\textit{keyword}}{\{}u,\theta_{i}{\}}$ are {begin( $r(T_{r})$ ), end( $r(T_{r})$ ), (0.8, 0.7, 0.6), (0.5), (0.6, 0.7, 0.9)}.

The index $L_{P}$ and $L_{\textit{keyword}}{\{}r(T_{r}),\theta_{i}{\}}$ are used for computing the whole possibility and keyword-possibility of the result subtrees. The membership degrees $\varphi_{i}$ on the path from the root node $r$ of the document to the root node $r(T_{r})$ of the result subtrees can be obtained from the index $L_{P}$ , and the membership degrees $\beta_{i}$ on the path from the node $r(T_{r})$ to the keyword nodes which are in the result subtrees can be obtained from the index $L_{\textit{keyword}}{\{}r(T_{r}),\theta_{i}{\}}$ , and the keyword-possibility can be computed by the Eq. (5).
(5)
The score index $M_{\textit{score}}{\{}M_{i}(r(T_{r}),\psi_{i}){\}}$

Given a set of keywords ${\{}k_{1},k_{2},...,k_{n}{\}}$ , and a result subtree $T_{r}$ . For an entry $M_{1}(r(T_{r}),\psi_{1})$ stores in the index $M_{\textit{score}}$ , $r(T_{r})$ is the root node of the result subtree $T_{r}$ , and $\psi_{1}$ is the score of Score ${}_{p}$ ( $T_{r}$ , $k_{1})$ which is the relevance degree of the result subtree $T_{r}$ to keyword $k_{1}$ , where, $\textit{Score}_{p}(T_{r},k_{1})=P_{k}(T_{r})\times\textit{Score}(T_{r},k_{1})$ . And similarity, $M_{i}(r(T_{r}),\psi_{i})$ stores the root node of the subtree $T_{r}$ and the $\textit{Score}(T_{r},k_{i})$ which is the relevance degree of the result subtree $T_{r}$ to keyword $k_{i}$ .

4.3 Rank mechanism

There are many threshold-based techniques for ranking the Top-K answers, such as Fagin’s Algorithm [15] and Threshold Algorithm [16, 17]. In this paper, we adopt the TA algorithm to obtain the Top-K answers with the highest scores. The algorithm first retrieve each keyword list ${\{}M_{i}(r(T_{r}),\psi_{i}){\}}$ in the index $M_{\textit{score}}$ , then compute $\textit{Scorecomb}(\psi_{1},\psi_{2},\ldots,\psi_{n})=\psi_{1}+\psi_{2}+\ldots% +\psi_{n}$ , and identify the Top-K results with highest scores. The TA algorithm is shown as follows.

[h] TA algorithm[1] Do sorted access in parallel to each keyword list ${\{}M_{i}(r(T_{r}),\psi_{i}){\}}$ . As each object $T_{r}$ is seen under sorted access in one keyword list, do random accesses to the remaining keyword lists and apply the Score function to find the final score of object $T_{r}$ . If $\textit{Score}_{p}(T_{r},k)$ is one of the Top-K seen so far, keep object $T_{r}$ along with its score. Define a threshold value as ScoreComb ( $\psi_{1}$ , $\psi_{2},\ldots,\psi_{n}$ ), where $\psi_{i}$ is the last score seen in the $i$ -th source. The threshold represents the highest possible value of any object that has not been seen so for in any keyword list. If the current Top-K objects seen so far have scores greater than or equal to the threshold, return those values. Otherwise, return to step 1.

5. Experiments

In this section, we first present the experimental results on the performance of our proposed method in comparison with the existing traditional keyword query methods. We analyze the query results on the metrics of precision, recall, ROC curve and $F$ -measure. Then, we compare AO-Twig with the ROstack which is the method proposed before for processing the keyword queries over fuzzy XML data on the metrics of precision and recall.

5.1 Experimental setup and dataset

All the experiments are performed on a laptop with 2.13 GHz Intel core i3 with 2 GB memory on Windows 7 system. And we use a real dataset DBLP [7] and a synthetic dataset XMark [22] for testing our method.

Table 1
The keyword query examples on different datasets

ID		ID
$D_{1}$	Information, retrieval	$D_{4}$	XML, twig, query
$D_{2}$	Fuzzy, model	$D_{5}$	XML, rank
$D_{3}$	Relational, database, keyword, query
$X_{1}$	America, item	$X_{4}$	Buyer, open_auction, credit card
$X_{2}$	Buyer, ship, region, phone	$X_{5}$	United States, close_auction
$X_{3}$	Asia, item, person 30

Figure 6.

The Top-K precision on different datasets.

Figure 7.

The Top-K recall on different datasets.

We generate five datasets $\textit{DB}_{1}$ , $\textit{DB}_{2}$ , $\textit{DB}_{3}$ , $\textit{DB}_{4}$ , $\textit{DB}_{5}$ for the DBLP with sizes of 60 M, 80 M, 100 M, 120 M, 140 M, respectively. And for XMark, we generate five datasets $\textit{XM}_{1}$ , $\textit{XM}_{2}$ , $\textit{XM}_{3}$ , $\textit{XM}_{4}$ , $\textit{XM}_{5}$ with sizes of 20 M, 40 M, 60 M, 80 M, 100 M, respectively. For each crisp dataset, we use the random fuzzy information generation method [28] to transform the crisp XML documents into fuzzy XML documents. And the fuzzy nodes ratio is controlled between 10%–20%. After transformation, the new generated fuzzy XML datasets for $\textit{DB}_{1}$ , $\textit{DB}_{2}$ , $\textit{DB}_{3}$ , $\textit{DB}_{4}$ , $\textit{DB}_{5}$ are denoted as $\textit{FDB}_{1}$ , $\textit{FDB}_{2}$ , $\textit{FDB}_{3}$ , $\textit{FDB}_{4}$ , $\textit{FDB}_{5}$ , with sizes of 75 M, 90 M, 113 M, 135 M, 155 M, respectively. And the fuzzy XML datasets for $\textit{XM}_{1}$ , $\textit{XM}_{2}$ , $\textit{XM}_{3}$ , $\textit{XM}_{4}$ , $\textit{XM}_{5}$ are $\textit{FXM}_{1}$ , $\textit{FXM}_{2}$ , $\textit{FXM}_{3}$ , $\textit{FXM}_{4}$ , $\textit{FXM}_{5}$ , with size of 30 M, 48 M, 75 M, 92 M, 118 M, respectively. Before the experiments, we pre-process the fuzzy documents with the object identification operation with the method in [30]. The keyword query examples on different datasets are shown in Table 1.

5.2 Evaluation of query quality

We evaluate our method AO-Twig with the traditional XML keyword methods XRank and SLCA on three standard metrics precision, recall and $F$ – measure. For a keyword query $q$ , we build a structure query statement $q_{(L)}$ according to the structure and schema of the document with LTwig (here, the query statement $q_{(L)}$ is proposed by the professional person). The answer of $q_{(L)}$ obtained with LTwig, denoted as $A_{q(L)}$ , is regarded as the accurate answer. The answer of a keyword query $q$ obtained with the keyword query method, denoted as $A_{q}$ , is regarded as the approximate answer. Then the precision is computed by the ratio between $|A_{q}\cap A_{q(L)}|$ and $|A_{q}|$ , and the recall is the ratio between $|A_{q}\cap A_{q(L)}|$ and $|A_{q(L)}|$ , where $|A_{q}\cap\quad A_{q(L)}|$ denotes the set of nodes that in both $A_{q}$ and $A_{q(L)}$ , and $|A_{q}|$ denote the number of the nodes in $A_{q}$ and similarity, $|A_{q(L)}|$ denotes the number of nodes in $A_{q(L)}$ . Then the precision $P_{\textit{precision}}$ can be calculated by the formula: $P_{\textit{precision}}=\frac{|A_{q}\cap A_{q(L)}|}{|A_{q}|}$ , and the recall $P_{\textit{recall}}$ can be calculated by the formula: $P_{\textit{recall}}=\frac{|A_{q}\cap A_{q(L)}|}{|A_{q(L)}|}$ . The $F$ -measure can be computed by the following formula: $F=2\times\frac{P_{\textit{precison}}\times P_{\textit{recall}}}{P_{\textit{% precision}}+P_{\textit{recall}}}(P_{\textit{recall}}\neq 0,F\neq 0)$ .

First we evaluate the precision of our method with the traditional keyword methods XRank and SLCA. We use the tf* idf document relevance score method to score the query results obtained by SLCA, and rank the results. We run keyword queries $D_{1}-D_{5}$ over datasets $\textit{FDB}_{1}$ and $\textit{FDB}_{3}$ , and $X_{1}-X_{5}$ over datasets $\textit{FXM}_{2}$ and $\textit{FXM}_{3}$ with different query methods. We obtain the Top-K results with the highest scores with different query methods, the precision of Top-K (K $=$ 20) results are shown in Fig. 6. From the experiment results, the AO-Twig has a significantly high precision compared with traditional XML keyword query method. And the precision of the AO-Twig outperforms XRank on the ratio of 35% on average and outperforms SLCA 45% on average. As the traditional keyword query method can not distinguish the fuzzy nodes with crisp nodes, and can not capture the fuzzy information of membership degrees of elements and possibility distributions of the values of attributes, which make the query results low precision. Our method AO-Twig combine the keyword query with structure query language which can obtain the useful query results from fuzzy XML data effectively, and can get a high query precision.

We run keyword queries $D_{1}-D_{5}$ over $\textit{FDB}_{1}$ and $\textit{FDB}_{3}$ datasets, and keyword queries $X_{1}-X_{5}$ over $\textit{FXM}_{2}$ and $\textit{FXM}_{3}$ datasets. The Top-K (K $=$ 20) recall of results is shown in Fig. 7. All the experimental results show that the AO-Twig can obtain the comprehensive query results with a high recall.

We run keyword queries $D_{1}-D_{5}$ over the $\textit{FDB}_{5}$ dataset, and get the Top-K (K $=$ 100) results with the highest scores through the method AO-Twig. We divide the 100 results into 10 groups in descending order; each group consists of 10 results. The first group has the top-10 results with the highest scores, the second group has the 11th result to the 20th result with the highest scores, and so on. We analyse the ROC (Receiver Operating Characteristic) curve of the results, there are two variables in the curve which are TPR and FPR. TPR is the true positive rate, and it can be computed with the formula: TPR $=$ TP/(TP $+$ FN). FPR is false positive rate, and it can be compute with the formula: FPR $=$ FP/(FP $+$ TN). And in the above formulas, TP represents the number of the results of true positive, and it means that if a result is positive and it is also predicted as true positive. FN represents the number of the results of false negative, and it means that if a result is positive but it is predicted as negative. FP represents the number of the results of false positive, and it means that if a result is negative but it is predicted as positive. TN represents the number of the results of true negative, and it means that if a result is negative and it is also predicted as negative. We obtain the Top-K (K $=$ 100) results of the structure queries built for the keyword query as the accurate results, and divide them into 10 groups in descending order. Each group has 10 results, and the first group has the Top-10 results with the highest scores. Comparing the results in each group of the approximate results and accurate results, we can compute and obtain 10 coordinate points (FPR, TPR). The ROC curve can be seen in Fig. 8, experimental results show that AO-Twig method has a well performance for finding the Top-K results.

Figure 8.

The ROC curve.

The $F$ -measure of the three methods of keyword queries on fuzzy XML data is shown in Table 2. We can see that the $F$ -measure of XRank on FXM dataset is 62%, the $F$ -measure of SLCA on FDB is 60%, and the $F$ -measure of AO-Twig on FDB is 97%. Obviously, AO-Twig achieves the best $F$ -measure on both FDB dataset and FXM dataset.

Table 2

The $F$ -measure (%)

$F$ -measure (%)	XRank	SLCA	AO-Twig
FDB	70	60	97
FXM	62	53	92

Figure 9.

The Top-K precision and Top-K recall when varying the values of threshold U.

Next, we evaluate the performances of AO-Twig with ROstack [33] on the metrics of precision and recall when setting different values of threshold U. As the ROstack algorithm does not have a ranking mechanism, we use the TA algorithm (Algorithm 4 in Section 4.3) to rank the results of keyword queries obtained with ROstack and get the Top-K results with their possibilities (scores). We random choose 10 groups of keyword queries which consisted of 1 $\sim$ 4 keywords, and we use AO-Twig and ROstack to obtain the Top-K results of these 10 keyword queries over the dataset $\textit{FDB}_{5}$ respectively. The results obtained with the AO-Twig and ROstack are the approximate results. We transform the keyword query into the corresponding structure query, and obtain the results of the built structure queries as the accurate results. The precision and recall of the results can be computed with the formulas in Section 5.2. When K $=$ 25, varying the values of the threshold U (U $=$ 0.2, 0.4, 0.6, 0.8, 1), the average Top-K precision and average Top-K recall of AO-Twig and ROstack are showed in Fig. 9. From the experimental results, the average precision of the ROstack to get the Top-K results with the highest scores is 87% on the dataset $\textit{FDB}_{5}$ , and the average precision of AO-Twig varies when the value of U changes. The average recall of ROstack to get the Top-K results with the highest scores is 91%, and the average recall of AO-Twig also varies when the value of U changes. We can conclude that when the value of the threshold U increases, the average ratio of precision and the average ratio of recall increase (AO-Twig). When the value of the threshold U is low (U $\leqslant$ 0.4), the average precision of AO-Twig is lower than the average precision of ROstack, as when the value of U is low, the number of the approximate matching object trees obtained in AO-Twig is small, and the number of the structure queries constructed is small too. There can be smaller accurate answers obtained when the value of U is low. The average recall of AO-Twig also becomes lower when the value of U decreases which can be easily inferred, as there are smaller relevant answers returned when the value of U becomes lower. And the average recall of AO-Twig is lower than the average recall of ROstack when the value of U is low (U $\leqslant$ 0.6).

6. Conclusion

In this paper, we study the problem of keyword querying on fuzzy XML data. We combine the structure query language into the keyword query over fuzzy XML to get more meaningful and comprehensive results. We introduce the concept of “object tree”, “the minimum object tree”, “the nearest object tree”, and propose a semantics of matching object tree for keyword query. We propose the method AO-Twig, and when we get the matching object trees of the keyword query, we find the approximate matching object trees of matching object trees by setting a threshold U, and input the structure query statements which are built with the approximate matching object trees into LTwig to obtain more query results. The users can make the value of threshold U higher to obtain more approximate matching object trees and more query statements can be putted into LTwig to get more meaningful query results. We propose a score mechanism based on the tf*idf and the keyword-possibilities of the result subtrees, and rank the results with the threshold algorithm for the Top-K results with highest scores. Experimental results show that our method achieves high search result quality on both synthetic and real datasets for keyword queries over fuzzy XML data, and outperforms the traditional XML keyword query approaches significantly.

As the large-scale data appears in all trades and professions commonly, considering the characteristics of large-scale data, how to find the useful and accurate information existed in the large-scale data with a fast way becomes much important. So in the future, we will devote our efforts on finding the effective method for getting the results of keyword queries over large-scale fuzzy XML data speedily.

Footnotes

Acknowledgment

The work is supported by the National Natural Science Foundation of China (grant no. 61772269).

References

Gaurav

and Alhajj

, Incorporating fuzziness in XML and mapping fuzzy relational data into fuzzy XML, in: Proceedings of the 2006 ACM Symposium on Applied Computing, 2006, pp. 456–460.

Nierman

and Jagadish

H.V.

, ProTDB: Probabilistic data in XML, in: Proceedings of the 28th International Conference on Very Large Data Bases, 2002, pp. 646–657.

Kimelfeld

and Sagiv

, Combining incompleteness and ranking in tree queries, in: Proceedings of the 11th International Conference on Database Theory, 2007, pp. 329–343.

Kimelfeld

and Sagiv

, Matching twigs in probabilistic XML, in: Proceedings of the 33rd International Conference on Vary large Data Bases, 2007, pp. 27–38.

Zhang

C.J.

Chang

Sha

C.F.

Wang

X.L.

and Zhou

A.Y.

, Keywords Filtering over Probabilistic XML Data, Web Technologies and Applications 7235 (2012), 183–194.

Milano

Scannapieco

and Catarci

, Structure-aware XML object Identification, IEEE Data Engineering Bulletin 29(2) (2006), 67–74.

DBLP, http://dblp.uni-trier.de/xml/.

Hung

Getoor

and Subrahmanian

V.S.

, PXML: a probabilistic semistructured data model and algebra, in: Proceedings of 19th International Conference on Data Engineering, 2003, pp. 467–478.

Bourgeois

and Lassale

J.C.

, An extension of Munkres Algorithm for the Assignment problem to rectangular, Communications of the ACM 14(12) (1971), 802–804.

10.

G.L.

Feng

J.H.

Wang

J.Y.

and Zhou

L.Z.

, Effective Keyword Search for Valuable LCAs over XML Documents, in: Proceedings of CIKM, 2007, pp. 31–40.

11.

Liu

Z.M.

and Ma

R.Z.

, Efficient processing of twig query with compound predicates in fuzzy XML, Fuzzy Sets and Systems 229 (2013), 33–53.

12.

Liu

Z.M.

and Qv

Q.L.

, Dynamically Querying possibilistic XML Data, Information Sciences 261 (2014), 70–88.

13.

J.X.

Liu

C.F.

Zhou

and Ning

, Processing XML Keyword Search by Constructing Effective Structured Queries, Advances in Data and Web Management 5446 (2009), 88–99.

14.

J.X.

Liu

C.F.

Zhou

and Xu

, Quasi-SLCA Based Keyword Query Processing Over Probabilistic XML Data, IEEE Transactions on Knowledge & Data Engineering 26(4) (2014), 957–969.

15.

Fagin

, Combining fuzzy information from multiple systems, in: Proceedings of the 15th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, 1996, pp. 216–226.

16.

Fagin

, Fuzzy queries in multimedia database systems, in: Proceedings of the 17th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, 1998, pp. 1–10.

17.

Fagin

, Optimal aggregation algorithms for middleware, in: Proceedings of the 20th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, 2001, pp. 28–37.

18.

Zhou

Liu

C.F.

J.X.

and Xu

, ELCA evaluation for keyword search on probabilistic XML data, World Wide Web 16(2) (2013), 171–193.

19.

Abiteboul

Segoufin

and Vianu

, Representing and querying XML with incomplete information, ACM Transactions on Database Systems (TODS) 31(1) (2006), 150–161.

20.

Cohen

Mamou

Kanza

and Sagiv

, XSEarch: A semantic search engine for XML, in: Proceedings of the 29th International Conference on very Large Data Bases, 2003, pp. 45–56.

21.

Yang

W.D.

and Shi

, Schema-aware keyword search over XML streams, in: Proceedings of 7th International Conference on Computer and Information Technology, 2007, pp. 29–34.

22.

XMark, http://www.xml-benchmark.org/.

23.

Kanza

Nutt

and Sagiv

, Querying incomplete information in semistructured data, Journal of Computer and System Sciences 64(3) (2002), 655–693.

24.

Y.W.

Wang

G.R.

Xin

J.C.

Zhang

E.D.

and Qiu

Z.L.

, Holistically twig matching in probabilistic XML, in: Proceedings of 25th International Conference on Data Engineering, 2008, pp. 1649–1656.

25.

and Papakonstantinou

, Efficient keyword search for smallest LCAs in XML databases, in: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, 2005, pp. 527–538.

26.

and Papakonstantinou

, Efficient lca based keyword search in XML data, in: Proceedings of 11th international Conference on Extending Database Technology: Advances in Database Technology, 2008, pp. 535–546.

27.

Z.M.

and Yan

, Fuzzy XML data modeling with the UML and relational data models, Data & Knowledge Engineering 63(3) (2007), 972–996.

28.

Z.M.

Liu

and Yan

, Matching twigs in fuzzy XML, Information Science 181(1) (2011), 184–200.

29.

Z.M.

Zhang

W.J.

and Ma

W.Y.

, Extending object-oriented databases for fuzzy information modeling, Information Systems 29(5) (2004), 421–435.

30.

Liu

Z.Y.

and Chen

, Identifying Meaningful Return Information for XML Keyword Search. in: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, 2007, pp. 329–340.

31.

and Ma

Z.M.

, Object-stack: An object-oriented approach for top-k keyword querying over fuzzy XML[J], Information Systems Frontiers 19(3) (2017), 669–697.

32.

Z.M.

and Yan

, An approach of top-k keyword querying for fuzzy XML, Computing, 2018, doi: 10.1007/s00607-017-0577-2.

33.

, An Object-Oriented Approach of Keyword Querying over Fuzzy XML, CIT, Journal of Computing and Information Technology 24(3) (2016), 293–309.

A structure-based approach of keyword querying for fuzzy XML data

Abstract

Keywords

1. Introduction

4.1 The scoring mechanism

5. Experiments

5.1 Experimental setup and dataset

Table 1 The keyword query examples on different datasets

Footnotes

Acknowledgment

References

Table 1
The keyword query examples on different datasets