Abstract
With the increasing popularity of XML for data representations, there is a lot of interest in keyword query on XML. Many algorithms have been proposed for XML keyword queries. But the existing approaches fall short in their abilities to analyze the logical relationship between keywords of spatiotemporal data. To overcome this limitation, in this paper, we firstly propose the concept of query time series (QTS) according to the data revision degree. For the logical relationship of keywords in QTS, we study the intra-coupling logic relationship and the inter-coupling logic relationship separately. Then a calculation method of keyword similarity is proposed and the best parameter in the method is found through experiment. Finally, we compare this method with others. Experimental results show that our method is superior to previous approaches.
Introduction
The Extensible Markup Language (XML) has evolved to be a paradigm for data exchange over the network since its foundation in 1998. XML is perceived as an adaptable hierarchical model that is appropriate to communicate a large amount of data without a rigid structure [1]. Hence, the ability to acquire knowledge from XML documents for decision support is certainly optimistic and it has been dominant format used in the web [4]. Besides, XML’s self-describing property enables XML to represent data without losing its semantics information [4].
Keyword query on XML document has received wide attention. The query semantics and algorithms of keyword queries on XML documents have been extensively studied in the literature [3, 19]. The keyword search semantics on XML documents are mainly focus on the Lowest common ancestor (LCA) based semantics, including the LCA semantics and its variants [7, 19] to improve the search quality. In [26], Xu and Papakonstantinou propose the SLCA semantics for keyword query processing on XML documents, and therefore two algorithms are presented. Guo et al. [9] introduce an ELCA (Exclusive lowest common ancestor) semantics to the keyword queries on XML documents, and an efficient algorithm named Indexed Stack for keyword queries on XML documents with the ELCA semantics is presented in [27]. The Valuable Lowest Common Ancestor (VLCA) semantics is proposed in [15] to answer keyword queries effectively, which not only improves the accuracy of LCAs by eliminating redundant LCAs that should not contribute to the answer, but also retrieves the false negatives filtered out wrongly by SLCA. Sun et al. [21] propose a Multiway SLCA (MSLCA) semantics to support the keyword search both in AND and OR Boolean operators, and two algorithms named basic multiway-SLCA (BMS) and incremental multiway-SLCA (IMS) are presented. Literature [17] proposes a structure-based approach of keyword querying for XML data, which combines the structure query language into the keyword query over XML to get more meaningful and comprehensive results. Literature [12] uses indexes to improve the performance of the query engines for XML data.
There have been many achievements in terms of capturing the keyword relationship based on XML document. The Generalized Vector Space Model (GVSM) [25] and the Context Vector Model (CVM-VSM) [2] use the co-occurrence information between the terms in the document to capture each other mutual relations. Latent semantic indexing (LSI) proposed in [8] uses the projection feature step to estimate the similarity between documents, which captures the semantic information of the keywords in the original document. Compared with traditional bag-of-words (BOW), these models have made some improvements. However, these models only consider the co-occurrence frequency of words, ignoring the implicit relationship between words. Hristidis V [11] proves that the coupling relationship is very effective for capturing implicit relationship in machine learning and data mining tasks (such as clustering and document analysis). Cost and Salzberg [7] propose MVDM based on labels, while Wilson and Martinez [24] study heterogeneous distances for instance-based learning. The measures in their study are only designed for supervised approaches. Wang and Tao [22] have proposed a novel coupled object similarity metric called Coupled Object Similarity (COS), which involves both attribute value frequency distribution (intra-coupling) and feature dependency aggregation(inter-coupling) in measuring attribute value similarity for unsupervised learning of nominal data. Wang and Dong [23] propose a coupled attribute similarity for objects (CASO) measure based on the coupled attribute similarity for values (CASV), by considering both the intra-coupled and inter-coupled attribute value similarities (IaASV and IeASV), which globally capture the attribute value frequency distribution and feature dependency aggregation.
The contributions of this paper are the following: We propose the concept of QTS and data revision degree. QTS segmentation algorithm is proposed based on the data revision degree to analyze the coupling relationship of keywords. We propose a calculation method of keyword coupled similarity, combining intra-coupling and inter-coupling relationship.
The rest of the paper is organized as follows. Section 2 devises a spatiotemporal data model based on XML. The framework is presented in Section 3. Section 4 proposes the QTS segmentation algorithm. The keyword coupled similarity is proposed based on time series in Section 5. Experimental evaluation is given in Section 6 and Section 7 concludes the paper.
Spatiotemporal data model based on XML
According to the characteristics of spatiotemporal data, the concept of spatiotemporal data model based on XML is introduced in this section. The Model is based on literature [5].
OID is the changing history of a spatiotemporal object; ATTR is the attributes of a spatiotemporal object; P is the position of a spatiotemporal object; T is the time of a spatiotemporal object; M is the motion of a spatiotemporal object.
type is the type of change in spatiotemporal objects; pre is the predecessor in a changing process; suc is the successor in a changing process; t
change
is the time of a changing process.
ATTR = (genAt, spaAt, t
valid
) where genAt is the general attribute of a spatiotemporal object; spaAt is the spatial attribute of a spatiotemporal object; t
valid
is the valid time for spaAt of a spatiotemporal object.
ATTR = ({name, color} , {size} , t1 - t0) denotes some attributes of the cloud A within the time t1 - t0, where name and color are general attributes, size is the spatial attribute. OID = (split, {B} , {C, D} , t2 - t1) reflects the changing history of the cloud A where B is the predecessor, C and D are the successors and the change happened in time interval t2 - t1. For cloud A, the position information P is represented as [(90°E,120°W),30 m] and T records the time at position P. M records the direction and value when the cloud A changes.

Schematic diagram of space-time objects.
In order to analyze the relationship between keywords and return accurate query results, this section proposes a two-step processing approach to address this problem. The framework of keyword query and result sort of spatiotemporal data is shown in Fig. 2.

The framework of keyword query and result sort of spatiotemporal data.
The first step is to calculate the coupling relationship between different keywords during the offline time. Firstly, an XML document tree is constructed using keywords extracted from query history. Then use the correlation analysis method to obtain the coupling relationship between different keywords. Finally, select typical queries from the query history based on the coupling relationship to form a query set.
The second step is to process user’s query and return results during the online time. The spatiotemporal data in traditional databases is represented by XML document tree in advance to facilitate the processing of queries. A query sentence is divided into many keywords. Then the semantic similarity between the user query and the typical query is calculated according to the keyword coupling relationship. Finally, the first k relevant results are presented to the user.
In this section, we firstly propose the QTS to describe a series of query requests by user. Then we put forward the concept of data revision degree. Finally, a time series segmentation algorithm is proposed based on data revision degree. The work in this section provides a theoretical foundation for analyzing the coupling relationship of keywords.
Query time series (QTS)
A query process includes many query operations (named subqueries). For explaining the changes of subqueries intuitively, we propose the following definitions.
In this example, the time series is T = [t1, t2, t3]. q1, q2, and q3 have the same query object temperature, and the query range is more and more accurate (from 200 m2 to 120 m2). The keyword QTS of this query process is expressed as follows:
Data revision degree
During the query process, the spatiotemporal data will change with time, which affects the query results. In order to measure the change in spatiotemporal data, we define the revision of data update.
℘ = (1 - ρ) × φ + ρ × φ denotes the data revision degree, where φ represents the variation in the attributes, φ indicates the variation in the entity movement and ρ is a weight between 0 and 1.
The process of data revision degree calculation is shown in Fig. 3. For time interval [tn-2, t
n
], the current state of the object changes with the update of the position, and the latest state of the object is defined as the state at the end of the last time interval. Based on this theory, time changes in space points and regions t → [tn-2, t
n
] is replaced with t → t
n
. For spatial changes, we use minimum covering circle (MCC) in Fig. 3 to explain. There are two types of spatial changes: entity movement and attribute changes. For the first type, the position change is the position change of the MCC center approximately, which is measured by Euclidean distance

Data revision degree calculation.
Combining the two aspects, the data revision degree is defined as
In actual query process, the query series is updated dynamically, which measured by the degree of data revision. The algorithm is proposed to segment the updated QTS. The pseudo code of the algorithm is described as follow
For each query Q, initialize the parameters firstly (line 1–3). Then a mark w indicates the degree of data revision between Q and the previous query (line 4–8). The QTS are segmented when w exceed the revision threshold ℓ (line 12–15). Figure 4 shows the process of the segmentation algorithm. For the different revision of the query history, the QTS are re-segmentation, which affects the coupling relationship between keywords in the QTS. The coupling relationship between keywords is discussed in the next section.

The process of the segmentation algorithm.
In this section, we firstly propose the intra coupling similarity for the keywords in the same query. Then we put forward the inter coupling similarity for the keywords in different queries. Finally, combining intra coupling and inter coupling, the calculation method of keyword coupling similarity is proposed.
Intra-coupling of spatiotemporal keywords within a query
The user’s query activity is recorded in the time series, and the user’s query intent can be clarified by analyzing the similarity between keywords in the time series. We use the Jac coefficient to evaluate the co-occurrence frequency of two keywords in the same time series. The formula is described as follows
Keywords p and q are intra-related if they are Sibling relationship (SR) in query history XML document V
j
. The intra-similarity (IaS) between p and q is quantified as follow

Intra-coupling XML tree.

intra-coupling and inter-coupling.
Intra-coupling relationship can only reflect the similarity of keywords in the same query, for the similarity of keywords in different queries, we propose the concept of inter-coupling. Some keywords are co-related with each other even though they are not in the same query. As in shown in Fig. 7, keyword1 and keyword4 belong to Query1t0 and Query2t0 respectively and keyword3 belongs to both Query1t0 and Query2t0. keyword1 and keyword4 are connected keyword3. This connection is defined as inter-coupling. The strength of the connection(inter-similarity) is calculated by the following formula

Inter-coupling XML tree.
The coupling relationship between keywords needs to consider intra coupling and inter coupling comprehensively. The keyword coupling similarity is defined as follow
Experiment
Experimental setup
The experiments are carried out on a computer running Windows10 with Intel Core i5-1.9 GHz CPU, 8 GB RAM. We implement all the algorithms in (Python 3.7).
The experiments use two different datasets. The first dataset NHDB is a large database of hotel information in the United States, which includes the location, business hours, reviews, and nature of the hotel. We use these attributes to simulate spatiotemporal datasets. The 3000 records are generated as historical queries. The second dataset is DBLP, which is a data set designed in XML format. We selected attributes such as author, paper name, writing, and publisher to construct an XML document tree, and 2568 records are generated.
To present evaluations of our approach, we performed two groups of experiments. In the first group of experiments, the impact of the value of α on keyword coupled similarity is tested. In the second group of experiments, we compare KQ algorithm with V-COS algorithm and random approach in terms of accuracy rate and recall rate to evaluate the effectiveness.
Experimental results
The influence of parameters on keyword similarity
This group of experiment aims at testing the impact of the weight α on the similarity of keywords in terms of Precision (formula 1) and Recall (formula 2). We first randomly select 10 keywords from NHDB and DBLP. For each keyword, set the weight α from 0 to 1 in increments of 0.1. Under each weight, the keyword coupling algorithm gets top-10 related keywords. Then mix these keywords to form a set containing 110 keywords. Next, we calculate the frequency of each different keyword in the set.
Finally, sort the results in descending order and select top-10 as the relevant keywords.
Figure 10 shows the results under NHDB and DBLP datasets. In order to simplify the research process, we set the number of retrieved keywords and keywords marked as relevant to be 10, so that the precision rate and the recall rate are equal. For the NHDB and DBLP databases, the precision and recall are 0.87 and 0.82 when α = 0.4 and α = 0.5 respectively, which represents the highest performance of our algorithm. In addition, the max precision and recall of NHDB are higher than DBLP, indicating that we can get better performance when we treat time and space keywords as the spatiotemporal attributes of the query XML tree instead of ordinary attributes.

Recall and Precision under different weights.
This group of experiment aims to evaluate the effectiveness of the similarity algorithm of spatiotemporal data. The precision and are recall used to evaluate the effectiveness. Firstly, 10 users randomly select query Q i from the NHDB query history. For each selected query, a set K i containing 30 queries is generated from the query history. The queries in the set are generated by three algorithms: data revision degree (KQ algorithm), cosine similarity based on traditional VSM (V-COS) and random (as a benchmark experiment). Then each user marks the top 10 queries that they think are semantically relevant in K i . Finally, the matching degree between the query marked by the user and the query returned by each algorithm is calculated.
The experimental results are shown in Figs. 8 9. For algorithm in this paper, the average precision and recall are 0.798 and 0.782 respectively. For the V-COS algorithm, the precision and recall averages are 0.609 and 0.652 respectively. For random queries, the rates are 0.211 and 0.191 respectively. Our algorithm is higher than the V-COS algorithm and the benchmark algorithm in terms of precision and recall. V-COS algorithm takes the local overlapping information between queries into account, which makes its recall rate relatively high. Our algorithm can adapt to the changes of time attribute and space attribute, and divide the time series in time through the segmentation algorithm when querying.

The precision of three algorithms.

The recall of three algorithms.
In this paper, we study the problem of keyword querying on XML data. We firstly introduce the spatiotemporal data model based on XML to support our research. Then we propose a QTS segmentation algorithm based on data revision degree. For analyzing the relationship between keywords in the QTS, we propose the keyword coupled similarity. Experimental results confirm the efficiency of our approach.
Keyword query is a hot research direction currently. Considering the impact of the uncertainty of spatiotemporal data on query results, the keyword query method will be extended to the query process of uncertain data in the future.
Footnotes
Acknowledgment
The work was supported by the National Natural Science Foundation of China (61402087), the Natural Science Foundation of Hebei Province (F2019501030), the Natural Science Foundation of Liaoning Province (2019-MS-130), the Key Project of Scientific Research Funds in Colleges and Universities of Hebei Education Department (ZD2020402), the Fundamental Research Funds for the Central Universities (N2023019), and in part by the Program for 333 Talents in Hebei Province (A202001066).
