A parametric approximation algorithm for spatial group keyword queries

Abstract

With the application of big data, various queries arise for information retrieval. Spatial group keyword queries aim to find a set of spatial objects that cover the query keywords and minimize a goal function such as the total distance between the objects and the query point. This problem is widely found in database applications and is known to be NP-hard. Efficient algorithms for solving this problem can only provide approximate solutions, and most of these algorithms achieve a fixed approximation ratio (the upper bound of the ratio of an approximate goal value to the optimal goal value). Thus, to obtain a self-adjusting algorithm, we propose an approximation algorithm for achieving a parametric approximation ratio. The algorithm makes a trade-off between the approximation ratio and time consumption enabling the users to assign arbitrary query accuracy. Additionally, it runs in an on-the-fly manner, making it scalable to large-scale applications. The efficiency and scalability of the algorithm were further validated using benchmark datasets.

Keywords

Information retrieval spatial group keyword queries parametric approximation database

1. Introduction

With the rapid development of the mobile Internet, artificial intelligence, and other such technologies, the number of spatial objects with textual descriptions will increase enormously. Such objects are named spatial keyword objects, e.g., a restaurant with keywords “Wi-Fi”, “chinese food” and “afternoon tea” or a shopping mall with keywords “free parking lot” and “big sale”. Spatial keyword objects are frequently used in spatial data analysis [1, 2] to retrieve useful information. Furthermore, these objects can be processed and queried for particular occasions.

In combination with business scenario requirements, applications in social sensing, intelligent logistics, and other fields have offered various queries. The nearest neighbor query [3] is a classical query for a spatial database that returns a single object nearest to the query point. Textual descriptions are considered to generate the top- $k$ spatial keyword query [4], where the obtained top $k$ objects are ordered by a goal function that considers the distance and keywords simultaneously. The diameter-aware extreme group query [5] was proposed to find a set of objects that meet a bound rather than a single object. This query requires the objects in the set to be as close as possible to each other. Embracing the characteristics of the above queries, the spatial group keyword query (SGK query) [6] returns a set of objects that cover the query keywords together and minimize a goal function. Further, plenty of subsequent work aims at proposing methods for these queries that retrieve information more efficiently.

It was reported in [7] that along with storm and flooding, Hurricane Maria, which was the strongest hurricane to ever hit Puerto Rico, left the island severely damaged and caused thousands of fatalities. Relief operations following this hurricane required the use of medical supplies and lifesaving equipment obtained from aid agencies around the world. A top- $k$ spatial keyword query can return top- $k$ aid agencies ordered by the distance between the aid agency and the location of the catastrophe. As usual, a single aid agency cannot cover the full requisite demands. This suggests considering the SGK query to return a set of aid agencies that together satisfy the full requisite demands while minimizing the total distance between aid agencies and the location of the catastrophe. The SGK query with different goal functions or query inputs was continuously studied in [8]. Further, it was extended to the top- $k$ SGK query [9] and a unified goal function for the SGK query was systematically proposed in [10]. An extended SQL (structured query language) using relational algebra was introduced in [11] enabling the SGK query to be directly searched using SQL languages.

The NP-hardness of the SGK query problem has been proved in [6], where it was obtained by the reduction from the weighted set cover (WSC) problem. Therefore, exact algorithms such as Sum-Exact [12] are inefficient and not scalable to massive real data. For efficiency, the Sum-Appro approximation algorithm was also presented in [12] with an approximation ratio of $\sum_{i=1}^{n}1/i$ , where $n$ is the number of query keywords. However, the approximation ratios of the previous approximation algorithms are fixed given an initial $n$ . To obtain approximation algorithms with higher query accuracy, the approximation ratios should be close as possible to 1.

A novel approach reported in [13] studied the trade-offs between the approximation ratio and time consumption, which is suitable for problems that are solvable in polynomial time but have poor approximation ratios. This approach leds us to find that a parametric approximation algorithm for the WSC problem in exponential time was given in [14]. This algorithm has redundant loops related to the size of data, resulting in a high runtime. Several other parametric approximation algorithms for the WSC problem were designed in [15]. However, all of these algorithms focus on the unweighted set cover problem and cannot be extended directly to the query problem.

Based on these previous works, we optimize the methods from [14, 13]to present a parametric approximation algorithm that we call Para-Appro for the SGK query. This algorithm manages to balance three crucial factors for the query: a) finding the set of objects with the minimal goal value to satisfy the query keywords, b) determining the parameter for scalability to large number of instances, and c) processing the data efficiently.

The proposed Para-Appro algorithm attempts to meet the approximation ratio of $1+\ln\varepsilon$ for a given parameter $\varepsilon\geqslant 1$ . When a user assigns a query accuracy $\alpha$ (i.e., the expected approximation ratio), Para-Appro runs in an on-the-fly manner and algorithmically determines a practical $\varepsilon$ to meet the query accuracy $\alpha$ as much as possible. The aforementioned algorithms are compared in Table 1, where $m$ is the number of relevant objects and $A=\lfloor n-n/\varepsilon\rfloor+2^{n/\varepsilon}$ . Since $\varepsilon$ can be as large as $n$ , the polynomial terms stay with the exponential term.

Table 1
Algorithm comparsion

Algorithms	Complexity	Approximation ratio
Sum-Exact	$\mathcal{O}(2^{n}m\log m)$	N.A.
Sum-Appro	$\mathcal{O}(nm\log m)$	$\sum_{i=1}^{n}1/i$
Para-Appro	$\mathcal{O}(Am\log m)$	$1+\ln\varepsilon$

Finally, the contributions of this paper are summarized as follows.

•

A parametric approximation algorithm for the SGK query is proposed that is scalable to large instances by setting the parameter.

•

The notion of RIR-tree with necessary operations is introduced for efficient storage and retrieval of spatial keyword objects.

•

Experiments on real datasets indicate that compared to the existing approximation algorithms, our algorithm improves the accuracy of the SGK query.

1.1 Organization

Section 2 proposes the SGK query problem. Section 3 introduces the RIR-tree with necessary operations. Section 4 describes the parametric approximation algorithm, whose experiments are delivered in Section 5. Section 6 reviews the related work, and Section 7 is the conclusion.

2. Preliminaries

This section is devoted to formalizing the spatial group keyword query. It uses a group of objects instead of a single one to cover the query keywords and minimize the goal function. Let $D$ be a database of objects $\{o_{1},o_{2},\ldots,o_{m}\}$ . Each $o$ in $D$ is a 3-tuple $(o.id,o.l,o.\psi)$ , where $o . i d$ is the unique identifier of object $o$ , $o . l$ is the location in planar coordinates (typically a region bounded in rectangle with lower left corner and upper right corner coordinates), and $o.\psi=\{t_{1},t_{2},\ldots,t_{v}\}$ is the associated keyword set. A query point $q$ has type $(q.l,q.\psi)$ where $q . l$ is the query location and $q.\psi$ is a set of query keywords. For an arbitrary object $o$ and query point $q$ , $\textit{Dist}(o,q)$ is the Euclidean distance calculated using $q . l$ and the nearest vertex or edge of the rectangle of $o$ . Specifically, we define the cost of object $o$ , $\textit{Cost}(o,q)$ , to be $\textit{Dist}(o,q)$ . Let $S$ be a subset of $D$ and objects in $S$ cover the query keywords. Analogous to [12], the cost of subset $S$ is measured by the goal function $\textit{Cost}(S,q)$ that is the aggregate distance defined as

$\displaystyle\textit{Cost}(S,q)=\sum_{o\in S}\textit{Dist}(o,q).$ (1)

Now we are ready to give a formal definition of spatial group keyword query.

.

Given a database $D$ and a query point $q$ , a spatial group keyword query, abbreviated as SGK query, returns a subset $S$ of $D$ such that $q.\psi\subseteq\bigcup_{o\in S}o.\psi$ and $\textit{Cost}(S,q)$ is minimized.

We use an example to illustrate the SGK query.

.

This example illustrates a simple situation in which the objects are points rather than regions. In Fig. 1, we use 8 objects listed in Table 2and a query point $q$ , whose coordinates are (7, 7) and keyword set is $\{t_{1},t_{2},\cdots,t_{7}\}$ , to demonstrate this query. The rectangles are explained below. The SGK query aims at finding a subset $S$ of these eight objects to satisfy the query keyword set and while minimizing $\textit{Cost}(S,q)$ .

Figure 1.

Objects and their bounding rectangles.

Table 2

Object information

Object	$o_{1}$	$o_{2}$	$o_{3}$	$o_{4}$	$o_{5}$	$o_{6}$	$o_{7}$	$o_{8}$
Coordinates	$(3,8)$	$(10,3)$	$(12,9)$	$(9,13)$	$(2,12)$	$(0,10)$	$(1,0)$	$(14,1)$
Keywords	$\{t_{1}\}$	$\{t_{1},t_{4}\}$	$\{t_{2},t_{4}\}$	$\{t_{3},t_{4},t_{10}\}$	$\{t_{1},t_{9}\}$	$\{t_{5},t_{8}\}$	$\{t_{5},t_{7}\}$	$\{t_{6},t_{7},t_{8}\}$

Let DTIME $(\tau)$ be the class of problems that are solvable by a deterministic Turing machine in time $\tau$ . Since the SGK query problem can be reduced from the weighted set cover problem, we can utilize the inapproximability result of the set cover problem from [13] and state:

.

The SGK query problem cannot be approximated within a factor of $1+\delta$ with $\delta>0$ in polynomial time, unless NP is contained in DTIME $(m^{\textit{poly}\log m})$ .

This theorem gives a strictly lower bound of the SGK query problem, so that in theoretical, no polynomial-time algorithm can solve it.

3. Index structure: The RIR-tree

3.1 Properties of the RIR-tree

The authors in [16] integrate R-tree and textual descriptions into the IR-tree where each keyword in the textual descriptions has a different frequency of occurrence. We modify their index structure using a map, named reduced inverted-file (RI), to store distinct keywords for each node and save memory. The RI of a node contains two elements, key and value. The key is a distinct keyword $t$ that appears in the textual descriptions of its entries. The value is a list of nodes where each node consists of $t$ . Our index structure is called the reduced inverted-file R-tree (RIR-tree).

In an RIR-tree, a leaf node consists of a few entries $E$ with the structure $(e.id,e.\textit{rectangle},e.ri)$ , where

•
$e . i d$ refers to an object $o$ in database $D$ ,
•
$e.\textit{rectangle}$ is the bounding rectangle of $o$ with lower left corner and upper right corner coordinates,
•
$e . r i$ is the identifier of RI that stores the textual descriptions of $o$ .

A non-leaf node consists of a few entries $N=(cn.id,cn.\textit{children},cn.\textit{father},cn.\textit{rectangle},cn.ri)$ , where

•
$c n . i d$ is the identifier of the node,
•
$cn.\textit{children}$ contains the children of $c n$ , $|cn.\textit{children}|$ is the number of $cn.\textit{children}$ ,
•
$cn.\textit{father}$ is the father node of $c n$ ,
•
$cn.\textit{rectangle}$ is the minimum bounding rectangle (MBR) of all the rectangles in the children of $c n$ ,
•
$c n . r i$ is the identifier of RI that stores the union of the textual descriptions of the children of $c n$ .

As in an R-tree, two important properties are always perserved in an RIR-tree. First, let $M$ be the maximum number of entries that will fit in a single node. Each node contains between $\lceil M/2\rceil$ (this lower bound can be adjusted for better performance) and $M$ entries unless it is the root node. The root node has at least two children unless it is a leaf node. Second, all leaf nodes appear on the same level so that an RIR-tree is a balanced tree.
3.2 Data insertion

We now describe Algorithm 3.2 used for inserting data into an RIR-tree, where constant MAXE stands for the initial maximum rectangle enlargement adding any entry to a node. For an entry $E$ and a RIR-tree $\mathcal{T}$ , ChooseLeaf chooses the leaf node for $E$ such that this leaf node has the minimum rectangle enlargement after $E$ is addded. SplitNode uses two standard operations from R-tree, Split and Adjust. The former operation splits a node if its number of entries plus one exceeds the maximum number of entries MAXR and the latter adjusts the tree upwards after a node is split. UpdateRI updates the RIs of all relevant nodes to store distinct keywords.

[ht] Insertion ( $E$ , $\mathcal{T}$ )[1] an entry $E$ referring to one data and an RIR-tree $\mathcal{T}$ to be inserted the RIR-tree $\mathcal{T}$ after insertion create node structures $R$ , $L$ , $L_{1}$ ; $R\leftarrow\mathcal{T}.\textit{root}$ ; $s\leftarrow 0$ , $\textit{min}\leftarrow\textit{MAXE}$ ; $R$ is non-leaf child node $C R$ of $R$ calculate the enlargement $s$ of $CR.\textit{rectangle}$ after $E$ is added; $s<\min$ $\min\leftarrow s$ ; ChooseLeaf $L\leftarrow CR$ ; $R\leftarrow L$ ; $|L.\textit{children}|+1>\textit{MAXR}$ $\{L,L_{1}\}\leftarrow L.\textit{Split}()$ ; SplitNode $\textit{Adjust}(L)$ , $\textit{Adjust}(L_{1})$ ; add $E$ into $L.\textit{children}$ ; $L\neq\mathcal{T}.\textit{root}$ $\textit{key}\in E.ri$ $\textit{key}\notin L.ri$ add key to $L . r i$ ; UpdateRI add distinct nodes to the value of key; $L\leftarrow L.\textit{father}$ ; $\mathcal{T}$ .

We review the process of splitting a node $L$ in [17] for better understanding of the algorithm presented in this paper. a) We calculate the wasteful area $w$ of each two entries $E_{i},E_{j}$ of $L$ if they were split into the same new node, i.e., $w=\textit{mbr}(E_{i},E_{j})-\textit{area}(E_{i})-\textit{area}(E_{j})$ , where $\textit{mbr}(\cdot)$ calculates the area of the MBR containing $E_{i},E_{j}$ and $\textit{area}(\cdot)$ calculates the area of the rectangle of each entry. b) We pick two entries with largest $w$ to be the seeds of two new nodes $N_{1},N_{2}$ . c) We pick the next entry $E_{k}$ to be inserted that causes the largest area enlargement difference $d$ , i.e., $d=|\textit{mbr}(N_{1},E_{k})-\textit{mbr}(N_{2},E_{k})|$ , if adding into $N_{1},N_{2}$ respectively. d) We add $E_{k}$ to the node with smaller enlargement and repeat from c).

.

Let $M=3$ . Figure 2 shows the process of building an RIR-tree of 8 objects described in Table 2, where entry $E_{i}$ refers to object $o_{i}$ . SplitNode is triggered when adding the fourth and sixth entries $E_{4}$ and $E_{6}$ , as shown in Fig. 2a and c, which results in Fig. 2b and d respectively. Figure 2e shows the whole index structure.

The MBR for each node is illustrated in Fig. 1. Table 3 lists the updated RIs of the different nodes in Fig. 2e. The optimal result of Example 1 returned by the SGK query should be $o_{1},o_{3},o_{4},o_{6}$ and $o_{8}$ with the minimal cost 32.67.

Table 3
Reduced inverted-files

RI ${}_{1}$			RI ${}_{2}$	RI ${}_{3}$		RI ${}_{4}$
$t_{1}:\langle N_{2},N_{3},N_{4}\rangle$	$t_{5}:\langle N_{4}\rangle$	$t_{9}:\langle N_{4}\rangle$	$t_{1}:\langle E_{1}\rangle$	$t_{1}:\langle E_{2}\rangle$	$t_{7}:\langle E_{8}\rangle$	$t_{1}:\langle E_{5}\rangle$	$t_{9}:\langle E_{5}\rangle$
$t_{2}:\langle N_{2}\rangle$	$t_{6}:\langle N_{3}\rangle$	$t_{10}:\langle N_{3}\rangle$	$t_{2}:\langle E_{3}\rangle$	$t_{3}:\langle E_{4}\rangle$	$t_{8}:\langle E_{8}\rangle$	$t_{5}:\langle E_{6},E_{7}\rangle$
$t_{3}:\langle N_{3}\rangle$	$t_{7}:\langle N_{3},N_{4}\rangle$		$t_{4}:\langle E_{3}\rangle$	$t_{4}:\langle E_{2},E_{4}\rangle$	$t_{10}:\langle E_{4}\rangle$	$t_{7}:\langle E_{7}\rangle$
$t_{4}:\langle N_{2},N_{3}\rangle$	$t_{8}:\langle N_{3},N_{4}\rangle$			$t_{6}:\langle E_{8}\rangle$		$t_{8}:\langle E_{6}\rangle$

Figure 2.

RIR-tree index structure.

4. Parametric approximation algorithm

4.1 Algorithm framework

A parametric approximation algorithm for the SGK query is presented in this section. We handle the query with the greedy method at the beginning, iteratively putting the object that contains most query keywords and are closest to the query point into the result. When the unsatisfied number of query keywords is decreased to the threshold value, we invoke the Sum-Exact (see Algorithm 4.1) as the subprocess, where the input should be adjusted from RIR-tree to a min-priority queue. We omit the contents of the algorithm for conciseness and interesting readers can refer to [12]. Finally, we return the desired result set for a given query accuracy.

[ht] Sum-Exact ( $q$ , $U$ ) [12][1] a query point $q$ in form $(q.l,q.\psi)$ , a queue $U$ maintaining the nodes from an RIR-tree $\mathcal{T}$ with the objects in database $D$ inserted a subset $S$ of database $D$ such that $q.\psi\subseteq\bigcup_{o\in S}o.\psi$ and $\textit{Cost}(S,q)$ is minimized

For a query with query point $q$ , we introduce a relative cost $\textit{RCost}(\cdot,q)$ to numerically measure an object or a node that contains the most query keywords and is closest to the query point. The relative cost of an object $o$ is given by

$\displaystyle\textit{RCost}(o,q)=\frac{\textit{Dist}(o,q)}{|o.\psi\cap q.\psi|},$ (2)

when $o.\psi\cap q.\psi$ is not empty. The relative cost of a node $N$ is given by

$\displaystyle\textit{RCost}(N,q)=\frac{\textit{minDist}(N,q)}{|N.\psi\cap q.% \psi|},$ (3)

where $\textit{minDist}(N,q)$ is the minimum distance between $q$ and the objects in $N$ , and $N.\psi$ denotes the union of the keywords of all objects in $N$ .

Previous work has calculated the distance between a query point and the nearest vertex or edge of the rectangle to compute the distance $\textit{minDist}(N,q)$ ; however, this is inaccurate since the nearest rectangle does not necessarily contain the nearest object. Therefore, the algorithm described in [18] must be used to accurately return $\textit{minDist}(N,q)$ . This process generates a list to store all nodes in a single rectangle ordered by $\textit{minRecDist}(N,q)$ , where $\textit{minRecDist}(N,q)$ is the minimal distance between $q$ and the rectangle of $N$ . Analogously, $\textit{maxRecDist}(N,q)$ is defined as the maximal distance between $q$ and the rectangle of $N$ . Then, we iteratively prune the list with $\textit{maxRecDist}(N,q)$ to obtain $\textit{minDist}(N,q)$ .

The algorithm adopts the best-first strategy to traverse the RIR-tree and a min-priority queue to store the intermediate results, where the best-first strategy finds an object with the minimal relative cost. The keyvalue of each element in the queue is the relative cost that should be adjusted each time we pick an object greedily. However, when invoking Algorithm 4.1, the keyvalue of an object in the queue is $\textit{Dist}(o,q)$ and the keyvalue of a node in the queue is $\textit{minDist}(N,q)$ . Therefore, we need to re-insert objects into the queue prior to invoking.

We set $n=|q.\psi|$ to be the number of query keywords and $m$ to be the number of relevant objects. Set currentSet stores the keyword set of the current partial query, and set precedingSet stores the keyword set of the preceding partial query. For arbitrary $\varepsilon\geqslant 1$ , Algorithm 4.1 finds a result for the SGK query with an approximation ratio $1+\ln\varepsilon$ .

[ht] Para-Appro ( $\varepsilon$ , $q$ , $\mathcal{T}$ )[1] an arbitrary number $\varepsilon\geqslant 1$ , a query point $q$ in form $(q.l,q.\psi)$ , an RIR-tree $\mathcal{T}$ containing objects in database $D$ a subset $S$ of database $D$ such that $q.\psi\subseteq\bigcup_{o\in S}o.\psi$ and the aggregate distance $\textit{Cost}(S,q)$ from the real optimum satisfies Theorem 2 $\textit{dist}\leftarrow 0$ ; $S_{A}\leftarrow\emptyset$ , $S_{E}\leftarrow\emptyset$ , $S\leftarrow\emptyset$ ; $\textit{currentSet}\leftarrow q.\psi$ , $\textit{precedingSet}\leftarrow\emptyset$ ; create min-priority queue $U$ ; $U$ .Enqueue ( $\mathcal{T}.\textit{root}$ , 0); $U\neq\emptyset$ $p\leftarrow U.\textit{Dequeue}()$ ; $p$ is an object

$|\textit{currentSet}|>n/\varepsilon$

$S_{A}\leftarrow S_{A}\cup\{p\}$ ; $\textit{precedingSet}\leftarrow\textit{currentSet}$ ; $\textit{currentSet}\leftarrow\textit{currentSet}\setminus p.\psi$ ; entry $u\in U$

$u.\psi\cap\textit{currentSet}\neq\emptyset$ $u.\textit{keyvalue}\leftarrow\frac{u.\textit{keyvalue}\cdot|u.\psi\cap\textit{% precedingSet}|}{|u.\psi\cap\textit{currentSet}|}$ ; $U\leftarrow U\setminus\{u\}$ ; re-insert entries into the priority queue $U$ using new keyvalue;

$U.\textit{Enqueue}(p,p.\textit{keyvalue})$ ; entry $u^{\prime}\in U$ $u^{\prime}$ is non-leaf $u^{\prime}.\textit{keyvalue}\leftarrow\textit{minDist}(u^{\prime},q)$ ; $u^{\prime}.\textit{keyvalue}\leftarrow\textit{Dist}(u^{\prime},q)$ ; re-insert entries into the priority queue $U$ using new keyvalue; $q^{\prime}.\psi\leftarrow\textit{currentSet}$ , $q^{\prime}.l\leftarrow q.l$ ; $S_{E}\leftarrow$ Sum-Exact $(q^{\prime},U)$ ; break;

child node $c p$ of $p$ $\textit{currentSet}\cap cp.\psi\neq\emptyset$ $p$ is non-leaf $\textit{dist}\leftarrow\textit{minDist}(cp,q)$ ; $\textit{dist}\leftarrow\textit{Dist}(cp,q)$ ; $U.\textit{Enqueue}(cp,\frac{\textit{dist}}{|cp.\psi\cap\textit{currentSet}|})$ ; $S\leftarrow S_{A}\cup S_{E}$ .

4.2 Approximation ratio and complexity

Let $t_{1},\ldots,t_{n}$ be the keyword sequence selected by Algorithm 4.1, and $\textit{Price}(t)$ the relative cost of the object which covers the keyword $t$ for the first time. The following lemma gives a bound when greedily picking keywords.

.

For each $k\in\{1,\ldots,n\}$ , $\textit{Price}(t_{k})\leqslant\mathcal{R}/(n-k+1)$ , where $\mathcal{R}$ is the cost of the optimal result of a query.

Proof..

In each iteration, the leftover objects of the optimal result, denoted as $C$ , can cover the remaining keyword set kSet with a $\textit{Cost}(C,q)$ at most $\mathcal{R}$ .

Among $C$ , there must exist an object $o$ such that the relative cost of $o$ is not greater than $\mathcal{R}/|\textit{kSet}|$ , where $|\textit{kSet}|$ is the number of the remaining keywords. We prove this point by contradiction. Suppose that all objects in $C$ have the relative cost greater than $\mathcal{R}/|\textit{kSet}|$ . Then, we have

$\displaystyle\frac{\textit{Cost}(o,q)}{|o.\psi\cap q.\psi|}>\frac{\mathcal{R}}% {|\textit{kSet}|}\quad\text{for each }o\in C\Longleftrightarrow\textit{Cost}(o% ,q)>\frac{\mathcal{R}}{|\textit{kSet}|}\cdot|o.\psi\cap q.\psi|\quad\text{for % each }o\in C\Longrightarrow\sum_{o\in C}\textit{Cost}(o,q)>\frac{\mathcal{R}}{% |\textit{kSet}|}\cdot\sum_{o\in C}|o.\psi\cap q.\psi|\Longrightarrow\sum_{o\in C% }\textit{Cost}(o,q)>\mathcal{R},$ (4)

where the last sub-equation follows from the fact that the keywords of all objects in $C$ cover kSet. However, $\sum_{o\in C}\textit{Cost}(o,q)=\mathcal{R}$ since $C$ belongs to the optimal result, leading to the contradiction. Therefore, there must exist an object $o$ in $C$ such that the relative cost of $o$ is not greater than $\mathcal{R}/|\textit{kSet}|$ .

In the iteration, when $k$ th keyword $t_{k}$ is covered, we have $|\textit{kSet}|\geqslant n-k+1$ . Since the greedy part of Algorithm 4.1 finds the object with minimal relative cost, it follows that $t_{k}$ belongs to $o$ whose relative cost is not greater than $\mathcal{R}/|\textit{kSet}|$ . Finally, we conclude $\textit{Price}(t_{k})\leqslant\mathcal{R}/|\textit{kSet}|\leqslant\mathcal{R}/% (n-k+1)$ . ∎

.

For any $\varepsilon\in\mathbb{Q}$ , the Para-Appro algorithm runs in $\mathcal{O}((\lfloor n-n/\varepsilon\rfloor+2^{n/\varepsilon})m\log m)$ and polynomial space with an approximation ratio $1+\ln\varepsilon$ .

Proof..

We obtain the optimal result $S_{E}$ , whose cost is denoted as $\mathcal{R}_{E}$ , when Algorithm 4.1 uses the Sum-Exact part. Clearly, $\mathcal{R}_{E}\leqslant\mathcal{R}$ . or the first $\lfloor n-n/\varepsilon\rfloor$ keywords that have been satisfied by $S_{A}$ , we can bound the cost of $S_{A}$ based on Lemma 1,

$\displaystyle\textit{Cost}(S_{A},q)\leqslant\sum_{k=1}^{\lfloor n-n/% \varepsilon\rfloor}\frac{\mathcal{R}}{n-k+1}=\sum_{k=1}^{n-\lceil n/% \varepsilon\rceil}\frac{\mathcal{R}}{n-k+1}=(H_{n}-H_{\lceil n/\varepsilon% \rceil})\cdot\mathcal{R}\leqslant(\ln n-\ln\lceil n/\varepsilon\rceil)\cdot% \mathcal{R}\leqslant\ln\varepsilon\cdot\mathcal{R}.$ (5)

Therefore, $\textit{Cost}(S,q)\leqslant(1+\ln\varepsilon)\cdot\mathcal{R}$ , which implies the approximation ratio of Algorithm 4.1 is $1+\ln\varepsilon$ .

If the algorithm pops a node from queue $U$ that is not an object, its child nodes with keyvalue should be inserted into the queue (lines 27–31). All nodes are inserted into the queue in the worst case, with the cost $\mathcal{O}(m\log m)$ . If a node refers to an object, we need to re-insert it into the queue when satisfying the first $\lfloor n-n/\varepsilon\rfloor$ query keywords (lines 13–16) since each time an object is added to the result set, the keyvalue of each element in the queue must be adjusted. At most $\lfloor n-n/\varepsilon\rfloor$ objects can be in the result with each only satisfying one keyword. Therefore, the complexity is $\mathcal{O}((\lfloor n-n/\varepsilon\rfloor)m\log m)$ . We run the Sum-Exact algorithm for the remaining uncovered keywords (lines 18–25), containing the re-insertion of the queue and the process of invoking Sum-Exact. The cost of this part is at most $\mathcal{O}(2^{n/\varepsilon}m\log m)$ . Thus, the total worst-case complexity is $\mathcal{O}((\lfloor n-n/\varepsilon\rfloor+2^{n/\varepsilon})m\log m)$ .

The space consumption will not exceed $n m$ since in the worst case, all objects with their keywords are inserted (lines 27–31). The same is true for the subprocess Sum-Exact. Therefore, Algorithm 4.1 runs in polynomial space. ∎

4.3 Parameter determination

The algorithm can be adapted to various objectives by setting $\varepsilon$ to different values. When a user dissatisfies the query accuracy of the existing approximation algorithms, he or she can assign the desired query accuracy $\alpha$ , and the algorithm determines $\varepsilon$ based on the two following conditions:

•
$1+\ln\varepsilon\leqslant\alpha$ implies $\varepsilon\leqslant e^{\alpha-1}$ ,
•
$1+\ln\varepsilon\leqslant\sum_{i=1}^{n}1/i$ implies $\varepsilon<n\cdot e^{\frac{1}{2n}+\gamma-1}$ , where $\gamma$ is the Euler-Mascheroni constant.

The first condition stands since the approximation ratio of Para-Appro should be less than $\alpha$ as required by the user. If the algorithm is about to run out of memory for large datasets, Para-Appro will determine a practical $\varepsilon$ based on the second condition in order to still obtain more accurate results than those obtained by existing approximation algorithms. As the number of data grows, Para-Appro gradually increases $\varepsilon$ to lower query accuracy in exchange for scalability to large datasets. Thus, Para-Appro can always determine $\varepsilon$ in the on-the-fly manner.

.

In Example 2, Sum-Appro will return $o_{2},o_{3},o_{4},o_{7}$ and $o_{8}$ as the result with cost 35.15. Para-Appro will return $o_{2},o_{3},o_{4},o_{6}$ and $o_{8}$ as the result with cost 33.55 when $\varepsilon=1.5$ or $o_{1},o_{3},o_{4},o_{6}$ and $o_{8}$ as the result with cost 32.67 when $\varepsilon=1.1$ . We illustrate the execution of the steps of Para-Appro from the dequeuing root node to the final result when $\varepsilon=1.5$ .

(1)
Dequeue $N_{1}$ , enqueue $N_{2},N_{4},N_{3}$ ; Queue $\{(N_{3},{\textstyle\frac{2}{5}}),(N_{2},{\textstyle\frac{1}{2}}),(N_{4},{% \textstyle\frac{5}{3}})\}$ .
(2)
Dequeue $N_{3}$ , enqueue $E_{2},E_{4},E_{8}$ ; Queue $\{(N_{2},{\textstyle\frac{1}{2}}),(N_{4},{\textstyle\frac{5}{3}}),(E_{2},{% \textstyle\frac{5}{2}}),(E_{4},\sqrt{10}),E_{8},{\textstyle\frac{\sqrt{85}}{2}}\}$ .
(3)
Dequeue $N_{2}$ , enqueue $E_{1},E_{3}$ ; Queue $\{(N_{4},{\textstyle\frac{5}{3}}),(E_{2},{\textstyle\frac{5}{2}}),(E_{3},{% \textstyle\frac{\sqrt{29}}{2}}),(E_{4},\sqrt{10}),(E_{1},\sqrt{17}),(E_{8},{% \textstyle\frac{\sqrt{85}}{2}})\}$ .
(4)
Dequeue $N_{4}$ , enqueue $E_{5},E_{6},E_{7}$ ; Queue $\{(E_{2},{\textstyle\frac{5}{2}}),(E_{3},{\textstyle\frac{\sqrt{29}}{2}}),(E_{% 4},\sqrt{10}),(E_{1},\sqrt{17}),(E_{5},5\sqrt{2}),(E_{7},{\textstyle\frac{% \sqrt{85}}{2}}),(E_{8},{\textstyle\frac{\sqrt{85}}{2}}),(E_{6},\sqrt{58})\}$ .
(5)
$|\textit{currentSet}|=7$ , $S_{A}=\{E_{2}\}$ , $\textit{currentSet}=\{t_{2},t_{3},t_{5},t_{6},t_{7}\}$ ;Queue $\{(E_{3},\sqrt{29}),(E_{4},2\sqrt{10}),(E_{1},\sqrt{17}),(E_{5},5\sqrt{2}),(E_% {7},{\textstyle\frac{\sqrt{85}}{2}}),(E_{8},{\textstyle\frac{\sqrt{85}}{2}}),(% E_{6},\sqrt{58})\}$ .
(6)
$|\textit{currentSet}|=5$ , $S_{A}=\{E_{2},E_{3}\}$ , $\textit{currentSet}=\{t_{3},t_{5},t_{6},t_{7}\}$ ;Queue $\{(E_{3},\sqrt{29}),(E_{4},2\sqrt{10}),(E_{1},\sqrt{17}),(E_{5},5\sqrt{2}),(E_% {7},{\textstyle\frac{\sqrt{85}}{2}}),(E_{8},{\textstyle\frac{\sqrt{85}}{2}}),(% E_{6},\sqrt{58})\}$ .
(7)
$|\textit{currentSet}|=4$ , $S_{A}=\{E_{2},E_{3}\}$ , $S_{E}=\{E_{4},E_{6},E_{8}\}$ , $S=\{E_{2},E_{3},E_{4},E_{6},E_{8}\}$ .

Therefore, Para-Appro is effective in solving the SGK query problem and yields a higher accuracy of the result requires a lower $\varepsilon$ .
5. Experiments

The experiments studied the performance of our algorithm compared with Sum-Exact and Sum-Appro. All of the algorithms were implemented in Java and executed in a Windows 10 System on an Intel Core i5-4570 CPU @3.2GHz with 8GB RAM. Our datasets and index are all memory resident.

5.1 Experimental setup

We use three real datasets adopted in [8, 12], named Hotel, Web and Nation. The properties of the datasets are listed in Table 4. Missing values in the data are set to be null. Dataset Hotel1

¹
www.allstays.com.

contains the hotels in the U.S. where each hotel is a spatial object with a set of descriptive words (e.g., “restaurant” and “airport”) with the coordinates taken from Google Maps. Hotel is small and thus is easy to evaluate the performance of our algorithm.

Table 4

Properties of datasets

Datasets	Hotel	Web	Nation
Number of objects	20790	31123	1162547
Number of keywords	80845	15374851	12430491
Number of distinct keywords	602	185643	1328533

Dataset Web is created from two real datasets, WEBSPAM-UK2007 and TigerCensusBlock.2

Dataset Web is offered by [8].

The former contains a large number of web descriptions denoted as numerals. The latter corresponds to a set of census blocks in Iowa, Kansas, Missouri and Nebraska. Web consists of spatial objects in TigerCensusBlock where each object is randomly associated with a web description from WEBSPAM-UK2007. This dataset costs great memory space since the RIs are established by a large number of keywords.

Dataset Nation3

geonames.usgs.gov.

is downloaded from the U.S. Board on Geographic Names. Each spatial object has latitude and longitude coordinates and a set of attributes, for example s a feature class such as “airport” or belonged state such as “Washington”. The scalability of Para-Appro is validated on the large dataset Nation, as shown in the next subsection.

5.1.1 Query generation

We generate 50 queries for each trial to avoid noisy data. For a query with $n$ keywords, we randomly select $n$ objects in the database and then select a keyword from each object as a query keyword. The query location is the center point of these $n$ objects. We can obtain similar results in the case where the query is too disperse and the average value of these 50 queries is more convincing. Default setting $\varepsilon$ to an integer to satisfy the second condition in Subsection 4.3.

5.2 Experimental results

5.2.1 Results on Hotel

Figure 3 shows that the runtime of Para-Appro is close to that of Sum-Appro, while its approximation ratio is lower. In both subfigures, the $x$ -axis is the number of the keywords. The $y$ -axis is the runtime in Fig. 3a and is the approximation ratio in Fig. 3b. It is observe that the runtime of Sum-Exact grows exponentially. Thus, approximation algorithms are more widely applied. Furthermore, Para-Appro outperforms Sum-Appro in query accuracy as shown in Fig. 3b.

Figure 3.

Results on Hotel.

5.2.2 Results on Web

Although the number of objects is of the same magnitude as Hotel, the runtime of Web is clearly longer as shown in Fig. 4. This is because Web contains a very high number of keywords. On the one hand, the RIs occupy great memory space. On the other hand, it takes more time to search and prune the RIR-tree to find query keywords. However, we can still observe that compared to Sum-Appro, Para-Appro obtains more accurate results in a reasonable time.

Figure 4.

Results on Web.

5.2.3 Results on Nation

As the number of objects reaches millions, the runtime of Sum-Exact grows exponentially as shown in Fig. 5 and the preponderance of approximation algorithms is more clear. Since the number of the keywords associated with a single object in Nation is approximately 10 while this number can be in the hundreds for Web, approximation algorithms for Web cost more time when querying the same number of keywords. Similar results for the approximation ratio appear in Fig. 5b.

Figure 5.

Results on Nation.

5.2.4 Time consumption and query accuracy

Figure 6 intuitively presents the relation between the tolerable time consumption $t$ (the $x$ -axis) and query accuracy that is reflected by the approximation ratio (the $y$ -axis). This experiment is set to search 20 keywords among 10000 objects on Nation. If the tolerable runtime is 40 ms, the optimal approximation ratio is 3.6 among 50 trials. As $t$ grows, Para-Appro determines $\varepsilon$ algorithmically for higher query accuracy, which ultimately returns the exact result, i.e., the approximation ratio approaches 1.

Figure 6.

Relation between time consumption and query accuracy.

5.2.5 Scalability

To verify the scalability to large number of instances, the first experiment is set to search 600 keywords on Nation and the number of objects $m$ (the $x$ -axis in Fig. 7) grows from $10^{3}$ to $10^{6}$ . In this setting, Sum-Appro could only meet a fixed approximation ratio of 6.97, as shown in the gray dash line in Fig. 7b. If a user assigns a desired query accuracy 3, Sum-Appro would be unsatisfactory. By contrast, Para-Appro meets the user needs when $m<5\cdot 10^{5}$ . It then algorithmically determines $\varepsilon$ by dynamically monitoring the runtime to prevent running out of the memory. The approximation ratio is increased to 4 when $m=5\cdot 10^{5}$ and 5 when $m=10^{6}$ in the black solid line in Fig. 7b. Thus, Para-Appro is scalable to a large $m$ at the cost of reducing query accuracy. Notice that the accuracy is still better than that of Sum-Appro in large number of instances.

Figure 7.

Result on scalability with different number of objects.

Figure 8.

Result on scalability with different number of query keywords.

The second experiment is set to search different number of keywords among 500000 objects on Nation, where the number of keywords $n$ (the $x$ -axis in Fig. 8) grows from 100 to 600. If a user assigns a desired query accuracy 3, Sum-Appro would be unsatisfactory for cases with various number of query keywords as shown in the gray dash line in Fig. 8b. By contrast, Para-Appro meets the user needs when $n\leqslant 500$ . Although the accuracy is decreased to prevent running out of the memory when $n=600$ , Para-Appro still outperforms Sum-Appro with a feasible runtime.

6. Related work

In processing the SGK query, data structures are necessary for retrieving massive spatial keyword objects. Most location-aware queries adopt R-tree [17] as the data structure; R-tree is developed from B-tree [19] to multidimensional space and uses rectangles to store the location information of each node. To efficiently handle textual descriptions, information retrieval R-Tree (IR ${}^{2}$ -tree) [4] stores all keywords in a node with a binary string. Binary strings can easily store keywords but make it difficult to retrieve objects that contain the corresponding keywords. Inverted-file R-tree (IR-tree) [16] uses inverted files to store the relationship between the keywords and the objects. It also contains the weight of each keyword, which is unnecessary in our Boolean satisfaction of keywords. SI-index [20] compresses two-dimensional coordinates to a one-dimensional distance, saving the memory cost but losing the shape information. CD-tree [21] is adaptive to dynamic databases while textual descriptions of data are not included. In [22], 10 kinds of data structures were compared and it was verified that the data structures with inverted files outperform the other data structures in keyword processing. Among these powerful data structures, we choose to apply the modified IR-tree in our problem.

Previous work in [6, 12] indicates that the SGK query can be reduced from the set cover problem [23], which is a notable combinatorial optimization problem with applications in a variety of areas. For solving the set cover problem, either approximation algorithms with greedy methods or exact algorithms with exhaustive methods are considered. By contrast, the use of parametric methods is relatively rare. Despite the lack of practical and effective applications, parametric methods for the set cover problem were introduced in [15, 14] to provide a new perspective for solving this problem. The inapproximability result for the set cover problem has been proved in [24], so that parametric methods for this problem is incapable of obtaining a solution in polynomial time. Thus, in this work, we accomplished a practical application that generalizes parametric methods from the set cover problem to the SGK query.

7. Conclusion

In this paper, we proposed an approximation algorithm, Para-Appro, for the SGK query with a parametric approximation ratio. Even though the algorithm does not represent a large advance relative to the existing approximation algorithms because it improves query accuracy at the cost of time consumption, it offers an approach for proposing new approximation algorithms. Para-Appro was verified to run in an on-the-fly manner on some benchmark datasets, and is potentially scalable to large number of instances.

In the future, we will consider more efficiently-operated data structures for retrieving spatial keyword object and will apply Para-Appro to variants of the SGK query, in which the goal functions are replaced with other measures, such as the maximum distance between the resulting objects.

Footnotes

Acknowledgments

The authors thank Cheng Long of Nanyang Technological University for offering the real datasets. The work is supported by the National Natural Science Foundation of China (No. 11871221) and the Fundamental Research Funds for the Central Universities.

References

Yoo

J.S.

and Bow

, A framework for generating condensed co-location sets from spatial databases, Intelligent Data Analysis 23(2) (2019), 333–355.

Feng

Cong

Jensen

C.S.

and Guo

, Finding attribute-aware similar regions for data analysis, Proceedings of the VLDB Endowment 12(11) (2019), 1414–1426.

Paulet

Bertino

and Varadharajan

, Practical approximate k nearest neighbor queries with location and query privacy, IEEE Transactions on Knowledge and Data Engineering 28(6) (2016), 1546–1559.

Felipe

I.D.

Hristidis

and Rishe

, Keyword search on spatial databases, in: IEEE 24th International Conference on Data Engineering, 2008, pp. 656–665.

Zhang

Chee

Y.M.

Mondal

Tung

A.K.H.

and Kitsuregawa

, Keyword search in spatial databases: Towards searching by document, in: IEEE 25th International Conference on Data Engineering, 2009, pp. 688–699.

Cao

Cong

Jensen

C.S.

and Ooi

B.C.

, Collective spatial keyword querying, in: SIGMOD’11, 2011, pp. 373–384.

Sterling

McKirdy

Santiago

and Valencia

, Hurricane Maria: Puerto Rico reels, Turks and Caicos hit, CNN, 2017.

Long

Wong

R.C.-W.

Wang

and Fu

A.W.-C.

, Collective spatial keyword queries: A distance owner-driven approach, in: SIGMOD’13, 2013, pp. 689–700.

Skovsgaard

and Jensen

C.S.

, Finding top-k relevant groups of spatial web objects, The VLDB Journal 24(4) (2015), 537–555.

10.

Chan

H.K.-H.

Long

and Wong

R.C.-W.

, On generalizing collective spatial keyword queries, IEEE Transactions on Knowledge and Data Engineering 30(9) (2018), 1712–1726.

11.

Mahmood

A.R.

Aref

W.G.

Aly

A.M.

and Tang

, Atlas: On the expression of spatial-keyword group queries using extended relational constructs, in: Proceedings of the 24th ACM SIGSPATIAL/GIS, 45, 2016, pp. 1–10.

12.

Cao

Cong

Guo

Jensen

C.S.

and Ooi

B.C.

, Efficient processing of spatial group keyword queries, ACM Transactions on Database Systems 40(2) (2015), 1–48.

13.

Bonnet

É.

Lampis

and Paschos

V.T.

, Time-approximation trade-offs for inapproximable problems, Journal of Computer and System Sciences 92 (2018), 171–180.

14.

Cygan

Kowalik

and Wykurz

, Exponential-time approximation of weighted set cover, Information Processing Letters 109(16) (2009), 957–961.

15.

Bourgeois

Escoffier

and Paschos

V.T.

, Efficient approximation of min set cover by moderately exponential algorithms, Theoretical Computer Science 410(21–23) (2009), 2184–2195.

16.

Cong

Jensen

C.S.

and Wu

, Efficient retrieval of the top-k most relevant spatial web objects, Proceedings of the VLDB Endowment 2(1) (2009), 337–348.

17.

Guttman

, R-trees: A dynamic index structure for spatial searching, in: SIGMOD’84, 1984, pp. 47–57.

18.

Roussopoulos

Kelley

and Vincent

, Nearest neighbor queries, in: SIGMOD’95, 1995, pp. 71–79.

19.

Bayer

and McCreight

E.M.

, Organization and maintenance of large ordered indexes, in: ACM SIGFIDET Workshop on Data Description and Access, 1970, pp. 107–141.

20.

Tao

and Sheng

, Fast nearest neighbor search with keywords, IEEE Transactions on Knowledge and Data Engineering 26(4) (2014), 878–888.

21.

Wan

Liu

and Wu

, Cd-tree: A clustering-based dynamic indexing and retrieval approach, Intelligent Data Analysis 21(2) (2017), 243–261.

22.

Almaslukh

and Magdy

, Evaluating spatial-keyword queries on streaming data, in: Proceedings of the 26th ACM SIGSPATIAL/GIS, 2018, pp. 209–218.

23.

Vazirani

V.V.

, Approximation Algorithms, Springer-Verlag, 2003.

24.

Nelson

, A note on set cover inapproximability independent of universe size, Electronic Colloquium on Computational Complexity 14(105) (2007), 1–4.

A parametric approximation algorithm for spatial group keyword queries

Abstract

Keywords

1. Introduction

Table 1 Algorithm comparsion

2. Preliminaries

.

.

.

3.1 Properties of the RIR-tree

.

Table 3 Reduced inverted-files

4.1 Algorithm framework

.

Proof..

.

Proof..

.

5.1 Experimental setup

1 www.allstays.com.

5.2 Experimental results

5.2.1 Results on Hotel

7. Conclusion

Footnotes

Acknowledgments

References

Table 1
Algorithm comparsion

Table 3
Reduced inverted-files

¹
www.allstays.com.