Cost evaluation and improvement of TB-tree insertion algorithm

Abstract

This paper proposes the cost evaluation model for the existing TB-tree insertion algorithm called BASEINSERT. Factors that may affect its performance are analyzed in a detailed manner. While we search the latest node, the number of nodes being accessed is very prominent. Given this property, the paper comes up with an improved algorithm, named IMPRINSERT by adding the cache structures of root-latest-node path and right-most path, which reduce the number of nodes being accessed while searching for the latest node and right-most path close to zero. Under the bulk data setting, algorithm that inserts the data segment by segment is fairly inefficient, therefore this paper introduces a new insertion algorithm, called BULKINSERT and its cost evaluation model is proposed as well. Verified by a large number of synthesized datasets, it is proved that the cost models are reliable, both IMPRINSERT and BULKINSERT are effective and demonstrate nearly linear scalability.

Keywords

TB-tree insertion algorithm cost evaluation bulk insertion moving objects databases

1. Introduction

As wireless communication technology and location technology improved rapidly, together with the popularity of related equipment, a wide range of applications need to track and manage the locations of moving objects, to provide query and analysis service. Some core applications in this context are logistics management, fleet management, traffic analysis and management, analysis of movements of people (customers) in mobile computing, geographic information system, environmental studies, etc. [1]. As time goes by, the moving objects will generate a considerable amount of data given the possible change of their time and locations. Therefore, it needs the support of moving objects databases (MODs) and this gives rise to more researches and applications of MODs. Currently, some major work includes developing new index technologies and novel query types and algorithms. The existing index technologies for moving objects database fall into two categories: present and future-oriented query and historical trajectory query. As for the former one, the query adopts TRP-tree [2], its variant TPR*-tree [3], HVTPR-tree [4], VCI-tree [5], SV model [6], STRIPE index [7] and so on, among which, TRP-tree and its variant TPR*-tree are based on R*-tree [8] and they track current locations of moving objects, VBRs and other parameters. These data are used to infer their future locations, thus, making them the most widely adopted query methods. As for the latter one, it mainly uses TB-tree [9] and its variant TB*-tree [10], SETI [11], multi-level multiple indexes [12], RPTI [13], PA-Tree [14], Cloud-Based Trajectory Index [15, 16], DTI $+$ S [17], etc. TB-tree evolves from R-tree [18], but each node only keeps the data of one moving object. It demonstrates great excellence in terms of trajectory preservation and trajectory-based query, thus it is widely used. There are also abundant researches about MODs-based query, including NN query [19, 20, 21, 22, 23, 24], RNN query [24, 25, 26], range query and closing events query [27, 28, 29, 30, 31, 32].

All the above mentioned research findings contributed greatly to the theory and application of MODs. Built on these methods, this paper endeavors to further improve the TB-tree insertion algorithm. Though literature [9] came up with the structure of TB-tree and realized the basic insertion algorithm (insert the data segment by segment, hence called BASEINSERT), there are still three problems: (1) the cost evaluation of the insertion algorithm was overlooked, which leads to efficiency uncertainty in actual applications. It may not help the system to choose the efficient execution plans, nor can it identify weak links of the algorithm, making it difficult to improve; (2) when search for latest node, there are too many disk accesses; (3) another drawback is lack of bulk data-oriented insertion algorithm, leaving it fairly inefficient under bulk data setting. This makes it highly necessary to develop a new insertion algorithm to cater to bulk data.

The major contributions of this paper are summarized as follows:

(4) (1)
Propose cost evaluation model for BASEINSERT.
(2)
Propose improved insertion algorithm (hence, called IMPRINSERT) and its cost evaluation model.
(3)
Present insertion algorithm for bulk data (hence, called BULKINSERT) and its cost evaluation model.
(4)
Based on many experiments of synthesized dataset, reliability of the proposed models, effectiveness and efficiency of the improved and new algorithm are verified.

Related work would be mentioned in Section 2 of this paper. Section 3 presents the IMPRINSERT algorithm and its cost evaluation model. Section 4 introduces the BULKINSERT algorithm and its cost evaluation model. Section 5 provides the experiment results and analyses and Section 6 concludes the paper with directions for future work.
2. Relate work

Section 2 starts by introducing TB-tree structure and BASEINSERT algorithm, and then proposes the cost evaluation model for BASEINSERT algorithm and analyzes its weakness. At the end, some possible solutions are pointed out.

2.1 TB-tree structure

Within the existing spatio-temporal data storage structures, the R-tree family is the most popular index structure. As one of the variants of R-tree, TB-tree embodies the general properties of R-tree, using MBR (Minimum bounding rectangle) to approximately describe the spatio-temporal range of included line segments, the MBR of upper-layer node covers all the MBRs of lower-layer sub-nodes and the pointers which point to the lower-layer sub-nodes are contained in upper-layer node. However, TB-tree is specially designed for trajectory preservation of moving objects. When inserting a new line segment of trajectory of certain object, it will search leaf node containing the previous line segment of the trajectory (hence, called latest node) and use it as target position. In addition, one leaf node only stores the trajectory line segments from one object and the leaf nodes storing the same object are connected to each other by pointers. The extending sequence of leaf nodes go from the left to the right. As shown in Fig. 1, c1, c3, c7, c8, c10, c12 are the leaf nodes that store the same object.

Such design takes some spatial discrimination away from the TB-tree, however, it brings sound trajectory preservation which makes the trajectory line segments within and between the leaf nodes well-organized. Thus, the following algorithms do not need to do sorting and it is much easier for the execution of pruning algorithms. Therefore, the performance of trajectories query (in particular, combination query) has been greatly improved [9].

Figure 1.

Structure of TB-tree [9].

Figure 2.

BASEINSERT algorithm of TB-tree [9].

2.2 BASEINSERT algorithm

As mentioned above, the structure of TB-tree has advantages in trajectories query, but the performance of its insertion algorithm has not been paid enough attention and no cost evaluation model of it is provided. This subsection will introduce the existing BASEINSERT algorithm, for the preparation of subsequent construction of its cost evaluation model, performance analysis and improvement. Specifically, BASEINSERT algorithm mentioned in literature [9] is shown in Fig. 2. $N$ is the root node of the tree, $N^{\prime}$ is latest node and $E$ is newly added line segment.

When executing the insertion of new line segment of an object, to achieve trajectory preservation, the new line segment must be inserted into latest node or be stored in the newly split leaf node which has two-way linkage with the latest node. Therefore, the search for latest node is a must process (please refer to line 1). Given the feedback of the operation, the following operations should be handled in three different ways. The first condition: if latest node has space, then the new line segment can be directly inserted in latest node (line 4); the second condition: if there is no extra capacity in latest node, then create a new leaf node (line 6) and store in it the newly added line segment. The new leaf node and latest node are pointing to each other with two pointers, in this way, trajectory preservation can be realized; the third condition: if no latest node is found, then create a new leaf node (line 8) and it is used as the first node of the object and stored in the newly added line segment. When handling the second and third conditions, algorithm must first find the path starting from root node to the penultimate internal node on the rightmost side of the TB-tree (hence, called structure of right-most path), and insert the new leaf node in the right-most path. If the right-most path is full, then split a new node within the penultimate layer to serve as a node for right-most path and then insert in it the newly added leaf node. If the parent node of newly split node in the right-most path is also fully occupied, then the split operation can propagate upwards until a new root node is generated.

2.3 Cost model

Apparently, the disk access cost of BASEINSERT algorithm will be affected by the height of the tree, node fanout, whether there is cache or not, the variety of data source (number of moving objects), velocity and direction of moving objects, the arrival consequences of all moving objects and uniformity. Therefore, it is hard to directly conduct the total cost evaluation. This paper will estimate the disk access cost of inserting single line segment.

Let $h$ be the height of the tree, MO the number of moving objects, FANOUT the fanout value of node, as described in Subsection 2.2, each moving object has a latest node (except the first line segment), thus we can eliminate the small probability event of having no latest node. The probability for latest node with extra storage capacity $P1=1-\frac{1}{\text{FANOUT}}$ , and the probability for fully occupied latest node $P2=\frac{1}{\text{FANOUT}}$ . Under such hypothesis, the average node accesses with inserted new line segment can be expressed as follows:

$\displaystyle\text{NAseg}=P1\ast(\text{NAsearch}+\text{NAwrite}+\text{NAupdate% })+P2\ast(\text{NAsearch}+\text{NAtopright}+\text{NAwrite}+\text{NAupdate}),$ (1)

where NAwrite is the node access for writing newly added line segment to latest node with extra storage capacity or newly split leaf node,

$\displaystyle\text{NAwrite}=1.$ (2)

When newly added line segment is written to leaf node, it may lead to expanding of the MBRs of the nodes in upper layers and this could cause the update of these nodes. When there is no statistical data, we can assume that when we insert a new line segment, half of the MBRs of the nodes between insertion point and root node will change, therefore,

$\displaystyle\text{NAupdate}=\frac{(h-2)}{2}.$ (3)

NAtopright is the node accesses during the search for right-most path when latest node is full,

$\displaystyle\text{NAtopright}=h-2.$ (4)

NAsearch is the node accesses during the search for latest node. Any TB-tree node with its MBR overlapping with the MBR of newly added line segment will be traversed and included in NAsearch. Next, we will check the nodes being queried and accessed in the leaf layer. The temporal dimension will be checked first. We assume that all data of moving objects has uniform time interval and their sequence is stable, then in the leaf layer, the number of nodes overlapping with newly added line segment is around 0.5MO. Next, we look into spatial dimension. The probability of spatial overlapping for each leaf node with the MBR of newly added line segment is:

$\displaystyle P=(nx+nx^{\prime})\ast(ny+ny^{\prime}),$ (5)

where $n x$ and $n y$ are respectively the length and width of spatial rectangle projected by MBR of leaf node, $nx^{\prime}$ and $ny^{\prime}$ are those of spacial rectangle projected by MBR of the newly added line segment. All the four variables are unitized, and in specific application, they can be obtained through statistics. Equation (5) is derived from literature [33, 34], specifically, the overlapping probability between the querying rectangle (with the length and width being respectively $nx^{\prime}$ and $ny^{\prime}$ ) and the queried rectangle (with the length and width being respectively $n x$ and $n y$ ) is equal to the probability that a random point falls in an enlarged rectangle (with length and width being respectively $(nx+nx^{\prime})$ and $(ny+ny^{\prime})$ ) in the unit space.

Therefore, the number of leaf nodes which has spatio-temporal overlapping with MBR of newly added line segment is around:

$\displaystyle\text{NAsearchleaf}=0.5\text{MO}\ast P.$ (6)

Since the access of leaf nodes account for a major part of total node access, we can assume NAsearch is equivalent to NAsearchleaf, i.e.

$\displaystyle\text{NAsearch}=0.5\text{MO}\ast P.$ (7)

The above mentioned node accesses is the estimated cost for single line segment insertion inBASEINSERT algorithm. The cost of writing each newly added line segment in the leaf node, NAwrite is 1. This part of cost is only affected by data volume rather than the height of tree. The update of upper-layer nodes and the search for right-most path are related to the height of tree. When inserting data, as the height of tree grows, cost of these two parts increases accordingly. NAsearch and MO have direct relationship, and when MO is relatively large (e.g. MO is over 1000), NAsearch would be considerable, thus having a big impact on the performance of the algorithm. Therefore, reducing or avoiding NAsearch is the major way to improve performance of BASEINSERT.

3. IMPRINSERT algorithm

Based on the BASEINSERT algorithm and its cost component analysis mentioned in the previous section, this section proposes targeted improvement, including newly added cache structures and improved IMPRINSERT insertion algorithm. Finally, the cost analysis of IMPRINSERT is conducted in a detailed manner.

3.1 Data structure

3.1.1 Structure of root-latest-node path

The cache structure of root-latest-node path is shown in Table 1, where id is the unique identifier for moving object, Node address chain is consisted of all the address information of the nodes on the path from root node to the latest node of an object, and content of leaf node includes all the information of the latest node. During the insertion of moving object, frequent query and access of latest node are performed. Caching the structure of root-latest-node path can reduce invalid disk access substantially. The reason for storing the content of leaf node in root-latest-node path is to verify whether leaf node is fully occupied and this can get rid of the disk access of leaf node.

Table 1
Cache structure of root-latest-node path

Object id	Node address chain	Content of latest leaf node
id1	rootaddr, nodeaddr2, …, nodeaddrn	information of latest node for id1
id2	…	…
…	…	…

Node address chain can be used to pinpoint the path from root node to the latest node of an object. When a newly added line segment is inserted into latest node and needs bottom-up MBR update, the program would read and update contents of nodes sequentially according to the information on the node address chain. If there is no node address chain that can offer information, then, all the nodes in the tree need to add a pointer pointing to its parent node (similar to the practice of [10]), which will lead to the following disadvantages: each node occupying more space, the storage space of the total tree growing accordingly and its operation getting more complex. The node address chain only saves addresses of nodes and does not include their contents, thus avoiding taking up too much memory storage when there are large number of moving objects need to kept in the structure.

The maintenance rules of the cache structure of root-latest-node path are as follows:

(1) (1)

Before a line segment of object is inserted to TB-tree, it needs to track the related record in the structure with the id information of the object. If there is a matching record, then read the content in latest node and decide whether it can be inserted directly:

(a) (a)

If the node still has storage capacity, then perform the insertion and update.

(b)

If there is no extra storage capacity, then split the node, find the right-most path and insert into the newly split node. At this moment, the correlated root-latest-node path needs to be updated.

(2)

If there is no matching records with the id information of the object in the root-latest-node path, then execute the similar query, insertion and update operations to BASEINSERT and create a root-latest-node path record dynamically.

(3)

If new root node is generated during the data insertion, then all the node address chains in the root-latest-node path should add the address information of the new root node.

3.1.2 Structure of right-most path

As described in Subsection 2.2, during the data insertion process, if latest node is full or there is no latest node, BASEINSERT algorithm needs to split new node and search right-most path. Frequent searching for right-most path and accessing contents of its nodes would cost some resources (refer to Eqs (1) and (4)). To overcome such deficiency, the cache structure of right-most path comes into being. It includes addresses and content information of all nodes on the right-most path. When performing the insertion requires obtain these information, it can be accessed through the cache, thus, avoiding the cost of querying and accessing it through disk. Structure of right-most path is created automatically when the first node is generated in the tree. When right-most path is inserted with new leaf node or it splits new nodes, the structure also update simultaneously so as to ensure it remains the most updated status.

Figure 3.

IMPRINSERT algorithm.

3.1.3 Discussion of space complexity

As mentioned above, there is only one right-most path structure for a TB-tree, and the main memory space consumption can be expressed as $h\ast\text{SIZEnode}$ , where SIZEnode is the space size of TB-tree node, $h$ is the height of TB-tree, $h=\left\lceil\frac{\lg(n_{\rm leaf})}{\lg(\text{FANOUT})}+1\right\rceil$ , $n_{\text{leaf}}$ is the leaf node number of TB-tree, $n_{\text{leaf}}=\sum\nolimits_{\text{id}=1}^{\text{MO}}{\left\lceil{\frac{% \textit{sum}(\text{id})}{\text{FANOUT}}}\right\rceil}$ , and $\textit{sum}(\text{id})$ represents the number of line segments for the moving object identified by id.

As for the root-latest-node path structure, it has a number same to that of moving objects. Each root-latest-node path structure contains the information of one leaf node, several node addresses and one object id. So the main memory space consumption is $\text{MO}\ast(\text{SIZEnode}+h\ast\text{SIZEadr}+\text{SIZEid})$ , where SIZEadr and SIZEid are respectively the space sizes to keep a node address and an object id, and generally, they vary from 4 to 8 bytes in different computer systems. When MO is very large, main memory space consumption may be considerable, which can be solved by introducing Least Recently Used structure and reducing the consumption to minimum level.

3.2 IMPRINSERT algorithm

The pseudo code of IMPRINSERT based on the newly added data structures is shown in Fig. 3, when executing the new line segment insertion for a moving object, algorithm will first get the information about the latest node and address chain from the structure root-latest-node path with its id information (refer to line 2). If the latest node is found and it is not full, then insertion will be performed, and contents of the related upper layer nodes will be adjusted according to the address chain (refer to lines 3 and 4). If the latest node is nowhere to be found or it is full, then split leaf node and store in a new line segment (refer to line 6). Then insert new leaf node through structure of right-most path (refer to line 7). If the nodes on right-most path need to be split, then timely update of the structure of right-most path is required (refer to line 10). Since the improved TB-tree algorithm makes full use of two newly added cache structures, it avoids the complicated search for latest node or top-down search of right-most path from root node. In this way, it has reduced the cost substantially.

Figure 4.

Relationship between bucket and segment list.

3.3 Cost model

Compared with BASEINSERT, the improved IMPRINSERT algorithm saves the access cost in the following two ways:

(1) (1)
Because latest node can be located through the cache structure of root-latest-node path, the cost of NAsearch is eliminated. However, when latest node is not full, the execution of new line segment leads to bottom-up MBR update, which requires reading the related nodes on the path from latest node to root node. This means extra $\frac{(h-2)}{2}+1$ internal nodes will be accessed (marked as NAextra). If latest node is full or there is no latest node, we use the right-most path structure to insert new leaf node. This does not require any extra reading from any node.
(2)
Given the existence of right-most path cache, NAtopright can be eliminated. Above all, the cost of executing line segment insertion by IMPRINSERT algorithm is as follows:

$\displaystyle\text{NAseg}=P1\ast(\text{NAwrite}+\text{NAupdate}+\text{NAextra}% )+P2\ast(\text{NAwrite}+\text{NAupdate}).$ (8)

4. BULKINSERT algorithm

When data of moving objects have been collected and stored in a data file, if we read the data file consequently and execute algorithms mentioned in Section 2 or Section 3, the task can be fulfilled, yet, the cost of disk access is still not optimized. In this regard, this section explores bulk insertion algorithm of TB-tree (named BULKINSERT). Its fundamental principle is as follows: according to the arriving time sequence of data of moving objects, contents of data nodes will be organized well in main memory, and then based on the structure of TB-tree, nodes are written directly to disk so that the total access cost is minimized.

4.1 Data structure

BULKINSERT algorithm will use three basic data structures, including the point, line segment and bucket and the details are described as follows:

$\displaystyle\text{StruPoint}=[x,y,t],$ $\displaystyle\text{StruSeg}=[\text{id},p1,p2],$ $\displaystyle\text{StruBucket}=[\text{BucketId},\text{SegAdr},\text{SegNum},t1% ,\text{prev},\text{next}].$

The point structure StruPoint includes data coordinate $(x,y)$ and time $t$ . The line segment structure StruSeg includes id of moving object and the two ends of the segment, $p1$ and $p2$ . Bucket structure StruBucket is used to organize the data in the list of line segment, align them with leaf node of TB-tree and finally write the data to disk. The relationship between bucket and line segment is shown as Fig. 4. SegAdr is used to indicate the location of the first line segment of the bucket in the line segment list. SegNum represents the number of line segments in the bucket. $t1$ is the ending time for first line segment of the bucket, while prev and next indicate the specific storage location for latest node and the following leaf node.

Figure 5.

BULKINSERT algorithm.

4.2 Description of algorithm

A detailed description of BULKINSERT algorithm is shown in Fig. 5. The original data file includes the data of all line segments of moving objects. The data is stored according to the original sequences (the ending time sequence of line segment). To ensure each leaf node of TB-tree only stores sorted data of one moving object, the algorithm does sorting for Seglist based on id of moving object and the starting time of line segment. Then, based on the value of fanout, Seglist comes up with a bucket list named Bucketlist (refer to lines 3 and 4). At this moment, Bucketlist is only a split of Seglist, the items on Bucketlist still maintain their order based on id of moving object and the starting time of line segment.

But TB-tree sorts the leaf nodes from left to right based on the ending time sequence of the first line segment. Therefore, in line 5, algorithm does the required sorting for Bucketlist. In line 6, algorithm confirms values of pre and next pointers for each record in Bucketlist and ensures the previous leaf node and next leaf node of same moving object can point to each other. In line 7, algorithm turns line segment information of each bucket in Bucketlist into total MBR of corresponding leaf node. Hence, content of leaf node is completely constructed and written to leaf node data file sequentially. Internal node at upper layer must be constructed according to the information (including MBR, node address) of the items provided by nodes at lower layer. Starting from leaf node layer, each node at each layer provides its address and MBR (comprised of MBR of each item in the node) to parent node. This information will be an item in the parent node. Each internal node is constructed based on the information provided by lower layer and then stored in the location assigned by sequence.

4.3 Cost model

In Fig. 5, the discussion about time complexity of BULKINSERT algorithm can be discussed in three parts, including the treatment of line segments (steps 1–4), the treatment of buckets (steps 5–7) and the construction of data of internal nodes (step 8). The treatment of line segments includes sorting and cyclic treatment for Seglist, their time complexities are $O(n\ast\lg n)$ (average complexity of quick sort) and $O(n)$ respectively, then the overall complexity will be $O(n\ast\lg n)$ , where $n$ is the total number of line segments, obviously, $n=\sum\nolimits_{id=1}^{\text{MO}}{\textit{sum}(id)}$ . The treatment of buckets also includes sorting and cyclic treatment of Bucketlist. The overall time complexity is $O(n_{\text{bucket}}\ast\lg(n_{\text{bucket}}))$ , where $n_{\text{bucket}}$ is the number of buckets in Bucketlist, $n_{\text{bucket}}=\sum\nolimits_{id=1}^{\text{MO}}{\left\lceil{\frac{\textit{% sum}(\text{id})}{\text{FANOUT}}}\right\rceil}$ , which has the same expression and value with $n_{\text{leaf}}$ . The construction and write-in of internal node data is conducted bottom-up from leaf node layer. The overall time complexity is $O(n_{\text{bucket}})$ . Since each node only executes necessary write-in once, when excluding the access of original data file, the disk access cost of algorithm is the sum of all nodes in TB-tree: $\sum\nolimits_{la=1}^{h}{\left\lceil{\frac{{\rm n}_{{\rm\text{bucket}}}}{\text% {FANOUT}^{la-1}}}\right\rceil}$ .

Figure 6.

Total node access VS MO.

5. Experimental evaluation

All the insertion algorithms discussed in this paper construct the same TB-tree, including same basic TB-tree structure and same contents in nodes, and there is no difference in terms of query performance. Therefore, this paper mainly compares the cost of three insertion algorithms. To be more specific, the paper compares their indices of node access number and time of CPU, among which, node access number is the most stable factor and it is free from the impact of different machine configurations and demonstrates good comparability. However, CPU time will be affected by data node access number and it is vulnerable to configurations of different machines, therefore, the value of it just serves as a reference.

5.1 Experimental configuration

The three algorithms are coded in C language and the algorithms have been running 100 times under different configurations. The average values are selected for evaluation. The machine used in experiment has the following configuration: Intel Core i5 2410 M 2.30 GH processor, 4 GB memory, 500 GB disk and Ubuntu12.03 OS. In line with literature [9], this paper also adopts the GSTD data generator [35] to generate several synthesized spatio-temporal datasets. These synthesized spatio-temporal datasets are comprised of 100, 200, 400, 800 and 1600 moving objects respectively. Each moving object is sampled for about 5,000 times. The data points are distributed in unit square and initial location of all moving objects conforms to Gaussian distribution and later, their movement follows uniform distribution. The speed of moving objects conforms to Gaussian distribution as well. The mean is 2 (moving distance per each unit of time), the maximum is 14.7, the minimum is 0 and variance is 0.4. The size of index page for TB-tree is fixed at 4,096 bytes.

Figure 7.

datar VS MO.

Figure 8.

dataw VS MO.

5.2 Experiment results

Figure 6 shows the changes of total node accesses of the three algorithms compared to the variance of number of moving objects. As the number of moving objects grows, the total node accesses of three algorithms increase nearly linearly. NODE(BASEINSERT) (node access of algorithm is denoted as Node(algorithm)) is constantly larger than NODE(IMPRINSERT) and far larger than NODE(BULKINSERT). The reason behind is that BULKINSERT only write nodes when necessary, i.e. for each node, there will be only one write-in. According to Eqs (1) and (8), both BASEINSERT and IMPRINSERT have great loads of write-in, especially in the leaf node layer, where each insertion of line segment requires one write-in operation. In addition, BASEINSERT requires large number of accesses when searching for target node. The relationship that NODE(BASEINSERT) is larger than NODE(IMPRINSERT) can be explained by the truth that datar(IMPRINSERT) (datar of algorithm is denoted as datar(algorithm)) is 0 (datar is the total times of leaf nodes being read from disk, and it is mainly comprised of NAsearch. Refer to Fig. 7 for more details), while datar(BASEINSERT) is relatively large.

Figure 7 shows the changes of datar when performing three algorithms. datar(BASEINSERT) increases quickly with the increase of moving object number. Because the increase of moving object number will lead to a direct increase of nodes which have time ranges overlapping with newly added line segment. When the average spatial sizes of newly added line segments and TB-tree nodes remain constant, as described in Eq. (6), the qualified number of nodes being accesses will go up. Since IMPRINSERT has a cache for latest node of moving object in the TB-tree, therefore the corresponding datar remains 0. As mentioned before, BULKINSERT only perform necessary node write-in operations, therefore, datar(BULKINSERT) is constantly 0.

Figure 9.

CPU Time VS MO.

Figure 8 demonstrates the variances of dataws (dataw is the number of caused disk accesses while algorithm writing leaf nodes to disk) when performing the three algorithms. Apparently, dataws of the three algorithms increase linearly as the number of moving objects grows. dataw(BASEINSERT) (dataw of algorithm is denoted as dataw(algorithm)) and dataw(IMPRINSERT) are generally equivalent and they both equal the number of line segments being inserted. This means with each insertion of line segment, leaf node would be written once, which is highly consistent with the evaluations of Eqs (1) and (8). And dataw(BULKINSERT) is the same as the number of leaf node in TB-tree, which matches with the description in Section 4.3 “each node only executes necessary write-in once”.

Figure 9 shows the CPU time variances of executing three algorithms. Node access (disk access) is the major part of time consumed when performing the algorithms. We can see that variance trends of CPU time of the three algorithms are almost the same to those of total node accesses (refer to Fig. 7), and their size relationships are also in line with those of total node accesses shown in Fig. 7.

6. Conclusions and future work

Compared with query performance of TB-tree, its insertion performance is often overlooked. This paper analyzes factors that affect the performance by establishing cost evaluation model for theBASEINSERT insertion algorithm of TB-tree. It is found that large and frequent node search and write-in operations are the major costs. IMPRINSERT proposed by this paper saves a lot of cost of searching but leaf node write-in still existed. BULKINSERT solves the insertion of bulk data and the cost is far smaller than IMPRINSERT and BASEINSERT. In general, the total cost of the three algorithms increase nearly linearly with the increase of moving object number. The cost models of the three algorithms in this paper investigate efficiency and influence factors from theory, they provide the basis for the evaluation, optimization, selection of execution plan of insertion algorithms. They not only make up for the deficiencies of the previous research work, but also can be used as an important reference for future work. As for future researches in this field, we will try to do as follows: (1) explore the cache strategy of nodes at upper layer so as to reduce node access volume when performing insertion; (2) when handling bulk data insertion, we can try to optimize the organization of nodes based on their spatio-temporal ranges rather than natural sequences, so that the total node density will be reduced but the query performance will be improved; (3) under the context of cloud computing, we can have TB-tree partitioned and adopt parallel algorithms to improve insertion efficiency and query performance.

Footnotes

Acknowledgments

This work is partially supported by the National Nature Science Foundation of China (No. 61202144) and Natural Science Foundation of Fujian Province of China (2015J01214). The authors also gratefully acknowledge the helpful comments and suggestions of the reviewers, which have improved the presentation.

References

Güting

R.H.

and Schneider

, Moving objects databases, Morgan Kaufmann publishers, 2005.

Jensen

C.S.

Leutenegger

S.T.

and Lopez

M.A.

, Indexing the positions of continuously moving objects, ACM SIGMOD International Conference on Management of Data ACM (2000), 331–342.

Tao

Y.F.

Papadias

and Sun

, The TPR*-tree: An optimized spatio-temporal access method for predictive queries, International Conference on Very Large Data Bases VLDB Endowment (2003), 790–801.

Liao

Tang

G.F.

Jing

et al., Hybrid indexing of moving objects based on velocity distribution, Chinese Journal of Computers 30 (2007), 661–671.

Prabhakar

Xia

Kalashnikov

D.V.

et al., Query indexing and velocity constrained indexing: Scalable techniques for continuous queries on moving objects, IEEE Transactions on Computers 51 (2002), 1124–1140.

Chon

H.D.

Agrawal

and Abbadi

A.E.

, Storage and retrieval of moving objects, Mobile Data Management (2001), 173–184.

Patel

J.M.

Chen

and Chakka

V.P.

, STRIPES: An efficient index for predicted trajectories, ACM SIGMOD International Conference on Management of Data (2004), 635–646.

Beckmann

Kriegel

H.P.

Schneider

et al., The R*-tree: An efficient and robust access method for points and rectangles, ACM Sigmod Record 19 (1990), 322–331.

Pfoser

Jensen

C.S.

and Theodoridis

, Novel approaches in the indexing of moving object trajectories, Proceedings of International Conference on Very Large Data Bases (2000), 395–406.

10.

Jae

L.E.

Ryu

K.H.

and Nam

K.W.

, Indexing for efficient managing current and past trajectory of moving object, Advanced Web Technologies and Applications, Springer Berlin Heidelberg (2004), 782–787.

11.

Chakka

V.P.

Everspaugh

A.C.

and Patel

J.M.

, Indexing large trajectory data sets with SETI, Ann Arbor, 2003.

12.

Cha

C.I.

Kim

S.W.

Won

J.I.

et al., Efficient indexing in trajectory databases, International Journal of Database Theory and Application 1 (2008), 21–28.

13.

Antoine

Ramamohanarao

Shao

et al., Recursive partitioning method for trajectory indexing, Australasian Database Conference DBLP (2010), 37–46.

14.

and Ravishankar

C.V.

, PA-tree: A parametric indexing scheme for spatio-temporal trajectories, Advances in Spatial and Temporal Databases (2005), 254–272.

15.

Chu

S.T.

Yeh

C.C.

and Huang

C.L.

, A cloud-based trajectory index scheme, 2009 IEEE International Conference on e-Business Engineering, IEEE Computer Society (2009), 602–607.

16.

Zhang

and Yuan

, The GPS trajectory data research based on the intelligent traffic big data analysis platform, Journal of Computational Methods in Sciences & Engineering 17 (2017), 1–8.

17.

Lange

Dürr

and Rothermel

, Scalable processing of trajectory-based queries in space-partitioned moving objects databases, Proceedings of the 16th international conference on Advances in geographic information systems, ACM (2008), 31.

18.

Guttman

, R-trees: A dynamic index structure for spatial searching, ACM Sigmod Record 14 (1984), 47–57.

19.

Frentzos

Gratsias

Pelekis

et al., Algorithms for nearest neighbor search on moving object trajectories, Geoinformatica 11 (2007), 159–193.

20.

Gao

Y.J.

Chen

G.C.

et al., Efficient k-nearest neighbor search algorithms for historical moving object trajectories, Journal of Computer Science and Technology 22 (2007), 232–244.

21.

Güting

R.H.

Behr

and Xu

, Efficient k-nearest neighbor search on moving object trajectories, The VLDB Journal 19 (2010), 687–714.

22.

Trajcevski

Tamassia

Ding

et al., Continuous probabilistic nearest-neighbor queries for uncertain trajectories, Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, ACM (2009), 874–885.

23.

Ding

and Boykin

, A framework for distributed nearest neighbor classification using Hadoop, Journal of Computational Methods in Sciences & Engineering 17 (2016), 1–9.

24.

Benetis

Jensen

C.S.

Karciauskas

et al., Nearest neighbor and reverse nearest neighbor queries for moving objects, Proceedings of Database Engineering and Applications Symposium, IEEE (2002), 44–53.

25.

Cheema

M.A.

Lin

Wang

et al., Probabilistic reverse nearest neighbor queries on uncertain data, IEEE Transactions on Knowledge and Data Engineering 22 (2010), 550–564.

26.

Guohui

Yanhong

Jianjun

et al., Continuous reverse k-nearest neighbor monitoring on moving objects in road networks, Information Systems 35 (2010), 860–883.

27.

Ishikawa

Iijima

and Yu

J.X.

, Spatial range querying for Gaussian-based imprecise query objects, Proceedings of 25th International Conference on Data Engineering, IEEE (2009), 676–687.

28.

Zheng

Trajcevski

Zhou

et al., Probabilistic range queries for uncertain trajectories on road networks, Proceedings of the 14th International Conference on Extending Database Technology, ACM (2011), 283–294.

29.

Chen

Tang

et al., Partition-based range query for uncertain trajectories in road networks, Geo Informatica 19 (2015), 61–84.

30.

Wang

Yao

et al., Probabilistic range query over uncertain moving objects in constrained two-dimensional space, IEEE Transactions on Knowledge & Data Engineering 27 (2015), 866–879.

31.

Patroumpas

Papamichalis

and Sellis

, Probabilistic range monitoring of streaming uncertain positions in geosocial networks, Proceedings of the 24th International Conference on Scientific and Statistical Database Management, Springer-Verlag (2012), 20–37.

32.

X.B.

and Shao

Z.P.

, Spatial temporal index-based historic closing event query for moving objects, Advances in Information Sciences & Service Sciences 5 (2013), 497–509.

33.

Kamel

and Faloutsos

, On packing R-trees, Proceedings of the Second International Conference on Information and Knowledge Management, ACM (1993), 490–499.

34.

Theodoridis

Stefanakis

and Sellis

, Efficient cost models for spatial queries using R-trees, IEEE Transactions on Knowledge & Data Engineering 12 (2000), 19–32.

35.

Theodoridis

Silva

J.R.O.

and Nascimento

M.A.

, On the generation of spatiotemporal datasets, International Symposium on Advances in Spatial Databases, Springer-Verlag (1999), 47–164.

Cost evaluation and improvement of TB-tree insertion algorithm

Abstract

Keywords

1. Introduction

2.1 TB-tree structure

2.3 Cost model

3.1 Data structure

3.1.1 Structure of root-latest-node path

Table 1 Cache structure of root-latest-node path

3.2 IMPRINSERT algorithm

4.1 Data structure

4.3 Cost model

5.1 Experimental configuration

Footnotes

Acknowledgments

References

Table 1
Cache structure of root-latest-node path