Abstract
This paper proposes the cost evaluation model for the existing TB-tree insertion algorithm called BASEINSERT. Factors that may affect its performance are analyzed in a detailed manner. While we search the latest node, the number of nodes being accessed is very prominent. Given this property, the paper comes up with an improved algorithm, named IMPRINSERT by adding the cache structures of root-latest-node path and right-most path, which reduce the number of nodes being accessed while searching for the latest node and right-most path close to zero. Under the bulk data setting, algorithm that inserts the data segment by segment is fairly inefficient, therefore this paper introduces a new insertion algorithm, called BULKINSERT and its cost evaluation model is proposed as well. Verified by a large number of synthesized datasets, it is proved that the cost models are reliable, both IMPRINSERT and BULKINSERT are effective and demonstrate nearly linear scalability.
Introduction
As wireless communication technology and location technology improved rapidly, together with the popularity of related equipment, a wide range of applications need to track and manage the locations of moving objects, to provide query and analysis service. Some core applications in this context are logistics management, fleet management, traffic analysis and management, analysis of movements of people (customers) in mobile computing, geographic information system, environmental studies, etc. [1]. As time goes by, the moving objects will generate a considerable amount of data given the possible change of their time and locations. Therefore, it needs the support of moving objects databases (MODs) and this gives rise to more researches and applications of MODs. Currently, some major work includes developing new index technologies and novel query types and algorithms. The existing index technologies for moving objects database fall into two categories: present and future-oriented query and historical trajectory query. As for the former one, the query adopts TRP-tree [2], its variant TPR*-tree [3], HVTPR-tree [4], VCI-tree [5], SV model [6], STRIPE index [7] and so on, among which, TRP-tree and its variant TPR*-tree are based on R*-tree [8] and they track current locations of moving objects, VBRs and other parameters. These data are used to infer their future locations, thus, making them the most widely adopted query methods. As for the latter one, it mainly uses TB-tree [9] and its variant TB*-tree [10], SETI [11], multi-level multiple indexes [12], RPTI [13], PA-Tree [14], Cloud-Based Trajectory Index [15, 16], DTI
All the above mentioned research findings contributed greatly to the theory and application of MODs. Built on these methods, this paper endeavors to further improve the TB-tree insertion algorithm. Though literature [9] came up with the structure of TB-tree and realized the basic insertion algorithm (insert the data segment by segment, hence called BASEINSERT), there are still three problems: (1) the cost evaluation of the insertion algorithm was overlooked, which leads to efficiency uncertainty in actual applications. It may not help the system to choose the efficient execution plans, nor can it identify weak links of the algorithm, making it difficult to improve; (2) when search for latest node, there are too many disk accesses; (3) another drawback is lack of bulk data-oriented insertion algorithm, leaving it fairly inefficient under bulk data setting. This makes it highly necessary to develop a new insertion algorithm to cater to bulk data.
The major contributions of this paper are summarized as follows:
Propose cost evaluation model for BASEINSERT. Propose improved insertion algorithm (hence, called IMPRINSERT) and its cost evaluation model. Present insertion algorithm for bulk data (hence, called BULKINSERT) and its cost evaluation model. Based on many experiments of synthesized dataset, reliability of the proposed models, effectiveness and efficiency of the improved and new algorithm are verified.
Related work would be mentioned in Section 2 of this paper. Section 3 presents the IMPRINSERT algorithm and its cost evaluation model. Section 4 introduces the BULKINSERT algorithm and its cost evaluation model. Section 5 provides the experiment results and analyses and Section 6 concludes the paper with directions for future work.
Section 2 starts by introducing TB-tree structure and BASEINSERT algorithm, and then proposes the cost evaluation model for BASEINSERT algorithm and analyzes its weakness. At the end, some possible solutions are pointed out.
TB-tree structure
Within the existing spatio-temporal data storage structures, the R-tree family is the most popular index structure. As one of the variants of R-tree, TB-tree embodies the general properties of R-tree, using MBR (Minimum bounding rectangle) to approximately describe the spatio-temporal range of included line segments, the MBR of upper-layer node covers all the MBRs of lower-layer sub-nodes and the pointers which point to the lower-layer sub-nodes are contained in upper-layer node. However, TB-tree is specially designed for trajectory preservation of moving objects. When inserting a new line segment of trajectory of certain object, it will search leaf node containing the previous line segment of the trajectory (hence, called latest node) and use it as target position. In addition, one leaf node only stores the trajectory line segments from one object and the leaf nodes storing the same object are connected to each other by pointers. The extending sequence of leaf nodes go from the left to the right. As shown in Fig. 1, c1, c3, c7, c8, c10, c12 are the leaf nodes that store the same object.
Such design takes some spatial discrimination away from the TB-tree, however, it brings sound trajectory preservation which makes the trajectory line segments within and between the leaf nodes well-organized. Thus, the following algorithms do not need to do sorting and it is much easier for the execution of pruning algorithms. Therefore, the performance of trajectories query (in particular, combination query) has been greatly improved [9].
Structure of TB-tree [9].
BASEINSERT algorithm of TB-tree [9].
As mentioned above, the structure of TB-tree has advantages in trajectories query, but the performance of its insertion algorithm has not been paid enough attention and no cost evaluation model of it is provided. This subsection will introduce the existing BASEINSERT algorithm, for the preparation of subsequent construction of its cost evaluation model, performance analysis and improvement. Specifically, BASEINSERT algorithm mentioned in literature [9] is shown in Fig. 2.
When executing the insertion of new line segment of an object, to achieve trajectory preservation, the new line segment must be inserted into latest node or be stored in the newly split leaf node which has two-way linkage with the latest node. Therefore, the search for latest node is a must process (please refer to line 1). Given the feedback of the operation, the following operations should be handled in three different ways. The first condition: if latest node has space, then the new line segment can be directly inserted in latest node (line 4); the second condition: if there is no extra capacity in latest node, then create a new leaf node (line 6) and store in it the newly added line segment. The new leaf node and latest node are pointing to each other with two pointers, in this way, trajectory preservation can be realized; the third condition: if no latest node is found, then create a new leaf node (line 8) and it is used as the first node of the object and stored in the newly added line segment. When handling the second and third conditions, algorithm must first find the path starting from root node to the penultimate internal node on the rightmost side of the TB-tree (hence, called structure of right-most path), and insert the new leaf node in the right-most path. If the right-most path is full, then split a new node within the penultimate layer to serve as a node for right-most path and then insert in it the newly added leaf node. If the parent node of newly split node in the right-most path is also fully occupied, then the split operation can propagate upwards until a new root node is generated.
Cost model
Apparently, the disk access cost of BASEINSERT algorithm will be affected by the height of the tree, node fanout, whether there is cache or not, the variety of data source (number of moving objects), velocity and direction of moving objects, the arrival consequences of all moving objects and uniformity. Therefore, it is hard to directly conduct the total cost evaluation. This paper will estimate the disk access cost of inserting single line segment.
Let
where NAwrite is the node access for writing newly added line segment to latest node with extra storage capacity or newly split leaf node,
When newly added line segment is written to leaf node, it may lead to expanding of the MBRs of the nodes in upper layers and this could cause the update of these nodes. When there is no statistical data, we can assume that when we insert a new line segment, half of the MBRs of the nodes between insertion point and root node will change, therefore,
NAtopright is the node accesses during the search for right-most path when latest node is full,
NAsearch is the node accesses during the search for latest node. Any TB-tree node with its MBR overlapping with the MBR of newly added line segment will be traversed and included in NAsearch. Next, we will check the nodes being queried and accessed in the leaf layer. The temporal dimension will be checked first. We assume that all data of moving objects has uniform time interval and their sequence is stable, then in the leaf layer, the number of nodes overlapping with newly added line segment is around 0.5MO. Next, we look into spatial dimension. The probability of spatial overlapping for each leaf node with the MBR of newly added line segment is:
where
Therefore, the number of leaf nodes which has spatio-temporal overlapping with MBR of newly added line segment is around:
Since the access of leaf nodes account for a major part of total node access, we can assume NAsearch is equivalent to NAsearchleaf, i.e.
The above mentioned node accesses is the estimated cost for single line segment insertion inBASEINSERT algorithm. The cost of writing each newly added line segment in the leaf node, NAwrite is 1. This part of cost is only affected by data volume rather than the height of tree. The update of upper-layer nodes and the search for right-most path are related to the height of tree. When inserting data, as the height of tree grows, cost of these two parts increases accordingly. NAsearch and MO have direct relationship, and when MO is relatively large (e.g. MO is over 1000), NAsearch would be considerable, thus having a big impact on the performance of the algorithm. Therefore, reducing or avoiding NAsearch is the major way to improve performance of BASEINSERT.
Based on the BASEINSERT algorithm and its cost component analysis mentioned in the previous section, this section proposes targeted improvement, including newly added cache structures and improved IMPRINSERT insertion algorithm. Finally, the cost analysis of IMPRINSERT is conducted in a detailed manner.
Data structure
Structure of root-latest-node path
The cache structure of root-latest-node path is shown in Table 1, where id is the unique identifier for moving object, Node address chain is consisted of all the address information of the nodes on the path from root node to the latest node of an object, and content of leaf node includes all the information of the latest node. During the insertion of moving object, frequent query and access of latest node are performed. Caching the structure of root-latest-node path can reduce invalid disk access substantially. The reason for storing the content of leaf node in root-latest-node path is to verify whether leaf node is fully occupied and this can get rid of the disk access of leaf node.
Cache structure of root-latest-node path
Cache structure of root-latest-node path
Node address chain can be used to pinpoint the path from root node to the latest node of an object. When a newly added line segment is inserted into latest node and needs bottom-up MBR update, the program would read and update contents of nodes sequentially according to the information on the node address chain. If there is no node address chain that can offer information, then, all the nodes in the tree need to add a pointer pointing to its parent node (similar to the practice of [10]), which will lead to the following disadvantages: each node occupying more space, the storage space of the total tree growing accordingly and its operation getting more complex. The node address chain only saves addresses of nodes and does not include their contents, thus avoiding taking up too much memory storage when there are large number of moving objects need to kept in the structure.
The maintenance rules of the cache structure of root-latest-node path are as follows:
Before a line segment of object is inserted to TB-tree, it needs to track the related record in the structure with the id information of the object. If there is a matching record, then read the content in latest node and decide whether it can be inserted directly:
If the node still has storage capacity, then perform the insertion and update. If there is no extra storage capacity, then split the node, find the right-most path and insert into the newly split node. At this moment, the correlated root-latest-node path needs to be updated. If there is no matching records with the id information of the object in the root-latest-node path, then execute the similar query, insertion and update operations to BASEINSERT and create a root-latest-node path record dynamically. If new root node is generated during the data insertion, then all the node address chains in the root-latest-node path should add the address information of the new root node.
As described in Subsection 2.2, during the data insertion process, if latest node is full or there is no latest node, BASEINSERT algorithm needs to split new node and search right-most path. Frequent searching for right-most path and accessing contents of its nodes would cost some resources (refer to Eqs (1) and (4)). To overcome such deficiency, the cache structure of right-most path comes into being. It includes addresses and content information of all nodes on the right-most path. When performing the insertion requires obtain these information, it can be accessed through the cache, thus, avoiding the cost of querying and accessing it through disk. Structure of right-most path is created automatically when the first node is generated in the tree. When right-most path is inserted with new leaf node or it splits new nodes, the structure also update simultaneously so as to ensure it remains the most updated status.
IMPRINSERT algorithm.
As mentioned above, there is only one right-most path structure for a TB-tree, and the main memory space consumption can be expressed as
As for the root-latest-node path structure, it has a number same to that of moving objects. Each root-latest-node path structure contains the information of one leaf node, several node addresses and one object id. So the main memory space consumption is
IMPRINSERT algorithm
The pseudo code of IMPRINSERT based on the newly added data structures is shown in Fig. 3, when executing the new line segment insertion for a moving object, algorithm will first get the information about the latest node and address chain from the structure root-latest-node path with its id information (refer to line 2). If the latest node is found and it is not full, then insertion will be performed, and contents of the related upper layer nodes will be adjusted according to the address chain (refer to lines 3 and 4). If the latest node is nowhere to be found or it is full, then split leaf node and store in a new line segment (refer to line 6). Then insert new leaf node through structure of right-most path (refer to line 7). If the nodes on right-most path need to be split, then timely update of the structure of right-most path is required (refer to line 10). Since the improved TB-tree algorithm makes full use of two newly added cache structures, it avoids the complicated search for latest node or top-down search of right-most path from root node. In this way, it has reduced the cost substantially.
Relationship between bucket and segment list.
Compared with BASEINSERT, the improved IMPRINSERT algorithm saves the access cost in the following two ways:
Because latest node can be located through the cache structure of root-latest-node path, the cost of NAsearch is eliminated. However, when latest node is not full, the execution of new line segment leads to bottom-up MBR update, which requires reading the related nodes on the path from latest node to root node. This means extra Given the existence of right-most path cache, NAtopright can be eliminated. Above all, the cost of executing line segment insertion by IMPRINSERT algorithm is as follows:
When data of moving objects have been collected and stored in a data file, if we read the data file consequently and execute algorithms mentioned in Section 2 or Section 3, the task can be fulfilled, yet, the cost of disk access is still not optimized. In this regard, this section explores bulk insertion algorithm of TB-tree (named BULKINSERT). Its fundamental principle is as follows: according to the arriving time sequence of data of moving objects, contents of data nodes will be organized well in main memory, and then based on the structure of TB-tree, nodes are written directly to disk so that the total access cost is minimized.
Data structure
BULKINSERT algorithm will use three basic data structures, including the point, line segment and bucket and the details are described as follows:
The point structure StruPoint includes data coordinate
BULKINSERT algorithm.
A detailed description of BULKINSERT algorithm is shown in Fig. 5. The original data file includes the data of all line segments of moving objects. The data is stored according to the original sequences (the ending time sequence of line segment). To ensure each leaf node of TB-tree only stores sorted data of one moving object, the algorithm does sorting for Seglist based on id of moving object and the starting time of line segment. Then, based on the value of fanout, Seglist comes up with a bucket list named Bucketlist (refer to lines 3 and 4). At this moment, Bucketlist is only a split of Seglist, the items on Bucketlist still maintain their order based on id of moving object and the starting time of line segment.
But TB-tree sorts the leaf nodes from left to right based on the ending time sequence of the first line segment. Therefore, in line 5, algorithm does the required sorting for Bucketlist. In line 6, algorithm confirms values of pre and next pointers for each record in Bucketlist and ensures the previous leaf node and next leaf node of same moving object can point to each other. In line 7, algorithm turns line segment information of each bucket in Bucketlist into total MBR of corresponding leaf node. Hence, content of leaf node is completely constructed and written to leaf node data file sequentially. Internal node at upper layer must be constructed according to the information (including MBR, node address) of the items provided by nodes at lower layer. Starting from leaf node layer, each node at each layer provides its address and MBR (comprised of MBR of each item in the node) to parent node. This information will be an item in the parent node. Each internal node is constructed based on the information provided by lower layer and then stored in the location assigned by sequence.
Cost model
In Fig. 5, the discussion about time complexity of BULKINSERT algorithm can be discussed in three parts, including the treatment of line segments (steps 1–4), the treatment of buckets (steps 5–7) and the construction of data of internal nodes (step 8). The treatment of line segments includes sorting and cyclic treatment for Seglist, their time complexities are
Total node access VS MO.
All the insertion algorithms discussed in this paper construct the same TB-tree, including same basic TB-tree structure and same contents in nodes, and there is no difference in terms of query performance. Therefore, this paper mainly compares the cost of three insertion algorithms. To be more specific, the paper compares their indices of node access number and time of CPU, among which, node access number is the most stable factor and it is free from the impact of different machine configurations and demonstrates good comparability. However, CPU time will be affected by data node access number and it is vulnerable to configurations of different machines, therefore, the value of it just serves as a reference.
Experimental configuration
The three algorithms are coded in C language and the algorithms have been running 100 times under different configurations. The average values are selected for evaluation. The machine used in experiment has the following configuration: Intel Core i5 2410 M 2.30 GH processor, 4 GB memory, 500 GB disk and Ubuntu12.03 OS. In line with literature [9], this paper also adopts the GSTD data generator [35] to generate several synthesized spatio-temporal datasets. These synthesized spatio-temporal datasets are comprised of 100, 200, 400, 800 and 1600 moving objects respectively. Each moving object is sampled for about 5,000 times. The data points are distributed in unit square and initial location of all moving objects conforms to Gaussian distribution and later, their movement follows uniform distribution. The speed of moving objects conforms to Gaussian distribution as well. The mean is 2 (moving distance per each unit of time), the maximum is 14.7, the minimum is 0 and variance is 0.4. The size of index page for TB-tree is fixed at 4,096 bytes.
datar VS MO.
dataw VS MO.
Figure 6 shows the changes of total node accesses of the three algorithms compared to the variance of number of moving objects. As the number of moving objects grows, the total node accesses of three algorithms increase nearly linearly. NODE(BASEINSERT) (node access of algorithm is denoted as Node(algorithm)) is constantly larger than NODE(IMPRINSERT) and far larger than NODE(BULKINSERT). The reason behind is that BULKINSERT only write nodes when necessary, i.e. for each node, there will be only one write-in. According to Eqs (1) and (8), both BASEINSERT and IMPRINSERT have great loads of write-in, especially in the leaf node layer, where each insertion of line segment requires one write-in operation. In addition, BASEINSERT requires large number of accesses when searching for target node. The relationship that NODE(BASEINSERT) is larger than NODE(IMPRINSERT) can be explained by the truth that datar(IMPRINSERT) (datar of algorithm is denoted as datar(algorithm)) is 0 (datar is the total times of leaf nodes being read from disk, and it is mainly comprised of NAsearch. Refer to Fig. 7 for more details), while datar(BASEINSERT) is relatively large.
Figure 7 shows the changes of datar when performing three algorithms. datar(BASEINSERT) increases quickly with the increase of moving object number. Because the increase of moving object number will lead to a direct increase of nodes which have time ranges overlapping with newly added line segment. When the average spatial sizes of newly added line segments and TB-tree nodes remain constant, as described in Eq. (6), the qualified number of nodes being accesses will go up. Since IMPRINSERT has a cache for latest node of moving object in the TB-tree, therefore the corresponding datar remains 0. As mentioned before, BULKINSERT only perform necessary node write-in operations, therefore, datar(BULKINSERT) is constantly 0.
CPU Time VS MO.
Figure 8 demonstrates the variances of dataws (dataw is the number of caused disk accesses while algorithm writing leaf nodes to disk) when performing the three algorithms. Apparently, dataws of the three algorithms increase linearly as the number of moving objects grows. dataw(BASEINSERT) (dataw of algorithm is denoted as dataw(algorithm)) and dataw(IMPRINSERT) are generally equivalent and they both equal the number of line segments being inserted. This means with each insertion of line segment, leaf node would be written once, which is highly consistent with the evaluations of Eqs (1) and (8). And dataw(BULKINSERT) is the same as the number of leaf node in TB-tree, which matches with the description in Section 4.3 “each node only executes necessary write-in once”.
Figure 9 shows the CPU time variances of executing three algorithms. Node access (disk access) is the major part of time consumed when performing the algorithms. We can see that variance trends of CPU time of the three algorithms are almost the same to those of total node accesses (refer to Fig. 7), and their size relationships are also in line with those of total node accesses shown in Fig. 7.
Compared with query performance of TB-tree, its insertion performance is often overlooked. This paper analyzes factors that affect the performance by establishing cost evaluation model for theBASEINSERT insertion algorithm of TB-tree. It is found that large and frequent node search and write-in operations are the major costs. IMPRINSERT proposed by this paper saves a lot of cost of searching but leaf node write-in still existed. BULKINSERT solves the insertion of bulk data and the cost is far smaller than IMPRINSERT and BASEINSERT. In general, the total cost of the three algorithms increase nearly linearly with the increase of moving object number. The cost models of the three algorithms in this paper investigate efficiency and influence factors from theory, they provide the basis for the evaluation, optimization, selection of execution plan of insertion algorithms. They not only make up for the deficiencies of the previous research work, but also can be used as an important reference for future work. As for future researches in this field, we will try to do as follows: (1) explore the cache strategy of nodes at upper layer so as to reduce node access volume when performing insertion; (2) when handling bulk data insertion, we can try to optimize the organization of nodes based on their spatio-temporal ranges rather than natural sequences, so that the total node density will be reduced but the query performance will be improved; (3) under the context of cloud computing, we can have TB-tree partitioned and adopt parallel algorithms to improve insertion efficiency and query performance.
Footnotes
Acknowledgments
This work is partially supported by the National Nature Science Foundation of China (No. 61202144) and Natural Science Foundation of Fujian Province of China (2015J01214). The authors also gratefully acknowledge the helpful comments and suggestions of the reviewers, which have improved the presentation.
