Abstract
Environment monitoring is one of the typical application scenarios of the wireless sensor networks. As an energy limited system, most of the energy consumption is for the data transmission. As a well-known principle, the difference among the physical parameters of adjacent nodes is approximate a constant. Eliminating these data to be transmitted will lead to remarkable energy saving. A correlative pattern based data aggregation mechanism following this principle is proposed in this paper, which is named the Correlative Pattern based Data Aggregation (CPDA). CPDA mines the correlations of every adjacent nodes pair, and generates a correlation graph of the network, then builds an aggregation routing tree for each connected component of correlation graph based on the shortest path methodology. Following the CPDA algorithm, a node’s sensed data will be suppressed when the data and the children’s match the restriction that is defined by CPDA. When the aggregated data arrive at the Sink node, all the data can be recovered. The recovery error will be limited within a specified small error threshold based on the reversed mechanism. The simulations based on the data set of Berkeley lab show that CPDA has excellent performance in aggregation degree and average error. Further more, a real established temperature sensing experiment also gives the same conclusion.
Introduction
In order to make sure the reliability and accuracy of monitoring data, the monitoring area must be covered inside the communication range of the sensor nodes [15]. Thus a large number of sensor nodes should be deployed as the wireless sensor networks (WSNs). Because of the gradient feature of physical phenomenon in a certain space, the differences of the monitoring data among adjacent sensors are approximate constants. And this is the reason why a mass of spatial redundant data generated [12, 18]. Obviously, the transmission of those redundant data will consume the limited energy of sensor nodes, cause the communication interference of networks and the wasting of limited communication bandwidth [1, 19]. As shown in Fig. 1, where node A/B/C/D/E monitor the environment temperature and T * is the monitored temperature value of each node. Due to the location gap between node A/C and B/D is approximated the same, the differences of monitored temperature values between node A/C and B/D are also very close. Due to the gaps of temperatures of adjacent nodes are constants in this scenario and the nodes will transmit spatial redundant data.
Because the monitored data will be transmitted to sink node via a certain multiple-hop routing path. The correlative data will meet at a certain intermediate node consequently. It is possible to aggregate the spatial correlative data to reduce the amount of communication just within these intermediate nodes. Currently, data aggregation researches mainly focus on how to deploy nodes or route data to improve the encountering probability of the correlative data. And, when the redundant data encounters, the present method is just simply merging the corresponding data into a single packet and is rare about how to identify, analyze and aggregate the correlative data. The lack of correlation decision and the limitation of aggregate method makes these studies difficult to be applied in real environment.
Based on the analyzing of the correlative environments, a method to mine the correlative pattern between nodes is designed in order to solve the above problems. In this paper, the method is called CPDA mechanism. Based on the correlative pattern, CPDA can not only suppress correlative data at intermediate nodes, but also recover the original data within a deviation threshold at the Sink node. Compared with the existing algorithms, CPDA improves the aggregation degree and the data accuracy.
The remainders of the paper are organized as follows. In Section 2, we give a brief overview of the related works. In Section 3, we propose the algorithm that is the object of our research work. We present the experiments and results of CPDA in Section 4. and conclude this paper in Section 5.
Related works
The main purpose of the data aggregation is to reduce the quantity of data transmission by suppressing the correlative information. Therefore, how to identify and suppress the correlative data are the exact problems.
For purpose of reducing the data transmission, literatures [2, 17] use simple aggregation methods (such as MAX, MIN, AVG, SUM) for aggregating, and does not consider the correlations. Even though a higher aggregation performance is achieved by adopting such methods, the accuracy of recovered data is badly poor. For those applications that high data accuracy is demanded, they are inappropriate. LEACH [2], a typical representation of these method, divides the network into several clusters, and aggregates data using AVG method at clusters head, thereby reduces the network data traffic.
Exploring the correlation among sensor nodes, literature [6] and [22] use the Cosine Similarity and Euclidean Distance of the data to identify correlative nodes, respectively. And literature [11] uses the space distance between nodes for the same purpose. These methods are designed to simply discover the similarity among data, but as what is well known, the meaning of
Further, Yuan [20, 21] points out that the distance between nodes does not reflect the correlation of nodes, and proposes a data density model, DDCD, to present the correlations. In the DDCD, the data density is adopted to determine the correlations degree of nodes. In literature [13], Wang mines the data correlation according to a pre-defined data criteria. Sujit [4] adopts the temporal prediction information to be the compensation of spatial-depended data, and mines the farther data correlation and data redundancy.
Literatures [3–6, 23] divide the sensor nodes into different clusters by adopting the method similar to those above ones. Then choose one or more nodes in each cluster as the representative to collect and send data and deactivate other nodes (or schedule the nodes in a same cluster to collect and send data in turn). The node’s energy can be saved significantly with these methods, but they will cause some important data loss because a large number of nodes are deactivated. YEAST algorithm [11] is a typical representative of these methodologies, which divides sensor nodes into different units based on the space distance between nodes and only a single node is activated in each unit. As the original objective of some cases, a large number of redundant nodes are deployed to ensure the reliability and accuracy.
From the above, it can be found that current data aggregation methods had not discovered all the spatial correlation between nodes yet, and the quality of recovered data is too bad to meet the application requirements. According to the temperature experiment referred in this paper, it is found that the temperature difference between many nodes is approximately a constant with a given period. So it is possible to pick up a constant value c to fit the difference of the temperature value of every double-node pair. Once the fitting error is less than a given threshold, we can say that the two nodes are correlative and the correlation pattern is c. As an example, considering an intermediate node received the data of two nodes, marked as d 1 , d 2 . If d1–d2 ≈ c, it is possible to send the d 1 to Sink, and use d2 ≈ d1–c to recover the value of d 2 at Sink end. Otherwise send both of d 1 and d 2 . By adopting this method, a large amount of correlative data can be suppressed, and the recovery data accuracy is guaranteed at the same time. This is the basic idea of the CPDA algorithm proposed in this paper.
Algorithm description
In this section, the CPDA Algorithm will be introduced in details. As the prerequisite, a WSN with n nodes is referred, in which the nodes collect data in a limited range without direction, and the data are sent to the Sink periodically in multi-hop way. The symbols and parameters that to be used are shown in the Table 1.
Correlative decision method
Because of the gradient feature in the physical environment, data collected by neighboring nodes are approximately the same, or their difference is a constant. According to this principle, a correlative identify method is proposed as below.
Assuming that (1, 2..., m) and (1, 2, ...m), are the most recent m data of node N1 and N2 respectively. The correlative decision process of N1, N2 works as follows:
1. Calculate the difference between N1 and N2, which is dif = (d1, d2,..., d m ), where ;
2. Calculate the mean value of dif:
3. Calculate the fitting error (standard deviation)
4. If ∣
5. Otherwise, N1 and N2 are not correlative.
By the above procedure, not only the correlative decision of N1 and N2, but also their correlative pattern are discovered. When the intermedia node receives the data from N1 and N2, the correlative error can be evaluated by the intermedia node according to the correlative pattern. The correlative error is the following:
Mining correlative pattern
Based on the method in Section 3.1, CPDA can use the recent m data of all nodes that stored in D to discover the correlation between nodes at the Sink, and save the correlative pattern in the correlative matrix C. For discovering all the correlation for each double nodes pair may cause massive computation, CPDA only mines the correlation of adjacent nodes to reduce the computational complexity. The discovering process is shown as Algorithm 1.
Constructing aggregation tree
As the description in Section 3.1, the redundant data can only be aggregated at the exact node at which the correlation data encounter. To decrease the latency is not the only objective [10, 16]. A well designed routing strategy may cause such encountering as soon
1:
2:
3:
4:
5:
6:
7:
8:
9:
as possible with a very high probability. For this purpose, the correlative matrix C can be regarded as a correlation graph, where the neighbor nodes are highly correlative. And this is the spirit of
Suppose that GetRoot(U) returns the node closest to the Sink in nodes set U (based on adjacency matrix A and Dijkstra algorithm, we can easily get the shortest path of every node). The construction process of Aggregation Tree is shown as Algorithm 2. In details, Algorithm 2 is an instance of classical Dijkstra algorithm.
Data aggregation process
When finishing building the aggregation tree, all the nodes make it clear that which nodes are their child nodes and what is the correlation mode among them. Therefore, the data which matches the aggregation condition could be suppressed when those data reaches the exact node.
However, when a node suppresses some of its children’s data, it may cause an error that within the range of (l, u), and the same to its father node. In some extreme cases, the error would accumulate along the routing path. When the accumulative error is beyond the range of (–e, e), it is hard to recover the final data at the Sink with a low error level. In order to avoid the error accumulation, the accumulative error has to be sent to the parent node with the aggregation data, when the data’s accumulative error of a certain child exceeds the range (–e, e), CPDA will not suppress this
1:
2:
3: U = {1, 2, . . . , n} ;
4: V =∅ ;
5:
6: root = GetRoot (U) ;
7: V = root ;
8:
9: U = U - V ;
10: TempV =∅ ;
11:
12:
13:
14:
15:
16: TempV = TempV∪ j ;
17:
18:
19:
20:
21: V = TempV ;
22:
23:
child’s data. By this way, all the data can be recovered within the error threshold e. For those data that can’t be eliminated, CPDA combines them in to one packet, so each node only send one packet in every collecting cycle. The detail aggregation process is shown as Algorithm 3, it is suitable for all node.
Data recovery process
Following the method described in Section 3.4, the data can be aggregated gradually. And as the endpoint of the transmission, the Sink node would receive several aggregation packets from root nodes. These aggregation packets contain the actual data collected by roots and some other nodes. Based on these actual data and the correlation patterns stored in M, CPDA can recover some data in the way like Section 3.1. And then CPDA uses the recovered data to recover other data, until all data are recovered. In this procedure, CPDA adds the recovered data in to D to update the recent m data. The detail process of data recovery is shown in Algorithm 4.
1: E = {x∣ - e < x < e} ;
2:
3:
4:
5:
6:
7: err = d - v -
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
1:
2:
3:
4:
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1:
2: d = kv .
3:
4: d = v - c ;
5:
6:
7:
8:
Dynamic update
The correlation is not a long-term factor for natural environment. Thus, the correlation pattern and aggregation tree need to be updated according to the most recent data. After the latest m-cycle data received by the Sink, CPDA re-calculates the correlation pattern of these nodes, and re-constructs the aggregation tree according the new correlation matrix C. Besides, the only nodes should be informed if whose parents, children or correlation patterns changed. Otherwise, CPDA would not send update message, which greatly reduce the amount of update messages. By the above method, CPDA can update the correlation patterns and Aggregation Tree according to the most recent data.
Simulation and result
In this section, experiments based on Matlab are designed to verify all the strategies proposed above. As the references, LEACH (LEACH uses AVG operation to aggregate data) algorithm and YEAST algorithm are both introduced in. The comparisons will be set in terms of aggregation degree (avg. degree), average error of recovered data (avg. error), average latency (avg. latency) and average energy (avg. energy). Where aggregation degree is the ratio of the received data count and the total data quantity, and average energy is the energy consumption of one node in one cycle.
The pattern deployed in simulation is from the 1000-6000th cycles real temperature data in Intel Berkeley Lab Dataset which collects data from 54 sensor nodes. In order to emulate the fast changing environment and limit the computing scale in simulation, we use five continuous data to mine correlation according to multiple simulations. That is to say m is set to 5. The distribution of all nodes is shown in Fig. 2 (the detail location of each node is provided in Intel Berkeley Lab Dataset). Assume that the communication radius of all nodes is 10 meters, and one unit energy will be consumed for sending or receiving one packet. The computing energy is ignored.
Individual simulation of CPDA
In the first experiment, the data from all these 54 sensor nodes are used to verify the CPDA algorithm at some different error threshold e. The results are listed in Table 2. It can be discovered that the aggregation degree and average error are both proportional to the error threshold e, while average energy consumption and average latency are just the opposite. The reason is the bigger error threshold e makes more data matching the formulation err = ∣––c∣ ≤ e, thus more redundant data could be suppressed. And the average error is increasing for the same reason. Moreover, bigger e makes more nodes correlated, by which the connectivity of the correlation graph is increased. And it is possible to find a couple of shorter paths and reduce the latency. At the same time, a shorter routing path and a higher aggregation degree make energy consumption reduce.
Comparison of three algorithms
In order to verify the performance of the CPDA algorithm in different scale networks, a comparison among those three algorithms is drawn according to the data from 1-10th, 1-20th, 1-30th, 1-40th, 1-50th nodes. The error threshold e of CPDA algorithm is set to 0.5, and other parameters are the same as the above.
As shown in Fig. 3, the aggregation degree of CPDA algorithm is no less than 0.8. Obviously, CPDA is better than the other two algorithms. This is because of the full usage of spatial correlation, and continuous data aggregation the data on the routing path in CPDA gradually, while the other two algorithms only aggregate data at cluster head. The aggregation degree of YEAST and LEACH are decided by the number of clusters which are related to the structure of network. For example, the distribution of 1-20th nodes are asymmetry, which increases the number of cluster heads and leads to a lower aggregation degree significantly.
Figure 4 shows the average error of the three algorithms against different node numbers. The average error of CPDA is extremely lower than YEAST and LEACH. This is because CPDA ensures the recovery error of every data less than error threshold e, while YEAST and LEACH only use the head’s data or the average value of all data to represent other nodes.
Figure 5 shows the average latency of these three algorithms. The average latency of CPDA is superior to LEACH, but is worse than YEAST, due to YEAST selects the node closest to the Sink as the cluster head and activates the only node. So the latency of every active node is shortest. However, each node in LEACH should be activated and can be selected as the cluster head. Once the cluster head is far from the Sink, it will make the latency of cluster member bigger. So the average latency of LEACH is not determinate. For CPDA, it selects the node closest to sink as the root and uses the shortest path routing for each connected component, this makes the data transferring path extremely close to the global shortest path. Therefore, the latency of CPDA is lower than LEACH.
Figure 6 shows the average energy consumption of three algorithms. Obviously, CPDA is superior to LEACH, but is worse than YEAST. This is because the YEAST only activates one node in a cluster, by which reduce the energy consumption greatly, while LEACH algorithm keep all nodes activated. At the same time, higher average latency and lower aggregation degree make the energy consumption of LEACH much more than CPDA.
Experiment on real nodes
As a real case, a sensor network in our lab is established, of which the snapshot is shown in Fig. 7. For each node, a temperature sensor and CC2530 controller are embedded, of which the structure is shown in Fig. 8. The node samples the real-time temperature every 1 minute. The scale of this sensor network is about 15 nodes. We implemented the CPDA, YEAST and LEACH algorithms in these nodes based on Zigbee Stack. Due to there is no energy monitoring circuits inside the nodes, the comparison of energy saving can not be achieved directly. Instead, a statistics result of transmitted packet counts of each algorithm is recorded. And the effect on energy saving can be roughly made according to the packet count. Also, the aggregation data and real data are recorded to calculate of the recovery error.
And for the pre-set parameter, the error threshold of CPDA is 0.5 because a 0.5 centigrade precision is enough for the indoors environment. The hops that each packet have be transferred are also recorded to indicate the average latency.
The experiment is online for 10 days. The results are summarized in Fig. 9. The CPDA’s aggregation degree is 0.893 that is higher than the above simulation result. This value is tightly associate with the monitored area. And because our monitored area is smaller than the Intel Berkeley Lab, which makes the node’s correlation more stronger and result in a higher aggregation degree. The average error of CPDA is about 0.148 that is remarkably lower than the YEAST and LEACH algorithms. And the energy consumption and average latency of three algorithm are quite satisfactory. Figure 10 is a snapshot in 15 minutes of these 10 days of three nodes, where Node A is the father node and Node B and C are the children nodes. It can be found that in case of the three sensed temperature value are very close, the aggression count is 3 that means the two duplicated data are suppressed. Meanwhile, the aggression count is 2 once one of the sensed temperature value goes far. Obviously, following the CPDA methodology the redundancy of sensed data is eliminated.
Base on the above comparison and analysis, CPDA offers a higher aggregation degree and keep a lower average error than LEACH and YEAST. But in terms of energy consumption, CPDA is in general.
Conclusion
We propose a correlative pattern based data aggregation mechanism to suppress correlative or redundant data in wireless sensor networks in this paper. By discovering the correlations between neighboring nodes, CPDA eliminates correlative data and recovers the original data within a specific error threshold. Thus the performance and energy consumption are extraordinary for the wireless network. The simulation based on the real temperature data show the aggregation degree of CPDA is quite high, while the average error is lower, compared with LEACH and YEAST algorithms.
CPDA algorithm is designed only to eliminate spatial correlation currently. As the future direction, not only the spatial correlation but also the temporal correlation will be considered, by which better performance can be achieved.
Footnotes
Acknowledgments
This work is partially supported by Program for New Century Excellent Talents in University (NCET-12-0164); National Natural Science Foundation of China (61370094); Natural Science Foundation of Hunan (13JJ1014).
