A dual synchronization prediction-based data aggregation model for an event monitoring IoT network

Abstract

The abrupt changes in the sensor measurements indicating the occurrence of an event are the major factors in some monitoring applications of IoT networks. The prediction-based approach for data aggregation in wireless sensor networks plays a significant role in detecting such events. This paper introduces a prediction-based aggregation model for sensor selection named the Grey prediction model and the Kalman filter-based data aggregation model with rank-based mutual information (GMKFDA-MI) that has a dual synchronization mechanism for aggregating the data and selecting the nodes based on prediction and cumulative error thresholds. Furthermore, the nodes after deployment are clustered using K-medoids clustering along with the Salp swarm optimization algorithm to obtain an optimized aggregator position concerning the base station. An efficient clustering promises energy efficiency and better connectivity. The experiments are accomplished on real-time datasets of air pollution monitoring applications and the results for the proposed method are compared with other similar state-of-the-art techniques. The proposed method promises high prediction accuracy, low energy consumption and enhances the throughput of the network. The energy-saving is recorded to be more than 10 to 30% for the proposed model when compared with other similar approaches. Also, the proposed method achieves 97.8% accuracy as compared to other methods. The method proves its best working efficiency in the applications like event reporting, target detection, and event monitoring.

Keywords

IoT wireless sensor networks data aggregation grey model kalman filter mutual information K-medoids clustering salp swarm optimization

1 Introduction

In this era of the Internet of Things (IoT), data plays a very crucial role. The devices and systems used in IoT technologies are used to generate and collect data before sending them to the server or cloud. The applications like tracking human or animal activities, trespassing, social interactions, etc. depend on the data provided by the devices. Handling this data with great precision is a challenging task. To accomplish this, an IoT concept that provides an intelligent framework to extract useful knowledge from the data is highly required. The methods which reduce the redundancies, anomalies in data could be a good solution to this concern. Employing these methods makes the system much more reliable and it enables the data to be handled with great efficiency and less complexity. Data that is needed for various applications like surveillance, tracking of human activities, monitoring, etc. is sensed using various sensors or actuators. A need to establish a network with these sensor devices is very important and thus wireless sensor network comes into the picture. A Wireless sensor network is an interconnected system of various homogeneous or heterogeneous low-cost sensor nodes with a facility of self-configuring, sensing, and data processing. The ad-hoc nature of this network enables it to self-organize itself according to the need of the application. Many event monitoring applications require a large number of transmissions of data from each sensor node to a central controller where it is further processed. This whole procedure consumes high energy and leads to a reduction in the lifetime of the network. To address this challenge data aggregation proves itself a very suitable solution.

Aggregation aims to mitigate the energy consumption by processing the raw data either at the node or at intermediate node, also called aggregator nodes. There are many means of aggregating the data collected by sensor nodes. Some methods use the conventional way of aggregation by using functions like min, max, sum, average [1 –3]. The use of mobile nodes to collect data from various nodes and aggregate them is also one of the techniques used in monitoring applications. But the drawback of this approach is the fast energy consumption of the nodes due to their mobility. The best and optimized way of achieving aggregation is to employ an optimized clustering approach to optimize the position of the cluster heads and then to use Spatio-temporal correlation methods or data estimation methods to forecast the data and to remove the redundancies in the data. This conditioned data is then sent to a central controller or base station, where it is analyzed for various applications like detection of forest fire, identification of polluted air in some specific area, habitat monitoring, surveillance, etc.

One of the key factors in event detecting or reporting applications is to search for those regions where the event occurs. To fulfill this task, the approach used for it must search those nodes which give abrupt sensor measurements that are either very high or very low than the normal sensor measurements of the surrounding parameters. To achieve this, a prediction-based approach can be adapted to only transmit the data when the predicted sequence shows some abruptness in the measurement. The data is generally spatially and temporally correlated which may exploit the notion of the prediction process. The predicted data with high redundancies when processed consumes a greater amount of energy. Many prediction-based and correlation-based methods are used to reduce these redundancies and to achieve a long lifetime of the network [4 –8]. The prediction-based approach helps the user to detect periodic variations in monitoring sensors which helps to control the potential risk in that region. Some model drive techniques for monitoring applications like data mining, query routing, data collection protocols, etc., provides high accuracy and network lifetime but are more complex which results in the infeasibility of their use in such applications. This paper uses the dual synchronization mechanism to predict the data as well as to select the sensors according to their predicted values using rank-based mutual information. The proposed method uses the merits of both the grey model and the Kalman filter to estimate the data and to enhance the accuracy by setting the two tunable parameters called prediction error threshold and cumulative error threshold. These parameters are set by the user to decide the abruptness in the data which is predicted from the recent past values. The abruptness in data is the basis of selecting the sensor nodes and only those nodes forward their data.

In this paper, these nodes are called pretentious nodes which are responsible for the occurrence of events like detection of polluted areas, COVID-19 containment zones, high traffic paths, etc. This paper performs the experimentation in which a real-time air pollution dataset is used. The steps involved in developing the proposed prediction model in a sensor system for monitoring an event are framed in Fig. 1. Figure 1 follows the following steps:

Data is imported from each sensor database based on time series.

Conditioning of data by removing redundancies, extracting useful features, and aggregating the data.

Developing a precise prediction model for data prediction using particle filter, Kalman filter, or artificial neural network and validating it for various datasets.

Incorporating the designed model with the main wireless environment for obtaining desired results.

Fig. 1

Stages in developing the proposed prediction-based model for IoT network.

The main contributions of this paper are:

An optimized lightweight clustering called K-medoids clustering is implemented to reduce the size of training datasets by the means of clustering.

The prediction model which is a combination of both Kalman filter and grey model is used to predict the time-based data series with a secular perspective, enhancing the accuracy and precision for the predicted value.

The rank-based mutual information approach is used in synchronization with the prediction-based data aggregation model (GMKFDA) to select the sensors with high cumulative and prediction errors. The cumulative error threshold is the deciding factor for the sensor selection.

The remaining paper is structured as follows: in section 2 relevant literature is reviewed with some findings. This section tells the advantage of the proposed method above other existing methods. Section 3 gives the proposed model overview along with the application for which it is used. It also gives the network and energy consumption model utilized in the proposed work. In section 4 details of the techniques used in the proposed model are given one by one starting from the deployment of sensors to the selection of nodes. It explains the clustering mechanism, SSA algorithm, grey prediction model, and Kalman filter-based data aggregation and sensor selection based on prediction using mutual information. Section 5 explains the working of the proposed model with an algorithm and model framework. Section 6 talks about the results and discusses the performance of the proposed work for various performance metrics. Also, it gives a complexity analysis for the proposed algorithm. Section 7 concludes the paper with possible future directions in this domain.

2 Relevant works

The research in the field of data aggregation using model-driven approaches have been done significantly to address various challenges like anomaly detection, data collection, data transmission, event reporting, etc., these approaches proved themselves very efficient in enhancing the QoS parameters of the WSN network. The size of the network is one of the challenges in front of application-specific IoT networks. The clustering of nodes is an efficient way to overcome this problem. There are many clustering algorithms used for WSN. Low energy adaptive clustering (LEACH) hierarchy is a well-known clustering protocol for WSN. LEACH protocol with its variants like MODLEACH, LEACH-C, LEACH-M, and many more [9, 10] are proposed to improve the quality of service in a sensor network. In [10], the authors gave an introduction to various LEACH variants and their significance. Similar to LEACH protocol, some routing protocols are also developed in a few years for various applications in WSN like one proposed in [11] which divides the area of deployment into some regions to track the animal presence. The method is called a hybrid heterogeneous routing (HHRS) scheme. Some other protocols used for hierarchical WSN like DEEC [12 –14] and TEEN [15] also aim to reach efficiency in terms of energy, throughput, etc., in WSN and IoT networks. To enhance the quality of clustering, genetic algorithms are also used in integration with clustering [16]. The optimized clustering helps in achieving desired results with very low consumption of energy. Some algorithms gave very good results like Ant colony optimization (ACO), Swarm intelligence algorithms like particle swarm intelligence, etc. An optimized approach for cluster head selection using the krill herd algorithm has been done in [17] and the results are evaluated and compared with LEACH and genetic optimization. An energy-efficient multipath routing for clustered WSN using ACO is proposed in [18] and is compared with other 3 similar protocols on the grounds of average energy(in %), network lifetime, energy consumption, and standard deviation i.e., variation in energy levels in all nodes.

To extend the domain of applications in IoT, some model-driven approaches are suggested like data aggregation model, data prediction model, etc. In [7], the authors have used a prediction-based data collection model called autoregressive integrated moving average (ARIMA). The cluster heads communicate this model with their cluster member nodes to collect the data based on a threshold. If some deviation occurs between predicted and actual data beyond the threshold, then the difference is transmitted and is further compressed by the PCA method. In [4] the work proceeds in a direction to obtain the abstract feature values from the nodes and predict the data from them using a neural network to reduce the redundancies. The data correlation for predicting the data is done using the MNMF model based on a bidirectional long short-term (LSTM) memory network. Wei et al. [6] implemented a prediction-based data aggregation approach for a granary monitoring application. It uses the grey model and Kalman filter to make the predictions and aggregation of data (GMKFDA). The aggregation is done based on the error threshold values. This method has been validated against the traditional Kalman filter and grey prediction model techniques.

The authors in [19] used Kalman filter along with salp swarm optimization to analyze the data packet thereby reducing the redundancy from data. The model uses ELM to detect the intrusion in the form of data packets extracted from nodes in a WSN. Kalman filter and its extended version called extended Kalman filter are efficiently used in estimating the target location and tracking the objects [20]. Some of the applications like deep seedling detection [21], time of flight measurements in ultrasonic systems [22], system monitoring [23], traffic forecasting [24], data streaming from cognitive networks [25] are implemented by Kalman filter-based models. Similarly, the grey prediction models give good prediction accuracy and success rate [8]. Integration of grey model with other approaches such as ELM, artificial neural network, Kalman filter, etc., gives an efficient model for data fusion [26], forecasting applications [27], etc. Apart from this, some other prediction models for routing and aggregation are also used like in [5] authors have built a model called the trust prediction model (FTPR) based on a fuzzy system to predict the node behavior in advance so that the security, network life, packet delivery ratio (PDR) of the system can be enhanced and end to end delay can be decreased. In [28], the authors use the value of information of sensors to rank them (VoISRAM). The method uses QoS, energy, Spatio-temporal accuracy, and other parameters as the value of information attributes. Another concern of the wireless network is scalability. The large-scale WSN and IoT networks which are spread on a large area for some applications like monitoring, faulty node detection, etc., have to be managed for their efficient and reliable use. These networks generally contain mobile nodes to cover the appreciable area to collect the data. In [29, 30] authors have focused on different ways to collect the data in a large-scale WSN (DCLS-WSN).

There are some applications like event monitoring, target tracking where some specific sensors have to be selected. The basis of selection is some performance metrics like mutual information, fisher information, etc. Some sensor selection methods used for target tracking applications are mutual information [31 –34], Fisher information-based sensor selection [35], and Kalman filter-based sensor selection [36]. These approaches guarantee better tracking accuracy by imparting essential constraints to the system.

Some query-based prediction models use auto-regressive approaches for aggregating data like PAQ [37]. These methods use local sensor data measurements to build a prediction model and to detect outliers. Sensor selection using Greedy algorithms [38, 39] and exhaustive search methods promise more accuracy with low root mean square error values. Some work has also been done in predicting the future energy values of the sensors using Markov and Auto-regressive (AR) models [40] to ensure high prediction accuracy.

Our paper proposes a novel method in which data aggregation is done from selected sensors by observing their predicted values using a prediction-based approach. Here we are using a grey prediction model with a Kalman filter-based data aggregation model. The selection is done using the mutual information technique which is synchronized with the prediction technique. The advantage of the proposed method over the methods mentioned above is its robustness for different scenarios in WSN. Also, it can be used for more than one application like object tracking, monitoring, and event reporting. This paper uses the proposed model for event reporting applications.

3 Proposed model

This section gives an overall framework of the ranking-based sensor selection using a prediction model for data aggregation in IoT. Figure 2 shows the overview of the proposed scheme. The proposed method has two important parts, i.e., clustering of the randomly deployed node using salp swarm optimization-based k-medoids clustering scheme for obtaining optimized aggregator nodes and a prediction-based approach integrated with the rank-based mutual information to select nodes for aggregation of data. Large WSN datasets are used to evaluate the performance of the proposed method. Since the scheme is designed mainly for event reporting, much of the sensor population is in clusters. The task to make this clustering efficient is done using the salp swarm algorithm in conjunction with k-medoids clustering. The resulting cluster heads have optimized positions concerning the base station to guarantee minimal delay in the data transmission.

Fig. 2

Overview of the proposed method.

Once the nodes are clustered at the optimized position, they start sensing the surrounding parameters. These parameters decide the occurrence of an event in that area. To reduce the computational complexity, a prediction-based data aggregation using both the grey prediction model and the Kalman estimation model has been employed to enhance the prediction accuracy of the system. Since the nodes responsible to create an event are few in a cluster, a mutual information-based sensor selection method is also used in integration with the prediction model to synchronize both the task of prediction and sensor selection based on the prediction analysis. The aggregated data from selected sensors are delivered to the base station which further sends it to the cloud-enabled IoT devices through an edge network. Edge network reduces the latency of data delivery by providing storage to the highly needed data in the vicinity of the IoT devices. It minimized the size of data to be transmitted and also minimizes the path of data delivery to the IoT devices thereby mitigating the congestion and transmission cost.

3.1 Network model

A WSN model is used in this paper consisting of ‘S’ sensor nodes placed randomly and clustered using the K-medoids algorithm. The centroids are considered aggregator nodes with maximum energy values than the sensor nodes. The aggregators are supposed to have the computational ability to aggregate the values according to the proposed algorithm. The aggregator nodes are supposed to occupy a minimum distance from the Base station (BS) in their cluster. All aggregators send their data to the aggregator which is nearest to the BS. The BS is considered to have no energy losses and has a self-replenishable energy source. The following assumptions are made in the proposed work:

All nodes are static and they transmit their data to their respective cluster’s aggregator node either through a single hop or multiple hops.

The sensor nodes have a limited energy source which is non-replenishable.

All the aggregators send their data periodically to the sink node or other aggregator nodes.

Both source and sink nodes have equal lifetime and obey the same prediction-based approach i.e., they work in synchronization to execute the proposed model.

3.2 Energy consumption model

The WSNs are energy constraint networks so the energy has to be managed as per the requirements of the application. This paper mainly focuses on the energy of the selected sensor nodes and aggregator nodes. The total associated energy cost is given by: ${En}_{T} = {En}_{TX} + {En}_{RX} + {En}_{1} + {En}_{S}$ (1)

En_TX is the energy required for transmitting N packets at distance D given by Equation (2). En_RX is the energy required for receiving N packets at distance D which is expressed in Equation (3). En_n1 is the energy when the network is at an idle state. En_S is the energy required for sensing.

$\begin{matrix} {En}_{TX} (N : D) = \\ {\begin{matrix} {En}_{el} * N + {En}_{fs} * N * D^{2}, if D < D_{0} \\ {En}_{el} * N + {En}_{pw} * N * D^{4}, if D \geq D_{0} \end{matrix} \end{matrix}$ (2) ${En}_{RX} (N : D) = {En}_{e} N$ (3)

En_elis the electronic energy given by Equation (5). En_e represents the energy consumption for per-bit transmission. The transmitted energy is represented by two models i.e., the free space model and the multipath fading model. The free space model is used when the distance (D) is less than the threshold (D₀), and when the threshold is less than or equal to the distance then the multi-path fading model will be used. The threshold (D₀) is given by: $D_{0} = \sqrt{\frac{{En}_{fs}}{{En}_{pw}}}$ (4)

En_fs and En_pw represent the energies used by the amplifier. ${En}_{el} = {En}_{TX} + {En}_{ag}$ (5)

En_ag is the energy used for data aggregation at the aggregator nodes in WSN.

4 Techniques used in the proposed model

4.1 Optimized K-medoids based clustering using Salp-swarm algorithm (KMed-SSA)

There are many methods of clustering, some use the spatiotemporal correlation of the data values, some use the Euclidean distance between the points to make clusters or some use the density of data points as the basis of clustering. Here in this research article, we are using the Mahalanobis distance-based k-medoids clustering which results in producing efficient cluster centroids. The Mahalanobis distance is shown as: $d_{i} (s_{i}, {CH}_{1}) = \sqrt{\sum_{i = 1}^{X} \frac{{(s_{i} - {CH}_{1})}^{2}}{st}}$ (6) where st is the standard deviation, s_i is a set that consists of data points and CH₁ is the cluster center.

To make the clustering more efficient in terms of energy consumption and network lifetime, we introduce salp swarm optimization along with the k-medoids clustering technique. This hybrid approach works on the optimization of the position of aggregator nodes or collector nodes which results in fast transmission of data to the base station thereby reducing the network delay. Mirjalili et. al introduced a swarm intelligence technique called the salp swarm algorithm (SSA) in 2017 which is inspired by particle swarm optimization [41].

Definition 1

The best aggregator node in each cluster is that one that is capable to send data from the selected sensors belonging to its cluster efficiently to the next aggregator connected to the aggregator node nearest to the BS to prevent unwanted energy wastage of the network thereby enhancing the network life.

4.1.1 Fitness function

The fitness function computed by the B.S aims to place the highest energy cluster head in that cluster that is nearest to it and all the aggregators are optimally placed in their corresponding clusters such that they occupy the position nearer to the base station in their respective cluster communication range. The fitness function given by Equation (7) should be minimized. $F_{fit} = γ . f_{a} + (1 - γ) f_{b}$ (7) where $f_{a} = \sum_{j = 1}^{k} ∥ {CH}_{j} - BS ∥$ which computes the distance between aggregator nodes from each cluster and base station (BS) and $f_{b} = \frac{\sum_{\begin{matrix} i = 1 \\ i \in s_{i} \end{matrix}}^{X} En (i)}{\sum_{\begin{matrix} j = 1 \\ {CH}_{j} \in k \end{matrix}}^{k} En ({CH}_{j})}$ , where En (i) and En (CH_j) is the energy of each node and each aggregator node respectively.

En (i) is the energy of each node which must lie between En (min) and En (max) i.e., En (min) ≤ En (i) ≤ En (max) and En (CH_j) is the energy of each aggregator node and it is assumed that En (CH_j) ≈ En (max). The value of γ should lie between 0 and 1 i.e., 0< γ<1.

4.1.2 Steps to implement KMed-SSA

Initializing the data points consisting set of points s_i.

Evaluating the mean value and obtaining threshold.

Assigning the mean value to CH₁ for the first cluster.

Using Equation (6), calculating the distance of each point from the cluster center.

The difference in distance between the points within a cluster is assigned Di_i.

Comparing each Di_i and replacing the higher value with a lower one and selecting the lower value Di_i as the centroid of that cluster.

Execute steps 2 to 5 iteratively till optimal center is not found.

If the value of the center is out of the threshold, then construct a new cluster with a new center value.

Repeat the algorithm till all data points are grouped into individual clusters.

Initializing the salp by taking all centroid points CH_j, (j = 1, 2, …, k) from the k-medoids algorithm. Where k is the number of clusters.

Calculating fitness of each aggregator node (CH_j) from Equation (7).

Updating c₁ from the equation.

c_{1} = 2 e^{- {(\frac{4 l}{L})}^{2}}

(8)

For each aggregator node, if the value of j = 1, then updating the position of leading salp i.e., the aggregator nearest to the B.S by using the following equation:

C H_{j}^{'} = {\begin{matrix} L_{j} + c_{1} ((u l_{j} - l l_{j}) c_{2} + l l_{j}) c_{3} \geq 0 \\ L_{j} - c_{1} ((u l_{j} - l l_{j}) c_{2} + l l_{j}) c_{3} < 0 \end{matrix}

(9) where

C H_{j}^{'}

is the position of the first salp and L_j is the position of the base station.

Otherwise updating the position for all follower salps using the below equation:

C H_{j}^{'} = \frac{1}{2} (C H_{j}^{'} + C H_{j}^{i - 1})

(10) where i ≥ 2 and CH_j is the position of i^th follower salp in j^th dimension.

Changing the aggregators based on upper and lower limits.

Repeat the steps until the search for B.S completes and all the aggregators are optimally placed near the B.S.

4.2 The grey prediction model and Kalman filter-based data aggregation with rank-based mutual information (GMKFDA-MI)

The data sensed by each node in a network follows a scheme that decides how the sensed data is collected by the aggregator node. Three such schemes are governed by the application layer of each node. In [6], a brief description of the three schemes namely Push, Pull, and the integration of both Push and Pull is given. In this paper, we are employing the GMKFDA method which works on a prediction-based aggregation of data and uses the Push scheme. The Push scheme focuses on prediction-based mutual support between the sensor node and aggregator node. Once the data is sensed the sensor node sends it straight away to the sink without taking link quality, channel characteristics, transceiver parameters, etc., into account.

4.2.1 Grey model

The grey model is a prediction model used to estimate/ forecast the data sequence and requires few data points or sensor measurements with uncertainty. The Grey model is known for its effective prediction of data including secular trends. In this paper, 1st order grey model GM(1,1) is used whose prediction sequence is given by: ${\hat{a}}_{g}^{(0)} (n) = {\hat{a}}_{g}^{(0)} (1), n = 1$ (11)

$\begin{matrix} {\hat{a}}_{g}^{(0)} (n) = {\hat{a}}_{g}^{(1)} (n) - {\hat{a}}_{g}^{(1)} (n - 1) = ({\hat{a}}_{g}^{(0)} (1) - \frac{\hat{u}}{\hat{m}}) \\ (1 - e^{\hat{m}}) e^{- \hat{m} (n - 1)}, n = 2, 3, \dots \end{matrix}$ (12)

And the prediction error is given by: ${er}_{g} = | {\hat{a}}_{g}^{(0)} (n + 1) - {\hat{a}}_{g}^{(0)} (n + 1) |$ (13)

4.2.2 Kalman filter

Kalman filter works on 2 phases namely: prediction phase and update phase. The Kalman-based prediction removes the randomness in predicted data. In the prediction phase, the estimation of data from the previous period is produced at the current time. The outputs of this phase are the prediction model and covariance model given by the following equations: ${\hat{a}}_{k}^{(0)} (n) = A (n) {\hat{a}}_{k}^{(0)} (n) + B (n) U (n)$ (14) $cov (n + 1 | n) = A (n) cov (n | n) A {(n)}^{T} + Q (n)$ (15)

The update phase refines the prediction model to produce a more precise model. The prediction and covariance model for the update phase is given by equations:

$\begin{matrix} {\hat{a}}_{k}^{(0)} (n + 1 | n + 1) = {\hat{a}}_{g}^{(0)} (n) + Kg (n + 1) \\ (Z (n) - H (n + 1) {\hat{a}}_{g}^{(0)} (n)) \end{matrix}$ (16)

$\begin{matrix} cov (n + 1 | n + 1) = (I - Kg (n + 1) \\ H (n + 1)) cov (n + 1 | n) \end{matrix}$ (17) where

$\begin{matrix} Kg (n + 1) = cov (n + 1 | n) H {(n)}^{T} [H (n + 1) \\ {cov (n + 1 | n) H {(n + 1)}^{T} + R (n)]}^{- 1} \end{matrix}$ (18)

For simplicity, let A (n) = 1, B (n) = Q (n) = 0, R (n) = H (n) = I

4.2.3 The grey prediction model and Kalman filter-based data aggregation

The GMKFDA combines both the attributes of the grey model as well as the Kalman filter model to enhance the system accuracy of the prediction process. The reason behind using GMKFDA is that this technique can remove the deficiencies which are faced by the individual grey model and the Kalman model.

This prediction model works on weights. Optimal weights which are the result of the reduction in the sum of error squares in prediction are assigned to the nodes.

The generalized approach of GMKFDA is discussed here.

Suppose Y = (y₁, y₂, …, y_p) ^T be the weight vector for ‘p’ prediction models, $\sum_{i = 1}^{p} y_{i} = 1$ (19)

Let a(n) be the sequence of actual data sensed for recent n periods and ${\hat{a}}_{j} (n)$ is the predicted sequence of data of p models, where n = 1,2, . . . ,t and j = 1,2, . . . ,p.

The combined technique (grey model and kalman filter) give its prediction sequence as: $\hat{a} (n) = f^{- 1} [(\sum_{j = 1}^{p} y_{j} {(g ({\hat{a}}_{j}))}^{P}]^{1 / P})]$ (20) where P ≠ 0, g and f are the differentiable functions and are continuous. The above equation is derived by keeping: $\frac{\partial J (n)}{\partial \hat{a} (n)} = 0$ (21) where J(n) is the sum of the square of the prediction error sequence [6].

To simplify the model, let $f (\hat{a} (n)) = \hat{a} (n)$ and g $({\hat{a}}_{j} (n)) = {\hat{a}}_{j} (n)$ and P = 1

Thus, the GMKFDA model will become a weighted average of meta prediction models. The prediction data sequence for i^th approach is given by: ${\hat{a}}_{i} = ({\hat{a}}_{i 1}, {\hat{a}}_{i 2} \dots, {\hat{a}}_{it}); i = 1, 2, \dots, p$ (22)

Let er_ij be the error in prediction of j^th datum for i^th approach.

er_i is the error vector given by:

$\begin{matrix} {er}_{i} = ({er}_{i 1}, {er}_{i 2}, \dots, {er}_{it}) = a_{1} - {\hat{a}}_{i 1}, \\ a_{2} - {\hat{a}}_{i 2}, \dots, a_{t} - {\hat{a}}_{it} \end{matrix}$ (23) ${er}_{ij} = a_{j} - \sum_{i = 1}^{p} y_{j} {\hat{a}}_{ij}$ (24) where er_ij is the error metrics for combined prediction and Y is the weight vector. The sum squares of the combined prediction error is given by: $J (n) = \sum_{j = 1}^{t} (a_{j} - \sum_{i = 1}^{p} y_{j} {\hat{a}}_{ij})$ (25)

Thus, the optimal weight vector is obtained by minimizing J(n) by the least-squares method. The optimal weight vector is: $Y = A^{- 1} M^{T} / {MA}^{- 1} M^{- 1}$ (26) where M = (1,1, . . . , 1)^T

and $A = (\begin{matrix} \sum_{i = 1}^{t} {er}_{1 i}^{2} & \sum_{i = 1}^{t} {er}_{1 i} {er}_{2 i} \dots & \sum_{i = 1}^{t} {er}_{1 i} {er}_{ti} \\ \sum_{i = 1}^{t} {er}_{2 i} {er}_{1 i} & \dots & \sum_{i = 1}^{t} {er}_{2 i} {er}_{ti} \\ \sum_{i = 1}^{t} {er}_{ti} {er}_{1 i} & \dots & \sum_{i = 1}^{t} {er}_{ti}^{2} \end{matrix})$ (27) every sensor node computes Y and sends it to the corresponding cluster aggregator node. After a certain number of periods, the nodes re-compute the weight vector and send it again to the aggregator to synchronize the prediction parameters.

4.3 Criteria for sensor selection

In this paper, sensor selection is done using the rank-based mutual information (MI) technique. Information-based sensor management tends to work on uncertainty in estimating the event occurrence. For this purpose, the entropy which represents randomness is a key factor to estimate the region state through sensor measurements. The sensor selection problem for applications including event detection can be solved by using the relationship between mutual information and entropy.

In the case of successful predictions, the PVQ and predicted values from the ADQ of the sensor give high mutual information with the PVQ of aggregators. Our objective is to select those sensors which give minimum or zero mutual information of their predicted values with the PVQ of the aggregator of that cluster. For this purpose, we have to check the below equation for all sensors in a cluster. Find a sensor ‘i’ that satisfies the following condition: $i = arg max_{i \in s_{i}} E ({PVQ}_{aggregator} (i) | {ADQ}_{sensor} (i))$ (28)

In other words, $i = arg min_{i \in s_{i}} I ({PVQ}_{aggregator} (i) {| ADQ}_{sensor} (i))$ (29) where mutual information (I) is given by:

$\begin{matrix} I ({PVQ}_{aggregator} (i), {ADQ}_{sensor} (i)) = \\ \int pdf (\hat{a} (n + 1), {\hat{a}}^{'} (n + 1)) ln \\ \frac{pdf (\hat{a} (n + 1), {\hat{a}}^{'} (n + 1))}{pdf (\hat{a} (n + 1)) pdf ({\hat{a}}^{'} (n + 1))} d \hat{a} (n + 1) d^{\hat{a}'} (n + 1) \end{matrix}$ (30) where E is the entropy and I is the mutual information between two observations.

5 Working of the proposed data aggregation scheme

The sensor nodes are deployed randomly covering the whole area of interest. The communication range is considered to be fixed and the same for each node. Each node is connected with its immediate next node or to the cluster head. It is assumed that all the aggregators are connected and only the cluster head in the vicinity of the base station will aggregate the data and transmit it to the base station. Figure 3 depicts the prediction, sensor selection process occurring in a clustered WSN with BS. The clusters are formed using the K-medoids technique, the medoids for each cluster are chosen to be the cluster heads or aggregators. To optimize the positions of aggregators, the Salp swarm optimization (SSA) has been employed in integration with the k-medoids clustering technique. Figure 4 shows the flow chart of the KMED-SSA clustering.

Fig. 3

Sensor selection mechanism using prediction-based data aggregation and mutual information in a cluster.

Fig. 4

Flow chart for KMed-SSA.

The event tracking/monitoring needs the estimation of the region state which is dependent on the sensor measurements. For this purpose, the concept of predictive analytics is used in this paper. Each sensor periodically produces a sensed data. Based on the actual sensed values, either past or current values, the prediction of sensed data takes place at each cluster’s aggregator node which is also responsible for selecting sensors. The same prediction algorithm has to be employed on both sensor and aggregator nodes. The sensor node is assumed to possess sufficient energy for data storage. All aggregators broadcast their cumulative and prediction error thresholds φ and α respectively to all respective cluster members as per the application requirements. Both φ and α are tunable parameters. The sensor node possesses two equal data queues namely actual data queue (ADQ) and predicted value queue (PVQ). ADQ stores the series of actual data and controls the cumulative error. PVQ contains predicted data values. Once the ADQ and PVQ are formed, the aggregator node also constructs the corresponding PVQ for each node i.e.,

PVQ_sensor (j) = PVQ_aggregator (j) for sensor j. Each node senses data and stores the first ‘s’ sensed data into PVQ and ADQ of sensors and sends them to the PVQ of aggregators.

Let a_i be the data item in the queue. Initially, ADQ (j) = PVQ_sensor (j) = PVQ_aggregator (j) ={ a₁, a₂, … a_s } for a random sensor j. Let, a_s+1, $a_{s + 1}^{'}$ , $a_{s + 1}^{″}$ be the actual sensed data, predicted value using ADQ, and predicted value from PVQ_sensor (j) respectively. Aggregator node also obtains $a_{s + 1}^{'}$ from PVQ_aggregator (j) queue. If $abs (a_{s + 1}^{″}$ - a_s+1)< α and $abs (a_{s + 1}^{″}$ - $a_{s + 1}^{'})$ < φ, then prediction error and cumulative error are said to be in the threshold. In the proposed method, the nodes which show $abs (a_{s + 1}^{″}$ - a_s+1)> α irrespective of the cumulative error status are said to be the pretentious nodes. We aim to separate these nodes from the normal nodes and focus only on pretentious nodes for further analysis of their data at the base station.

When the predicted and cumulative errors are in thresholds, the prediction is said to be successful and the node need not send their data to the aggregator. The aggregator considers $a_{s + 1}^{″}$ as a_s+1.

So, the updated queue will be:

$ADQ (j) = {a_{2}, a_{3}, \dots a_{s + 1}^{″}}$

${PVQ}_{sensor} (j) = {a_{2}, a_{3}, \dots a_{s + 1}^{″}}$

${PVQ}_{aggregator} (j) = {a_{2}, a_{3}, \dots a_{s + 1}^{″}}$

Reliable data delivery based on the acknowledgment (ACK) signal from the receiver is considered. The mutual information aims to solve the sensor selection problem. The sensors which sense abrupt measurements will have more entropy and will contribute to a very low value of mutual information. These nodes are assigned high ranks and the aggregators collect the data predicted by these nodes. The rank-based MI algorithm works in synchronization with the hybrid predictive model GMKFDA. Also, the next period prediction queue for each sensor is synchronized with that of its aggregator node thus the name dual synchronization method is used for the proposed scheme. The ranks are assigned iteratively for every node. These nodes are called pretentious nodes. There may be more than one such node and thus high ranks are given to those as compared to normal nodes. Figure 5 shows the flow schematic for the sensor selection at the aggregator node by using GMKFDA with the rank-based MI method. The algorithm for GMKFDA with rank-based sensor selection using MI is given below.

Fig. 5

Proposed model for cluster-based GMKFDA-MI technique.

Algorithm for GMKFDA with rank-based MI
Input:
$\hat{a} (i)$ : current prediction data sequence, $a (i) = {\hat{a}}_{i - n + 1}, {\hat{a}}_{i - n + 2}, \dots, {\hat{a}}_{i}; i \geq n$
Y: the current optimal weight vector
a_i+1: the sensed data of the (i + 1) ^th period
b: number of successful predictions;
u: period count for re-computing weight vectors
v: threshold for ‘b’. If b≥v, sensor should send current sensed data to aggregator
curr: age of the current weight vector
α: threshold of the prediction error
φ: threshold of the cumulative error
Output:
S: vector containing selected sensors with errors in prediction. Initially S = {Φ}
Begin GMKFDA ( $\hat{a} (i), Y, a_{i + 1}, b, u, v, curr, α, φ$ ):
{
Executing Grey prediction model-based data aggregation to get the predicted data ${\hat{a}}_{g}$ and its error er_g;
Executing Kalman filter estimation-based Data aggregation to acquire the predicted data ${\hat{a}}_{k}$ and its error er_k;
if curr < u-1{
curr = curr + 1;
Execute the GMKFDA algorithm and attain the predicted data $\hat{a}$ , prediction error p_er and cumulative error c_er;
}
for each cluster sensor node s_i {
if (p_er > α & & c_er > φ) \| \| (p_er 〈 α & & c_er 〉 φ){
b = 0;
Compute Mutual Information between PVQ_aggregator (i) & ADQ_sensor (i) for each sensor in cluster with its corresponding aggregator node as shown in Equation (30).
Comparing MI values for each sensor to search for the minimum MI value sensor. //synchronizing the calculation of predicted queue with the calculation of MI for each sensor.
Assigning ranks to all selected nodes with decreasing number of their mutual information values.
Add high rank sensors to vector ‘S’
Send ${\hat{a}}^{‘} (i + 1)$ data from selected sensor vector to the aggregator.
return S;
}
else if (p_er < α & & c_er < φ) and b < v-1 {
b = b + 1;
${\hat{a}}_{i + 1} = \hat{a}$ ; //synchronizing the
prediction data sequence for next period with
the aggregator node.
$a (i + 1) = ({\hat{a}}_{i - n + 2}, {\hat{a}}_{i - n + 3} \dots, {\hat{a}}_{i + 1});$
//refreshing the prediction sequence for future
predictions.
}
else {
Calculate optimal new weight
vector Y_new;
Forwarding the updated weight vector
to the aggregator node.
Y = Y_new;
curr = 0;
}
}
}

6 Results and discussions

In this paper, we are validating our algorithm on 2 air pollution datasets, one from the UCI repository called “Air quality dataset” published in 2016, and the other is from the experimentation carried out on the busy roads near the hospital premises for 1 day. the below subsection describes the example application scenario.

6.1 Real-world example

A random deployment is considered in which the nodes are equipped with abilities to sense various gases to monitor pollutants in the air. Figure 6 shows a scenario for the application of air pollution monitoring in real-time environment near hospital premises. These sensing metrics i.e., G₁, G₂, … G_m advances overtime. For e.g.: we can say G₁ = CO concentration, G₁ = SO₂ concentration etc. We have designed a model which is able to work with every node producing data, the repeated reading is taken as one and this model updates the aggregator node about the readings and thus communication happens only when variation in readings takes place. The classification of pretentious nodes producing abrupt and unusual readings is done at the aggregator node using rank-based mechanism as proposed in the model. It is very necessary to monitor and track the regions where the content of pollutants like carbon monoxide, benzene, carbon dioxide, nitrogen oxide, sulphur dioxide, etc., is too high. Various gas sensors mounted on a node are used to collect the data in various regions. The simulation parameters for the scenario are given in Table 1.

Fig. 6

Randomly deployed clustered sensor network employing the proposed methodology for air pollutant monitoring application.

Table 1

Scenario parameters

Parameter	Value
Deployment field (m²)	100×100
Location of BS	(46, 58)
Node energy	1 mJ
Number of nodes including aggregator nodes	100
Number of clusters using KMed-SSA	4
Energy dissipation idle mode (E_elec)	50 nJ/bit
Free space model (E_fs)	10 pJ/bit/m²
Multi-path model (E_mp)	0.0013 pJ/bit/m⁴
Energy for data aggregation E_DA	5 nJ/bit/signal

The nodes are clustered and each cluster and its aggregator has significant transmission and communication range which keeps it connected to other cluster aggregators or BS. All the aggregators receive the data from all the cluster member nodes belonging to its cluster. The aggregator nodes after performing the prediction and sensor selection task, send the aggregated data values to the base station which is connected to other edge devices or users via the cloud. The details of datasets are given in Table 2.

Table 2

The description of the two air quality monitoring datasets used in the experimentation

Data set	Characteristic	Number of instances	Number of attributes
Data set I (Air quality dataset from UCI repository)	Time-series	9358	15
Data set II (real-world application)	Time-series	10,000	9

The ranks are assigned to each sensor according to the criteria given in the GMKFDA-MI algorithm. Table 4 shows an example in which ranking for the nodes is done based on their predicted and actual measurements. The dataset after the prediction protocol is classified using some well-known classifiers like Logistic regression (LR), Random Forest (RF), and AdaBoost as shown in Table 5, and the performance metrics like accuracy, precision, recall, and F1score are compared with other existing methods. These three classifiers perform better than other classifiers like SVM, Decision trees, etc. when applied to our datasets, so the performance metrics of the three classifiers are taken to validate with other existing models.

Table 4

Example assigning a rank to the nodes by applying mutual information on the prediction algorithm

Nodes in a cluster	Node 1	Node 2	Node 3	Node 4
Mutual information-based ranking of nodes
Actual sensor readings	CO	6.9	0.7	16.8	1.3
	C₆H₆	58.1	10.0	38.5	8.4
	CO	17.1	1.4	7.0	1.5
	C₆H₆	5.0	9.4	7.7	7.6
Predicted value of sensed parameter	CO	14.5	1.2	12.9	0.9
	C₆H₆	42.1	9.7	38.1	7.1
Ranks assigned to nodes		1	2	1	2
Nodes 1 & 3 are pretentious nodes

Table 5

Performance evaluation of proposed method and other existing methods for both datasets

Model	Accuracy (in %)	Precision	Recall	F1score
KF based SS model using LR (Dataset-I)	86.7	0.86	0.80	0.85
FI based SS model using DT (Dataset-I)	78.9	0.77	0.79	0.78
KF based SS model using LR (Dataset-II)	91.2	0.83	0.83	0.83
FI based SS model using DT (Dataset-II)	87.9	0.90	0.89	0.89
Proposed method using LR (DataSet-I)	93.2	0.87	0.92	0.89
Proposed method using RF (DataSet-I)	89.4	0.86	0.86	0.86
Proposed method using Adaboost (DataSet-I)	90.2	0.88	0.88	0.88
Proposed method using LR (DataSet-II)	96.4	0.96	0.90	0.93
Proposed method using RF (DataSet-II)	97.8	0.99	0.83	0.90
Proposed method using Adaboost (DataSet-II)	96.2	0.98	0.88	0.93

Abbreviations: KF, Kalman filter; SS, sensor selection; FI, Fisher Information; LR, Logistic regression; DT, Decision trees; RF, Random Forest. The bold values signify the best values achieved by the proposed model over other existing models.

6.2 Performance measures

The performance measures taken to validate the proposed method with other existing models are defined below:

6.2.1 Cumulative distribution function (CDF)

CDF of a random variable ‘R’ which is a real number is its probability function that indicates that R occupies the value less than or equal to ‘r’. where ‘r’ is again a real number at which ‘R’ is evaluated. Mathematically it is shown by: $F_{R} (R) = P (R \leq r)$ where F_R (R) is the function of R, P is the probability that R takes value less than equal to r.

The CDF of a continuous random variable is given by: $F_{R} (R) = \int_{- \infty}^{r} f_{R} (t) dt$ where f_R (t) is the probability density function of R.

6.2.2 Mean square error (MSE)

MSE evaluates the quality of an estimator. It is the average calculated for the square of errors. These errors are the difference between predicted and actual values. it possesses non-negative values but practically its value is greater than zero. The low value of MSE means a more perfect predictor model in terms of errors. $MSE = \frac{1}{n} \sum_{i = 1}^{n} (V_{i} - {\hat{V}}_{i})^{2}$ where n is the number of data points, ‘V’ is the observed value and $\hat{V}$ is the predicted value.

6.2.3 End-to-end delay

It is the time that a packet takes to get transmitted from the source node to the sink node. Generally, its unit is taken as seconds. The end-to-end delay must be as low as possible to ensure low network latency. $End - to - end delay = \sum_{i = 1}^{pc} (B_{RXi} - S_{TXi})$ where pc is the number of packets, B_RX is the time at which sink node or base station receives the packet, S_TX is the time at which sensor transmits the packet.

6.2.4 Throughput

It is defined as the average successful packet delivery rate in a network. The high value of this metric is desirable. It is measured in bits/sec. $Throughput at sink node = \frac{\sum_{i = 2}^{N} 8 * P_{i}}{(L_{t} + F_{t})}$ where N is the total number of nodes except sink node, P_i is the total bytes delivered from the i^th node, L_t denotes the time at which the last packet received at sink node, F_t time at which the first packet delivered at the sink node.

6.2.5 Packet delivery ratio (PDR)

It is given by the ratio between the number of data packets successfully delivered and the total packets transmitted. The high value of PDR indicates good system throughput. $PDR = \frac{Number of packets received successfully}{Total number of packets tranmitted}$

6.2.6 Energy consumption

The total energy consumed by every sensor node involved in transmission, reception, idle, and sleep actions is termed as the energy consumption of a network. It is in millijoule. $Total energy consumption = \sum_{i = 1}^{s} ({TX}_{i} + {RX}_{i} + {ID}_{i})$

6.2.7 Accuracy

It gives correct classifications out of all the total classifications. The high number of accuracies leads to a more accurate model. It is mentioned in percentage. A perfect model has 100% accuracy which is not a practical scenario. $Accuracy = \frac{TP + TN}{TP + TN + FP + FN}$

Where TP symbolizes true positive value, TN represents a true negative value, FP expresses false positive number, FN indicates a false negative value.

6.2.8 Precision/ positive predictive value

Among the proportion of positive classifications, how many are actually correct is answered by precision. A classification with no false positive is denoted by a precision value of 1. $Precision = \frac{true positive}{true positive + false positive}$

6.2.9 Recall/ sensitivity

Recall gives the proportion of data items that are identified as correct from actual positive data items. A model with no false negatives has a recall as 1. $Recall = \frac{true positive}{true positive + false negative}$

6.2.10 F-score

It is the weighted average of recall and precision and is given by the harmonic mean of the recall and precision values. it is the measure for the testing phase’s accuracy. $F 1 score = \frac{2 * (Precision * Recall)}{(Precision + Recall)}$

In the above context, a true positive value indicates that the predicted value is actually correct and also classified as correct, true negative specifies that the predicted value is incorrect and classified as incorrect, false-positive value denotes that the predicted value is incorrect and classified to be a correct prediction, false-negative value signifies that the predicted value is correct but classified as an incorrect prediction. In Table 3, it can be observed that KMed-SSA clustering gives zero standard deviation i.e., the best fitness value, average fitness value, and worst fitness value are the same. Also, the time taken by the KMed-SSA algorithm is very less as compared to the other two algorithms.

Table 3
Performance evaluation of algorithms used for optimized clustering evaluated on dataset I

Algorithm BFV AFV WFV Time (in sec)

K-Means PSO 93.05 94.71 95.52 15.6

KMed-SSA 98.75 98.75 98.75 5.4

PSO 99.58 99.87 100.75 15

Algorithm	BFV	AFV	WFV	Time (in sec)
K-Means PSO	93.05	94.71	95.52	15.6
KMed-SSA	98.75	98.75	98.75	5.4
PSO	99.58	99.87	100.75	15

Abbreviations: PSO, particle swarm optimization; KMED-SSA, k-medoids with salp swarm algorithm; BFV, best fitness value; AFV, average fitness value; WFV, worst fitness value.

6.3 Performance analysis and discussions

The prediction-based data aggregation used in the paper is free of structure and topology, so random sensor measurements for carbon monoxide (CO) are taken for 500 data items. In the proposed method the prediction and cumulative error thresholds (α and φ respectively) are the application-specific factors and are based on user requirements on the accuracy of data. In the carbon monoxide (CO) measurements, the tolerated error thresholds are relatively small values. In our experiment we take α= 1, and for simplicity, we let α=φ. In Fig. 7 (a), the Kalman filter-based data aggregation model performs better than the grey model-based aggregation. The CDF curve shows that the communication energy savings for the Kalman filter model are higher than the grey-based model for the prediction error threshold value equal to 1 (α= 1). The proposed model has a CDF value for prediction errors much higher than the other two methods. It requires high computation and has high energy saving, thus, outperforming the other two models. Figure 7 (b) depicts the graph for the number of predictions made successfully against the prediction error threshold parameter. It can be inferred that due to the usage of PVQ instead of ADQ in the prediction process, the number of successful predictions is high in the case of the GMKFDA-MI algorithm. The proposed method also contributes to less consumption of energy by preventing redundant communications and overhead to occur. The auto-regressive model [42] performs worst in this case whereas a query-based model (PAQ) [37] gives an average performance.

Fig. 7

Comparison of CDFs and number of successful predictions for the proposed model with two other existing models (a) Cumulative distribution functions for prediction errors when α= 1. (b) Evaluation of different prediction success rates for different values of α.

The proposed model is trained and tested for two datasets described in Table 2. The QoS parameters obtained for both the datasets are compared against some pre-existing cluster-based models like DEEC [12 –14], LEACH [9, 10], and TEEN [15], some sensor ranking models like FTPR [5], VoISRAM [28], DCLS-WSN [29, 30] and HHRS [11] and some sensor selection models based on predictive analysis such as Kalman filter-based sensor selection (KFSS) [36], Fisher information-based sensor selection (FISS) [35], Greedy algorithms [38, 39] and exhaustive search methods like Brute force. Many clustering-based routing analyses prove that DEEC performs better than LEACH and it is also shown in this case. Figure 8 shows the comparative analysis of the proposed method with other methods done on the dataset I. Figure 8(a) gives the comparison of the proposed method with other sensor selection methods for mean square error. The proposed method performs very close to the KFSS model which exhibits high accuracy among other methods. It is observed that the proposed method shows a 66.6% decrease in the MSE when compared with the KFSS model. The FISS, Greedy, and exhaustive methods use estimation theory so, mean square error calculation is done for them. These methods show 75% to 80% enhancement in the MSE as compared to the proposed model. Figure 8(b) depicts the end-to-end delay performance for four methods and the proposed method measured against the percentage of the pretentious nodes. The delay for the proposed method is recorded to be the lowest of all methods. There is an increment of delay by 45% to 52% in case of FTPR, VoISRAM, DCLS-WSN and LEACH approach as compared to the novel prediction model. Due to the use of prediction and sensor selection, the data is transmitted only from the selected nodes and thus, the proposed method possesses very low latency and outperforms other similar data collection and sensor selection methods. Figure 8(c) shows throughput analysis in case of DEEC, LEACH, TEEN, and HHRS. The performance evaluation of throughput with the number of transmissions for DEEC, LEACH and TEEN is very similar because these three protocols are based on clustering. However, HHRS performs better than these three methods. The proposed method shows 18.7% higher throughput than the HHRS method. The HHRS method shows high packet loss that is why it shows low throughput in contrast to the novel prediction model. The reduction in throughput for DEEC, LEACH, and TEEN is recorded 21%, 35%, and 57.8% respectively when compared with the throughput of the novel method. Figure 8(d) gives the packet delivery ratio for the proposed method along with four other methods that are LEACH, DEEC, HHRS, and FTPR. FTPR shows higher PDR than LEACH, DEEC, and HHRS after proposed method. 63%, 71%, and 87% decrease in PDR is observed in case of DEEC, HHRS, and LEACH protocols respectively when compared with the proposed novel method. The proposed method employs clustering along with a predictive sensor selection scheme which is not present in LEACH, DEEC, HHRS, and FTPR and thus it achieves high PDR. In Figure 8(e), the energy consumption is calculated in joules with respect to the pretentious node percentage. In FTPR, the usage of memory leads to more energy consumption. The proposed method consumes 66% less energy as compared to the FTPR method. The LEACH protocol consumes energy in the construction of clusters and records 71.4% energy enhancement as compared to the proposed approach. It has been observed that a 25-33% of energy is utilized less in case of the proposed technique as compared to VoISRAM and KFSS.

Fig. 8

Comparison of performance measures for validating the proposed method against other models for the dataset I. (a) Mean square error vs pretentious node percent (b) End-to-end delay with respect to pretentious node percent (c) Throughput vs number of transmissions (d) Packet delivery ratio vs percentage of pretentious nodes (e) Network energy consumption vs percentage of pretentious nodes.

Figure 9 gives the performance and comparative analysis of the proposed algorithm trained and tested on dataset II for the same performance metrics as discussed in Fig. 8. The sensor selection methods like KFSS, FISS use the only correlation between the data measured by sensors and not prediction so they could not perform up to the mark as compared to the proposed model. The greedy approaches use estimation theory and give high error whereas KFSS uses Kalman filter so it performs well in case of mean square error. In Fig. 9(a) the KFSS and FISS show 66% and 75% of enhancement in error in contrast with the proposed method. The proposed method shows approximately 85% decrement in the MSE in comparison to the exhaustive and greedy methods. The end-to-end delay for LEACH is recorded to be highest and is 55% higher than the proposed method as shown in Fig. 9(b). Since all nodes in a cluster are involved in the transmission, the delay is high in case of LEACH protocol. The reduction in delay for the proposed method against FTPR, DCLS-WSN, and VoISRAM is 42.8%, 53%, and 54% respectively as pretentious nodes increase. Figure 9(c) depicts the throughput versus the number of transmissions. The highest throughput is shown by HHRS. The throughput for DEEC, LEACH, and TEEN is 25%, 42.8%, and 81.8% respectively less than that of the proposed method. Figure 9(d) gives PDR for FTPR, HHRS, DEEC, LEACH, and proposed model. With an increase in the number of pretentious nodes, the PDR lowers. The highest PDR is observed for the proposed method and its value is 82 whereas for the LEACH it is the lowest and its value is 12. The FTPR records 13.8% less PDR and HHRS and DEEC observe 64% and 69.5% resp. reduction in PDR as compared to the proposed method. In Fig. 9(e), after LEACH, the energy utilization for FTPR is the highest due to the usage of memory in the FTPR protocol. The low energy consumption is shown by the VoISRAM and KFSS methods. There is an approximately 11% to 15 % reduction in energy usage for the proposed method as compared to VoISRAM and KFSS methods. Table 6 gives a detailed comparison of our proposed work with other similar existing models.

Fig. 9

Comparison of performance measures for validating the proposed method against other models for dataset II. (a) Mean square error vs pretentious node percent (b) End-to-end delay with respect to pretentious node percent (c) Throughput vs number of transmissions (d) Packet delivery ratio vs percentage of pretentious nodes (e) Network energy consumption vs percentage of pretentious nodes.

Table 6

Comparative study of similar state-of-the-art techniques and proposed model

SNo.	Model	Literature	Techniques used in the model	Advantages	Limitations	Compared with	Applications
1	CoGKDA	[6]	Grey model and Kalman Filter	High prediction accuracy without large training data, eliminate redundant transmissions, lightweight in terms of computational complexity, low communication overhead, 35.21% and 59.85% energy savings for prediction accuracies(ɛ) 0.5 and 1 respectively.	Suitable only for networks with one sink node.	AR(3), similarity based adaptive (SAF) framework, Probabilistic adaptable query (PAQ) system	Environmental monitoring
2	Auto regressive (AR) model	[42]	Query based framework	Time series prediction evaluation, can control overfitting, clusters are well maintained, reduces the communication	Very high RMSE for AR(12) model (RMSE above 150)	AR(1) to AR(5), AR(12) and Moving average(MA(1))	Data prediction application
3	VoISRAM	[28]	Residual energy prediction model	13% boost in network life, energy utilization, delay, spatio-temporal accuracy, low time complexity, Topology management using rank based sensor service	Performs good for static scenarios rather than mobility enabled network.	Resource allocation in heterogeneous (SACHSEN) WSN, Energy aware ranking (EARM)mechanism, context- aware sensor search, selection, and ranking (CASSARAM) model,	Detection applications
4	KFSS	[36]	Maximum likelihood estimation, Fisher information, Kalman filter.	Enhanced tracking performance and accuracy, less computation load, Target detection and tracking challenge, computation time reduces by 49.5%.	Causes more communication overhead as all nodes participate in tracking process	Maximum likelihood Kalman filter based collaborative target tracking approach,	Target tracking
5	FISS	[35]	Fisher information and greedy algorithm	Low mean square error, Sensor scheduling and selection	Suitable only for weak correlation noise and centralized networks, high computational complexity	kalman filter based optimization in multi-step sensor selection strategy	Temperature monitoring, Localization for multipath networks, clock synchronization, distributed estimation
6	Mutual information based sensor selection (MISS)	[32, 34]	Mutual Information	Performs well for network with mobile nodes, high estimation accuracy,	Can be made more efficient using greedy path selection approach	Brute-force, Decoupled, Greedy algorithms with multistep algorithms	Target tracking, data aggregation using clustering
7	Proposed model	–	GMKFDA + MI	High accuracy (97.8%), high throughput, low MSE, low energy consumption (upto 30% of energy savings at all prediction accuracies) and low end-to-end delay, low communication overhead	Performs good for multiple sink networks, networks with node mobility and for high number of nodes.	AR, PAQ, Grey model, Kalman filter model, FISS, KISS, routing approaches like LEACH, TEEN, DEEC etc	Clustered WSN for IoT event monitoring applications, prediction based application, detection.

6.4 Analysis of the computational complexity for the proposed method

The run time performance evaluation for the proposed algorithm which is the combination of various techniques that are optimized k-medoids clustering, data aggregation using the prediction-based approach, and rank-based sensor selection using mutual information is done by asymptotic analysis for individual techniques. The generalized computational complexity for k-medoids clustering is O(n²) and for the salp swarm optimization algorithm, the complexity is given as O(IT*n*d). The input parameters liable for algorithmic execution are the number of iterations (IT), input weight vector (Y), and prediction data sequence (a_i), The data aggregation model based on predictive analysis of sensor measurements uses combined properties of the grey model (GM (1,1)) and Kalman filter estimator. The grey model contributes O(n) as deduced by Equations (13). The Kalman filter estimator which uses Equations (18) has the computational complexity of O(2n³). The rank-based mutual information possesses a computational complexity of O(n). Since with each run, the sensors in a cluster have to be ranked simultaneously, so with the increase in clusters, the complexity becomes O(n log n). Thus, for the proposed method the training complexity reaches O(n³) which is reduced to O(n) for the testing phase.

7 Conclusion

In the view to address the challenges raised in the event monitoring and detection applications of IoT networks, a sensor selection system plays a significant role. For this purpose, a dual synchronized sensor selection model based on prediction-based data aggregation is proposed in this paper. The proposed method works for each cluster and every aggregator node contribute to estimating the predicted value for each sensor node measurements. The prediction-based sensor selection method gives the set of all pretentious nodes which show high prediction error which means that there are some discontinuities in the sensor measurement of that particular sensor. The proposed method promises high accuracy which is evaluated as 97.8% and reduces network delay by 40% to 50% as well as mitigation in the total energy consumption of the network by approximately 10% to 30% when compared to its immediate rival methods. The future work to this approach can consider the application of multi-objective optimization-based routing for event reporting and some alternative techniques which require less computational complexity.

References

, Siddula

, Cheng

, Tian

and Li

, Approximate Data Aggregation in Sensor Equipped IoT Networks, Tsinghua Science and Technology 25(1) (2019), 44–55.

Peng

M.I.N.

, Garg

, Wang

, Bradai

, Lin

and Hossain

M.S.

, Learning-Based IoT Data Aggregation for Disaster Scenarios, IEEE Access. 8 (2020), 128490–128497.

Al-Karaki

I.N.

, UI-Mustafa

and Kamal

A.E.

, Data aggregation in wireless sensor networks-exact and approximate algorithms. In2004Workshop on High Performance Switching and Routing, 2004. HPSR (2004), 241–245. IEEE.

Cheng

, Xie

, Wu

, Yu

and Li

, Data prediction model in wireless sensor networks based on bidirectional LSTM, EURASIP Journal on Wireless Communications and Networking 1 (2019), 1–2.

Anita

, Bhagyaveni

M.A.

and Manickam

J.M.L.

, Fuzzy-based trust prediction model for routing in WSNs, The Scientific World Journal. (2014).

Wei

, Ling

, Guo

, Xiao

and Vasilakos

A.V.

, Prediction-based data aggregation in wireless sensor networks: Combining grey model and Kalman Filter, Computer Communications. 34(6) (2011), 793–802.

Perumal

S.D.B.

and Devi

K.V.

, A cluster prediction model-based data collection for energy efficient wireless sensor network, The Journal of Supercomputing. 75(6) (2019), 3302–3316.

Balochian

and Baloochian

, Improving Grey Prediction Model and Its Application in Predicting the Number of Users of a Public Road Transportation System, Journal of Intelligent Systems. 30(1) (2021), 104–114.

Aslam

, Rasheed

M.B.

, Shah

, Rahim

, Khan

Z.A.

, Qasim

M.W.

, Hassan

, Khan

and Javaid

, Energy optimization and Performance Analysis of Cluster Based Routing Protocols Extended from LEACH for WSNs, (2013).

10.

Maurya

and Kaur

, A survey on descendants of leach protocol, International Journal of Information Engineering and Electronic Business 8(2) (2016), 46.

11.

Behera

T.M.

, Mohapatra

S.K.

, Samal

U.C.

and Khan

M.S.

, Hybrid heterogeneous routing scheme for improved network performance in WSNs for animal tracking, Internet of Things 6 (2019), 100047.

12.

Reddy

M.M.P.

and Rajan

S.V.

, DEEC protocol for WSNs. India Advances in Wireless and Mobile Communications, ISSN 10(1) (2017), 51–63.

13.

Qureshi

T.N.

, Javaid

, Malik

, Qasim

and Khan

Z.A.

, On performance evaluation of variants of DEEC in WSNs, In 2012 Seventh international conference on broad-band, wireless computing, communication and applications (2012), 162–169. IEEE.

14.

Saini

and Sharma

A.K.

, E-DEEC-enhanced distributed energy efficient clustering scheme for heterogeneous WSN, In 2010 First international conference on parallel, distributed and grid computing (PDGC 2010), (2010), 205–210. IEEE.

15.

Manjeshwar

and Agrawal

D.P.

, TEEN: A Routing Protocol for Enhanced Efficiency in Wireless Sensor Networks, Inipdps 1 (2001), 189.

16.

Bhola

, Soni

and Kaur

, Genetic algorithm based optimized leach protocol for energy efficient wireless sensor networks, Journal of Ambient Intelligence and Humanized Computing 11(3) (2020), 1281–1288.

17.

Karthick

P.T.

and Palanisamy

, Optimized cluster head selection using krill herd algorithm for wireless sensor network, Automatika: časopis za automatiku, mjerenje, elektroniku, računarstvo i komunikacije 60(3) (2019), 340–348.

18.

Yang

, Xu

, Zhao

and Xu

, A multipath routing protocol based on clustering and ant colony optimization for wireless sensor networks, Sensors 10(5) (2010), 4521–4540.

19.

Gavel

, Singh

and Tiwari

, Distributed intrusion detection scheme using dual-axis dimensionality reduction for Internet of things (IoT), The Journal of Supercomputing 77(9) (2021), 10488–10511.

20.

Medeiros

, Park

and Kak

A.C.

, Distributed object tracking using a cluster-based kalman filter in wireless camera networks, IEEE Journal of Selected Topics in Signal Processing 2(4) (2008), 448–463.

21.

Jiang

, Li

, Paterson

A.H.

and Robertson

J.S.

, DeepSeedling: deep convolutional network and Kalman filter for plant seedling detection and counting in the field, Plant Methods 15(1) (2019), 1–9.

22.

You

, Be

M.A.Y.

and In

, Real-time implementation of Kalman filter to improve accuracy in the measurement of time of flight in an ultrasonic pulse-echo setup, Review of Scientific Instruments 90(2) (2019), 025105.

23.

Acharya

, Mongan

W.M.

, Rasheed

, Liu

, Anday

, Dion

, Fontecchio

, Kurzweg

and Dandekar

K.R.

, Ensemble learning approach via kalman filtering for a passive wearable respiratory monitor, IEEE journal of biomedical and health informatics 23(3) (2018), 1022–1031.

24.

Xie

, Zhang

and Ye

, Short-term traffic volume forecasting using Kalman filter with discrete wavelet decomposition, Computer-Aided Civil and Infrastructure Engineering. 22(5) (2007), 326–334.

25.

Wang

, Luo

, Liu

, Li

, Liu

and Zhu

, Improved Kalman filter based differentially private streaming data release in cognitive computing, Future Generation Computer Systems 98 (2019), 541–549.

26.

Luo

and Chang

, A novel data fusion scheme using grey model and extreme learning machine in wireless sensor networks, International Journal of Control, Automation and Systems 13(3) (2015), 539–546.

27.

and Dang

, An optimized greyGM(2, 1) model and forecasting of highway subgrade settlement, Mathematical Problems in Engineering (2015).

28.

Bharti

, Pattanaik

K.K.

, Member

and Bellavista

, Value of information-based sensor ranking for efficient sensor service allocation in service oriented wireless sensor networks, IEEE Transactions on Emerging Topics in Computing 9(2) (2019), 823–838.

29.

Shen

, Member

, Li

, Yu

and Qiu

, Efficient data collection for large-scale mobile monitoring applications, IEEE transactions on parallel and distributed systems 25(6) (2013), 1424–1436.

30.

Dong

, Ota

and Liu

, RMER: Reliable and energy-efficient data collection for large-scale wireless sensor networks, IEEE Internet of Things Journal 3(4) (2016), 511–519.

31.

Wang

, Pottie

and Estrin

, Entropy-based sensor selection heuristic for target localization, In Proceedings of the 3rd international symposium on Information processing in sensor networks (2004), 36–45.

32.

Russ

J.A.

, Mutual information-based tracking with mobile sensors (Doctoral dissertation, Massachusetts Institute of Technology).

33.

Cavagnaro

D.R.

, Myung

J.I.

, Pitt

M.A.

and Kujala

J.V.

, Adaptive design optimization: A mutual information-based approach to model discrimination in cognitive science, Neural Computation 22(4) (2010), 887–905.

34.

Kachouie

N.N.

and Shutaywi

, Weighted Mutual Information for Aggregated Kernel Clustering, Entropy 22(3) (2020), 351.

35.

Liu

, Member

, Chepuri

S.P.

, Fardad

, Maşazade

, Leus

and Varshney

P.K.

, Sensor selection for estimation with correlated measurement noise, IEEE Transactions on Signal Processing 64(13) (2016), 3509–3522.

36.

Xingbo

, Huanshui

, Liangliang

H.A.N.

and Ping

, Sensor selection based on the fisher information of the Kalman filter for target tracking in WSNs, In Proceedings of the 33rd Chinese Control Conference (2014), 383–388. Ieee.

37.

Tulone

and Madden

, PAQ: Time series forecasting for approximate query answering in sensor networks, In European Workshop on Wireless Sensor Networks (2006), 21–37. Springer, Berlin, Heidelberg.

38.

, Ambrosino

and Sinopoli

, Sensor selection strategies for state estimation in energy constrained wireless sensor networks, Automatica 47(7) (2011), 1330–1338.

39.

Shamaiah

, Banerjee

and Vikalo

, Greedy sensor selection: Leveraging submodularity, In 49th IEEE conference on decision and control (CDC) (2010), 2572–2577. IEEE.

40.

Theerthagiri

, CoFEE: Context-aware futuristic energy estimation model for sensor nodes using Markov model and autoregression, International Journal of Communication Systems (2019), 1–15.

41.

Mirjalili

, Gandomi

A H.

, Zahra

, Saremi

, Faris

and Mirjalili

S.M.

, Salp Swarm Algorithm: A bio-inspired optimizer for engineering design problems, Advances in Engineering Software 114 (2017), 163–191.

42.

Bergmeir

, Hyndman

R.J.

and Koo

, A note on the validity of cross-validation for evaluating autoregressive time series prediction, Computational Statistics & Data Analysis 120 (2018), 70–83.

A dual synchronization prediction-based data aggregation model for an event monitoring IoT network

Abstract

Keywords

1 Introduction

3 Proposed model

3.2 Energy consumption model

4.1 Optimized K-medoids based clustering using Salp-swarm algorithm (KMed-SSA)

4.2.1 Grey model

6.1 Real-world example

6.2.1 Cumulative distribution function (CDF)

6.2.2 Mean square error (MSE)

6.2.3 End-to-end delay

6.2.4 Throughput

6.2.5 Packet delivery ratio (PDR)

6.2.6 Energy consumption

6.2.7 Accuracy

6.2.8 Precision/ positive predictive value

6.2.9 Recall/ sensitivity

6.2.10 F-score

Table 3 Performance evaluation of algorithms used for optimized clustering evaluated on dataset I Algorithm BFV AFV WFV Time (in sec) K-Means PSO 93.05 94.71 95.52 15.6 KMed-SSA 98.75 98.75 98.75 5.4 PSO 99.58 99.87 100.75 15

7 Conclusion

References

Table 3
Performance evaluation of algorithms used for optimized clustering evaluated on dataset I

Algorithm BFV AFV WFV Time (in sec)

K-Means PSO 93.05 94.71 95.52 15.6

KMed-SSA 98.75 98.75 98.75 5.4

PSO 99.58 99.87 100.75 15