Abstract
The abrupt changes in the sensor measurements indicating the occurrence of an event are the major factors in some monitoring applications of IoT networks. The prediction-based approach for data aggregation in wireless sensor networks plays a significant role in detecting such events. This paper introduces a prediction-based aggregation model for sensor selection named the Grey prediction model and the Kalman filter-based data aggregation model with rank-based mutual information (GMKFDA-MI) that has a dual synchronization mechanism for aggregating the data and selecting the nodes based on prediction and cumulative error thresholds. Furthermore, the nodes after deployment are clustered using K-medoids clustering along with the Salp swarm optimization algorithm to obtain an optimized aggregator position concerning the base station. An efficient clustering promises energy efficiency and better connectivity. The experiments are accomplished on real-time datasets of air pollution monitoring applications and the results for the proposed method are compared with other similar state-of-the-art techniques. The proposed method promises high prediction accuracy, low energy consumption and enhances the throughput of the network. The energy-saving is recorded to be more than 10 to 30% for the proposed model when compared with other similar approaches. Also, the proposed method achieves 97.8% accuracy as compared to other methods. The method proves its best working efficiency in the applications like event reporting, target detection, and event monitoring.
Keywords
Introduction
In this era of the Internet of Things (IoT), data plays a very crucial role. The devices and systems used in IoT technologies are used to generate and collect data before sending them to the server or cloud. The applications like tracking human or animal activities, trespassing, social interactions, etc. depend on the data provided by the devices. Handling this data with great precision is a challenging task. To accomplish this, an IoT concept that provides an intelligent framework to extract useful knowledge from the data is highly required. The methods which reduce the redundancies, anomalies in data could be a good solution to this concern. Employing these methods makes the system much more reliable and it enables the data to be handled with great efficiency and less complexity. Data that is needed for various applications like surveillance, tracking of human activities, monitoring, etc. is sensed using various sensors or actuators. A need to establish a network with these sensor devices is very important and thus wireless sensor network comes into the picture. A Wireless sensor network is an interconnected system of various homogeneous or heterogeneous low-cost sensor nodes with a facility of self-configuring, sensing, and data processing. The ad-hoc nature of this network enables it to self-organize itself according to the need of the application. Many event monitoring applications require a large number of transmissions of data from each sensor node to a central controller where it is further processed. This whole procedure consumes high energy and leads to a reduction in the lifetime of the network. To address this challenge data aggregation proves itself a very suitable solution.
Aggregation aims to mitigate the energy consumption by processing the raw data either at the node or at intermediate node, also called aggregator nodes. There are many means of aggregating the data collected by sensor nodes. Some methods use the conventional way of aggregation by using functions like min, max, sum, average [1–3]. The use of mobile nodes to collect data from various nodes and aggregate them is also one of the techniques used in monitoring applications. But the drawback of this approach is the fast energy consumption of the nodes due to their mobility. The best and optimized way of achieving aggregation is to employ an optimized clustering approach to optimize the position of the cluster heads and then to use Spatio-temporal correlation methods or data estimation methods to forecast the data and to remove the redundancies in the data. This conditioned data is then sent to a central controller or base station, where it is analyzed for various applications like detection of forest fire, identification of polluted air in some specific area, habitat monitoring, surveillance, etc.
One of the key factors in event detecting or reporting applications is to search for those regions where the event occurs. To fulfill this task, the approach used for it must search those nodes which give abrupt sensor measurements that are either very high or very low than the normal sensor measurements of the surrounding parameters. To achieve this, a prediction-based approach can be adapted to only transmit the data when the predicted sequence shows some abruptness in the measurement. The data is generally spatially and temporally correlated which may exploit the notion of the prediction process. The predicted data with high redundancies when processed consumes a greater amount of energy. Many prediction-based and correlation-based methods are used to reduce these redundancies and to achieve a long lifetime of the network [4–8]. The prediction-based approach helps the user to detect periodic variations in monitoring sensors which helps to control the potential risk in that region. Some model drive techniques for monitoring applications like data mining, query routing, data collection protocols, etc., provides high accuracy and network lifetime but are more complex which results in the infeasibility of their use in such applications. This paper uses the dual synchronization mechanism to predict the data as well as to select the sensors according to their predicted values using rank-based mutual information. The proposed method uses the merits of both the grey model and the Kalman filter to estimate the data and to enhance the accuracy by setting the two tunable parameters called prediction error threshold and cumulative error threshold. These parameters are set by the user to decide the abruptness in the data which is predicted from the recent past values. The abruptness in data is the basis of selecting the sensor nodes and only those nodes forward their data.
In this paper, these nodes are called pretentious nodes which are responsible for the occurrence of events like detection of polluted areas, COVID-19 containment zones, high traffic paths, etc. This paper performs the experimentation in which a real-time air pollution dataset is used. The steps involved in developing the proposed prediction model in a sensor system for monitoring an event are framed in Fig. 1. Figure 1 follows the following steps: Data is imported from each sensor database based on time series. Conditioning of data by removing redundancies, extracting useful features, and aggregating the data. Developing a precise prediction model for data prediction using particle filter, Kalman filter, or artificial neural network and validating it for various datasets. Incorporating the designed model with the main wireless environment for obtaining desired results.

Stages in developing the proposed prediction-based model for IoT network.
The main contributions of this paper are: An optimized lightweight clustering called K-medoids clustering is implemented to reduce the size of training datasets by the means of clustering. The prediction model which is a combination of both Kalman filter and grey model is used to predict the time-based data series with a secular perspective, enhancing the accuracy and precision for the predicted value. The rank-based mutual information approach is used in synchronization with the prediction-based data aggregation model (GMKFDA) to select the sensors with high cumulative and prediction errors. The cumulative error threshold is the deciding factor for the sensor selection.
The remaining paper is structured as follows: in section 2 relevant literature is reviewed with some findings. This section tells the advantage of the proposed method above other existing methods. Section 3 gives the proposed model overview along with the application for which it is used. It also gives the network and energy consumption model utilized in the proposed work. In section 4 details of the techniques used in the proposed model are given one by one starting from the deployment of sensors to the selection of nodes. It explains the clustering mechanism, SSA algorithm, grey prediction model, and Kalman filter-based data aggregation and sensor selection based on prediction using mutual information. Section 5 explains the working of the proposed model with an algorithm and model framework. Section 6 talks about the results and discusses the performance of the proposed work for various performance metrics. Also, it gives a complexity analysis for the proposed algorithm. Section 7 concludes the paper with possible future directions in this domain.
The research in the field of data aggregation using model-driven approaches have been done significantly to address various challenges like anomaly detection, data collection, data transmission, event reporting, etc., these approaches proved themselves very efficient in enhancing the QoS parameters of the WSN network. The size of the network is one of the challenges in front of application-specific IoT networks. The clustering of nodes is an efficient way to overcome this problem. There are many clustering algorithms used for WSN. Low energy adaptive clustering (LEACH) hierarchy is a well-known clustering protocol for WSN. LEACH protocol with its variants like MODLEACH, LEACH-C, LEACH-M, and many more [9, 10] are proposed to improve the quality of service in a sensor network. In [10], the authors gave an introduction to various LEACH variants and their significance. Similar to LEACH protocol, some routing protocols are also developed in a few years for various applications in WSN like one proposed in [11] which divides the area of deployment into some regions to track the animal presence. The method is called a hybrid heterogeneous routing (HHRS) scheme. Some other protocols used for hierarchical WSN like DEEC [12–14] and TEEN [15] also aim to reach efficiency in terms of energy, throughput, etc., in WSN and IoT networks. To enhance the quality of clustering, genetic algorithms are also used in integration with clustering [16]. The optimized clustering helps in achieving desired results with very low consumption of energy. Some algorithms gave very good results like Ant colony optimization (ACO), Swarm intelligence algorithms like particle swarm intelligence, etc. An optimized approach for cluster head selection using the krill herd algorithm has been done in [17] and the results are evaluated and compared with LEACH and genetic optimization. An energy-efficient multipath routing for clustered WSN using ACO is proposed in [18] and is compared with other 3 similar protocols on the grounds of average energy(in %), network lifetime, energy consumption, and standard deviation i.e., variation in energy levels in all nodes.
To extend the domain of applications in IoT, some model-driven approaches are suggested like data aggregation model, data prediction model, etc. In [7], the authors have used a prediction-based data collection model called autoregressive integrated moving average (ARIMA). The cluster heads communicate this model with their cluster member nodes to collect the data based on a threshold. If some deviation occurs between predicted and actual data beyond the threshold, then the difference is transmitted and is further compressed by the PCA method. In [4] the work proceeds in a direction to obtain the abstract feature values from the nodes and predict the data from them using a neural network to reduce the redundancies. The data correlation for predicting the data is done using the MNMF model based on a bidirectional long short-term (LSTM) memory network. Wei et al. [6] implemented a prediction-based data aggregation approach for a granary monitoring application. It uses the grey model and Kalman filter to make the predictions and aggregation of data (GMKFDA). The aggregation is done based on the error threshold values. This method has been validated against the traditional Kalman filter and grey prediction model techniques.
The authors in [19] used Kalman filter along with salp swarm optimization to analyze the data packet thereby reducing the redundancy from data. The model uses ELM to detect the intrusion in the form of data packets extracted from nodes in a WSN. Kalman filter and its extended version called extended Kalman filter are efficiently used in estimating the target location and tracking the objects [20]. Some of the applications like deep seedling detection [21], time of flight measurements in ultrasonic systems [22], system monitoring [23], traffic forecasting [24], data streaming from cognitive networks [25] are implemented by Kalman filter-based models. Similarly, the grey prediction models give good prediction accuracy and success rate [8]. Integration of grey model with other approaches such as ELM, artificial neural network, Kalman filter, etc., gives an efficient model for data fusion [26], forecasting applications [27], etc. Apart from this, some other prediction models for routing and aggregation are also used like in [5] authors have built a model called the trust prediction model (FTPR) based on a fuzzy system to predict the node behavior in advance so that the security, network life, packet delivery ratio (PDR) of the system can be enhanced and end to end delay can be decreased. In [28], the authors use the value of information of sensors to rank them (VoISRAM). The method uses QoS, energy, Spatio-temporal accuracy, and other parameters as the value of information attributes. Another concern of the wireless network is scalability. The large-scale WSN and IoT networks which are spread on a large area for some applications like monitoring, faulty node detection, etc., have to be managed for their efficient and reliable use. These networks generally contain mobile nodes to cover the appreciable area to collect the data. In [29, 30] authors have focused on different ways to collect the data in a large-scale WSN (DCLS-WSN).
There are some applications like event monitoring, target tracking where some specific sensors have to be selected. The basis of selection is some performance metrics like mutual information, fisher information, etc. Some sensor selection methods used for target tracking applications are mutual information [31–34], Fisher information-based sensor selection [35], and Kalman filter-based sensor selection [36]. These approaches guarantee better tracking accuracy by imparting essential constraints to the system.
Some query-based prediction models use auto-regressive approaches for aggregating data like PAQ [37]. These methods use local sensor data measurements to build a prediction model and to detect outliers. Sensor selection using Greedy algorithms [38, 39] and exhaustive search methods promise more accuracy with low root mean square error values. Some work has also been done in predicting the future energy values of the sensors using Markov and Auto-regressive (AR) models [40] to ensure high prediction accuracy.
Our paper proposes a novel method in which data aggregation is done from selected sensors by observing their predicted values using a prediction-based approach. Here we are using a grey prediction model with a Kalman filter-based data aggregation model. The selection is done using the mutual information technique which is synchronized with the prediction technique. The advantage of the proposed method over the methods mentioned above is its robustness for different scenarios in WSN. Also, it can be used for more than one application like object tracking, monitoring, and event reporting. This paper uses the proposed model for event reporting applications.
Proposed model
This section gives an overall framework of the ranking-based sensor selection using a prediction model for data aggregation in IoT. Figure 2 shows the overview of the proposed scheme. The proposed method has two important parts, i.e., clustering of the randomly deployed node using salp swarm optimization-based k-medoids clustering scheme for obtaining optimized aggregator nodes and a prediction-based approach integrated with the rank-based mutual information to select nodes for aggregation of data. Large WSN datasets are used to evaluate the performance of the proposed method. Since the scheme is designed mainly for event reporting, much of the sensor population is in clusters. The task to make this clustering efficient is done using the salp swarm algorithm in conjunction with k-medoids clustering. The resulting cluster heads have optimized positions concerning the base station to guarantee minimal delay in the data transmission.

Overview of the proposed method.
Once the nodes are clustered at the optimized position, they start sensing the surrounding parameters. These parameters decide the occurrence of an event in that area. To reduce the computational complexity, a prediction-based data aggregation using both the grey prediction model and the Kalman estimation model has been employed to enhance the prediction accuracy of the system. Since the nodes responsible to create an event are few in a cluster, a mutual information-based sensor selection method is also used in integration with the prediction model to synchronize both the task of prediction and sensor selection based on the prediction analysis. The aggregated data from selected sensors are delivered to the base station which further sends it to the cloud-enabled IoT devices through an edge network. Edge network reduces the latency of data delivery by providing storage to the highly needed data in the vicinity of the IoT devices. It minimized the size of data to be transmitted and also minimizes the path of data delivery to the IoT devices thereby mitigating the congestion and transmission cost.
A WSN model is used in this paper consisting of ‘S’ sensor nodes placed randomly and clustered using the K-medoids algorithm. The centroids are considered aggregator nodes with maximum energy values than the sensor nodes. The aggregators are supposed to have the computational ability to aggregate the values according to the proposed algorithm. The aggregator nodes are supposed to occupy a minimum distance from the Base station (BS) in their cluster. All aggregators send their data to the aggregator which is nearest to the BS. The BS is considered to have no energy losses and has a self-replenishable energy source. The following assumptions are made in the proposed work: All nodes are static and they transmit their data to their respective cluster’s aggregator node either through a single hop or multiple hops. The sensor nodes have a limited energy source which is non-replenishable. All the aggregators send their data periodically to the sink node or other aggregator nodes. Both source and sink nodes have equal lifetime and obey the same prediction-based approach i.e., they work in synchronization to execute the proposed model.
Energy consumption model
The WSNs are energy constraint networks so the energy has to be managed as per the requirements of the application. This paper mainly focuses on the energy of the selected sensor nodes and aggregator nodes. The total associated energy cost is given by:
EnTX is the energy required for transmitting N packets at distance D given by Equation (2). EnRX is the energy required for receiving N packets at distance D which is expressed in Equation (3). Enn1 is the energy when the network is at an idle state. EnS is the energy required for sensing.
Enelis the electronic energy given by Equation (5). Ene represents the energy consumption for per-bit transmission. The transmitted energy is represented by two models i.e., the free space model and the multipath fading model. The free space model is used when the distance (D) is less than the threshold (D0), and when the threshold is less than or equal to the distance then the multi-path fading model will be used. The threshold (D0) is given by:
Enfs and Enpw represent the energies used by the amplifier.
Enag is the energy used for data aggregation at the aggregator nodes in WSN.
Optimized K-medoids based clustering using Salp-swarm algorithm (KMed-SSA)
There are many methods of clustering, some use the spatiotemporal correlation of the data values, some use the Euclidean distance between the points to make clusters or some use the density of data points as the basis of clustering. Here in this research article, we are using the Mahalanobis distance-based k-medoids clustering which results in producing efficient cluster centroids. The Mahalanobis distance is shown as:
To make the clustering more efficient in terms of energy consumption and network lifetime, we introduce salp swarm optimization along with the k-medoids clustering technique. This hybrid approach works on the optimization of the position of aggregator nodes or collector nodes which results in fast transmission of data to the base station thereby reducing the network delay. Mirjalili et. al introduced a swarm intelligence technique called the salp swarm algorithm (SSA) in 2017 which is inspired by particle swarm optimization [41].
Definition 1
The best aggregator node in each cluster is that one that is capable to send data from the selected sensors belonging to its cluster efficiently to the next aggregator connected to the aggregator node nearest to the BS to prevent unwanted energy wastage of the network thereby enhancing the network life.
The fitness function computed by the B.S aims to place the highest energy cluster head in that cluster that is nearest to it and all the aggregators are optimally placed in their corresponding clusters such that they occupy the position nearer to the base station in their respective cluster communication range. The fitness function given by Equation (7) should be minimized.
En (i) is the energy of each node which must lie between En (min) and En (max) i.e., En (min) ≤ En (i) ≤ En (max) and En (CHj) is the energy of each aggregator node and it is assumed that En (CHj) ≈ En (max). The value of γ should lie between 0 and 1 i.e., 0< γ<1.
Initializing the data points consisting set of points si. Evaluating the mean value and obtaining threshold. Assigning the mean value to CH1 for the first cluster. Using Equation (6), calculating the distance of each point from the cluster center. The difference in distance between the points within a cluster is assigned Dii. Comparing each Dii and replacing the higher value with a lower one and selecting the lower value Dii as the centroid of that cluster. Execute steps 2 to 5 iteratively till optimal center is not found. If the value of the center is out of the threshold, then construct a new cluster with a new center value. Repeat the algorithm till all data points are grouped into individual clusters. Initializing the salp by taking all centroid points CH
j
, (j = 1, 2, …, k) from the k-medoids algorithm. Where k is the number of clusters. Calculating fitness of each aggregator node (CH
j
) from Equation (7). Updating c1 from the equation. For each aggregator node, if the value of j = 1, then updating the position of leading salp i.e., the aggregator nearest to the B.S by using the following equation: Otherwise updating the position for all follower salps using the below equation: Changing the aggregators based on upper and lower limits. Repeat the steps until the search for B.S completes and all the aggregators are optimally placed near the B.S.
The data sensed by each node in a network follows a scheme that decides how the sensed data is collected by the aggregator node. Three such schemes are governed by the application layer of each node. In [6], a brief description of the three schemes namely Push, Pull, and the integration of both Push and Pull is given. In this paper, we are employing the GMKFDA method which works on a prediction-based aggregation of data and uses the Push scheme. The Push scheme focuses on prediction-based mutual support between the sensor node and aggregator node. Once the data is sensed the sensor node sends it straight away to the sink without taking link quality, channel characteristics, transceiver parameters, etc., into account.
Grey model
The grey model is a prediction model used to estimate/ forecast the data sequence and requires few data points or sensor measurements with uncertainty. The Grey model is known for its effective prediction of data including secular trends. In this paper, 1st order grey model GM(1,1) is used whose prediction sequence is given by:
And the prediction error is given by:
Kalman filter works on 2 phases namely: prediction phase and update phase. The Kalman-based prediction removes the randomness in predicted data. In the prediction phase, the estimation of data from the previous period is produced at the current time. The outputs of this phase are the prediction model and covariance model given by the following equations:
The update phase refines the prediction model to produce a more precise model. The prediction and covariance model for the update phase is given by equations:
For simplicity, let A (n) = 1, B (n) = Q (n) = 0, R (n) = H (n) = I
The GMKFDA combines both the attributes of the grey model as well as the Kalman filter model to enhance the system accuracy of the prediction process. The reason behind using GMKFDA is that this technique can remove the deficiencies which are faced by the individual grey model and the Kalman model.
This prediction model works on weights. Optimal weights which are the result of the reduction in the sum of error squares in prediction are assigned to the nodes.
The generalized approach of GMKFDA is discussed here.
Suppose Y = (y1, y2, …, y
p
)
T
be the weight vector for ‘p’ prediction models,
Let a(n) be the sequence of actual data sensed for recent n periods and
The combined technique (grey model and kalman filter) give its prediction sequence as:
To simplify the model, let
Thus, the GMKFDA model will become a weighted average of meta prediction models. The prediction data sequence for ith approach is given by:
Let er ij be the error in prediction of jth datum for ith approach.
er i is the error vector given by:
Thus, the optimal weight vector is obtained by minimizing J(n) by the least-squares method. The optimal weight vector is:
and
In this paper, sensor selection is done using the rank-based mutual information (MI) technique. Information-based sensor management tends to work on uncertainty in estimating the event occurrence. For this purpose, the entropy which represents randomness is a key factor to estimate the region state through sensor measurements. The sensor selection problem for applications including event detection can be solved by using the relationship between mutual information and entropy.
In the case of successful predictions, the PVQ and predicted values from the ADQ of the sensor give high mutual information with the PVQ of aggregators. Our objective is to select those sensors which give minimum or zero mutual information of their predicted values with the PVQ of the aggregator of that cluster. For this purpose, we have to check the below equation for all sensors in a cluster. Find a sensor ‘i’ that satisfies the following condition:
In other words,
The sensor nodes are deployed randomly covering the whole area of interest. The communication range is considered to be fixed and the same for each node. Each node is connected with its immediate next node or to the cluster head. It is assumed that all the aggregators are connected and only the cluster head in the vicinity of the base station will aggregate the data and transmit it to the base station. Figure 3 depicts the prediction, sensor selection process occurring in a clustered WSN with BS. The clusters are formed using the K-medoids technique, the medoids for each cluster are chosen to be the cluster heads or aggregators. To optimize the positions of aggregators, the Salp swarm optimization (SSA) has been employed in integration with the k-medoids clustering technique. Figure 4 shows the flow chart of the KMED-SSA clustering.

Sensor selection mechanism using prediction-based data aggregation and mutual information in a cluster.

Flow chart for KMed-SSA.
The event tracking/monitoring needs the estimation of the region state which is dependent on the sensor measurements. For this purpose, the concept of predictive analytics is used in this paper. Each sensor periodically produces a sensed data. Based on the actual sensed values, either past or current values, the prediction of sensed data takes place at each cluster’s aggregator node which is also responsible for selecting sensors. The same prediction algorithm has to be employed on both sensor and aggregator nodes. The sensor node is assumed to possess sufficient energy for data storage. All aggregators broadcast their cumulative and prediction error thresholds φ and α respectively to all respective cluster members as per the application requirements. Both φ and α are tunable parameters. The sensor node possesses two equal data queues namely actual data queue (ADQ) and predicted value queue (PVQ). ADQ stores the series of actual data and controls the cumulative error. PVQ contains predicted data values. Once the ADQ and PVQ are formed, the aggregator node also constructs the corresponding PVQ for each node i.e.,
PVQsensor (j) = PVQaggregator (j) for sensor j. Each node senses data and stores the first ‘s’ sensed data into PVQ and ADQ of sensors and sends them to the PVQ of aggregators.
Let ai be the data item in the queue. Initially, ADQ (j) = PVQsensor (j) = PVQaggregator (j) ={ a1, a2, … as } for a random sensor j. Let, as+1,
When the predicted and cumulative errors are in thresholds, the prediction is said to be successful and the node need not send their data to the aggregator. The aggregator considers
So, the updated queue will be:
Reliable data delivery based on the acknowledgment (ACK) signal from the receiver is considered. The mutual information aims to solve the sensor selection problem. The sensors which sense abrupt measurements will have more entropy and will contribute to a very low value of mutual information. These nodes are assigned high ranks and the aggregators collect the data predicted by these nodes. The rank-based MI algorithm works in synchronization with the hybrid predictive model GMKFDA. Also, the next period prediction queue for each sensor is synchronized with that of its aggregator node thus the name dual synchronization method is used for the proposed scheme. The ranks are assigned iteratively for every node. These nodes are called pretentious nodes. There may be more than one such node and thus high ranks are given to those as compared to normal nodes. Figure 5 shows the flow schematic for the sensor selection at the aggregator node by using GMKFDA with the rank-based MI method. The algorithm for GMKFDA with rank-based sensor selection using MI is given below.

Proposed model for cluster-based GMKFDA-MI technique.
In this paper, we are validating our algorithm on 2 air pollution datasets, one from the UCI repository called “Air quality dataset” published in 2016, and the other is from the experimentation carried out on the busy roads near the hospital premises for 1 day. the below subsection describes the example application scenario.
Real-world example
A random deployment is considered in which the nodes are equipped with abilities to sense various gases to monitor pollutants in the air. Figure 6 shows a scenario for the application of air pollution monitoring in real-time environment near hospital premises. These sensing metrics i.e., G1, G2, … G m advances overtime. For e.g.: we can say G1 = CO concentration, G1 = SO2 concentration etc. We have designed a model which is able to work with every node producing data, the repeated reading is taken as one and this model updates the aggregator node about the readings and thus communication happens only when variation in readings takes place. The classification of pretentious nodes producing abrupt and unusual readings is done at the aggregator node using rank-based mechanism as proposed in the model. It is very necessary to monitor and track the regions where the content of pollutants like carbon monoxide, benzene, carbon dioxide, nitrogen oxide, sulphur dioxide, etc., is too high. Various gas sensors mounted on a node are used to collect the data in various regions. The simulation parameters for the scenario are given in Table 1.

Randomly deployed clustered sensor network employing the proposed methodology for air pollutant monitoring application.
Scenario parameters
The nodes are clustered and each cluster and its aggregator has significant transmission and communication range which keeps it connected to other cluster aggregators or BS. All the aggregators receive the data from all the cluster member nodes belonging to its cluster. The aggregator nodes after performing the prediction and sensor selection task, send the aggregated data values to the base station which is connected to other edge devices or users via the cloud. The details of datasets are given in Table 2.
The description of the two air quality monitoring datasets used in the experimentation
The ranks are assigned to each sensor according to the criteria given in the GMKFDA-MI algorithm. Table 4 shows an example in which ranking for the nodes is done based on their predicted and actual measurements. The dataset after the prediction protocol is classified using some well-known classifiers like Logistic regression (LR), Random Forest (RF), and AdaBoost as shown in Table 5, and the performance metrics like accuracy, precision, recall, and F1score are compared with other existing methods. These three classifiers perform better than other classifiers like SVM, Decision trees, etc. when applied to our datasets, so the performance metrics of the three classifiers are taken to validate with other existing models.
Example assigning a rank to the nodes by applying mutual information on the prediction algorithm
Performance evaluation of proposed method and other existing methods for both datasets
Abbreviations: KF, Kalman filter; SS, sensor selection; FI, Fisher Information; LR, Logistic regression; DT, Decision trees; RF, Random Forest. The bold values signify the best values achieved by the proposed model over other existing models.
The performance measures taken to validate the proposed method with other existing models are defined below:
Cumulative distribution function (CDF)
CDF of a random variable ‘R’ which is a real number is its probability function that indicates that R occupies the value less than or equal to ‘r’. where ‘r’ is again a real number at which ‘R’ is evaluated. Mathematically it is shown by:
The CDF of a continuous random variable is given by:
Mean square error (MSE)
MSE evaluates the quality of an estimator. It is the average calculated for the square of errors. These errors are the difference between predicted and actual values. it possesses non-negative values but practically its value is greater than zero. The low value of MSE means a more perfect predictor model in terms of errors.
End-to-end delay
It is the time that a packet takes to get transmitted from the source node to the sink node. Generally, its unit is taken as seconds. The end-to-end delay must be as low as possible to ensure low network latency.
Throughput
It is defined as the average successful packet delivery rate in a network. The high value of this metric is desirable. It is measured in bits/sec.
Packet delivery ratio (PDR)
It is given by the ratio between the number of data packets successfully delivered and the total packets transmitted. The high value of PDR indicates good system throughput.
Energy consumption
The total energy consumed by every sensor node involved in transmission, reception, idle, and sleep actions is termed as the energy consumption of a network. It is in millijoule.
Accuracy
It gives correct classifications out of all the total classifications. The high number of accuracies leads to a more accurate model. It is mentioned in percentage. A perfect model has 100% accuracy which is not a practical scenario.
Where TP symbolizes true positive value, TN represents a true negative value, FP expresses false positive number, FN indicates a false negative value.
Precision/ positive predictive value
Among the proportion of positive classifications, how many are actually correct is answered by precision. A classification with no false positive is denoted by a precision value of 1.
Recall/ sensitivity
Recall gives the proportion of data items that are identified as correct from actual positive data items. A model with no false negatives has a recall as 1.
F-score
It is the weighted average of recall and precision and is given by the harmonic mean of the recall and precision values. it is the measure for the testing phase’s accuracy.
In the above context, a true positive value indicates that the predicted value is actually correct and also classified as correct, true negative specifies that the predicted value is incorrect and classified as incorrect, false-positive value denotes that the predicted value is incorrect and classified to be a correct prediction, false-negative value signifies that the predicted value is correct but classified as an incorrect prediction. In Table 3, it can be observed that KMed-SSA clustering gives zero standard deviation i.e., the best fitness value, average fitness value, and worst fitness value are the same. Also, the time taken by the KMed-SSA algorithm is very less as compared to the other two algorithms.
Performance evaluation of algorithms used for optimized clustering evaluated on dataset I
Performance evaluation of algorithms used for optimized clustering evaluated on dataset I
Abbreviations: PSO, particle swarm optimization; KMED-SSA, k-medoids with salp swarm algorithm; BFV, best fitness value; AFV, average fitness value; WFV, worst fitness value.
The prediction-based data aggregation used in the paper is free of structure and topology, so random sensor measurements for carbon monoxide (CO) are taken for 500 data items. In the proposed method the prediction and cumulative error thresholds (α and φ respectively) are the application-specific factors and are based on user requirements on the accuracy of data. In the carbon monoxide (CO) measurements, the tolerated error thresholds are relatively small values. In our experiment we take α= 1, and for simplicity, we let α=φ. In Fig. 7 (a), the Kalman filter-based data aggregation model performs better than the grey model-based aggregation. The CDF curve shows that the communication energy savings for the Kalman filter model are higher than the grey-based model for the prediction error threshold value equal to 1 (α= 1). The proposed model has a CDF value for prediction errors much higher than the other two methods. It requires high computation and has high energy saving, thus, outperforming the other two models. Figure 7 (b) depicts the graph for the number of predictions made successfully against the prediction error threshold parameter. It can be inferred that due to the usage of PVQ instead of ADQ in the prediction process, the number of successful predictions is high in the case of the GMKFDA-MI algorithm. The proposed method also contributes to less consumption of energy by preventing redundant communications and overhead to occur. The auto-regressive model [42] performs worst in this case whereas a query-based model (PAQ) [37] gives an average performance.

Comparison of CDFs and number of successful predictions for the proposed model with two other existing models (a) Cumulative distribution functions for prediction errors when α= 1. (b) Evaluation of different prediction success rates for different values of α.
The proposed model is trained and tested for two datasets described in Table 2. The QoS parameters obtained for both the datasets are compared against some pre-existing cluster-based models like DEEC [12–14], LEACH [9, 10], and TEEN [15], some sensor ranking models like FTPR [5], VoISRAM [28], DCLS-WSN [29, 30] and HHRS [11] and some sensor selection models based on predictive analysis such as Kalman filter-based sensor selection (KFSS) [36], Fisher information-based sensor selection (FISS) [35], Greedy algorithms [38, 39] and exhaustive search methods like Brute force. Many clustering-based routing analyses prove that DEEC performs better than LEACH and it is also shown in this case. Figure 8 shows the comparative analysis of the proposed method with other methods done on the dataset I. Figure 8(a) gives the comparison of the proposed method with other sensor selection methods for mean square error. The proposed method performs very close to the KFSS model which exhibits high accuracy among other methods. It is observed that the proposed method shows a 66.6% decrease in the MSE when compared with the KFSS model. The FISS, Greedy, and exhaustive methods use estimation theory so, mean square error calculation is done for them. These methods show 75% to 80% enhancement in the MSE as compared to the proposed model. Figure 8(b) depicts the end-to-end delay performance for four methods and the proposed method measured against the percentage of the pretentious nodes. The delay for the proposed method is recorded to be the lowest of all methods. There is an increment of delay by 45% to 52% in case of FTPR, VoISRAM, DCLS-WSN and LEACH approach as compared to the novel prediction model. Due to the use of prediction and sensor selection, the data is transmitted only from the selected nodes and thus, the proposed method possesses very low latency and outperforms other similar data collection and sensor selection methods. Figure 8(c) shows throughput analysis in case of DEEC, LEACH, TEEN, and HHRS. The performance evaluation of throughput with the number of transmissions for DEEC, LEACH and TEEN is very similar because these three protocols are based on clustering. However, HHRS performs better than these three methods. The proposed method shows 18.7% higher throughput than the HHRS method. The HHRS method shows high packet loss that is why it shows low throughput in contrast to the novel prediction model. The reduction in throughput for DEEC, LEACH, and TEEN is recorded 21%, 35%, and 57.8% respectively when compared with the throughput of the novel method. Figure 8(d) gives the packet delivery ratio for the proposed method along with four other methods that are LEACH, DEEC, HHRS, and FTPR. FTPR shows higher PDR than LEACH, DEEC, and HHRS after proposed method. 63%, 71%, and 87% decrease in PDR is observed in case of DEEC, HHRS, and LEACH protocols respectively when compared with the proposed novel method. The proposed method employs clustering along with a predictive sensor selection scheme which is not present in LEACH, DEEC, HHRS, and FTPR and thus it achieves high PDR. In Figure 8(e), the energy consumption is calculated in joules with respect to the pretentious node percentage. In FTPR, the usage of memory leads to more energy consumption. The proposed method consumes 66% less energy as compared to the FTPR method. The LEACH protocol consumes energy in the construction of clusters and records 71.4% energy enhancement as compared to the proposed approach. It has been observed that a 25-33% of energy is utilized less in case of the proposed technique as compared to VoISRAM and KFSS.

Comparison of performance measures for validating the proposed method against other models for the dataset I. (a) Mean square error vs pretentious node percent (b) End-to-end delay with respect to pretentious node percent (c) Throughput vs number of transmissions (d) Packet delivery ratio vs percentage of pretentious nodes (e) Network energy consumption vs percentage of pretentious nodes.
Figure 9 gives the performance and comparative analysis of the proposed algorithm trained and tested on dataset II for the same performance metrics as discussed in Fig. 8. The sensor selection methods like KFSS, FISS use the only correlation between the data measured by sensors and not prediction so they could not perform up to the mark as compared to the proposed model. The greedy approaches use estimation theory and give high error whereas KFSS uses Kalman filter so it performs well in case of mean square error. In Fig. 9(a) the KFSS and FISS show 66% and 75% of enhancement in error in contrast with the proposed method. The proposed method shows approximately 85% decrement in the MSE in comparison to the exhaustive and greedy methods. The end-to-end delay for LEACH is recorded to be highest and is 55% higher than the proposed method as shown in Fig. 9(b). Since all nodes in a cluster are involved in the transmission, the delay is high in case of LEACH protocol. The reduction in delay for the proposed method against FTPR, DCLS-WSN, and VoISRAM is 42.8%, 53%, and 54% respectively as pretentious nodes increase. Figure 9(c) depicts the throughput versus the number of transmissions. The highest throughput is shown by HHRS. The throughput for DEEC, LEACH, and TEEN is 25%, 42.8%, and 81.8% respectively less than that of the proposed method. Figure 9(d) gives PDR for FTPR, HHRS, DEEC, LEACH, and proposed model. With an increase in the number of pretentious nodes, the PDR lowers. The highest PDR is observed for the proposed method and its value is 82 whereas for the LEACH it is the lowest and its value is 12. The FTPR records 13.8% less PDR and HHRS and DEEC observe 64% and 69.5% resp. reduction in PDR as compared to the proposed method. In Fig. 9(e), after LEACH, the energy utilization for FTPR is the highest due to the usage of memory in the FTPR protocol. The low energy consumption is shown by the VoISRAM and KFSS methods. There is an approximately 11% to 15 % reduction in energy usage for the proposed method as compared to VoISRAM and KFSS methods. Table 6 gives a detailed comparison of our proposed work with other similar existing models.

Comparison of performance measures for validating the proposed method against other models for dataset II. (a) Mean square error vs pretentious node percent (b) End-to-end delay with respect to pretentious node percent (c) Throughput vs number of transmissions (d) Packet delivery ratio vs percentage of pretentious nodes (e) Network energy consumption vs percentage of pretentious nodes.
Comparative study of similar state-of-the-art techniques and proposed model
The run time performance evaluation for the proposed algorithm which is the combination of various techniques that are optimized k-medoids clustering, data aggregation using the prediction-based approach, and rank-based sensor selection using mutual information is done by asymptotic analysis for individual techniques. The generalized computational complexity for k-medoids clustering is O(n2) and for the salp swarm optimization algorithm, the complexity is given as O(IT*n*d). The input parameters liable for algorithmic execution are the number of iterations (IT), input weight vector (Y), and prediction data sequence (ai), The data aggregation model based on predictive analysis of sensor measurements uses combined properties of the grey model (GM (1,1)) and Kalman filter estimator. The grey model contributes O(n) as deduced by Equations (13). The Kalman filter estimator which uses Equations (18) has the computational complexity of O(2n3). The rank-based mutual information possesses a computational complexity of O(n). Since with each run, the sensors in a cluster have to be ranked simultaneously, so with the increase in clusters, the complexity becomes O(n log n). Thus, for the proposed method the training complexity reaches O(n3) which is reduced to O(n) for the testing phase.
Conclusion
In the view to address the challenges raised in the event monitoring and detection applications of IoT networks, a sensor selection system plays a significant role. For this purpose, a dual synchronized sensor selection model based on prediction-based data aggregation is proposed in this paper. The proposed method works for each cluster and every aggregator node contribute to estimating the predicted value for each sensor node measurements. The prediction-based sensor selection method gives the set of all pretentious nodes which show high prediction error which means that there are some discontinuities in the sensor measurement of that particular sensor. The proposed method promises high accuracy which is evaluated as 97.8% and reduces network delay by 40% to 50% as well as mitigation in the total energy consumption of the network by approximately 10% to 30% when compared to its immediate rival methods. The future work to this approach can consider the application of multi-objective optimization-based routing for event reporting and some alternative techniques which require less computational complexity.
