Abstract
As a result of the advancements in Industry 4.0, the amount of data collected within industries are continuously expanding to achieve an innovative environment within the industry by maximizing asset usage. Meanwhile, the redundancy rate is increasing in cloud storage, which has an impact on data storage and analysis. To lower the rate of redundancy, the proposed system comprises a Time series-based deduplication technique. In the Time series-based deduplication technique, the Adaptive Multi-Pattern Boyer Moore Horspool (AM-BMH) algorithm, and Merkle tree were used to produce time-series data. Another significant challenge is that the geographically distributed cloud system has encountered that the data placement methodology with high-priced transportation costs for data transmission. To overcome this issue, an optimal data placement strategy using Modified Distribution is proposed. Thus the proposed Time Series-based Deduplication and Optimal Data Placement Strategy (TDOPS) is found to be effective when compared with the existing system. The various parameters like space reduction, efficient retrieval, data transportation costs, and data transmission time are taken into the account in the cloud environment for an evaluation. The proposed scheme saves 98 percent of storage space, 55 percent computation overhead, and improves 60% of cloud storage efficacy.
Introduction
Industrial IoT (IIoT) is a framework that outlines various physical components, operational groups, procedures, network alignment, and data layouts to be used. The Internet of Things (IoT) framework is a collection of components that include receptors, regulations, initiators, virtual assistance, and zones in nature [1]. Industry 4.0 refers to the fourth industrial revolution, which characterizes the growing trend toward automation and data exchange in manufacturing technology, including IoT, IIoT, Cyber-physical systems (CPS), Smart factories, Cloud computing, cognitive computing, etc. The IIoT market is predicted to grow at a Component Annual Growth Rate (CAGR) of 16.7% between 2022 and 2027. By 2027, the market is expected to be worth $263.4 billion [2]. IIoT has developed as a significant arena for research by resonating IoT data-associated problems to be resolved, especially in cloud storage environments. Numerous conflicts were raised due to IoT-based data accumulation in cloud storage are non-centralized execution with the combined administration of frameworks. Many issues were raised due to huge and unstructured data dispensation at various stages like data depiction, data accumulation, and analysis [3, 4]. IIoT applications such as robotics, supply chain management, and monitoring, as well as sectors such as automotive, manufacturing, retail, and aerospace, are presently developing IIoT with sophisticated characteristics. The IoT sensors embedded in the various industrial environments generate monitored data sequentially upon the prompt of extraneous action. On the other hand, data produced by several sensor nodes in the industries must be grouped, accumulated, examined, and envisioned to attain the IoT sensor network’s model [5–7]. In manufacturing industries, a massive amount of redundant information is delivered to the information appraisal zone. The allocation of abundant space for this redundant data is not sufficient. It augments the repository capability and reduces the efficacy of the manufacturing system. So, the data deduplication schematic is needed where the unused or same information is obtained and removed. The majority of current deduplication solutions [8–20, 22] focus on storage capacity reduction and security. Furthermore, intelligent decision-making models aid in the administration of items like assistants by using a pattern-based decision-making system that is identified by providing and incorporating the whole firm data [5]. Industries use time-series data to forecast the outcomes of various scenarios in the environment [25]. The proposed TDOPS system was developed as an alternative to the need for time-series data in an industrial context. Time series-based deduplication is performed as part of the TDOPS system. Instead of deletion, the prevalence of each observed value was taken into account. In the existing system, redundant data is removed during the deduplication procedure to retain the single instance [8–20]. Time-series data can be collected if the occurrence count is measured before deletion. Accumulating the core facts in IIoT equipment regionally is not advisable when the resultant equipment power is concerned. This also involves strict restrictions on the garage area. Hence, the observed industrial data is stored in a cloud for high scalability and flexibility [7]. The cost of data transmission becomes prohibitive when storing large amounts of IIoT data in geographically distributed cloud data centers [26–32]. As a consequence, the best data placement strategy is required to reduce data transportation costs while simultaneously shortening data transmission time.
The contributions of the research work are as follows Time series-based deduplication technique for IIoT data in the cloud environment is proposed to obtain time-series data. The proposed deduplication technique uses an Adaptive Multi-pattern Boyer Moore Horspool algorithm to find the identical data that exist in the IIoT data. Also, the indexing of deduplicated data is carried out using the Merkle tree efficient data retrieval. To reduce the data transportation cost during data transmission in the cloud environment, an optimal data placement strategy is proposed. To consider capacity limitation and maintenance of load balance factors in the cloud data centers the linear programming model is employed to acquire data placement schemes. The optimal data placement strategy is based on a modified distribution that effectively addresses capacity constraints, cloud data centre load balance, and data transportation costs.
The paper is compiled in the following manner:
Section 1 puts forward the latest parameters in IIoT. Section 2 outlines the related research work carried out on IIoT, the existing deduplication scheme, time series analysis, and the existing data placement scheme. Section 3 researches the system model and architecture. Section 4 includes an extensive security analysis of the proposed system. Section 5 mentions the experimental setup and performance evaluation of the proposed system. Finally, section 6 outlines the conclusion of the manuscript.
Related work
Reducing storage space allocated for industrial data is one of the main challenges in this era. If any intelligent actions were performed concurrently during the elimination of redundant data, it would be more advantageous to the industrial environment. Inspired to move in this direction, this research work proposes a Time series-based deduplication technique that produces time-series data for the profiting of the industrial environment. This section discusses the evolution of IIoT from Wireless Sensor Network (WSN) with the importance of the deduplication scheme and also summarizes the various approaches involved in the deduplication system, time series analysis, and data placement strategy. The IoT has produced a great impact in the modern world due to the high contribution of WSN. WSN is an ad hoc wireless network with minimal infrastructure that monitors environmental factors using a large number of wireless sensor nodes. The introduction of Cyber-Physical Systems (CSP), IoT, cloud, and Artificial Intelligence has aided the progress of Industry 4.0. Furthermore, many industrial applications require a decision-making model [6, 7]. Internet users have increased considerably in recent years as a result of the availability of the Internet all over the globe. Many individuals were accessing the internet through various programs. The use of social media has become a part of everyday life. Furthermore, as a result of the COVID epidemic, many companies and educational institutions are using the benefits of technology to accomplish their work online, even during these difficult times. The massive rise in the usage of mobile and online applications results in an exponential increase of data around the world. As a result, storage capacity has become the most critical factor for storing data [8]. Provisioning of storage optimization techniques has become a vital constraint to huge storage capacities like cloud storage. Deduplication is a storage optimization technique that evades accumulating identical replicas of data [9]. The main task to be performed in deduplication is the partitioning of data [11, 18], and [22]. Various security measures are considered as deduplication is carried out with protective data [10, 19]. Many adversaries may try to steal data while decrypting it for the deduplication process, which is dealt with different potentials in secure storage [16, 17]. The secure storage mechanism for IIoT (SDSSIIoT) and deduplication carried with 2FBO2 for IIoT data (FaCIIoT) is stated in [20, 21].
Many existing deduplication works consider memory and security as the prime factor, which is depicted in Table 1. However, with this, the time series analysis cannot be achieved concurrently. Time series analysis and prediction have been widely accepted in a range of sectors in recent decades [25]. To address these challenges, the proposed system introduces time series based-deduplication for IIoT using the AM-BMH algorithm and Merkle tree. The problem of matching tree patterns on the ordered trees is a significant parameter in computer science. Various linearization can be used to represent ordered trees as strings with additional features [24]. More applications require more computer processing and data storage as information technology advances and network bandwidth expands. At the same time, the amount of data generated by users and sensors grows at an exponential rate. The majority of data are stored in data centers around the world [28, 29]. Transferring large amounts of cloud data centers could be prohibitively expensive. Since each cloud data center’s processing power varies in a cloud platform, the cloud data center’s computing capacity changes as well [26, 27]. Data location, transmission time, cost, and bandwidth have all been considered in the majority of data placement studies. These technologies aim to cut down on user access time and to a degree, data transmission expenses. Remote data services increase the cost of data access while also improving the responsiveness of data flow applications that analyze this type of data [30–32].
Summary of related work
Summary of related work
This section describes the proposed work’s overall architecture as well as the implementation of the time series-based deduplication technique and optimal data placement strategy. The proposed TDOPS architecture provides a useful framework for IIoT data deduplication as well as an optimal path for data placement in a geographically distributed cloud system which is revealed in Fig. 1. IIoT data, proxy layer, and cloud storage are the three components of the proposed architecture. The IIoT data is routed through a proxy server to cloud storage. To create time-series data, the IIoT data is deduplicated based on time series. The proposed system also includes an ideal data placement strategy for storing the data in cloud data centers at low transportation costs and minimal transmission time. The specific implementation of the time series-based deduplication and optimal data strategy for IIoT data in the cloud environment are as follows.

Architecture of TDOPS.
Time series analysis (TSA) is a technique for analyzing a set of data points accumulated over time. Time series analysis, rather than capturing data points infrequently or arbitrarily, records data points at regular intervals over a predetermined period [25]. It is served that in time series analysis, mean, moving average, root-mean-squared error, and many other parameters are estimated with the occurrence of data points over some time. To execute time series-based deduplication, data from industries must first be partitioned according to a time interval. After this, deduplication must be performed using time series-based deduplication with the reckoning of occurrence at each time interval.
Data chunking
The gathered IIoT data must be separated into time period-based data chunks to meet the main goal of time series-based deduplication. To acquire reliable results, IoT data gathered from industries comprises noisy and unnecessary data that must be handled before data chunking. The proposed system uses a chi-square test to remove unnecessary and noisy data during the pre-processing phase. In IIoT data, the chi-square test (χ2) analyses the difference between the observed and predicted values based on the null hypothesis. Also, the expected values (E i ) are determined after obtaining the number of observed sensor values (o i ). The null hypothesis is a default hypothesis in which the quantity is measured as zero or null. The expected value for the chi-square calculation is determined with Max (o i ) and Min (o i ), which are found in the observed data. With large IIoT data, this chi-square test produces good results.
The preprocessed data is divided into many chunks, each of which can be used to perform time series-based deduplication. The data is first divided into several chunks, known as super-blocks, based on the time interval depicted in Fig. 2. Each super-block is made up of several sub-blocks. The proposed system divides the super-block into multiple sub-blocks by assigning boundary values. A unique id is issued to both the super-block and the sub-block to efficiently trace the information. The presence of redundant values in the data set affects the fluctuation between the sizes of sub-blocks. The value in the sub-blocks is specified with transitive closure when the presence of the redundant value is high in the data collected. i.e., The size of the sub-block is enhanced and retrenched with the presence of duplicate values in the IIoT data.

Workflow of Time series-based deduplication.
Deduplication is a storage optimization technique that removes multiple copies of data that are no longer needed. With this method, storage consumption might be increased [11]. The authors of most existing deduplication solutions focused on reducing the amount of memory consumed by redundant data. Few authors have researched the achievement of security during deduplication. Because the majority of IoT data is utilized for time series forecasting and prediction modeling, current technology requires a computational repository to execute the time series analysis and decision-making activities while data is stored in the cloud. This paper proposes a time series-based deduplication for IIoT data in the cloud environment to address this issue. The AM-BMH algorithm is used in the proposed system to provide a time series-based deduplication. The AM-BMH method employs both good and bad character shift tables, along with a Multi-pattern table. To achieve the main goal of time series-based deduplication for IIoT data, a Multi-pattern table and similarity index table is included in the Boyer Moore Horspool (BMH) approach. The AM-BMH technique is used to find the identical data that exists in the dataset with the help of a Multi-pattern table.
The AM-BMH algorithm’s multi-pattern feature assists in the identification of all related strings in the collected data. AM-BMH looks for correlations in the collected data. If the values are identical, there is no need to keep the duplicate value; instead, the occurrence count, which is maintained in the similarity index table, is updated. The similarity index table assigned in the sub-blocks may be used to collect the numerous instances of each observed value that exist in the collected data. In the super-blocks, the similarity index values of all the sub-blocks are updated. Finally, the time series data is obtained as a result of time period-based chunking and the AM-BMH algorithm. This system will not support audio and video data.
Indexing deduplication data with Merkle tree
As this time series data obtained after time series-based deduplication are not in sequence order, they must be reorganized using an indexing technique to be retrieved effectively. For indexing and quick retrieval, the proposed system employs the Merkle tree. The Merkle Tree uses dynamic indexing on multiple levels to store a single instance of all data points at each period. This generates a hash value for each piece of data in the leaf node and saves it in the internal node, which is a non-leaf node. The proposed system retains multiple existences of each observed value acquired via time-series-based deduplication, in addition to the hash value of data in the internal node. The proposed system’s Merkle tree preserves unique values with a count of how many times the data hits the tree structure for leaf node appearance, resulting in effective indexing. Merkle Hash Tree is a tree-based linear data structure containing hash-based values. It also simplifies the hash inventory. It is a non-linear data structure, i.e., tree data structure where every leaf point holds data. Every non-leaf point contains hashes of its children, i.e., non-leaf nodes store hashes of data. The non-leaf nodes are represented as l-s and rs. Here, the ls is a left subtree containing left child nodes, and rs is a right subtree containing right child nodes. The non-leaf nodes in the tree are allocated to store the hash value of each data [20]. The leaf nodes of both left and right nodes in the subtrees range from 1 to n-1. For instance, the origin of the tree, i.e., Merkle root, consists of leaves, and its key varies from 0 to n-1. In Eq. (1) the mathematically representation of Merkle Tree is determined as follows:
The bottom-most layer of the Merkle tree contains a leaf node with data. The numerous data collected from the Industrial environment is accumulated in the leaf node of the Merkle tree. When the leaf nodule is identified, a hash value is generated and stored in the internal node, which acts as a layer above the leaf node. SHA-3 512 is used in the hash value generation. The leaf node containing data is secured with 512 addressing modes during the hash value generation. Every observed value is allotted with a 512-bit hash value, and the recurrence of data is verified with the hash value. Correspondingly, the hashing advances at each stage results in the attainment of elevated stages till the origin node called the ‘Merkle root’ is reached. As a result of the connection of hashes resembling a tree, the system comprises the data of all the operation hashes which are found in the node. It produces a single-level hash parameter which facilitates justifying all the parameters existing in the node.
Merkle tree considers only one instance, and the incremented reckoned occurrence value is stored in the internal node along with the hash value of the leaf node. After the value, a leaf node, the path update, and the number of instances of each observed value must be restructured on a regular basis, as detailed in Algorithm 1. Thus the proposed system consistently maintains the path from leaf nodes to root nodes along with the occurrence factor. Since the construction of the Merkle tree begins from the bottom-most layer of leaf nodes, the duplicate existence is also updated with a hash value of data as a separate field in the internal nodes. When indexing starts from the root node, numerous existences of such data over time are also supplied, together with the route to the data node position. Due to several hits to the tree by several times, the path established to find the nodes is accessed frequently, which helps in improving the search mechanism.
In this part, an optimal data placement strategy based on the modified distribution approach is proposed, taking into account the capacity constraints, maintenance of load balance, and optimal transportation cost in the geographically distributed cloud.
Optimal data placement strategy in the cloud environment
The optimal data placement strategy is proposed by considering the capacity constraints, maintaining the load balance, data transportation cost in the cloud environment. Furthermore, the suggested technique employs a linear programming approach to offer an appropriate data placement model for IIoT data in a cloud environment, lowering data transportation costs and decreasing data transmission time for data placement operations in data centers. The cloud storage contains distributed cloud data centers, and each center contains several servers placed in various locations. As shown in Fig. 3, the optimal data placement process is carried out in the geographically distributed cloud environment. The proposed data placement strategy based on the modified distribution method works as follows: 1) Data is transmitted from IIoT data providers to the storage controller, where it is stored in the best possible condition. 2) The cloud storage analyzer balances the data providers’ demand with the cloud data center’s capacity. 3) The nearest cloud data center is chosen for the initial deployment of data depending on the IIoT data provider location. According to the size of the data to be placed in the cloud data centers, the storage controller estimates the least transportation cost for optimal data placement. The data placement strategy with the shortest data transportation time is generated based on the capacity constraint and load balance. (4) Finally, the data placement operation is carried out by the storage controller in the proposed scheme. The proposed strategy’s major goal is to keep the distributed cloud system load-balanced, reduce data transportation costs, and minimize the data transmission time during the data placement process in the cloud.

Data placement strategy in a geographically distributed cloud environment.
IoT data supplied by industries is massive, necessitating a large quantity of storage space. The data is transported to the cloud since the cloud environment provides such space and remote access. As the amount of data to be transferred from industry to cloud grows, the cost of transportation also increases. To solve this issue, the proposed system considers capacity constraints and load balancing difficulties in the data center to provide an effective data placement strategy with minimal data transportation costs. The linear programming model provides a feasible solution for an optimal data placement strategy [26].
Through the adoption of a linear programming model, the data can be transported from a set of data providers (i.e., industries) to a set of cloud data centers subject to the supply and demand of the cloud data centers and data providers, respectively, such that the total cost of transportation is minimized. A linear programming model aimed at optimizing performance (i.e., cost) while adhering to a set of resource constraints (i.e., cloud data center capacity, load balancing) is provided by cloud service providers. When the initial basic solution is established, the modified distribution approach is employed to predict the optimal transportation cost. This method enables the quick calculation of improvement indices for each idle space in data centers without having to design all the closed pathways. As a result, when compared to conventional approaches, it can typically save a significant amount of time. The notations used for determining the optimal transportation cost is shown in Table 2. The capacity of each cloud data center is assumed as c1, c2. . . c k . Each cloud data center servers are recorded as s = {s1, s2 … s k }, and the data set to be placed is d = {d1, d2 … d k }. Each data center that exists in the cloud is configured with its remaining capacity (i.e., availability of storage space). Then the capacity of data centers is compared with the data providers’ demand. Using Vogel’s approximation, the initial feasible solution is obtained.
List of notations
List of notations
With the initial feasible solution, optimal data transportation cost is determined to attain the goal of an optimal data placement strategy. This leads to storing a huge amount of IIoT data with minimum transportation cost in the cloud environment that is provided in Eq. (2). The linear programming model for minimizing data transportation cost is presented as follows:
Where:
It represents the availability of space in the cloud data centers to hold the data provided by industries. The objective function minimizes the total cost of data transportation between various IIoT data providers and cloud data centers. The constraint in Eq. (3) ensures that the total data sets are transported from data providers is less than or equal to their quantity. The constraint in Eq. (4) ensures that the total data sets transported to the destination are greater than or equal to its demand. Initially, resource availability should be confirmed as a balanced or unbalanced problem. If the total supply at the data provider is equal to the total demand at the cloud data center, then it is considered as a balanced transportation problem which is represented as Eq. (5).
If the sum of the supplies of all the data providers is not equal to the sum of the demands of all the cloud data center capacity, then the problem is termed as an unbalanced problem. If the availability of storage space in cloud data centers does not meet the industry demand, it will result in an unbalanced transportation problem. In such cases, the availability of storage space should be increased to provide resources according to the demands of the industry. The unbalanced problem can be represented as Eq. (6).
As the cloud environment provides abundant storage space, the proposed system assumes that the storage space available for data placement in the cloud data center storage nodes is always greater or equal to the demand of the customer or industry. With Vogel’s approximation method, the industry demand is identified and the storage node which has excess or equal space is allocated. Similarly, several iterations are carried out to fulfill all the demands of the industry by providing appropriate storage space with minimum transportation costs. Finally, with a modified distribution method, the optimal data placement strategy is established based on capacity limitations and balancing the load balance in cloud data centers.
Through this linear programming method, the minimal cost for data transportation from data providers ithe industrial environment to the cloud data center is calculated.
If the balanced state exists between the available storage space and the industrial demand, then the diff (min (a
i
.), next_min (a
i
)) is intended with an assignment of penalty estimate with Eq. (7). Similarly, the transportation cost for each cloud data center is evaluated. Finally, the degeneracy is estimated from the final solutions. If the number of final solutions contains a value less than m + n - 1, then there exists a degeneracy leading to the completion of the optimal cost determination process. Otherwise, the non-degenerate feasible solution is redefined using the modified distribution method with the assignment of ∈ (≈0) in a suitable independent position. Finally, the d
ij
. examined cell doesn’t contain either ∑a
i
or ∑b
j
.
From the above estimation as defined by Eq. (8), if d ij > 0 then according to the theorem of complementary slackness it can be shown that the corresponding solution to the transportation problem is optimum and unique. If d ij = 0, the transportation cost is optimal, which facilitates an alternative optimal solution. If d ij < 0 the solution is not optimal, and hence the iteration of the calculation process repeats to attain the optimal state. With the above calculation, the proposed system establishes the optimal data placement strategy for IIoT data in the cloud environment.
Each node in the cloud environment is given a unique identity and a shared secret key based on the AES algorim. To ensure data transmission between each node, it generates a shared session key. During outsourcing of IIoT data to the deduplication process, the cloud is encrypted with an AES algorithm to prevent data leakage. On the other hand, the node which receives data packages decrypts the package before getting processed. The cloud accumulates a huge volume of data produced in the industrial environment. On the other hand, the data accumulated in the structure of ciphertext, which is not deciphered devoid of the data users’ undisclosed inputs [21]. Additionally, the Merkle tree contains hashes of hashes for the data stored in the leaf node through which the data will not be corrupted. The real data may be collected at the end only if the chain is followed in a continual manner. Failure results in the processing of tampered data.
Experimental setup
This section comprises of the experimental environment and a test case. The experimental results of the proposed algorithms are then analyzed.
Experimental environment
Experiments were carried out on an open-source Eucalyptus 4.3.1v, Walrus (W) storage controller, and Hadoop distributed file system (HDFS) to evaluate the proposed system’s performance. Java (JDK1.8.0 02, jre1.8.0 291), Apache Netbeans 12.1, Apache Spark 2.3.4, and Python 3.8 were used to run the experiments. The cluster’s experimental setup consists of a virtual machine cluster built on top of the local server and the remote server. The remote cluster is made up of three physical servers and 18 virtual machines. Each server has the following configuration: CPU Intel Xeon E5 2620 v2 workstation, 2.1 GHz processing speed, 64GB RAM, and 2TB of storage space. The virtual machine cluster created on the local server consists of ten virtual computers, whereas the local cluster consists of three physical machines. Finally, both clusters are deployed in different locations. The geographically distributed cloud architecture is comprised of two infrastructure providers, AWS and Google Cloud, each with four data centers. The RAM capacity offered by AWS and Google Cloud in this experiment is 5GB, and the bandwidth is 25Mbps.
Test cases
The IIoT data for both the data placement algorithm and the deduplication technique was acquired from the RWTH AACHEN University’s Aachen/Cologne smart factory [26] and kaggle [27]. Timestamps and sensor values for energy consumption in the industrial environment on various processes are included in the smart energy data obtained from the repository. Each dataset contains roughly 1,32,000 sensor values that were monitored. The IIoT dataset containing 10,000 data blocks was selected for the experiments. Each data set includes the block number, block size, and data type.
Performance evaluation
The proposed system is compared with the FaCIIoT [20] and SDSSIIoT [21] to evaluate the performance of time series-based deduplication and data placement based on the Modified Distribution algorithm. The parameters considered for comparison are data transmission time and data transmission cost of the geographically distributed cloud system, space reduction ratio, efficient data retrieval, network lifetime, computation time, and the average latency of the deduplication technique.
Space reduction due to deduplication
The space reduction is determined by the deduplication ratio (DR) that is calculated by dividing the total number of sensor values (sv) obtained after deducting the no of sensor values obtained after the deduplication process by the total number of sensor values considered for deduplication.
The deduplication process carried out with the partition of the super-block and sub-block is depicted in Table 3. One of the primary goals of the suggested system is to improve the efficiency of cloud storage via deduplication.
Performance of proposed work
Performance of proposed work
To measure the performance of the proposed system, a total of 10 data sets were considered, and each node observes approximately 1,30,000 sensor values. The sensor values are partitioned into blocks based on the period and boundary value. Using AM-BMH, the time series-based deduplication is carried out. In each data set, a range of 90 to 120 super-blocks were formed based on the period, and each super-block holds approximately 14 sub-blocks. Following a series of experiments, it was discovered that 90% of the sensor data on all blocks were equivalent under each block information. Furthermore, the deduplication carried out at one super-block with the proposed scheme creates a large impact on others by maintaining the single instance with computation of existence at each super-block. The results attained from the deduplication system and space reduction percentage presented in Figs. 4 5, respectively, prove that 10% of the data under each data set is unique. The remaining values are identified as redundant data, and the occurrence count is measured to provide time series data.

Performance of proposed deduplication scheme.

Space reduction percentage.
Data retrieval is the process of obtaining non-redundant data as a result of the deduplication technique. The proposed work provides a simple approach to de-duplicate the redundant IIoT data, which can be retrieved from the cloud server with minimal time. The proposed system performance is measured with FaCIIoT [20] and the SDSSIIoT [21] to establish the average data retrieval time on the cloud server. In FaCIIoT, it requires more time for scanning to obtain the results from the query. The results increased linearly with an increase in the feature vectors. However, SDSSIIoT gave good results in the data retrieval mechanism and, many relevant data were produced along with explored data. Whereas in the proposed system, the Merkle Hash Tree gave the best performance in the data retrieval mechanism with exact data as it maintains a single instance for each data set. The proposed scheme maintains a unique structure for the indexing methods with an additional feature of reckoning the occurrence to count the existence of each value. Due to several hits made by the redundant data to acquire space in the cloud storage, paths established between the root and data become recurrent. Thus, the search proportion gradually decreased, which enabled good results, as revealed in Fig. 6.

Efficient data retrieval.
Network Lifetime is the time at which the first node in the cloud environment fails to transmit the data packets. The improvement of the network lifetime using the proposed system is a demanding responsibility. In this framework, the continuance is evaluated using the FaCIIoT [20] and the SDSSIIoT [21]. It is evident from the plots that the proposed system attains an elevated network lifetime when compared to the previous works. The average network continuance obtained in the preceding research was lesser than the proposed system value of the 60 s. In SDSSIIoT, the network lifetime was not much good because the authors have focused only on the secure storage scheme for the IIoT data.
The deduplication carried out using time series analysis is a novel approach despite the availability of many deduplication systems for addressing IoT data. In the FaCIIoT, the network lifetime was achieved with good value. But it failed to persist for longer durations. Attaining maximum network lifetime requires balancing traffic overhead between the data provider and cloud storage. The proposed system maintains a good network lifetime for a longer duration which is shown in Fig. 7. It is due to the high availability of storage space at the cloud server and the reduction of network traffic overhead due to redundant data evasion before reaching storage. The cloud persisted in receiving data from proxy servers due to the availability of resources.

Network Lifetime.
Computation time is the length of time taken to complete the time series-based deduplication technique. One of the difficult challenges addressed by the proposed work is calculating time series analysis simultaneously throughout the deduplication process in cloud storage. The performance of the proposed work is compared with existing works like FaCIIoT [20] and the SDSSIIoT [21]. Merely ten data set readings were taken into consideration for evaluating this parameter. The average computation time taken by the proposed system to execute the deduplication is 47 minutes. The resilience for consuming less time during deduplication is plotted in Fig. 8, whereas SDSSIIoT takes 88 minutes and FaCIIoT takes 107minutes. When compared to previous research, the proposed system consumes less than 50% of computation time for deduplication since it adopts a reduced time consuming AM-BMH algorithm.

Computation time.
Average Latency is the delay between the request made by the IIoT data provider and the response received from the cloud storage. The proposed work employs AM-BMH and Merkle tree for time series analysis-based de-duplication to reduce the redundant data despite its quick user, which leads to excellent latency time. The latency time is the duration needed to react to the customer’s needs. The mean latency is characterized by the addition of the duration to fulfill every demand put forward by the user. It is calculated based on the lesser and greater duration values. The lesser duration is 0, whereas the greater duration is the interval needed for assimilating the user’s private demand. The average latency is calculated using the previous works, including the FaCIIoT [20] and SDSSIIoT [21], which are considered for determining the effectiveness of the proposed system. As presented in Fig. 9, the mean latency for 5000 sensor values using the proposed system produces 1.46 ms, which is lesser than the FaCIIoT and SDSSIIoT.

Average Latency.
Data Transmission Time is defined as the duration needed for transmitting data from the IIoT providers to the cloud data center. As the proposed TDOPS s ystem employs modified distribution, the capacity of the cloud data center and maintaining of load balance can be considered to provide an optimal data placement strategy that facilitates minimum transmission time when compared to the FaCIIoT [20] and the SDSSIIoT [21]. The average time taken to transmit the data to cloud is 1.79 min that is less than the time obtained by FaCIIoT and SDSSIIoT. As depicted in Fig. 10, the data transmission time and data size are proportional to each other. When there is an increase in data size, the transmission time also increases. The data transmission time considered by the proposed algorithm is less when compared to the FaCIIoT [20] and SDSSIIoT [21].

Data transmission time.

Data transmission cost.
Data Transmission cost is defined as the cost involved in transmitting data from one location to another location and varies with the channel. The modified distribution method employed in the proposed system determines the optimal path to transmit the data from one end to the other by equally distributing the load to the cloud data centers. The proposed TDOPS system proves its efficiency which is 48 % and 23 % elevated than the FacIIoT [20] and SDSSIIoT [21], respectively.
Conclusion and future works
At this time, the amount of data generated by users is growing at an exponential rate, and the majority of this data is housed in data centers spread throughout the globe. Transferring significant volumes of data between geographically distributed data centers might be extremely expensive. Hence, the proposed system provides an optimal data placement strategy based on a modified distribution method. Another problem is extensive comparable data in the repository server, which needs to be avoided. Despite the problems that exist in various existence deduplication schemes, the proposed system provides a Time series-based deduplication using the AM-BMH algorithm and Merkle tree. To compute the time-series data, the proposed system undergoes a data chunking process with the period and boundary value before the deduplication process. From the experimental results, the proposed system was revealed to perform in an enhanced manner when compared to the previous works in terms of data transmission time and cost, space reduction percentage, search time, network lifetime, computation time, average latency and. There are still many challenges in the implementation of this proposed system. Open research based on security needs to be implemented for improving this framework’s efficiency. In the future, we have planned to concentrate on other applications like IoT healthcare services at edge environments.
