Efficient big data security analysis on HDFS based on combination of clustering and data perturbation algorithm using health care database

Abstract

In this manuscript proposes an efficient big data security analysis on HDFS based on the combination of Improved Deep Fuzzy K-means Clustering (IDFKM) Algorithm and Modified 3D rotation data perturbation algorithm using health care database. To compile a similar group of data, an Improved Deep Fuzzy K-means Clustering (IDFKM) Algorithm is used as partitioning the medical data. After clustering, Modified 3D rotation data perturbation technique is used to satisfy the privacy requirement of the client. Modified 3D rotation Data Perturbation technique perturbs each and every sensitive data of the cluster and all the key parameters values used for clustering have warehoused in the database file sector. The proposed approach is executed by Java program, its efficiency is assessed by Health care database. The metrics under the study of memory usage attains higher accuracy 34.765%, 23.44%, 52.74%, 18.74%, lower execution time 35.23%, 23.76%, 27.86%, 27.76%, higher Efficiency 26.85%, 38.97%, 28.97%, 35.65%. then the proposed method is compared with the existing methods such asSecurity Analysis of SDN Applications for Big Data with spoofing identity, Tampering with data, Repudiation threats, Information disclosure, Denial of service and Elevation of privileges (STRIDE), Big Data Analysis-based Secure Cluster Management for using Ant Colony Optimization (ACA) Optimized Control Plane in Software-Defined Networks, System Architecture for Secure Authentication and Data Sharing in Cloud Enabled Big Data Environment using LemperlZivMarkow Algorithm (LZMA) and Density-based Clustering of Applications with Noise (DBSCAN), Big Data Based Security Analytics using data based security analytics (BDSA) approach for Protecting Virtualized Infrastructures in Cloud Computing respectively.

Keywords

Hadoop distributed file system (HDFS)cloud storage hadoop data security Improved Deep Fuzzy K-means Clustering (IDFKM) Algorithm modified 3D rotation data perturbation algorithm

1 Introduction

“Big Data” implies volume, velocity and type of data, which is complicated to examine the use of conventional typical data processing sites [1]. Now, the sources of data production are rapidly enhanced, viz streaming machines, maximal throughput instruments, sensor networks, telescopes, also creates huge amount of data [2]. Big data plays a significant part in variouscontexts, likeindustry, natural resource organization, healthcare, scientific research, business management, social network, public administration. Moreover, it is characterized by 10V’s Volume: Big data is implicated by big volume. In recent years, the sources of data production are maximized, which contributes to the data diversity, viz audio, video, text, large-scale imageries. Regular data processing methods need to be improved to process large amounts of data [3]. Velocity: The incoming data has been raised. The term velocity signifiesspeed of data production. The social media data explosion isvaried, also caused types of data. At present, people does not focused old post (tweet, status updates etc.), but noticing the latesthot updates [4]. Variety: raised the collection of data types. For eg, certain organizations utilize the given kind of data formats: database, excel, CSV, which is saved in plain text file. Notwithstanding, the data may not be in the expected format and it may cause difficulties in processing. To deal this problem, the organization needs to identify a data storage scheme that can examine various data [5]. Value: It does not be useful to have continual data until it is converted as value. It is important to understand that Big Data does not always mean value [6].

Visualization is a noteworthy key to create big data an integral portion of decision making. Big Data impact in healthcare. Currently, the system of health care is quickly adopting medical data that enlarge the health records size are accessed electronically [7, 8]. A recent study illustrates six utility cases of big data to lessen the patient cost, triage, readmissions. In other studies, big data utilization cases in healthcare is separated as count of categories, viz medical decision support, administration with delivery, user behavior, maintains services.

•Problem Statement/ Research gap

To process the data, the existing method contains Hadoop map reduce structure. The map-reduce structure consumes more time to process the datadue to the rapid improvement of data [9]. The Hadoop Distributed File System (HDFS) utilizes map reduce to analyze the data. After every operation, it takes a backup of all data in a physical server [10]. This is executed because the data is stored on RAM is volatile likened to the data stored on the physical server. This reduces the scalability of the system and takes more time to share and secure data in the Hadoop file, which reduces the accuracy of system and the privacy of the user.

•Motivation

To overcome the above issues this work is motivated. This method will secure the privacy of the user by securely share the medical health care records in the Hadoop system using clustering and data perturbation method with less computation time and reduces the complexity of privacy protection with better accuracy.

•Novelty

The novelty of the proposed Improved Deep Fuzzy K-means Clustering (IDFKM) Algorithm and Modified 3D rotation data perturbation algorithm is used to store and secure the HDFS from big dataset.

This manuscript proposes an efficient big data security analysis on HDFS based on combination of Improved Deep Fuzzy K-means Clustering (IDFKM) Algorithm [11 –13] algorithm and modified 3D rotation data perturbation algorithm [14 –16] using health care database. To compile a similar group of data, an Improved Deep Fuzzy K-means Clustering (IDFKM) Algorithm is used based on its similarity among various clustering methods and the Partition algorithm finds the clusters altogether as an initial partition of data. After clustering, Modified 3D rotation data perturbation technique is used to satisfy the privacy requirement of the client. Modified 3D rotation Data Perturbation technique perturbs each and every sensitive data of the cluster and all key parameters values used for clustering have warehoused in the database file sector.

•Contribution

This manuscript proposes an efficient big data security analysis on HDFS depending on the combination of Improved Deep Fuzzy K-means Clustering (IDFKM) and Modified Three dimensional rotation transformation (M3DRT) algorithm using health care database.

To compile a similar group of data, a clustering algorithm is used based on its similarity among the various clustering methods and Partition algorithm finds the clusters altogether as an initial partition of data [17 –19].

After clustering, by Modified Three dimensional rotation transformation (M3DRT) algorithm technique, to satisfy the privacy requirement of the client.

Modified Three dimensional rotation transformation (M3DRT) algorithm technique perturbs each and every sensitive data of the cluster and all the key parameters values used for clustering have warehoused in the database file sector.

The proposed approach is executed with the help of Java program, its efficiency is assessed by Health care database.

The metrics under the study of memory usage attains precision, recall, accuracy, F-measure, aggregating time, efficiency, execution time. The proposed method is compared with existing methods such as STRIDE [20], SDN-ACA [21], LZMA- DBSCAN [23] and BDSA [25] for securing and storing the data in efficient manner.

The remaining section of this manuscript is designed as follows. Section 2 delineates the related work. Section 3 describes about the proposed data privacy method. Section 4 demonstrates the experimental outcomes with discussion. Finally, section 5 concludes the manuscript.

2 Related works

Several investigation works were already presented in the literature depending on big data security using clustering methods.Certain works are reviewed here,

Ahmad et al., [20] have presented the analysis of security features with STRIDE threat modeling technique. Software Defined Networking (SDN) maximizes the Hadoop presentation by enhancing the usage of bandwidth along network management. Safety attacks in the controller of SDN with switches can co-operate the entire Hadoop system causing loss or manipulation of valuable data. The3 advanced methods are selected, which consider accelerating the data transmit amid the cluster nodes. The aim wasexamine the safety features of SDN utilizations. Every methods requiredenhancementfor gain secure. While other methods were protected by the use of additional security measures, Pythia find the most secure model. Here using the evaluation metrics are Accuracy, Recall, F-measure. The advantages of this model was less expensive, to improve the robustness of its security infrastructure and the limitation was reduces scalability.

Wu et al., [21] have presented Big Data Analysis-based Secure Cluster Management for Optimized Control Plane in Software-Defined Networks. A secure authorization mode was presented for cluster management. Also,the ant colony optimization approach was presented that performs big data analysis withactivation scheme that enhances the control plane. Simulations prove thesuperiority of presentedmethod. The presented method wasenhancing the security. Here using the evaluation metrics are energy consumption, delay, traffic rate. This approach was used to improve the security and efficiency SDN control plane and the limitation was restrained processing power, long control latency, lesser control plane reliability.

Shamsi and Khojaye [22] have introduced a protection in the context of big data, systems, estimating the strengths and weaknesses of its protection strategies. Here, the data-anonymization technique was used. Big data processes were involved to solve computational issues for business intelligence with forecast testing.The advantage of this method was to protect confidential information in big data systems. The disadvantage was lack of computational complexity. The performance metrics was Accuracy, Sensitivity and Specificity.

Narayanan et al., [23] have suggested a big picture of dealing the major complexity of Big data security over Cloud. Where, the new system architecture named Secure Authentication including Data Sharing in Cloud (SADS-Cloud) enabled Big Data Environment. The suggested method involves 3 models (i). Big Data Outsourcing, (ii). Big Data Sharing (iii). Big Data Management. At 1^st model, data owners were recorded to Trusted Center utilizing SHA-3 Hashing approach. At 2^nd models, data users contribute the secured file recovery. At 3^rd model, 3 processes were activated for organizing Big data as follows: Compression, Clustering, Indexing utilizing LemperlZivMarkow approach (LZMA), Density-based Clustering of Applications with Noise (DBSCAN), Fractal Index Tree was utilized to index the files at Cloud database. The suggested method was tested for the given metrics: Information Loss, compression ratio, throughput, encryption with decryption time, information loss, compression ratio, and efficiency. The advantages of this approach were in secure file retrieval and reduce the total energy consumption. The drawback of this approach was data loss.

Bin [24] have presented the “Internet of Things” and “Big Data” turned into a field of firmly related innovation use. The step by step process of adequately discover significant model connections big data at Internet of Things was helpful for directors settling on right choices on the organization’s future advancement patterns. Conventional K-means calculation streamlined create its appropriate requirements of Big Data RFID Internet of Things information. The K-means examination was executed in light of the Hadoop cloud clustering stage. The advantage of this method was improved clustering efficiency and the disadvantage of this method was long control latency and low control plane reliability.

Win et al., [25] have presented a big data based security analytics (BDSA) for identifying thesuperior attacks in virtualized infrastructures. From the guest virtual machines, network logs together with user application logs were saved in HDFS. Extraction of attack properties was enabled with the help of graph-based event correlation along Map Reduce parser depend potential attack paths recognizing. The purpose of attack was acts via 2 stage machine learning called logistic regression for computing conditional probabilities of attack depends on attributes, also belief propagation was employed to scale the belief in existence of attack depending on them. The tests were carried out to assess the presented method through feasible malware and compare with other methods for virtualized infrastructure. The outcomes demonstrate that the presented method was effective to detect attacks with minimal performance overhead. The detection time was the main performance metric in this approach. The main advantages of this model were detecting both botnets and VM malware drawback was occasional latency increase owing to SSH server reset through guest VMs.

Xiong et al., [26] have introduced an online system for active semi-supervised spectral clustering, which choose pair-wise constraints as clustering proceeds depending upon the uncertainty lessening policy. By first-order Taylor expansion, the anticipated uncertainty lessening issue was decayed as gradient including step-scale, determined through utilization of matrix perturbation theory including cluster-assignment entropy. Here, human client pair wise queries were presented only the better candidate sample. Moreover, three distinctive pictures of datasets, lot of regular UCI AI datasets and quality dataset were considered. The advantage of this method was minimizing the uncertainty of the clustering problem and the disadvantage of this method was high computational complexity and limited scalability.

Wang and Singh [27] have suggested the theoretical characteristics of general subspace clustering algorithm called sparse subspace clustering (SSC). The suggested model considered the common fully deterministic mode, where basic subspaces and data points within every subspace were deterministically situated the broad range of dimensionality lessening systems that drop in subspace installing structure. The different private SSC approach was used and set up both security and utility assures the suggested strategy. The advantage of this approach was efficient computation and compressed measurement. The limitation of this approach was strong convexity of dual problem. Here, the performance metrics were Accuracy, Sensitivity and Specificity.

Table 1 shows the Comparison for existing methods. Here the existing methods are compared with the Author Name/Year, Method, Advantages, Disadvantages, and Performance Metrics.

Table 1
Comparison for existing methods

Author Name/Year Method Advantages Disadvantages Performance Metrics

Ahmad /2018 STRIDE threat modeling technique Less expensive, To improve the robustness of its security infrastructure. Reduces scalability. Accuracy, Recall, F-measure,

Jun Wu/2018 Software-defined networks (SDN), Ant Colony Algorithm (ACA) Improved the security and efficiency SDN control plane. Restrained processing power, long control latency,lesser control plane reliability. Energy consumption, delay, traffic rate.

Jawwad A. Shamsi/2018 Data-Anonymization Technique To protect confidential information in big data systems. Lack of computational complexity Accuracy, sensitivity and specificity

Uma Narayanan/2020 Lempel ZivMarkow Algorithm (LZMA) and DBSCAN algorithm In secure file retrieval, to reduce the total energy consumption data loss Information Loss, Compression Ratio, Throughput, Efficiency

Ni Bin/2018 K-means algorithm Improved clustering efficiency Longcontrol latency and low control plane reliability. Data size

Thu Yein Win/2017 BDSA Detecting both botnets and in VM malware. Occasional latency increase owing to SSH server reset throughguest VMs. Detection time

Caiming Xiong/2017 Active Clustering with Model-Based Uncertainty Reduction minimizing the uncertainty of the clustering problem high computational complexity, limited scalability. Jaccard coefficients, V-Measure

Yining Wang/2019 sparse subspace clustering (SSC), Efficient computation, Compressed measurement strong convexity of the dual problem, Accuracy, sensitivity and specificity

Author Name/Year	Method	Advantages	Disadvantages	Performance Metrics
Ahmad /2018	STRIDE threat modeling technique	Less expensive, To improve the robustness of its security infrastructure.	Reduces scalability.	Accuracy, Recall, F-measure,
Jun Wu/2018	Software-defined networks (SDN), Ant Colony Algorithm (ACA)	Improved the security and efficiency SDN control plane.	Restrained processing power, long control latency,lesser control plane reliability.	Energy consumption, delay, traffic rate.
Jawwad A. Shamsi/2018	Data-Anonymization Technique	To protect confidential information in big data systems.	Lack of computational complexity	Accuracy, sensitivity and specificity
Uma Narayanan/2020	Lempel ZivMarkow Algorithm (LZMA) and DBSCAN algorithm	In secure file retrieval, to reduce the total energy consumption	data loss	Information Loss, Compression Ratio, Throughput, Efficiency
Ni Bin/2018	K-means algorithm	Improved clustering efficiency	Longcontrol latency and low control plane reliability.	Data size
Thu Yein Win/2017	BDSA	Detecting both botnets and in VM malware.	Occasional latency increase owing to SSH server reset throughguest VMs.	Detection time
Caiming Xiong/2017	Active Clustering with Model-Based Uncertainty Reduction	minimizing the uncertainty of the clustering problem	high computational complexity, limited scalability.	Jaccard coefficients, V-Measure
Yining Wang/2019	sparse subspace clustering (SSC),	Efficient computation, Compressed measurement	strong convexity of the dual problem,	Accuracy, sensitivity and specificity

3 Proposed method

3.1 Secured information transmission by the proposed approach

Figure 1 portrays the proposed framework. The medicinal services and enormous informational index in the HDFS framework is represented in Fig. 2. The Hadoop Distributed File System (HDFS) denotes crucial information framework of stockpiling employed to many applications. Blocking a lot of information in the business to actualize an appropriated record framework, it provides elite access with information transverse throughout intensely adaptable clusters Hadoops.

Fig. 1

Block diagram for HDFS big data system.

Fig. 2

HDFS design with IDFKM.

In HDFS big data system, split at peripheral aggregates and distributes them different blocks of cluster, due to this enormously permitted expert by parallel organization. Here, the time aggregation method is established by advanced clustering algorithm IDFKM applicable with subdivided blocks. The HDFS is distributed similar in fashion. The single name of the node takes a way in the data is stored in the servers of the cluster known as Data Nodes. The data is placed in data blocks of data nodes is stored in the files of HDFS. These files are separated as one or more segments and then stored in the single data nodes. These segmented files are called blocks. The block is defined as the little amount of data in HDFS that can be read/write. The size of the default block is 64 MB, it is increased as per the need to change the configuration of HDFS.

The privacy of the grouped information has been achieved through M3DRT is employed for clustered information, and creates complexity to be perceived by unauthorized clients. The perturbed information is warehoused in the perturbed database. Here, each datum contains a tremendous proportion of data, along these lines, the calculation and the correspondence overhead has occurred. The database contains diverse sensitive scraps of information. It joins emergency information, drugs, immunization, hypersensitivities and other prosperity information. At any time an approved client/service provider sends a request for the data about a particular patient, while the social insurance application requests the perturbed database for the specific subtleties

Therefore, the high dimensional dataset is isolated into numerous gatherings depending on its identicalness by using the grouping system. The application can get the recovered data from the database gives the expected data to the client. The proposed framework is exceptionally intended to be remarkably deficiency tolerant. The database recreates or duplicates as every bit information of numerous methods as well as conveys duplicates with single data, setting as not less than unique duplicate at alternative server rack to another. Therefore, information about DB is blocked to get elsewhere within a cluster. It guarantees organization can continue though information is recovered. The Partitioned K-means clustering count provides the ideal result over other existing strategies.

3.1.1 Improved Deep Fuzzy K-Means clustering

In this Improved Deep Fuzzy K-Means clustering is used to tackle the problems of the deep learning. several neural network based clustering algorithms have been developed to tackle the data with more complex structures. Auto-encoder is a symmetric neural network that reconstructs data points via multiple layers. Stacked autoencoder (SAE) and structured auto-encoder (StructAE) incorporate graph-based clustering to achieve clear deep representations. However, both of them perform deep feature extraction and clustering in separate phases with high dependence on pre-trained similarity matrix. Accordingly, they become laborious as graph-based clustering models on big datasets. Deep embedding clustering (DEC) is a well-known deep model which performs clustering and network training simultaneously. However, it performs clustering through SGD which results in slow convergence. Here, the Fuzzy clustering algorithm is broadly employed to practical applications. Improved deep K-Means fuzzy variations are the big proportion of common approach for FCM. Figure 2 shows the HDFS Design with IDFKMClustering, which is deployedto train the Neural Network,hence the auto-encoder will map the data to extract the deep feature space. Besides, IDFKMperforms deep features extraction as well as clustering simultaneously. (1) The Adaptive loss functions has beenemployed to upgrade the insensitivity of proposed method for outliers and adaptive weights. A proficientapproach is generated and guaranteesto incorporate a local minimum for dealing the issue. (2) encoder (3) decoder. Then the proposed model does not require anything like graph-based information and the affinity with the centroid matrix is updated via closer format solution than the SGD. Hence, it worksproficiently in big data. The primary goal of an algorithm is grouping data’s correspondence based on its aspects, and then it can be aggregated. The improved k-means original variation is given in Equation (1):

${∥ N ∥}_{2, 1} = \sum_{i = 1}^{d} \sqrt{\sum_{j = 1}^{M} n_{ij}^{2}} = \sum_{i = 1}^{d} {∥ n^{i} ∥}_{2}$ (1) where ∥nⁱ ∥ ₂ denotes the l₂-norm of vector nⁱ

All of uppercase italic boldface letters represent matrices, lowercase italic boldface letters represent vectors.

The uppercase curlicue letters signifies functions, italic letters signifies scalar values. N^t indicates transpose of N . nⁱ indicates i^th row of matrix N, n_j specifies j^th column of N, n_ij implies entry of i^th row with j^th column of N. |N| implicates the absolute value of matrix N, ∥N ∥ _f as Frobenius norm of N . 1 = [1, 1, . . . . . , 1] ^t ∈ ℑ ^d×M, the l_2,1-norm is defined as the Equation (1), ℑ is represented as the data batch.

These proposed algorithms are used in the fuzzification layer with the purpose of big data security.

Several auto-encoder variations are deemed to acquire optimum deep representation of raw data. For eg., stacked auto-encoder (SAE) utilizes the data points likeness as the auto-encoder input, also trained them layer-wise. After training, few clustering model, viz K-Means is carried out to get the final partition. Notwithstanding, the proficiency of SAE isdepending upon the similarity matrix. To tackle the problem of securing and storing the medical data in Hadoop file are used in IDFKM algorithm. Then the loss functions are explained in below steps 1. Adaptive Loss function, 2. Cost Function of IDFKM. The Fuzzy c means algorithm with the clustering function is given as the entropy regularization.

a) Adaptive loss functions δ -norm for finding the problem in securing the big data in Hadoop

In terms of metrics, the l₂- norm is sensitive to huge data outliers along strength to the smaller loss, when l₁-norm is sensitive to the smaller loss along strength to the huge one.This is strong for outliers in spite of smaller or larger losses to create the robust loss. The δ- norm is expressed in Equation (2)

${∥ N ∥}_{δ} = \sum_{i} \frac{(1 + δ) {∥ n^{i} ∥}_{2}^{2}}{{∥ n^{i} ∥}_{2} + δ}$ (2) where, δ have great knowledge, the illustration of robust loss function along various values of δ is depicted in Fig. 2. Here, δ signifies tradeoff parameter that regulates effectiveness of the big data with the security.

The robust loss function containsgivenfeatures:

∥N ∥ _δ signifies non negative with convex that is appropriateto loss function

∥N ∥ _δ has double differentiate, also simply to optimization

When ${∥ n^{i} ∥}_{2} < < δ, {∥ N ∥}_{δ} \to \frac{1 + δ}{δ} {∥ N ∥}_{f}^{2}$

When ∥nⁱ ∥ ₂ >> δ, ∥ N ∥ _δ → (1 + δ) ∥ N ∥ _2,1

When δ → 0, ∥ N ∥ _δ → ∥ N ∥ _2,1

When $δ \to \infty, {∥ N ∥}_{δ} \to {∥ N ∥}_{f}^{2}$

At sum, adaptive loss operation interpolates amid the l₁- and l₂-norm through modifying δ parameter. For solving problem (1), a common robust loss function is suggested as the Equation (3)

$min_{X} f (X) + \sum_{i} \frac{(1 + δ) {∥ g_{i} (X) ∥}_{2}^{2}}{{∥ g_{i} (X) ∥}_{2} + δ}$ (3) where g_i (X) specifies vector output, 2^nd term of problem (3) is the expansion of proposed loss function in problem (2). Here, f (X) specifies smooth function. By utilizing iterative re-weighted model, the problem (3) is handled. Considering the derivative of problem (3) depends on X and sitting it zero. Then the big data problem is solved in the next stage.

b) Cost Function of IDFKM for reducing the data dimensionality error

IDFKM uses a neural network that contains (N + 1) layers to map raw data for non-linear feature space, here N represents even number. First $\frac{N}{2}$ hidden layers aid the encoders that learn a group of compact depiction, and used to the reduction of dimensionality of raw data, and the last $\frac{N}{2}$ hidden layers activate as decoder which reconstruct the input data.

From Fig. 2, suppose G⁽⁰⁾ = X_in = [x₁, x₂, . . . . x_N] ∈ ℑ ^D×N implicates input matrix of 1^st layer along N samples, then every data point g⁽⁰⁾ implicates column of the matrix G⁽⁰⁾ including dimension D. Then the output of encoder is given in Equation (4):

$\begin{matrix} G^{(n)} = encoder output (E^{(n) g_{j}^{(n - 1)}} + y^{(n)}) \\ \in ℑ^{d_{n}} \end{matrix}$ (4) where, $n = 1, 2, . . . \frac{N}{2}$ represented layers of the encoder, E⁽ⁿ⁾ ∈ ℑ ^d_n×d_n-1 implicates weight matrix, y⁽ⁿ⁾ ∈ ℑ ^{d
_n} implicates bias of n^th layer, ℑ^{d
_n} represents G⁽ⁿ⁾ belongs to the d_n dimension feature space, d_e (.) indicates activation function of the layer. Especially, the $\frac{N}{2}$ -th layer $G^{\frac{N}{2}} \in ℑ^{d_{\frac{N}{2}}}$ is shared through the encoder, decoder. The dimensions of layers at the encoder structured as $D ⩾ d_{n - 1} ⩾ d_{n} ⩾ d_{\frac{N}{2}}$ to the intention of dimensionality reduction.With respect to the decoder, the j-th layer output is specified in Equation (5):

$G_{j}^{(n)} = decoder (E^{(n) g_{j}^{(n - 1)}} + y^{(n)}) \in ℑ^{f_{m}}$ (5)

where, $j = \frac{N}{2} + 1, \frac{N}{2} + 2, . . . . . . . N$ M indexes the decoder layers with nonlinear decoder processing function d_d (.) is similar as d_e (.) or various nonlinear mode. At the decoder, the dimensions of the layers structured as $D_{\frac{N}{2}} ⩾ d_{n - 1} ⩾ d_{n} ⩾ d_{M} = D$ to the intention of data reconstruction. Hence, specified sample G⁽⁰⁾ is represented as the input of the first layer 0f IDFKM is given as X_in and it is encoded and the output of the encoder is given as G⁽ⁿ⁾ and it is represented as the X_out and it is reconstruction of the G⁽⁰⁾, then the associated $G^{\frac{N}{2}}$ implies deep representation of X_in. Let data matrix $G^{(0)} = [g_{1}^{(0)}, g_{2}^{(0)}, . . . . ., g_{m}^{(0)}] \in ℑ^{f_{\times M}}$ implies the set of N specified samples, the output matrix of decoder $G^{(N)} = [g_{1}^{N}, g_{2}^{N}, . . . . ., g_{m}^{N}] \in ℑ^{f_{\times M}}$ is the corresponding reconstruction of G⁽⁰⁾ and $G^{(\frac{n}{2})} = [g_{1}^{\frac{n}{2}}, g_{2}^{\frac{n}{2}}, . . . . ., g_{m}^{\frac{n}{2}}] \in ℑ^{f_{\frac{m}{2} \times M}}$ is the low-dimensional deep representation of G⁽⁰⁾.

In case G⁽⁰⁾ = A ∈ ℑ ^f×M as the input of the network, and the other $G_{j}^{(0)}$ denotes the j^th column vector G⁽⁰⁾ is given as $G_{j}^{(0)} = A_{j}$ , then the n^th layer outputspecifies output of decoder, g is represented as the each data point in the column of matrix G⁽⁰⁾ along D dimension.

E⁽ⁿ⁾ and y⁽ⁿ⁾ denotes weight matrix, bias of associated layer respectively. In this the objective of the DFKM algorithm is to reduce the data dimensionality error.

So, IDFKM method is proposed utilizing the objective function of improved fuzzy k-means, RLFwith auto-encoder for securing the big data in Hadoop for minimizing the loss function of IDFKM is expressed in Equations (6)–(9).

$\begin{matrix} E^{(n)}, min_{y^{((n))}}, V, X {∥ G^{(N)} - S ∥}_{D}^{2} \\ + β_{1} \sum_{j = 1}^{M} \sum_{i = 1}^{L} v_{ji} ∥ g_{j}^{(\frac{N}{2})} - x_{i} ∥ \hat{σ} + α v_{ij} log v_{ji} \\ + β_{2} \sum_{n = 1}^{N} {∥ E^{(n)} ∥}_{D}^{2} + {∥ y^{(n)} ∥}_{2}^{2} \\ s . t \sum_{i = 1}^{L} v_{ij} = 1, 0 < v_{ij} < 1 \end{matrix}$ (6) where

${∥ G^{(N)} - S ∥}_{D}^{2} = δ_{1}$ (7)

$β_{1} \sum_{j = 1}^{M} \sum_{i = 1}^{L} v_{ji} ∥ g_{j}^{(\frac{N}{2})} - x_{i} ∥ \hat{σ} + α v_{ij} log v_{ji} = δ_{2}$ (8)

$β_{2} \sum_{n = 1}^{N} {∥ E^{(n)} ∥}_{D}^{2} + {∥ y^{(n)} ∥}_{2}^{2} = δ_{3}$ (9) where β₁ and β₂ are the trade of parameters. X_i ∈ ℑ ^f′ Specifies i - th cluster centroid in lesser dimensional feature space including $f' = f^{(\frac{N}{2})}$ . Then the parameters δ₁, δ₂, δ₃ are used to secure the big data in Hadoop method. δ₁ ensures the minimal reconstruction error. δ₂ implicates cost function of improvedfuzzy k-means with robust loss function. β₁ signifies tradeoff parameter amid the reconstruction and improvedfuzzy k-means. Here, by setting β₁ as higher value then the δ₂ model has inadequate presentation owing to bad reconstruction. δ₃ specifies regularization that is removing the auto-encoders over-fitting along regularization parameter δ₃ prevent the auto-encoder produce a trivial map. In this way, the fuzzy method is secure the big data.

3.1.2 Modified 3D rotation transformation (M3DRT) approach for securing medical data in Hadoop

Data perturbation (DP) is a common data mining strategy for privacy protective. The complexity of DP is how to balance the 2 conflict aspects:(i) privacy secure (ii) data utility. In this mode, the attributes have been separated as 3 groups, and every group of attributes rotates around various axes pair. Here, the rotation angle is chosen, because variance based privacy metric attains higher that creates difficult to reconstruct the original data. The equation of the M3d rotation and data Perturbation for securing medical data along the pair is given in Equations (10)–(12)

$\begin{matrix} T_{AB} & = T_{A} \times T_{B} \\ = (\begin{matrix} cos φ & 0 & - sin φ \\ {sin}^{2} φ & cos φ & sin φ cos φ \\ sin φ cos φ & - sin φ & {cos}^{2} φ \end{matrix}) \end{matrix}$ (10)

$\begin{matrix} T_{BC} & = T_{B} \times T_{C} \\ = (\begin{matrix} {cos}^{2} φ & - sin φ cos φ & - sin φ \\ sin φ & cos φ & 0 \\ sin φ cos φ & - sin φ & cos φ \end{matrix}) \end{matrix}$ (11)

$\begin{matrix} T_{AC} & = T_{A} \times T_{C} \\ = (\begin{matrix} cos φ & - sin φ & 0 \\ sin φ cos φ & {cos}^{2} φ & sin φ \\ {sin}^{2} φ & sin φ cos φ & cos φ \end{matrix}) \end{matrix}$ (12)

Equations 10–12 shows the M3D rotation matrix along pairs between the points remains almost equal.The normalization moderate utilized in Min_Max strategy. Then the meaning of this sentence is Data set is regularized utilizing Min_Max and converting the attribute values as [0, 5] range. Everyacquired triplet has been rotatearound a pair of axis (AB, BZ, XZ) with the aim of increasing the amount of perturbation to elements without affect the spatial distance. The related rotation matrix is computed to every rotation. To various values of Φ, determine the rotated triplet V′ = V × R_q as well as a graph is plottedamid the angle H and difference variation. It provides the range for Φ such that minimal secure threshold need for every attribute is fulfilled by Φ. The variance vs angle graphs for 3triplets rotate with XZ axis and the changes are added in Section 3.1.2 in the revised manuscript and the changes are highlighted in blue color. For the rotation process to secure the medical data, it consists of the 4 steps. Figure 3 shows the flow chart for Modified 3D rotation transformation (M3DRT) approach for securing medical data in Hadoop.The step by step procedure is given below,

Fig. 3

Flow chart for M3DRT.

Step 1: Choose the axes pair to Rotate for determining the medical data pairs in the matrix

Select the axis pair w∈ { AB, BZ, XZ } then determines the rotation matrix for w. The medical data pair equation is given in Equation (13),

$T_{w} = T_{j} \times T_{i} Where j, i \in a, b, c and j \neq i$ (13)

Then the modified three dimensional Rotation Transformation is given as (F, Φ, V), where L is the number of attributes, Φ is represented as the array of security threshold angle, F is represented as the Normalized data matrix.

Step 2: Compile the attributes as triplets for grouping the medical data

Group the attributes into L triplets (X_a, X_b, X_c). Here a ≠ b ≠ c the triplets have been complied serially. After compiling, if remains one attributes, the final attribute on the left is compiled with prior 2 attributes. Whereas, if remains 2 attributes, the final 2 attributes consolidated to the prior attribute. In this way, the medical data is grouped and secured in the Hadoop file.

Here the number of triplets is given as k← ⌈ L/ - 3 ⌉ with every axes pair w∈ { AB, BZ, XZ } and its range is from φ_w.

Step 3: Rotating the triplets around axes pairs for reducing the risk to store and secure the medical data

To different rotation angles, rotating the triplets surrounding axes pairs w at 3D plane. To acquire the perturbed dataset f_w, M3-D Rotation is activated on every triplet C and specified axes pairs w, C′ and C is represented as the variance of the triplets and its equation is given in Equation (14):

$f_{w} : C (X_{a}', X_{b}', X_{c}') = T_{w} \times C (X_{a}, X_{b}, X_{c})$ (14)

The value of rotated datasets implies φ function that represents rotation matrix T_w about w - th axis pair. This step is used to determine the angle of the rotation to avoid the risk for securing the data in Hadoop.

Step 4: Determine the rotation angle Φ to set the threshold range

Rotated dataset consists of given steps:

Step 4.1. To every triplet, original plot variance with perturbed data sets v/s φ graph.

Step 4.2. Get3 equations for every triplet depending upon the constraints: the equations are given in Equation (15)

$\begin{matrix} Variance (X_{a} - {X'}_{a}) ⩾ λ_{1} \\ Variance (X_{b} - {X'}_{b}) ⩾ λ_{2} \\ Variance (X_{c} - {X'}_{c}) ⩾ λ_{3}, \\ Where, λ_{1} > 0, λ_{2} > 0,, λ_{3} > 0, \end{matrix}$ (15) Step 4.3 . Determine the φ range that fulfills the safety threshold of δ₁, δ₂, δ₃. It is called range of security. Then the parameters δ₁, δ₂, δ₃ are used to secure the big data in Hadoop method.

Step 4.4. From the secure range intersection of everytriplets acquired in prior step, select the random actual value β as φ.

Step 5: Perturb the data through rotate at angle Φ for securing medical data by avoiding the complexity of the system

Determine the rotating triplet on angle Φ found in step 4. Then the security range is computed as φ_L (Range), such that variance (X_a - X′_a) ⩾ δ_L1, variance (X_b - X′_b) ⩾ δ_L2, and variance (X_w - X′_w) ⩾ δ_L3, Where δ_L3 ∈ Φ_L, then the range φ_L (Range) is given as φ_w? (Range) ← φ_w (Range) ∩ φ_L (Range), β_w is the rotation angle such that β_w ∈ φ_w. Then the medical data secured equation is given in Equation (16):

$F_{w β} = C' (X_{a}', X_{b}', X_{c}') = T_{w} \times C (X_{a}, X_{b}, X_{c})$ (16) where F_wβ is represented as the data perturbed along w^th axis with angle β.

Step 6: Chosen the data to be released:

To every perturbed data sets F, w∈ { F_AB, F_BZ, F_XZ } measure the variance. The data F_w along greatest value of variance is deemed into final perturbed data F′. Then the final secured data is stored in the Hadoop file and the secured medical data equation is given in Equation (17):

$F' = F_{w β} with maximum C_{w}$ (17)

3.1.2.1 Computational complexity

Let F denotes number of medical data and L denotes number of attributes. The processing time of M3DRT approach is C (F × L).

3.1.2.2 Process of retrieving the data

Figure 3 shows the Flow chart for M3DRT. Data recovery is one of the difficult errands in the data innovation field.. From Fig. 3 the clustered data is given to the M3DRT, then the rotations are selected based on the Data perturbation (DP) process. Then the axes pairs are selected based w∈ { AB, BZ, XZ } then determines the rotation matrix for w. Then the axes pairs rotated based on the Group the attributes into L triplets (X_a, X_b, X_c). Here a ≠ b ≠ c the triplets have been complied serially, then finds the angle, then perturb the data by rotating at an angle and then finally choose the data released. The detailed descriptions of flow chart is given in step by step procedures explained in Section 3.1.2.

In the hash map [28] the clustered original information can be extricated from all the perturbed qualities. Before creating the report of client, the framework checks the entrance control table concerning whether the client approaches the characteristics of their inquiries. Clients can get unique information only for the attributes, generally, just perturbed information is shown up in the report, which is presented in Fig. 4. The Healthcare information proprietor needs to offer authorization for getting the information of every single potential client depending on their prerequisites. Such clients incorporate different medicinal services experts just as various therapeutic insurance agencies or associations. Here, the trait based access control is portrayed. At first, the framework necessitates the portion of the client’s traits get before the social insurance information. The framework assigns rights to that customer based on those properties. On the off chance that client’s ascribes neglects to coordinate predefined values, the client will be denied access to the information. In the meantime, the framework gives consent to such an extent that approved clients can get just certain pieces of the social insurance information: they can’t get to other information zones in the database. To access the database, authorized clients receive a final report, which contains unique information about the attributes they hold. The authorized clients acquire just perturbed information for every single other feature.

Fig. 4

Health care database.

4 Result with discussion

Here, the proposed privacy protection is described and the implementation criterion is used to analyze the algorithm. This method is analyzed by memory usage, accuracy, recall, precision, F-measure, clustering time and time of execution. The JAVA simulations run in PC with Intel Core, 2.50 GHz CPU, 8 GB of RAM, and Hadoop 64Bit 64Bit on Ubuntu 14 64 Bit. Here, HDFSbased on Combined IDFKM-M3DRT technique for privacy preservation is implemented. Table 1 shows an example of health insurance database in hadoop.

4.1 Performance and analysis

The big data security interpretation based on Improved Deep Fuzzy K-means Clustering (IDFKM) Algorithm with Modified Three dimensional rotation transformation (M3DRT) algorithm in Hadoop Distributed File System (HDFS) collected from medical data set for increasing the storing capacity in the process.

4.1.1 Dataset description

Here, the health data is collected from Big Geno-mics Data(BGD) (https://cran.rproject.org/web/packages/BGData/readme/README.html). The user-specific analysis is permitted by dynamic data visualizations, also gain newly experiences, then thedata is generally access via Global Health Data Exchange. This method provides private healthcare big data in Hadoop Distributed File System (HDFS) and is controlled by only by valid user. This method securely keeps the healthcare big data in block chain for attaining accountability, integrity, security. Patients may achieve entire control over the blocks, here their data is saved by using crypto graphic functions. This method achieves following advantages while using HDFS such as low cost, because this software is a open source and stores large amount of data, provides more scalability by adding more number of nodes, so the system grows quickly through less system administrations, This system is highly portable among the Hadoop distributions, that facilitate for reducing vendor specific locking constraints, it has high storage flexibility and stores Unstructured data such as text, images and videos directly into the system, Inherent data protection, self-healing capabilities, the system works on the basis of, when a node does not take action, then the task is redirected to other virtual nodes to ensure the delivered computing is not fail. It signifies hardware independence that is Data with application processing is secured against hardware fault.

The primary focus of M3DRT is to provide security before conveying the data. In the proposed method, the healthcare database contains various types of information about the patient. Figure 4 depicts the healthcare database. Here, Medical history, insurance details, diseases, allergies, medicines, lab tests and lab resultsare considered. With the M3DRT calculation, the alternative required perturbed data is provided. It is additionally conceivable to pick any number of segments for the annoyance procedure. Annoyance procedure can be used for security and sparing can be used for data stream mining.

Table 2 shows the Health insurance details in database. The M3DRT strategy with the extension of Gaussian noise is completely splitting for assessing exceptional motivating force from the perturbed data. The M3DRT is an improvement to turn annoyance by joining additional parts, for example, irregular interpretation perturbation and noise expansion development to the fundamental sort of multiplicative perturbation. The major aim is to change given enlightening accumulation of data into perturbed dataset that satisfies given security essential with least information loss for the proposed data examination task.

Table 2
Health insurance details in database

Market Reform Employer mandate plus Medicaid Employer Choice of Medicare or private Medicare for all Comprehensive public plan

Coverage 20-23 million 6 million Universal Universal Universal

Benefits limited Basic Medicare Medicare Comprehensive

Private Insurance 170 million 200 million 180 million Supplemental None

Cost containment None Managed care All payer Single payer Global budget

Financing Individual Premium Employer premium contribution Employer premium contribution or payroll Payroll tax Income tax Payroll tax Income tax Other

	Market Reform	Employer mandate plus Medicaid	Employer Choice of Medicare or private	Medicare for all	Comprehensive public plan
Coverage	20-23 million	6 million	Universal	Universal	Universal
Benefits	limited	Basic	Medicare	Medicare	Comprehensive
Private Insurance	170 million	200 million	180 million	Supplemental	None
Cost containment	None	Managed care	All payer	Single payer	Global budget
Financing	Individual Premium	Employer premium contribution	Employer premium contribution or payroll	Payroll tax Income tax	Payroll tax Income tax Other

4.1.2 Performance metrics

Precision, Recall, F-Measure, Accuracy, Specificity are utilized for measuring the performance metrics.

Precision

It is defined as the Positive predictive values; it is expressed in Equation (18).

$Precision = \frac{TP}{TP + FP}$ (18) where TP is represented as the true positive, FP is represented as the false positive

F-measure

It determines utilizing the given Equation (19)

$F - measure = 2 \times \frac{recall \times precision}{recall + precision}$ (19)Recall It is defined as the number of records correctly classified to the number of all corrected events. It is given by the Equation (20)

$Re call = \frac{TP}{TP + FN}$ (20)

Accuracy

The accuracy contains the total number of the classification results. The values are determined by the following Equation (21)

$Accuracy = \frac{TP + FN}{FP + FN + TN + TP}$ (21)

Clustering period

Clustering Time is an unsupervised data mining technique for organizing data points into groups based on their similarity.

Security level

Security level is a measure of the strength that a cryptographic primitive.

Efficiency

Efficiency is the ratio of output power to input power, and multiply the result by 100. The values are determined by the following Equation (22)

$Efficiency = \frac{output}{input} \times 100 %$ (22)

4.1.3 Comparison of the performance analysis of various existing methods for securing and storing the data in efficient manner

Here, the performance analysis of proposed IDFKM-M3DRT method is compared with the existing methods such as STRIDE [20], SDN-ACA [21], LZMA- DBSCAN [23] and BDSA [25] for securing and storing the data in efficient manner.

Table 3 shows the security level, attack level and efficiency of the proposed IDFKM-M3DRT and existing methods STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA respectively. The existing STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA methods was achieves 87.4%, 89.6%, 85.2% and 91.6% low security level. But the proposed IDFKM-M3DRT method achieved 97.5% high security level. The existing STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA methods was achieves 10.7%, 12.6%, 15.4% and 11.5% high attack level. But the proposed IDFKM-M3DRT method achieved 5.4% low attack level. The existing STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA methods was achieves 85.2%, 87.6%, 58.1% and 89.2% low efficiency. But the proposed IDFKM-M3DRT method achieved 96.8% high efficiency.

Table 3
Security level, attack level and efficiency of proposed IDFKM-M3DRT and existing methods

Method Security Attack Efficiency

level (%) level (%) (%)

STRIDE 87.4 10.7 85.2

SDN-ACA 89.6 12.6 87.6

LZMA- DBSCAN 85.2 15.4 58.1

BDSA 91.6 11.5 89.2

IDFKM-M3DRT 97.5 5.4 96.8

(proposed)

Method	Security	Attack	Efficiency
STRIDE	87.4	10.7	85.2
SDN-ACA	89.6	12.6	87.6
LZMA- DBSCAN	85.2	15.4	58.1
BDSA	91.6	11.5	89.2
IDFKM-M3DRT	97.5	5.4	96.8
(proposed)

Figure 5 shows the time taken to perturb health data of proposed IDFKM-M3DRT and existing methods STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA respectively. At file size 10 mb the existing STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA methods was achieves 4856, 3546, 3659 and 4385 low time taken to perturb health data. But the proposed IDFKM-M3DRT method achieved 5115 high time taken to perturb health data. At file size 20 mb the existing STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA methods was achieves 6434, 7622, 6451 and 6878 low time taken to perturb health data. But the proposed IDFKM-M3DRT method achieved 6545 high time taken to perturb health data. At file size 30 mb the existing STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA methods was achieves 12475, 14363, 11734 and 17563 low time taken to perturb health data. The proposed IDFKM-M3DRT method achieved 11445 high time taken to perturb health data. At file size 40 mb the existing STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA methods was achieves 18364, 20463, 18465 and 19464 low time taken to perturb health data. The proposed IDFKM-M3DRT method achieves 15855 high time taken to perturb health data. At file size 50 mb the existing STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA methods was achieves 25373, 25385, 24365 and 21047 low time taken to perturb health data. The proposed IDFKM-M3DRT method achieved 22456 high time taken to perturb health data.

Fig. 5

Time taken to perturb health data proposed IDFKM-M3DRT and existing methods.

Table 4 shows the Time taken to de-perturb health data of proposed IDFKM-M3DRT and existing methods STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA respectively. At file size 10 mb the existing STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA methods was achieves 5823, 4895, 5934 and 3745 high time taken to de-perturb health data. But the proposed IDFKM-M3DRT method achieved 5333 low time taken to de-perturb health data. At file size 20 mb the existing STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA methods was achieves 7453, 7742, 7324 and 7259 high time taken to de-perturb health data. But the proposed IDFKM-M3DRT method achieved 7124 low time taken to de-perturb health data. At file size 30 mb the existing STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA methods was achieves 11484, 13985, 12846 and 11757 high time taken to de-perturb health data. The proposed IDFKM-M3DRT method achieved 10177 low time taken to de-perturb health data. At file size 40 mb the existing STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA methods was achieves 18567, 17485, 15485 and 18467 high time taken to de-perturb health data. The proposed IDFKM-M3DRT method achieves 17661 low time taken to de-perturb health data. At file size 50 mb the existing STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA methods was achieves 21983, 19465, 18455 and 23893 high time taken to de-perturb health data. The proposed IDFKM-M3DRT method achieved 17340 low time taken to de-perturb health data.

Table 4

Time taken to de-perturb health data proposed IDFKM-3DRT and existing methods

Size of file (mb)	10	20	30	40	50
STRIDE	5823	7453	11484	18567	21983
SDN-ACA	4895	7742	13985	17485	19465
LZMA- DBSCAN	5934	7324	12846	15485	18455
BDSA	3745	7259	11757	18467	23893
IDFKM-M3DRT	5333	7124	10177	17661	17340
(proposed)

Table 5 shows the Memory usage on perturb health data of proposed IDFKM-M3DRT and existing methods STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA respectively. At file size 10 mb the existing STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA methods was achieves 4835266, 5978776, 6230968 and 6650936 high Memory usages on perturb health data. But the proposed IDFKM-M3DRT method achieved 4707464 low Memory usages on perturb health data. At file size 20 mb the existing STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA methods was achieves 5869371, 4987883, 4956458 and 5245575 high Memory usages on perturb health data. But the proposed IDFKM-M3DRT method achieved 4904680 low Memory usages on perturb health data. At file size 30 mb the existing STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA methods was achieves 6328583, 5868947, 5284758 and 5757853 high Memory usages on perturb health data. The proposed IDFKM-M3DRT method achieved 4835266low Memory usages on perturb health data. At file size 40 mb the existing STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA methods was achieves 6947372, 6274685, 5968343 and 5969839 high Memory usages on perturb health data. The proposed IDFKM-M3DRT method achieves 4747584low Memory usages on perturbs health data. At file size 50 mb the existing STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA methods was achieves 7253634, 6936746, 6395786 and 6763837 high Memory usages on perturb health data. The proposed IDFKM-M3DRT method achieved 4845857 low Memory usages on perturbs health data.

Table 5

Memory usage on perturb health data

Size of file (mb)	10	20	30	40	50
STRIDE	5978776	5869371	6328583	6947372	7253634
SDN-ACA	6230968	4987883	5868947	6274685	6936746
LZMA- DBSCAN	6650936	4956458	5284758	5968343	6395786
BDSA	4966573	5245575	5757853	5969839	6763827
IDFKM-M3DRT (proposed)	4707464	4904680	4835266	4747584	4845857

Table 6 shows the Memory usage on de-perturb health data of proposed IDFKM-M3DRT and existing methods STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA respectively. At file size 10 mb the existing STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA methods was achieves 4866748, 4684724, 5284739 and 5837517 high Memory usage on de-perturb health data. But the proposed IDFKM-M3DRT method achieved 4644388 low Memory usage on de-perturb health data. At file size 20 mb the existing STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA methods was achieves 5285475, 4938272, 573393 and 5856207 high Memory usage on de-perturb health data. But the proposed IDFKM-M3DRT method achieved 4267659 low Memory usage on de-perturb health data. At file size 30 mb the existing STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA methods was achieves 5846387, 5538227, 5943684 and 6382722 high Memory usages on de-perturb health data. The proposed IDFKM-M3DRT method achieved 4267659 low Memory usages on de-perturb health data. At file size 40 mb the existing STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA methods was achieves 6385883, 6184739, 6737825 and 6947286 high Memory usage on de-perturb health data. The proposed IDFKM-M3DRT method achieves 3457079 low Memory usage on de-perturb health data. At file size 50 mb the existing STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA methods was achieves 7838627, 7943783, 7235636 and 7462629 high Memory usage on de-perturb health data. The proposed IDFKM-M3DRT method achieved 7150936 low Memory usage on de-perturb health data.

Table 6

Memory usage on de-perturb health data

Size of file (mb)	10	20	30	40	50
STRIDE	4866748	5285475	5846387	6385883	7838627
SDN-ACA	4684724	4938272	5538227	6184739	7943783
LZMA- DBSCAN	5284739	5738393	5943684	6737825	7235636
BDSA	5837517	5856207	6382722	6947286	7462629
IDFKM-M3DRT (proposed)	4644388	4267659	4267659	3457079	7150936

Table 7 shows the Precision of proposed IDFKM-M3DRT and existing methods STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA respectively. At file size 10 mb the existing STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA methods was achieves 77.58%, 82.45%, 78.46% and 84.76% low Precision. But the proposed IDFKM-M3DRT method achieved 90.45% high Precision. At file size 20 mb the existing STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA methods was achieves 83.56%, 86.46%, 81.03% and 86.79% low Precision. But the proposed IDFKM-M3DRT method achieved 92.66% high Precision. At file size 30 mb the existing STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA methods was achieves 82.45%, 89.57%, 75.83% and 88.56% low Precision. The proposed IDFKM-M3DRT method achieved 93.43% high Precision. At file size 40 mb the existing STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA methods was achieves 86.45%, 90.02%, 79.58% and 87.56% low Precision. The proposed IDFKM-M3DRT method achieves 95.34% high Precision. At file size 50 mb the existing STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA methods was achieves 91.46%, 93.85%, 87.68% and 90.17% low Precision. The proposed IDFKM-M3DRT method achieved 96.45% high Precision.

Table 7

Performance of precision

Size of file (mb)	10	20	30	40	50
STRIDE	77.58	83.56	82.45	86.45	91.46
SDN-ACA	82.45	86.46	89.57	90.02	93.85
LZMA- DBSCAN	78.46	81.03	75.83	79.58	87.68
BDSA	84.76	86.79	88.56	87.56	90.17
IDFKM-M3DRT	90.45	92.66	93.43	95.34	96.45
(proposed)

Table 8 shows the confusion matrix of proposed IDFKM-M3DRT and existing methods STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA respectively.

Table 8

Confusion matrix

	TP	FP	FN	TN
STRIDE	21000	5000	4010	29990
SDN-ACA	21100	5100	3290	30510
LZMA- DBSCAN	22400	4050	1540	32010
BDSA	22640	3900	1400	32060
IDFKM-M3DRT	22940	3100	1100	32860
(proposed)

Table 9 shows the recall of proposed IDFKM-M3DRT and existing methods STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA respectively. At file size 10 mb the existing STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA methods was achieves 85.25%, 78.98%, 66.92% and 71.92% low recall. But the proposed IDFKM-M3DRT method achieved 93.11% high recall. At file size 20 mb the existing STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA methods was achieves 87.46%, 82.58%, 69.20% and 75.28% low recall. But the proposed IDFKM-M3DRT method achieved 94.35% high recall. At file size 30 mb the existing STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA methods was achieves 89.26%, 86.37%, 73.95% and 78.23% low recall. The proposed IDFKM-M3DRT method achieved 95.23% high recall. At file size 40 mb the existing STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA methods was achieves 91.89%, 89.10%, 77.39% and 85.28% low recall. The proposed IDFKM-M3DRT method achieves 96.83% high recall. At file size 50 mb the existing STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA methods was achieves 92.16%, 91.02%, 83.25% and 89.25% low recall. The proposed IDFKM-M3DRT method achieved 97.83% high recall.

Table 9

Performance of recall

Size of file (mb)	10	20	30	40	50
STRIDE	85.25	87.46	89.26	91.89	92.16
SDN-ACA	78.98	82.58	86.37	89.10	91.02
LZMA-DBSCAN	66.92	69.20	73.95	77.39	83.57
BDSA	71.92	75.28	78.23	85.28	89.25
IDFKM-M3DRT	93.11	94.35	95.23	96.83	97.83
(proposed)

Table 10 shows the F-measure of proposed IDFKM-M3DRT and existing methods STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA respectively. At file size 10 mb the existing STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA methods was achieves 84.57%, 72.94%, 82.48% and 79.37% low F-measure. But the proposed IDFKM-M3DRT method achieved 92.04% high F-measure. At file size 20 mb the existing STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA methods was achieves 86.48%, 75.73%, 85.47% and 81.37% low F-measure. But the proposed IDFKM-M3DRT method achieved 93.14% high F-measure. At file size 30 mb the existing STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA methods was achieves 88.94%, 78.13%, 87.03% and 84.28% low F-measure. The proposed IDFKM-M3DRT method achieved 93.88% high F-measure. At file size 40 mb the existing STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA methods was achieves 90.81%, 82.36%, 91.37% and 86.37% low F-measure. The proposed IDFKM-M3DRT method achieves 95.14% high F-measure. At file size 50 mb the existing STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA methods was achieves 93.26%, 87.27%, 92.46% and 88.19% low F-measure. The proposed IDFKM-M3DRT method achieved 96.85% high F-measure.

Table 10

Performance of F-measure

Size of file (mb)	10	20	30	40	50
STRIDE	84.57	86.48	88.94	90.81	93.26
SDN-ACA	72.94	75.73	78.13	82.36	87.27
LZMA- DBSCAN	82.48	85.47	87.03	91.37	92.46
BDSA	79.37	81.37	84.28	86.37	88.19
IDFKM-M3DRT	92.04	93.14	93.88	95.14	96.85
(proposed)

Table 11 shows the clustering period of proposed IDFKM-M3DRT and existing methods STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA respectively. At file size 10 mb the existing STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA methods was achieves 13.46%, 12.57%, 11.25% and 14.37% high clustering period. But the proposed IDFKM-M3DRT method achieved 10.55% low clustering period. At file size 20 mb the existing STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA methods was achieves 19.25%, 18.36%, 17.36% and 18.46% high clustering period. But the proposed IDFKM-M3DRT method achieved 16.43% low clustering period. At file size 30 mb the existing STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA methods was achieves 24.57%, 22.57%, 24.36% and 25.84% high clustering period. The proposed IDFKM-M3DRT method achieved 20.54% low clustering period. At file size 40 mb the existing STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA methods was achieves 28.46%, 27.57%, 28.36% and 28.37% high clustering period. The proposed IDFKM-M3DRT method achieves 26.44% low clustering period. At file size 50 mb the existing STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA methods was achieves 36.95%, 37.98%, 36.37% and 36.36% high clustering period. The proposed IDFKM-M3DRT method achieved 35.23% low clustering period.

Table 11

Clustering period

Size of file (mb)	10	20	30	40	50
STRIDE	13.46	19.25	24.57	28.46	36.95
SDN-ACA	12.57	18.36	22.57	27.57	37.98
LZMA- DBSCAN	11.25	17.36	24.36	28.36	36.37
BDSA	14.37	18.46	25.84	28.37	36.36
IDFKM-M3DRT	10.55	16.43	20.54	26.44	35.23
(proposed)

Table 12 shows the accuracy of proposed IDFKM-M3DRT and existing methods STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA respectively. At file size 10 mb the existing STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA methods was achieves 73.57%, 67.96%, 73.56% and 74.68% low accuracy. But the proposed IDFKM-M3DRT method achieved 84.77% high accuracy. At file size 20 mb the existing STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA methods was achieves 77.12%, 69.01%, 75.46% and 76.27% low accuracy. But the proposed IDFKM-M3DRT method achieved 93.14% high accuracy. At file size 30 mb the existing STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA methods was achieves 79.25%, 74.62%, 77.83% and 78.26% low accuracy. The proposed IDFKM-M3DRT method achieved 90.34% high accuracy. At file size 40 mb the existing STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA methods was achieves 82.95%, 78.26%, 79.36% and 80.62% low accuracy. The proposed IDFKM-M3DRT method achieves 91.77% high accuracy. At file size 50 mb the existing STRIDE, SDN-ACA, LZMA- DBSCAN and BDSA methods was achieves 85.65%, 81.92%, 82.36% and 87.32% low accuracy. The proposed IDFKM-M3DRT method achieved 93.66% high accuracy. Figure 6 shows the Computational complexity.

Table 12

Performance of accuracy

Size of file (mb)	10	20	30	40	50
STRIDE	73.57	77.12	79.25	82.95	85.65
SDN-ACA	67.96	69.01	74.62	78.26	81.92
LZMA- DBSCAN	73.56	75.46	77.83	79.36	82.36
BDSA	74.68	76.27	78.26	80.62	87.32
IDFKM-M3DRT	84.77	86.88	90.34	91.77	93.66
(proposed)

Fig. 6

Computational complexity.

5 Conclusions

This manuscript proposes the Big Data in Hadoop Distributed File System (HDFS) in the light of combinedIDFKMandM3DRT approach is successfully implemented. The proposed method is implemented in Java programming and its performance is assessed by Health care database and the metrics under the study are memory usage, higher precision 34.75%, 25.86%, 23.75%, 33.78%, higher recall 23.86%, 27.97%, 22.97%, 37.97, 31.867%, lower aggregating time 35.23%, and the proposed method is compared with the existing method such as Security Analysis of SDN Applications for Big Data with spoofing identity, Tampering with data, Repudiation threats, Information disclosure, Denial of service and Elevation of privileges (STRIDE), Big Data Analysis-based Secure Cluster Management for using Ant Colony Optimization (ACA) Optimized Control Plane in Software-Defined Networks, System Architecture for Secure Authentication and Data Sharing in Cloud Enabled Big Data Environment using Lemperl Ziv Markow Algorithm (LZMA) and Density-based Clustering of Applications with Noise (DBSCAN), Big Data Based Security Analytics using data based security analytics (BDSA) approach for Protecting Virtualized Infrastructures in Cloud Computing respectively. The advantages of this method have less computation time as well as reduce the complexity of privacy protection with better accuracy.

The limitation of this work is this paper discusses only about Electronic Health Record (EHR). The other sources of big data in healthcare are not discussed. The other sources of big data in healthcare are Medical Imaging Data, Unstructured Clinical Notes and Genetic Data. This are discussed in future work and combine the various other sources of big data in healthcare such as social media, web searches and mobile devices and developing the knowledge system. Also for security purposes block chain based cryptography hash technique can be used for securing and sharing more pseudonymity details of different users and to reduce the complexity of the access of the authorized company.

References

Lakshmanaprabu

S.K.

, Shankar

, Ilayaraja

, Nasir

A.W.

, Vijayakumar

and Chilamkurti

, Random forest for big data classification in the internet of things using optimal features, International Journal of Machine Learning and Cybernetics 10(10) (2019), 2609–2618.

Baumann

, Mager

, Wetzker

, Thiele

, Zimmerling

and Trimpe

, Wireless Control for Smart Manufacturing: Recent Approaches and Open Challenges, Proceedings of the IEEE 109(4) (2020), 441–467.

Razzak

M.I.

, Imran

and Xu

, Big data analytics for preventive medicine, Neural Computing and Applications 32(9) (2020), 4417–4451.

Berkani

, Decision support based on optimized data mining techniques: Application to mobile telecommunication companies, Concurrency and Computation: Practice and Experience 33(1) (2021), 5833.

Shah

, Engineer

, Bhagat

, Chauhan

and Shah

, Research trends on the usage of machine learning and artificial intelligence in advertising, Augmented Human Research 5(1) (2020), 1–15.

Klarin

, The decade-long crypto currencies and the block chain rollercoaster: Mapping the intellectual structure and charting future directions, Research in International Business and Finance 51 (2020), 101067.

Campello

R.J.

, Kröger

, Sander

and Zimek

, Density-based clustering, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 10(2) (2020), 1343.

Thrun

M.C.

and Ultsch

, Using projection-based clustering to find distance-and density-based clusters in high-dimensional data, Journal of Classification 38(2) (2021), 280–312.

and Hu

, Improved sensor fault detection, diagnosis and estimation for screw chillers using density-based clustering and principal component analysis, Energy and Buildings 173 (2018), 502–515.

10.

Al-Shammari

, Zhou

, Naseriparsaa

and Liu

, An effective density-based clustering and dynamic maintenance framework for evolving medical data streams, International Journal of Medical Informatics 126 (2019), 176–186.

11.

Zhu

and Ma

, An effective partitional clustering algorithm based on new clustering validity index, Applied Soft Computing 71 (2018), 608–621.

12.

Mythili

, Thiyagarajah

, Rajesh

and Shajin

F.H.

, Ideal position and size selection of unified power flow controllers (UPFCs) to upgrade the dynamic stability of systems: an antlionoptimiser and invasive weed optimisation algorithm, HKIE Trans 27(1) (2020), 25–37.

13.

Rajesh

and Shajin

F.A.

, Multi-Objective Hybrid Algorithm for Planning Electrical Distribution System 22(1) (2020), 224–509.

14.

Singh

, Data Leakage Detection Using Cloud Computing, International Journal Of Engineering And Computer Science (2017).

15.

Shajin

FH.

and Rajesh

, Trusted secure geographic routing protocol: outsider attack detection in mobile ad hoc networks by adopting trusted secure geographic routing protocol, International Journal of Pervasive Computing and Communications (2020).

16.

Thota

MK.

, Shajin

FH.

and Rajesh

, Survey on software defect prediction techniques, International Journal of Applied Science and Engineering 17(4) (2020), 331–44.

17.

Pavlov

, Cazin

, Ern

and Roig

, Exploration by shake-the-box technique of the 3D perturbation induced by a bubble rising in a thin-gap cell, Experiments in Fluids 62(1) (2021).

18.

Manogaran

, Thota

, Lopez

, Vijayakumar

, Abbas

and Sundarsekar

, Big Data Knowledge System In Healthcare (In Internet of things and big data technologies for next generation healthcare) (2017).

19.

Kumar

and Rishiwal

, Design of retrievable data perturbation approach and TPA for public cloud data security, Wireless Personal Communications 108(1) (2019), 235–251.

20.

Ahmad

, Jacob

and Khondoker

, Security Analysis of SDN Applications for Big Data, In SDN and NFV Security, 39–55 (2018) Springer, Cham.

21.

, Dong

, Ota

, Li

and Guan

, Big data analysis-based secure cluster management for optimized control plane in software-defined networks, IEEE Transactions on Network and Service Management 15(1) (2018), 27–38.

22.

Shamsi

and Khojaye

, Understanding privacy violations in big data systems, IT Professional 20(3) (2018), 73–81.

23.

Narayanan

, Paul

and Joseph

, A novel system architecture for secure authentication and data sharing in cloud enabled big data environment, Journal of King Saud University-Computer and Information Sciences (2020).

24.

Bin

, Research on Methods and Techniques For Iot Big Data Cluster Analysis, International Conference on Information Systems and Computer Aided Education (ICISCAE), (2018).

25.

Win

T.Y.

, Tianfield

and Mair

, Big data based security analytics for protecting virtualized infrastructures in cloud computing, IEEE Transactions on Big Data 4(1) (2017), 11–25.

26.

Xiong

, Johnson

and Corso

, Active clustering with model-based uncertainty reduction, IEEE Transactions on Pattern Analysis and Machine Intelligence 39(1) (2017), 5–17.

27.

Wang

and Singh

, A theoretical analysis of noisy sparse subspace clustering on dimensionality-reduced data, IEEE Trans Inf Theory 65(2) (2019), 685–706.

28.

Yang

, Li

, Liu

, Li

, Feng

, Wang

, Zheng

, Xu

and Lyu

, Brief introduction of medical database and data mining technology in big data era, Journal of Evidence-Based Medicine 13(1) (2020), 57–69.

Efficient big data security analysis on HDFS based on combination of clustering and data perturbation algorithm using health care database

Abstract

Keywords

1 Introduction

2 Related works

3.1 Secured information transmission by the proposed approach

3.1.2.2 Process of retrieving the data

4.1 Performance and analysis

4.1.1 Dataset description

Precision

F-measure

Accuracy

Clustering period

Security level

Efficiency

Table 3 Security level, attack level and efficiency of proposed IDFKM-M3DRT and existing methods Method Security Attack Efficiency level (%) level (%) (%) STRIDE 87.4 10.7 85.2 SDN-ACA 89.6 12.6 87.6 LZMA- DBSCAN 85.2 15.4 58.1 BDSA 91.6 11.5 89.2 IDFKM-M3DRT 97.5 5.4 96.8 (proposed)

References

Table 3
Security level, attack level and efficiency of proposed IDFKM-M3DRT and existing methods

Method Security Attack Efficiency

level (%) level (%) (%)

STRIDE 87.4 10.7 85.2

SDN-ACA 89.6 12.6 87.6

LZMA- DBSCAN 85.2 15.4 58.1

BDSA 91.6 11.5 89.2

IDFKM-M3DRT 97.5 5.4 96.8

(proposed)