Abstract
Storage consumption is increasing significantly these days, with consumers trying to find an effective approach to safe storage space. In these situations, a deduplication in cloud storage services is a significant way to reduce bandwidth and service space by omitting unnecessary information and keeping only a single copy of the information. This raises computational, privacy and storage issues when large numbers of handlers outsource the similar data to cloud service storage. To overcome these problems, an effective Fuzzy-Dedup framework is designed in this research by integrating four steps namely is introduced, which breaks down the data into fixed size chunks and is immediately fingerprinted by a hashing algorithm for ensuring data authentication and then indexing is done with the help of traditional b-tree indexing, similarity function is calculated to compute the similarity value in the documents. After calculating the similar values, the fuzzy interference system is designed by formulating appropriate rules for the decision-making process that determines duplicate and non-duplicate files by obtaining an effective de-duplication ratio over existing methods. After detecting duplicate files, the inline based deduplication policy checks that the new data is ready to send for storage against existing data and does not store any redundant data it discovers. The proposed model is implemented in MATLAB software is carried out several performance metrics and these parameter attained better performance such as, deduplication ratio of 1.2, memory utilization of 12500 bytes in inline and 9550 bytes in offline, throughput of 32500 Mb/s in inline and 25500 Mb/s in offline and processing time of 0.4494 s in inline and 0.1139 s in offline. Thus when compared to previous methods, such as Two Thresholds Two Divisors deduplication (TTTD) approach proposed design shows high range of performance.
Keywords
Introduction
Cloud computing provides seemingly unlimited “virtualized” resources to users as services across the whole Internet, while hiding platform and implementation details [1]. Today’s cloud service providers offer both highly available storage and massively parallel computing resources at relatively low costs. As cloud computing becomes prevalent, an increasing amount of data is being stored in the cloud and shared by users with specified privileges, which define the access rights of the stored data. Although storage costs are relatively low, and with the evolution of cloud storage solutions, more and more data can be stored, the cost associated with managing, maintaining and processing this large amount of data can be significant [3]. Moreover, other studies have shown that around 75% digital data is identical [2] and the repetition rate of which in backup systems exceeds 90% [4]. In this scenario with high data redundancy, deduplication can be used to effectively reduce data storage space and communication overhead. This way each data file can only be stored as one copy on the server. To make data management scalable in cloud computing, deduplication has been a well-known technique and has been receiving more and more attention in recent years [5]. Data deduplication is a special data compression technique for removing redundant copies of recurring data in storage. This approach helps to augment memory consumption and can be applied for transferring the information from local disk to third party server for avoiding the data loss. This can be also performed to transfer the larger bytes into smaller bytes for making the memory consumption effective. Instead of keeping multiple copies of data with the same content, it removes duplicate data by keeping only one original copy and referring the other duplicate data to that copy. That is to say, de-duplication technology can significantly help cloud server reduce storage and network burden [6]. Dropbox, Google Drive, Mozy and other cloud storage service providers have adopted de-duplication technologies to reduce the storage resource consumption of cloud storage managers, which further reduces the cost of managing and maintaining big data. It also reduces cloud storage slot and communication bandwidth.
At three levels, data duplication is performed. They are block level, byte level and file level. Block level is also termed as chunk level. Each of these deduplication bytes, blocks and files are compared and hashed for repetitive prediction. Normally, there were four main steps in chunk level they are index lookup, chunking, writing and fingerprinting. At chunking stage data is separated into chunks of non-overlapping data blocks. Data block size might be either variable or fixed based on chunking method utilized. In the case of fixed data blocks FSC is utilized. Content defined chunking use common method for chunking. Cryptographic hash function is used in fingerprinting. From chunking phase fingerprint is calculated for every chunk produced. Lookup table is designed consists of fingerprinting for each data chunk in index lookup. For each fingerprint generated in step a lookup operation is evaluated. For predicting whether the chunk is unique or not fingerprinting step is performed. In the lookup table if the fingerprint is not found then the data chunk is said to be unique [7, 8]. This new fingerprint is then inserted to table and the chunk is identified as data store. Every unique data chunks are written to data store and stored on storage space. It utilizes chunk-based deduplication method as a unique fingerprint identification in index of chunk. To predict duplication of chunk fingerprint when obtained checked through index and existing fingerprint (hash). If it is found to be in index it is then an identical copy of existing chunks. Hence, only reference of existing data is required to be stored. Else if no match is found, then it is said to be a unique one. It should be then stored to index for further reference.
Data deduplication offers many benefits, but it creates security and privacy concerns because users’ sensitive data is vulnerable to internal and external attacks. Data deduplication has been successfully used in various application scenarios, such as backup system, virtual machine storage, primary storage, and WAN replication. However, due to the distrust of cloud users on the cloud service providers, these data holders tend to encrypt their data before uploading. In particular, encryption is performed before outsourcing to protect the confidentiality of outsourced data [9]. Data encryption is an important method to keep private data safe, but it restricts data re-usability. This makes de-duplication much more difficult because the same data encrypted with different keys will generate different cipher texts [10, 11]. Conventional encryption technique provides data confidentiality, but is not compatible with data deduplication. In particular, conventional encryption standard requires multiple users to encrypt data with their personal keys. Thus, the same copies of data from different users will generate different cipher texts, making deduplication impossible [12]. Therefore, developing new solutions to coordinate de-duplication and client-side encryption and then effectively de-duplicate encrypted data is an important research direction in cloud computing. Major contributions of this work are By designing a framework with DSC technique that creates a fingerprint using hashing algorithm for ensuring security In order to calculate similarity between documents a b-tree indexing have been used that makes the system more efficient By incorporating appropriate rules, fuzzy interference fetched to make decision at the final stage of proposed design To ensure that whether non-duplicate content is isolated and stored in database with security analysis using Triple Data Encryption Standard, that made the proposed design more compact to cloud applications A unique data reduction mechanism used in disk-based storage systems, which not only saves energy, storage space and cooling in data centres, but also suggestively diminishes human error, administrative time as well as the operational complexity.
The manuscript is organized as follows: section 2 details about the related works. In section 3, problem statement and motivation of the research are elaborated. Section 4 demonstrates the proposed methodology. Dataset description and experimental results are elaborated in subsequent Section. Conclusion is provided in the final section.
Related works
Recently, extensive researches has been conducted on de-duplication technologies. Some of the recent research works are reviewed below.
Yuan et al. [13] have introduced a block-chain based deduplication model for enhancing the security mechanism using public auditing. For security purpose homomorphic encryption standard was utilized to meet the security demand. In general security based block chain encompasses smart contract that supports auditing scheme without depending on third party auditor. They designed the security model along with deduplication process. This helps the deduplication over encrypted information which in terms diminished the storage overhead and cost of cloud service provider (CSP). The presented model automatically avoid the malicious CSP. But the homomorphic encryption may lags in high level of security concern. Likewise, Li et al. [14] have issued the reliable key management problem and developed the deduplication based block chain approach. They exhibited a baseline model where each and every consumers holds a separate master key which are not dependent for making the encryption process using convergent key by outsourcing the information to cloud storage. The enormous key are generated with the help of baseline key management approach by holding large number of users and users are dedicated to preserve the master keys.
This framework don’t need to manage the keys on their own however it distribute securely the convergent key around several servers. Yang et al. [15] have designed a model to perform over encrypted data with securely using zero knowledge on client side duplication. This scheme could attain high recognition of misbehaviour happened in client side. In addition to this they developed a proxy re-encryption algorithm over a key distribution technique. This presented approach assures that server identifies nothing around the encryption key also it tends to act like a proxy for distributing the encryption key of specific file. It seems that only the authorized clients has the ownership to share the file with the secret key generated without determining the channels securely among them. It was proved that client secret key can’t be recovered by the client or server during the distribution phase of key value. Jayapandian, N. and Md Zubair Rahman [16] have focused on storage and privacy issues in server side. They distributed the content based on checksum storage model. They presented a technique named interactive Message-Locked Encryption with Convergent Encryption (iMLEwCE), here the information gets secured primarily then the cipher text is encrypted again. This mechanism helps to diminish the space of storage. From the chunk of data dependency, encryption keys are created in consistent way. The similar chunks belongs to same cipher text formation and the attacker might not be able to deduce the key configuration from the chunk which was encrypted. By this manner the message gets protected from the third party cloud server.
The Deduplication-Assisted primary storage system was exhibited by the authors Wu et al. [17] in cloud of cloud providers. The presented system tends to omit the repeated block of data in the cloud computing infrastructure and then it distributes the information between non-dependent service providers from cloud by offering the characteristics of information. This technique stores the data block in multiple cloud platform by integrating data replication and erasure code mechanisms. Due to the advancement of integrating both the erasure code model and replication type, the block of data with high reference model gets stored in replication approach and other data blocks gets stored in erasure code approach. Zhang et al. [18] have analysed the functionality of electronic health records from the actual health records, here various patients may create a duplicate electronic health records this tends to reduce the system efficiency. The cross patients duplicate health records might be created enormously and need to be updated only when the doctors refer for the same data in same department in term they proposed the HealthDep mechanism. This relates to provide security of duplicated health records stored in cloud assisted platform. This approach permits the cloud server platform for efficient performance of electronic health record deduplication. In the same manner the author Rao et al. [19] have analysed a Genetic Programming model for managing the duplicated information which integrates substance of information for providing deduplication point by segmenting into two or more cut points for making the duplication detection process flexible. However the presented approach attained computationally less performance and the trust evaluation was also not performed and reached up to the mark performance.
Li et al. [20] have designed a data availability model in deduplication technique based on storage system. It depends on the reference chuck count and frequency attained for accessing the information from cloud. But it augments the repeated data while performing chunk and assures the data integrity and availability in addition to this it reduces the storage overhead. Li et al. [21] have illustrated the challenge of security issue in deduplication storage mechanism. They introduced the convergent encryption standard for offering confidentiality to the information which are sensitive for supporting deduplication. The search key was performed in index generation phase of deduplication. The uploaded data was checked by the user for integrity measure. Again it was verified by the users after downloading and fuzzy keyword search was utilized for the keyword search. Sun et al. [22] have invented online based replica deduplication for deleting the redundant replicas. For diminishing the duplication files, the presented method was applied for prediction performance. The fuzzy clustering was applied to the duplication technique before deleting the replica files. Panda S S et al. [23] have presented key characteristics of block chain technology such as fault tolerance, immutability, openness and traceability. It used key generation and management for mutual authentication and also used one-way hash chain technique that provide a set of private and public key pairs for IoT devices. Panda S S et al. [24] have presented a light weight mutual authentication protocol based on Diffie Hellman for IOT enabled smart systems and security attributes are validated by widely accepted BAN logic A through security analysis with different security attacks. Panda S S et al. [25] have presented a decentralised framework depend on block chain technology that managed smart devices. It attempt to provide legitimacy to every user and is also developed. Anand et al. [26] developed a cloud-based secure watermarking system for smart healthcare applications based on IWT-Schur-RSVD and a fuzzy inference framework. For medical records sharing in the cloud, a secure watermarking approach based on integer wavelet transform (IWT)-Schur-Randomized Singular Value Decomposition (RSVD) is designed. To achieve dual watermarking by disguising the system MAC address in the logo picture, discrete wavelet transform (DWT) based embedding is used to assure a high level of authentication. To reduce/eliminate channel noise, the system address is encoded using turbo code first.
Typically, most traditional approaches as discussed above tend to record deduplication rely on a lot of choices for setting parameters and not always optimal. In existing techniques the more processing time is required to calculate the similarity between attributes, so the more time is required to find the replica. Likewise, similarity checking among documents fails in determining the boundary values for differentiating both the duplicate and non-duplicate files which obtains error values and leads the system performance to degrade. Therefore, an efficient de-duplication is developed in the proposed methodology to overcome duplication issues.
Proposed methodology- Fuzzy-Dedup: A secure deduplication model
Users around the world store their valuable data to cloud because of the benefits like cost savings, scalability and accessibility. As data production rates rise, providing efficient storage for providers can be a daunting task. Cloud storage providers practice dissimilar methods to progress storage capacity and one of the most popular techniques in current years is deduplication. Data deduplication is primarily used to remove redundant data that is unnecessary and occupied the space. Hence in this procedure, the data must be analysed to recognize duplicate data and to remove it from the datasets. It is a unique data reduction mechanism extensively used in disk-based storage systems, which not only saves energy, storage space and cooling in data centres, but also suggestively diminishes human error, administrative time as well as the operational complexity. With these objectives, the proposed methodology has been developed and its outlines are displayed in Fig. 1 below

Overall architecture of the proposed methodology.
Initially, chunking step is done to separate the files into different chunks and for each individual chunk; a unique hash signature is generated by means of HMAC-SHA-256 hash signature algorithm, called as fingerprint. The generated fingerprints are stored in an index format and its corresponding similarity values are found to identify duplicate files that have the same fingerprint value, which is done using the proposed Fuzzy-Dedup system. The proposed model detects duplicate and non-duplicate files and secures them using a security protocol. If the files are repeated on the storage server, only its fingerprint is indicated on the cloud platform. The proposed method stores the files in an encrypted way to ensure authentication. For security purposes, an encryption standard is designed to protect the privacy of sensitive data when uploaded to third-party cloud storage. The following steps provide the proposed Fuzzy-Dedup framework in detail.
The proposed Dynamic size chunking (DSC) model imitates the working principle of clustering technique by defining size and type of the document contained in the input dataset. The proposed DSC technique is utilized in the initial stage of deduplication strategy which helps to partition large set of data into a particular number of groups. This approach segregates the given dataset into two or more clusters with the help of Euclidean distance. Based on the distance measure, n number of disjoint cluster is classified from the dataset taken. The first step of DSC approach evaluates n centroid and in the second step, it takes each point from the corresponding data point to the nearest Centroid cluster. For defining the nearest centroid distance, Euclidean distance is the most appropriate method. In the partition, each cluster is defined by its centroid and member objects. Assume that dataset carries set of documents encompassing d ={ d _ 1, d _ 2, …, d _ n }. and its size is defined in Gb. Based on the dataset size, set of chunks are generated. The maximum chunk size in an input dataset considered as 32 GB and is denoted as p. and dataset must be clustered into totally k. number of cluster. The centre of the cluster is indicted as c _ k. The subsequent steps defines the flow of proposed DSC approach:
1. Primarily initialize k. number of cluster and its centre.
2. For every document in the dataset, Euclidean distance d. is computed among the cluster centre and each input dataset size. It is mathematically calculated by the following expression 1.
3. Based on calculated distance d assign all the document size to the nearest centre.
4. Re-evaluate new position of the centre after all documents have been assigned. It is mathematically represented using the expression shown in Equation 2.
5. The process repeated till all the files gets chunked anattained minimum error value.
Based on the above steps, original files are chunked into different clusters. The dataset size and type are assigned to an individual cluster based on the Edean distance calculation. Once the chucking of files gets over, the proposed DSC revaluates the new position of centroid of every cluster. Then a new distance is formed dependent on that centroid by calculating the Euclidean distance among centre and each data point then allocates the point in cluster possessing lowest Euclidean distance.
The data authentication scheme used here is HMAC-SHA 256 that creates secure collision-free unique hash values of data chunks. To generate the fingerprint, the hash generator uses HMAC-SHA256 which produces 256-bits signature for each chunk. It works on the concept of digital signature which is a secure method that ensures data authentication, non-repudiation and data integrity. Hashing can therefore be used instead of a digital process for long data or messages. In this proposed methodology, the information or message is transmitted through an algorithm called one way-hash function (SHA256) or cryptographic hash function for the purpose of allocating fingerprints. Hashing creates a compressed signature of data in the form of a hash value or message digest which is smaller than the message. Same hash fingerprint is followed for same files. And if any change made to the message produces a different hash result. The representation of HMAC-SHA256 is defined as follows:
The above stated equation uses the following parameters:
H = Cryptographic hash function (HMAC-SHA 256),.
K = Secret key,
m = Message,
|| = Concatenation,
⊕ = Exclusive OR,
opad = Outer padding,
ipad = Inner padding
Based on the above stated expression, the hash values are generated for each file in the chunk. This helps to enhance the authentication of each files. In the proposed model the client tends to use this fingerprint as proof of ownership to prove the ownership of the entire file.
In order to maintain the fingerprint in a structured manner, indexing is performed with the help of a tree-based approach. It helps to check the data chunks uniqueness for the process of storing and deleting files. In the proposed research, the B-Tree approach iused for indexing and it is attracting good attention due to its compatibility. While this tree system is a well-organized structure the lower elements are mounted on the left side of the root and the larger elements are built on the right side of the root node. It involves updating and searching procedure with θ (logn) time. This process flow enhances the deduplication performance by leading fast updating and searching scheme. It helps to avoid large sorting functions. The main advantage of B-tree indexing over other indexing mechanism is thatreduces the comparison count and complexity of time. This tree approach suits very well in larger databases hence it reduces computational overhead and search time. The reason they are fully effective is because they use hashes instead of complete files.
S Fuzzy-Dedup: A deduplication model
Duplicate files in the dataset are detected by the proposed Fuzzy-Dedup using cosine similarity based fuzzy inference system. Before subjecting the input information to the fuzzy system, the cosine similarity function is evaluated to compare the level of similarity between the two files. Similar values can only be calculated for two input files contained in a single chunk. Similar values vary gradually from 0 to 1. This means that the similarity value won’t come exact measure as 0 or 1 so the utilized inline deduplication process finds difficulties on which decision is made on duplication of files. Therefore, these type of challenges can be accurately demonstrated by means of fuzzy logic. As mentioned earlier, the crisp value is calculated by the cosine similarity function between two documents
In this Equation (4), Where
Figure 2 demonstrates the fuzzy inference system in which the similarity measurement is the input value. Then fuzzification process and rules are constructed. After completing fuzzification, defuzzification methodology is used for producing the output for generating the duplicate detection mechanism. The proposed system is used to obtain the output according to the given input for the system. The total number of rules framed is 5. The membership function values and the fuzzy rules are framed by expert knowledge based on the fuzzy inference concept. The step-by-step procedure followed by the proposed methodology is displayed below.

Fuzzy interference system.
For Lowest membership function of input cosine similarity function is mathematically expressed by the following notation.
The membership function for lowest term is defined with the help of Equation 5. Based on the similarity value obtained, the ranges are checked at first. Then the condition are determined.
The membership function for the term Low with similarity range of input parameter is formulated as.
The ranges are determined by the fuzzy interference logic. The ranges define the criteria and resultant the outcome values for the low membership function.
For the membership function medium, the cosine similarity iut parameter obtained by the following notation.
Equation 8 resembles the membership function obtained for the outcome high in regards with the input similarity parameter.
Similarly, for the highest membership function, the cosine similarity input parameter ranges from 0.81 to 1 and is mathematically represented as follows.
Based on the ranges defined, the membership functions are generated with the help of fuzzy rules. Using the fuzzy system, the similarity values won’t attain the confusion state to make decision of either duplicate or non-duplicate file. The above membership function values and the fuzzy rules are framed by researcher’s assumption based on the fuzzy inference concept. The number of rules is calculated based on input parameter’s membership function and is shown in the below figure.
Figure 3 demonstrates the Fuzzy System in which similarity value is considered to be input value. Based on the degree on membership of each value, the crisp values are transformed to linguistic variables. The MF lowest relates to first region, likewise the region low corresponds to second region, similarly medium matches to third region, the region high corresponds to region 4, and highest refers to fifth region, very large to region 6. Region refers to the linguistic variables in which they are mapped accordingly to their degree of membership.

Membership function of Input parameter.
Rule 1. If
Rule 2. If
Rule 3. If
Rule 4. If
Rule 5. If
After completing Fuzzification, Defuzzification methodology is used for producing the output for deciding the output file as duplicate or non-duplicate. As data is received into a disk system, software will determine duplicate blocks based on files, hashes or bytes already exists before it is written on the target system. Table 1 illustrates the rules generated using the proposed fuzzy-dedup algorithm.
Rules generated using Fuzzy-Dedup
After detecting the duplicate and non-duplicate file using Fuzzy-Dedup technique, the proposed system decides the time of deduplication process to happen. For that purpose, in this research methodology, both offline and inline deduplication process is applied to enhance the system performance. This can be done based on the following criteria and is offered in below case 1 and 2.
Based on the above cases the deduplication process is carried out. Finally, the non-redundant files obtained are secured and then stored in the cloud serveference or outsourcing the file to third party server in order to preserve security, users may want their data encrypted. Hence the output defuzzied non-redundant files are stored in the Cloud Storage (e.g. Google Drive, Dropbox etc.) so that the files can be stored and accessed all over the Web. Before storing the input files in cloud server, security analysis is done with the help of Triple Data Encryption Standard (TDES). This security mechanism is a symmetric key block cipher in which the identical key is used for encrypting and decrypting data in fixed group of bits called blocks. TDES encompasses the key size by performing three times with key bundle of three different keys with the combined key size of 168 bits(3times56). It involves Encrypt- Decrypt- Encrypt mode in which the plain text is encrypted using K1., decrypted with K2. and again encrypted using K3. The procedure of encryption algorithm in TDES process each of 56 bits excluding parity bits is as follows:
That is DES encrypt the plaintext with K1., then decrypt with K2. and then encrypt with K3. The process of decryption is reverse that is decrypted with K3. encrypted with K2. and then decrypted with K1. and is denoted as follows:
Each triple encryption encrypts 64 bits block of data at a time. In each case of either encryption or decryption, the middle operation is reverse to the first and last. This increases the strength of the algorithm. Hence the user who is familiar with the proxy fingerprint of the file can prove the ownership to the storage server and responsible for downloading the whole file. Finally, the secured files are stored in cloud service provider for further access. This deduplication process tends to eradicate memory loss, as only specific server files are stored and duplicated files are removed.
The pseudo code of the proposed methodology is presented in the below algorithm 1,
The whole process is now capable to differentiate both duplicate and non-duplicate files by securely storing in the cloud server. The proposed methodology offers the storage availability both in offline and inline method. Hence the entire research is capable to obtain efficient outcome in duplication process. The overall performance of the proposed model shows the superiority of proposed methodology.
This section carries experimental outcomes of the proposed methodology. The proposed research initiates with the chunking phase where set of documents get partitioned into several chunks. After performing chunking, fingerprint for each file is created to make the signature of each specified input file and indexing phase is carried out. Then similarity values are computed using cosine similarity function. Based on the similarity values obtained, decision making process is done with the help of Fuzzy logic system. The decision-making procedure generates set of rules for categorizing both duplicate and non-duplicate files. The detected non-duplicate files are secured and stored in the cloud server securely in the form of both offline and inline deduplication process. The whole implementation process is carried out in MATLAB having Intel (R) Core (TM) i5-3570 S CPU processor with speed 3.10 GHz. For evaluating the proposed methodology, the files contained in authors Personal Computer and laptop are considered as dataset to test and validate the proposed technique. The dataset contain set of documents and the maximum chunk size in an input dataset considered as 32 GB. The experimental outcome in this section validates the achievability and effectiveness of the proposed methodology.
From Table 2, it can be clearly stated that the proposed design performed well both at inline and offline. Previous methods are overwhelmed at the parameters of throughput, overall processing time and memory usage values by proposed Fuzzy dedup method.
Comparison of Performance analysis results
Comparison of Performance analysis results
The parameters taken for comparison are throughput, overall running time, memory utilization, duplication ratio, encryption time and decryption time. The existing techniques compared are Rabin content defined chunking (Rabin CDC), Fast Dedup, Two Thresholds Two Divisors deuplication (TTTD) approaches. Table 3 illustrates the Parameters Estimated for proposed and existing techniques.
Parameters Estimated for proposed and existing techniques
Parameters Estimated for proposed and existing techniques
Duplication elimination ratio is defined as the ratio between the original data size before duplication to size of the data after duplication. Measurement of deduplication ratio for proposed as well as existing methods are illustrated as graph shown below Fig. 4. The graphical illustration is based on several techniques and value of de-duplication ratio on both X and Y labels respectively. The de-duplication ratio is found for the proposed method 1.2 and 0.8 for inline and offline which is greater compared to existing methods such as Rabin CDC 0.9 and 0.79 for inline and offline, Fast Dedup is 0.8 and 0.6 for inline and offline and TTTD is 0.7 and 0.45 for inline and offline. Based on the obtained values the deduplication ratio remains high for proposed Fuzzy-dedup method in both offline and inline deduplication process whereas the other existing methods remains low. It is well known fact that, when the deduplication measure attains maximum then the system performs better. By satisfying this, the proposed methodology obtains a higher measure as compared with Fast dedup, TTD and Rabin CDC. Likewise, Throughput is defined as the measure of total number of information units a system can handle in a given instance of time. In the deduplication process it is termed as the amount of data de-duplicated in a given amount of time. It is measured in terms of bit per seconds (bps), megabits per second (mbps) or gigabits per second (gbps). Throughput is measured as Mb/s for proposed and previous methods that are illustrated as graph shown in Fig. 5.

De-duplication ratio obtained for offline and inline de-duplication process.

Outcome of throughput measure for proposed and existing methods.
Performance metrics of the proposed model is compared with existing techniques like Rabin CDC, Fast Dedup and TTTD in term of throughput evaluation and is represented in Fig. 5. The graphical illustration is based on several techniques and value of throughput on both X and Y labels respectively. The throughput is found for the proposed method 32500 (Mb/s) and 25500 (Mb/s) for inline and offline which is greater compared to existing methods such as Rabin CDC 25050 (Mb/s) and 18000 (Mb/s) for inline and offline, Fast Dedup is 23000 (Mb/s) and 17500 (Mb/s) for inline and offline and TTTD is 20000 (Mb/s) and 15000 (Mb/s) for inline and offline. Proposed Fuzzy-Dedup model exhibits high throughput for both offline and inline phases whereas existing techniques attains minimum measure. Hence it is efficient for proposed method than existing techniques. The next step is to calculate time taken for deduplication technique to obtain output response. In other words, computation or duplication time is termed as time taken for the entire deduplication process. It can be measured in terms of seconds or milliseconds.
Figure 6 illustrates the overall processing time of the system for proposed and existing methods. Deduplication time of the proposed Fuzzy-dedup method is low hence it is efficient and fast when compared to other existing methods. The graphical illustration is based on several techniques and value of processing time on both X and Y labels respectively. The overall prcessing time is found for the proposed method 0.4494 second and 0.1139 for inline and offline which is less compared to existing methods such as Rabin CDC 0.8 second and 0.21 second for inline and offline, Fast Dedup is 0.9 second and 0.42 second for inline and offline and TTTD is 1.22 second and 0.59 second for inline and offline. Finally, memory space occupied by each methods are calculated. The data which are occupied memory space in cloud is termed as the memory utilization for the proposed deduplication approach. Memory utilization is a measurement of possible data space utilized by the user.

Overall processing time of the system for proposed and existing methods.
Figure 7 illustrates the memory utilization attained by proposed and existing methods. The graphical illustration is based on several techniques and value of memory (bytes) on both X and Y labels respectively. The memory utilization is found for the proposed method 12000 bytes and 9500 bytes for inline and offline which is less compared to existing methods such as Rabin CDC 24000 bytes and 17500 bytes for inline and offline, Fast Dedup is 34500 bytes and 29550 bytes for inline and offline and TTTD is 35000 bytes and 30000 bytes for inline and offline. Storage memory is low for the proposed Fuzzy-Dedup model in both offline and inline mechanism that makes the system more efficient. Other existing method takes high storage memory, hence it cannot be implemented in realistic environment. From the overall analysis, it is clear that the proposed method achieves efficient outcome than other existing methods.

Memory utilization attained by proposed and existing methods.
Figure 8 elaborates the outcome of proposed TDES based security algorithm. Data encryption and decryption time are measured for different data sizes, which increase in proportion to the data size. The obtained graphs for encryption and decryption time is displayed in Fig. 8. It is clear from the figure the proposed method achieves minimum time for encryption and decryption process and is indicated in seconds. The existing technique compared here are Data Encryption Standard (DES), Advanced Encryption Standard (AES) and Homomorphic encryption. Thus, from the overall analysis the proposed method outperforms well than the existing techniques. Proposed design of security design with respect to previous methods, are compared and exhibited a better performance. Proposed fuzzy dedup security method, gives while measuring deduplication ratio gives 20% better ratio output than Rabin CDC, likewise fast dedup and TTTD showed 20–30% less performance than fuzzy-dedup method measured both inline and offline. Overall processing time of proposed fuzzy-dedup inline gives 40% and offline gives 10% value, when compared to previous ones it was 40% better than Rabin CDC as it gives 0.8 seconds of time for whole process and fast dedup gives 0.91 seconds finally TTTD gives 1.2 seconds of processing time. Memory utilization of fuzzy dedup gives 10000 bytes for inline and 9000 bytes for offline that was 30% lower thus making the objective of design to be satisfied that previous ones. TTTD method occupies a high volume of memory space than all other implemented methods. Thus proposed design of fuzzy dedup gives overall better performance than previous ones in terms of memory usage and processing time.

Encryption and decryption time obtained by proposed and existing methods.
The main objective of this research is redundant data reduction to increase storage space with fast deduplication performance. The proposed de-duplication module involves chunking, fingerprinting, indexing of fingerprinting and writing. In chunking, files are partitioned into different chunks and the chunk size is decided by the working principle of k-means clustering algorithm. A unique identifying value is computed using suitable hash signature algorithm, known as fingerprint. These fingerprints are stored in the index format. Then similarity values and duplicate record detections are found with the help of proposed cosine similarity based Fuzzy interference system. Based on the similarity value calculated, fuzzy rules are generated for deciding the particular files get stored or not. Thus the proposed system makes the decision which file is duplicate and non-duplicate file. The storage is done in cloud server before that, security analysis are done with the help of encryption standard and follows the inline deduplication process. The proposed methodology obtains the efficient outcome as compared with the existing techniques in terms of throughput, de-duplication ratio, processing time and memory utilization. The proposed methodology provides a comparatively maximum redundancy detection with higher throughput, higher deduplication ratio, lesser computation time and very low hash values comparison time for big data storage system. The experimental results illustrates the better performances such as, deduplication ratio of 1.2, memory utilization of 12500 bytes in inline and 9550 bytes in offline, throughput of 32500 Mb/s in inline and 25500 Mb/s in offline and processing time of 0.4494 s in inline and 0.1139 s in offline. The key generation time of the proposed TDES based security algorithm attained 0.1 seconds of encryption and 0.07 seconds which is less compared to other methods. In future, the storage of index format and standard of encryption might be focussed to enhance that gives a new way of developing a fuzzy dedup design of security analysis of cloud applications.
