Grey wolf optimization based clustering of hybrid fingerprint for efficient de-duplication

Abstract

This paper intends to perform de-duplication for enhancing the storage optimization. Hence, this paper contributes by proposing a hybrid fingerprint extracting using simhash (SH) and Huffman coding (HC) algorithms. Secondly, the data is clustered using the latest technique called as grey wolf optimization (GWO) to extract the metadata. The extracted metadata is stored in metadata server which provides better storage optimization and de-duplication. Euclidean distance based GWO is adopted as it provides minimum Euclidean distance in the GWO based clustering for de-duplication. The proposed GWO based clustering method is compared with the existing methods such as k-means, k-mode, Euclidean distance based Particle Swarm Optimization and Euclidean distance based genetic algorithm in terms of accuracy, True Positive Rate (TPR), True Negative Rate (TNR) and performance time and the significance of the GWO based clustering method is described.

Keywords

De-duplication simhash algorithm huffman coding grey wolf optimization accuracy

1. Introduction

The rapid growth of data in data centres leads to the problem of data/file duplication [8]. A recent survey by International Data Company (IDC) concludes that data/files in the digital world are doubling every 18 months and more particularly, 75% of stored data are duplicated. Specifically, in the backup system, duplication rate of data exceeds 90% [1]. As the growth of data increased rapidly [4], there is a need to protect and optimize data. Additionally, data loss is also one of the major concerns. Promisingly, “De-duplication techniques” are widely used to enhance the data reliability and storage efficiency. Moreover, de-duplication is an effective and efficient technology that used to backup data [3]. Data De-duplication [1, 34] is a distinct technique for compressing data that eliminates redundant copies of replicating data. It may also define in terms of single-instance (data) storage [36] and intelligent (data) compression and single-instance (data) storage [35]. This technique can also use in network data transfer that minimizes the byte number that is sent and also enhances the storage utilization. While processing de-duplication technique, the byte patterns or unique chunks of data are recognized and stored for analyzing purpose.

De-duplication technique is not only felicitous for structured data but also advantageous for unstructured data like key-value stores, distributed file system and backup/archival systems [9] since mostly the entire files in the cloud environment and enterprise is duplicated. However, several de-duplication techniques are proposed and processed for storage efficiency. DupLESS [2] is one of the de-duplication techniques that solve the issue of convergent encryption (CE) scheme and provides a better privacy policy. Moreover, Content-Defined Chunking (CDC) [2] algorithm is a chunk-level de-duplication is there to solve the boundaries shifting problem. Furthermore, an inline de-duplication model named I-sieve [3] and Resemblance and Mergence based De-duplication scheme (RMD) [4] increases the system performance and provides rapid response to fingerprint queries.

However, data de-duplication technique also faces many barriers. Computational resource intensity is one of the major issues that faced by data de-duplication. Moreover, the integrity of data has also been a challenge for de-duplication system since it purely depends on the design of the de-duplication system. Another important obstacle that faced by the de-duplication system is Scaling since the scope of the technique is necessary to share among the storage devices. Apart from this, maintaining the quality of the data while processing is also a major pitfall of de-duplication technologies. Since many meta-heuristic approach [37, 38, 39, 40, 41, 42, 43, 44, 45, 46] was exploited for de-duplication. Further, the space efficiency is also adversely affected if the infrastructure has multiple disk backup devices with distinct de-duplication. Furthermore, the attacker may hack the data by guessing or knowing the hash values that are owned by others. Despite, numerous techniques or schemes are available for de-duplication; still, it needs an efficient de-duplication model for storage enhancement.

The main contribution of this paper is to present de-duplication for improving the storage optimization. Therefore, a hybrid fingerprint extracting using simhash (SH) and Huffman Coding (HC) algorithms is proposed in this paper. Moreover, the data is clustered by exploiting a renowned algorithm termed as GWO in order to extract the metadata. The hybrid fingerprint and the exploitation of GWO method to extract the metadata is the new approach, which is proposed in this paper. The extracted metadata is stored in metadata server that presents better storage optimization and de-duplication. Euclidean distance based GWO is adopted as it provides minimum Euclidean distance in the GWO based clustering for de-duplication.

The rest of the paper is organized as follows. Section 2 portrays the related works. Section 3 illustrates the modeling of a heterogeneous network. Section 4 depicts the proposed handover decision algorithm. Section 5 describes the simulation results, and Section 6 concludes the paper.

2. Literature review

In 2015, Xing et al. [1] had proposed a cluster de-duplication system named AR-De-duplication to improve the data reliability and storage efficiency. Usually, de-duplication systems always face some challenges such as communication overhead is high for data routing, data de-duplication rate is decreased with more de-duplication server nodes as well as load balance for throughput improvement. However, in routing server, to speed up the index of handprints they have utilized a mechanism named application-aware mechanism. The proposed de-duplication technique experimented in two real datasets. Moreover, the results have demonstrated that the proposed system had the capability of achieving high data de-duplication rate. Along with this, the communication overhead is also low and also maintained better load balancing since they have adopted new data routing algorithm. Further, the application-aware mechanism has provided 30% improvement regarding performance.

In 2016, Zhang et al. [2] have proposed a new CDC algorithm named Asymmetric Extremum (AE) code for chunk-level de-duplication. The proposed algorithm has solved the boundaries shifting problem in which the extreme values cannot be considered as a new extreme value. The proposed algorithm has experimented with real-world datasets, and the results have shown increased chunking throughput. Moreover, when compared to the existing CDC algorithms, the proposed algorithm has provided smaller chunk size variance, and it could find proper chunk boundaries even in low entropy string. Besides, the throughput performance of state-of-art CDC algorithms was also improved by the proposed AE algorithm. The improvement was 2.3 times higher than the existing algorithm, and while comparing, the system throughput has been enhanced more than 50%.

In 2015, Wang et al. [3] had proposed a novel de-duplication model that named as I-sieve. In the cloud storage system, the proposed de-duplication architecture I-sieve has aimed to get increased data sieve system performance which was on the basis of iSCSI. More particularly, they have presented the multi-level cache that reduced the RAM consumption and optimized the lookup performance by designing the mapping and index tables. Moreover, on the basis of open source iSCSI target, they have implemented the prototype of I-sieve. Further, the experimentation of the proposed architecture was carried out by testing tools and virtual machine images. The results have shown the excellent performance with the aid of de-duplication and foreground performance. More precisely, the proposed I sieve system architecture co-exist with the existing systems since it supports the iSCSI protocol.

In 2017, Zhang et al. [4] have proposed a de-duplication scheme that provided faster response to the queries of the fingerprint which was named as resemblance and mergence based de-duplication scheme RMD. Moreover, the ultimate goal of the proposed scheme was to reduce the range of query and to grasp a Bloom filter array. The proposed de-duplication scheme has used a resemblance algorithm in order to resemble the data segment and the corresponding resemblance segments were put together into the same bin. Therefore at the query session, to detect the duplicate content they need only to search in the corresponding bin and thereby the query process was significant speeds up. In addition, the proposed RMD scheme has exploited the frequency-based fingerprint retention policy and to accumulate resemblance segment; they have used a mergence strategy which improved the data duplication ratio and query throughput. The proposed RMD scheme has experimented with real-world datasets, and the results have shown that the proposed method could obtain high query performance and outperforms various de-duplication schemes.

In 2015, He et al. [5] had proposed a new scheme named public auditing policy for storing cloud system in which both the data integrity analyze and de-duplication of encrypted data were achieved together in the same framework. In such a way that the auditor could exactly check the integrity of de-duplicated data and cloud server could exactly check the new owner’s ownership. Along with this, the proposed method has used the method of proxy re-encryption that supported the de-duplication of encrypted data, and they aggregated the tags from varying owners and achieved the de-duplication of the data tags. The experimentation results have shown that the proposed scheme was more efficient and proved that the scheme was more secure.

In 2015, Kwon et al. [6] had proposed a secure de-duplication scheme along with user revocation to solve the problem that occurred in CE scheme. The problem of the CE scheme is even if the user lost his/her ownership to the cloud data, the CE based policy that permits users to decrypt the cloud data. In order to generate the encryption key, the proposed scheme has grasped the oblivious privilege-based encryption. The experimentation of the proposed scheme has been done on some data sets, and the security analysis has proved that the proposed scheme was safer from decryptions that are unauthorized by cloud server, brute-force attack and revoked users.

Figure 1.

Taxonomy of de-duplication scheme.

The properties and limitations of the literature review reveal the significance of adopted methodologies of data de-duplication. AR-De-dupe cluster de-duplication system [1] reduces the communication overhead, and meanwhile, it produces high data de-duplication rate. However, if the file is larger, it needs more capacity. Asymmetric Extremum (AE) Content-Defined Chunking (CDC) algorithm [2] gives high chunking throughput and able to eliminate low-entropy strings. But the experimental evaluation is difficult since it needs more tests. Similarly, I-sieve inline de-duplication model [3] supports iSCSI protocols, and it is possible to de-duplicate data in small storage environments but the computational time is more. Furthermore, RMD [4] achieves high query performance, but problems may occur while buffering fingerprints. Subsequently, Public Auditing Scheme [5] is more secure and also achieves de-duplication of data tags. But, the Computational process is complicated, and there is a chance of problem in the cloud if the data owners increased. Moreover, [6] protects from brute-force attack, but the access policy is complex. However, there is a need for efficient data de-duplication methodology for reliable and better storage enhancement. Figure 1 represents the taxonomy of de-duplication scheme.

Figure 2.

Architecture of the proposed de-duplication method for storing files.

3. Proposed de-duplication architecture

Figure 2 demonstrates the architecture representation of the proposed de-duplication method for storing files. In the process of storing metadata, two stages are involved the namely online storage and offline storage. In offline mode, when the webpage or document is provided, the chunks are separated initially. Then for detecting the near duplicates and to generate fingerprints, the separated chunks are subjected to simhash and Huffman coding. After generating the fingerprints using simhash and Huffman coding, the fingerprints are concatenated. The hash collision of exploiting concatenated simhash and Huffman coding as the fingerprint is possible but the compressed hash coded data is uncertain. To extract the metadata, GWO [27] method, which performs clustering, is employed. The resultant centroids from the clustering are constructed as metadata and saved into the metadata server. The index of the respective chunk is stored in the index table, while the chunks are indexed and stored in the storage disk. In online mode, when we attempt to save a new file, the chunks are first separated.

New chunks are formed for data that are mismatching and are stored in a storage disk. The new metadata generated are stored in metadata server. Then the index is formulated and stored in the index table.

3.1 Simhash fingerprint

In fingerprint computation, we use simhash algorithm [32] for addressing the near duplicates. simhash have two vital but conflicting features that make it an ideal method to address the presence of near duplicates. They are: (i) The fingerprint of a $C_{ij}$ is defined as a “hash” of its properties and (ii) Same $C_{ij}$ has unique values of the hash. Thus, it guides in describing whether the 2 chunks are unique or not by comparing their hash values accordingly. At first, it transforms each of the chunks that are filtered it into a set of keywords. Each keyword is addressed with its weight, i.e., the total number of times the keyword appears in the $C_{ij}$ . Then, the f-bit fingerprint is transformed into a vector with High Dimension (HD), where ‘f’ is small when compared with its real dimension. Here, let the input ‘ $C_{ij}$ ’ be composed and pre-processed with a series of keywords. We assign an f-dimensional vector V with each element as zero. After assigning, each keyword is hashed into an f-bit value. These f-bits increases or decreases according to the number and weights of keywords. At last, the component signs describe the particular bits of the final fingerprint of the $C_{ij}$ . The process is repeated for each of the $C_{ij}$ and its fingerprint is extracted. In simhash, the fingerprints of chunks are given by $F_{ij}^{SH}$ . The pseudo code of the simhash algorithm is given in Algorithm 1 and the detailed process is given in Fig. 3.

Figure 3.

Generation of fingerprints for file chunks using simhash.

3.2 Huffman coding fingerprint

Nowadays an advanced coding method known as simhash coding is used for extracting the fingerprints from chunks. Since the simhash based fingerprint focuses mainly on the indexing of the respective keywords, this paper proposes fingerprint based on Huffman coding [33] and to generate the final fingerprint by combining both the simhash and Huffman coding. The Huffman coding based fingerprint generates the fingerprint by providing more significance to the frequency of occurrences of the keywords, rather than their indexing. As Huffman coding provides the frequencies of the chunks, it helps for better de-duplication and chunk separation process.

Figure 4.

Generation of fingerprint for file chunks using Huffman coding.

Figure 5.

Exemplary Huffman coding tree for generating the fingerprint for chunk keywords.

Figure 4 demonstrates the generation of fingerprint for file chunks using Huffman coding. Figure 3 is explained by a Huffman coding tree in Fig. 5. The tree is constructed with different frequencies and weights are allotted to them. $K_{1}$ Carries an associated frequency of 8, $K_{2}$ with 4, $K_{3}$ with 3, $K_{4}$ with a relative frequency of 2 and $K_{5}$ with a frequency of 1. Figure 5 shows the exemplary Huffman coding tree for generating the fingerprint for chunk keywords.

In an Huffman coding tree, the corresponding symbol for any codes can be known by beginning from the root and moving down till we locate the particular leaf that carries the symbol. 0 is assigned for all the left branches and 1 is assigned to all the right branches in a tree. For example, in Fig. 4, we reach the leaf $K_{3}$ by passing through a right branch, again a right branch and then through a left branch. Thus $K_{3}$ carries a code of 110. Thus the Huffman coding for the entire keywords is 01011011101111. In Huffman coding, the fingerprints of chunks are represented by $F_{ij}^{HC}$ . Thus the overall representation of fingerprints can be denoted by $|{F_{ij}}|=|{F_{ij}^{SH}}|+|{F_{ij}^{HC}}|$ .

3.3 Objective model

The objective model for clustering the chunks is given in Eq. (1),

$\displaystyle Q_{c}^{\ast}=\arg\min\limits_{Q_{c}}\|{Q_{E}}\|$ (1) $\displaystyle Q_{E}=\sqrt{\sum\limits_{c=1}^{N_{c}}{\|{{{f_{ij}}-{Q_{c}}}}\|}^% {2}}$ (2)

where, $Q_{E}$ is the measurement of error and $Q_{c}^{\ast}$ is the optimal centroid. Each cluster is represented by a centroid, which is given by $Q_{c}$ : $c=1,2,\ldots,N_{c}$ . Each centroid that represents a chunk also represents the chunks of the respective cluster. By minimizing the clustering error using Eq. (1), the clustering can be achieved and so the metadata can be constructed.

4. GWO-based metadata construction

Metadata, here, is the representative of the group of chunks to be presented as a fingerprint. To construct the metadata, the fingerprints of the chunks are clustered to identify the homogeneous and heterogeneous chunks. The homogeneous chunks are the chunks that share common keywords, whereas the heterogeneous chunks are highly different from each other. The promising clustering process is enabled at this moment to cluster the chunks for which the Euclidean distance-based similarity measure is exploited in the paper. There are wide ranges of algorithms such as k-means, SOMs [9] etc. Due to the lack of diversification and curse of dimensionality problem, this paper exploits GWO to cluster [28] the chunks.

To solve the objective model, which is given in Eq. (1), GWO [27] is adopted here. The GWO algorithm depicts the hunting mechanism and leadership hierarchy of grey wolves. There are four classifications of grey wolves such as $\alpha,\beta,\gamma,\delta$ are adopted for executing the leadership hierarchy. Searching, rounding, and attacking the prey are the main three processes in hunting that are adopted to enhance optimization. Figure 6 shows the steps involved in GWO method [29].

4.1 Initial centroid

In GWO, the grey wolves represent the optimal centroid Q ${}_{c}$ as represented in Fig. 7 where the element of each wolf is a bounded random integer, i.e., $Q_{cE}=[{Q^{\min},Q^{\max}}]$ , where, $Q^{\min}=\min({F_{ij}^{D}})$ and $Q^{\max}=\max({F_{ij}^{D}})$ , respectively and $F_{ij}^{D}$ is the decimal version of $F_{ij}$ .

Figure 6.

Flowchart to describe the steps involved in GWO.

Figure 7.

Centroid representation for GWO optimization.

Here the centroid value is iterated until the best wolf is found out by minimizing the Euclidean distance based evaluation function

4.2 Evaluation

In the evaluation, the initialized centroids are evaluated using Eq. (1), since the objective is to minimize the error function; the wolves are assigned with $\alpha,\beta,\gamma,\delta$ values. The centroid with minimum error is said to be $\alpha$ , the centroid with second minimum error is said to be $\beta$ , the one with third minimum error function is said to be $\gamma$ and the remaining wolves are said to be $\delta$ . So that the relationship between the hierarchy of wolves can be mathematically represented as, $Q_{E}(\alpha)<Q_{E}(\beta)<Q_{E}(\gamma)<Q_{E}(\delta)$ .

4.3 Centroid update

After initialization, the fitness values of the wolves are calculated. The best wolves $\alpha,\beta,\gamma$ are selected. Then the distance or positions of wolves have updated accordingly for a maximum number of iterations. The iterations are repeated until $Q_{E}$ is attained. The optimum centroid can be calculated by finding the distance with the $\alpha,\beta,\gamma$ using Eqs (7)–(9). By taking the average for the centroids, using Eqs (4)–(6), the updated form of centroid can be obtained using Eq. (3) where $t$ denotes the current iteration.

$\displaystyle{Q_{c}}({t+1})=\frac{{Q_{c1}}+{Q_{c2}}+{Qc_{3}}}{3}$ (3) $\displaystyle Q_{c1}=Q_{c\alpha}-A_{1}({{D_{\alpha}}}),$ (4) $\displaystyle Q_{c2}=Q_{c\beta}-{A_{2}}({{D_{\beta}}}),$ (5) $\displaystyle Q_{c3}=Q_{c\delta}-A_{3}({{D_{\delta}}})$ (6) $\displaystyle{D_{\alpha}}=|{{C_{1}}\cdot{Q_{c\alpha}}-{Q_{c}}}|$ (7) $\displaystyle{D_{\beta}}=|{{C_{2}}\cdot{Q_{c\beta}}-{Q_{c}}}|$ (8) $\displaystyle{D_{\delta}}=|{{C_{3}}\cdot{Q_{c\delta}}-{Q_{c}}}|$ (9) $\displaystyle C=2\cdot r_{2}$ (10) $\displaystyle A={2a}\cdot{r_{1}}-a$ (11)

In Eqs (10) and (11), coefficient vectors are given by $A$ and $C$ , whereas $r_{1}$ and $r_{2}$ are the random vectors. By varying the values of $A$ and $C$ vectors, many places surrounding the good agents can be found corresponding to the present position. The $r_{1}$ and $r_{2}$ permit wolves to arrive at any points that lie within the iterations. Hence, updated centroids using Eq. (3) are subjected further to updating using Eq. (12), where ${Q_{c_{p}}}$ denotes the vector location of the prey.

$\displaystyle{Q_{c}}({t+1})={Q_{c_{p}}}(t)-A.D$ (12) $\displaystyle D=|{{C\cdot{Q_{c_{p}}}(t)-{Q_{c}(t)}}}|$ (13)

4.4 Fetching metadata

After performing a number of iterations, the $\alpha,\beta,\gamma$ wolves detect the position of the prey. Each solution gets updated in terms of distance. Thus the prey which is updated with $Q_{E}$ is said to be the best wolf. Hence by using GWO algorithm, the $Q_{E}$ can be obtained. $\alpha$ is said to be the best centroid with minimum error. The centroid is stored as metadata in the metadata server and then clustering is done. The clustered data using the metadata are indexed and the indices are stored in the index library, whereas the indexed data is stored in the storage disk.

Table 1
Comparison of the proposed clustering method with existing methods

Percentage of duplication	k-mode	k-means	ED-PSO	ED-GA	ED-GWO
10	0.352	0.362	0.402	0.425	0.598
20	0.456	0.485	0.489	0.498	0.685
30	0.624	0.634	0.658	0.666	0.695
40	0.657	0.663	0.674	0.684	0.725
50	0.698	0.721	0.722	0.754	0.845

Figure 8.

True Positive Rate (TPR) versus % duplication for the proposed and conventional de-duplication methods.

Figure 9.

True Negative Rate (TNR) and % duplication between the existing methods and the proposed clustering method.

5. Simulation results

5.1 Procedure

The proposed intelligent de-duplication methodology is developed in Java and compared with traditional methods such as particle swarm optimization (PSO) and genetic algorithm (GA), which are used for clustering. Moreover, the traditional k-means [30] and k-mode [31] clustering algorithms are also used for comparative study. Henceforth, the clustering methods are called here as Euclidean distance based GWO (ED-GWO), Euclidean distance based PSO (ED-PSO) and Euclidean distance based GA (ED-GA). The benchmark database, called as 20 Newsgroups data set which can be accessed from the given link: http://qwone.com/∼jason/20Newsgroups/ is synthesized for duplication by 10%, 20%, 30%, and 40% and 50%. The comparisons on identifying and reconstructing from duplication are studied using accuracy, True Positive Rate (TPR) and True Negative Rate (TNR) against the different percentage of duplication and it is summarized in Table 1.

5.2 De-duplication

The TPR of all the methods has increased with respect to the percentage of duplication, which is shown in Fig. 8. However, the proposed method achieves a TPR value of 4.8% better than ED-GA, 0.8% better than ED-PSO, 1.3% better than k-mode and 0.2% better than k-means. Thus the proposed GWO based clustering method provides better TPR rate than other existing methods.

The TNR of all the methods increases with respect to the percentage of duplication is shown in Fig. 9. Anyhow, the proposed method achieves a TNR value of 1.8% better than ED-GA, 1.7% better than ED-PSO, 0.5% better than k-means and 3.2% better than k-mode techniques. Thus the proposed GWO based clustering method provides better TNR rate than other existing methods.

Table 2
Reduction of computation time in the proposed clustering method when compared with existing methods

Sl.no	Methods	Computational time
1	k-mode	11.154
2	k-means	10.024
3	ED-PSO	12.324
4	ED-GA	13.457
5	ED-GWO	11.312

Figure 10.

Restoration accuracy and % duplication among the existing methods and the proposed clustering method.

5.3 Data restoration

The restoration accuracy increases with increase in the percentage of duplication in all the methods, which is shown in Fig. 10. The proposed system achieves an accuracy of 1.3% better than k-mode, 0.8% better than ED-PSO and 4.8% better than ED-GA. The accuracy of k-means is slightly greater than GWO by 0.1% but in other cases the proposed method provides better accuracy of about 1.6%, 1.4% and 0.5% than k-means. Thus the proposed GWO based clustering method achieves better accuracy than other existing methods.

5.3.1 De-duplication vs restoration

The restoration accuracy increases with increase in compression ratio for all techniques, which is shown in Fig. 11. But the proposed clustering method achieves an accuracy of 0.8% better than ED-PSO, 4.7% better than ED-GA and 1.6% better than k-mode. The accuracy of k-means is slightly greater than ED-GWO by 0.3%, but in other cases the proposed method delivers higher accuracy of 1.3%, 1.4% and 0.6% better than k-means. Thus the proposed GWO based clustering method provides better accuracy rate than other existing methods.

Figure 11.

Restoration accuracy and compression ratio between the existing methods and the proposed clustering method.

5.3.2 Computing overhead

The computational time is reduced in the proposed algorithm when compared with other algorithms, which are shown in Table 2. The computational time for ED-GWO is slightly increased than k-mode and k-means by 1% and 9% respectively. As ED-GWO achieves 3.2% TNR rate higher than that of k-mode and 0.5% TNR better than k-means, the performance time of proposed method can be considered as negligible. From the obtained results, it can be concluded that the proposed system, i.e., ED-GWO system achieves better accuracy, TPR and TNR rates and less computational time for de-duplication when compared with other existing methods.

6. Conclusion

With the rapid growth of the data, a single-node de-duplication system cannot assure the required performance in data storage as well as data protection. Hence, cluster-based de-duplication system emerges. However, the storage node information island will outcome in the cluster de-duplication system failing to improve the performance of de-duplication throughput and scalability. In order to solve this problem, we have introduced a novel clustering method, GWO to obtain de-duplication in this paper. It overcomes the performance gaps from conventional methods such as, ED-PSO, ED-GA, k-means and k-mode. Finally the proposed ED-GWO based clustering method was compared with other existing methods such as k-means, k-mode, ED-PSO and ED-GA. From the result it was concluded that the proposed de-duplication system provides better TPR of 1.3%, TNR of 3.2%, accuracy of 4.7% based on compression ratio and accuracy of 4.8% based on a percentage of de-duplication than existing methods for enhancement on de-duplication and storage optimization. In future, a novel clustering method, second order mutual information based grey wolf optimization can be adopted in order to obtain de-duplication in an effective way.

Footnotes

Authors’ Bios

Jyoti Malhotra did M.E. in Computer Science and Engineering from the SPPU University, Pune in 2005. She is working as an Assistant Professor in Engineering College, Pune. She has 15

+

years of teaching experience. She is pursuing her Ph.D. Degree in Computer Science and Engineering; RTM Nagpur University under the guidance of Dr. J. W. Bakal. Her research interest lies in Data Storage patterns, Big Data, Software Testing and Theory of Computation. She has worked in C, Java, PHP, and Linux Programming. She is the life member of Computer Society of India. She has publications in National, International conferences and Journals like IEEE conference, Springer conferences, etc.

Jagdish W. Bakal received M. Tech. (EDT), from Dr. Babasaheb Ambedkar Marathwada University, Aurangabad. Later, He completed his Ph.D. in the field of Computer Engineering from Bharati Vidyapeeth University, Pune. He is presently working as Principal at the S.S. Jondhale College of Engineering, Dombivali (East) Thane, India. In the University of Mumbai, he was on honorary assignment as a chairman, board of studies in Information Technology and Computer Engineering. He is also associated as chairman or member of Govt. committees, University faculty interview committees, for interviews, LIC or various approval works of institutes. He has more than 27 years of academics experience including HOD, Director in earlier Engineering Colleges in India. His research interests are Telecomm Networking, Mobile Computing, Information Security, Sensor Networks and Soft Computing. He has publications in journals, conference proceedings in his credit. He is a Professional member of IEEE. He is also a life member of professional societies such as IETE, ISTE INDIA, CSI INDIA. He has prominently contributed in the governing council of IETE, New Delhi India.

References

Xing

Y.-X.

Xiao

Liu

Sun

and He

W.-H.

, AR-dedupe: An efficient de-duplication approach for cluster de-duplication system, Journal of Shanghai Jiaotong University (Science) 20(1) (2015), 76–81.

Zhang

et al., A fast asymmetric extremum content defined chunking algorithm for data de-duplication in backup storage systems, IEEE Transactions on Computers 66(2) (2017), 199–211.

Wang

Zhao

Zhang

and Guo

, I-sieve: An inline high performance de-duplication system used in cloud storage, in Tsinghua Science and Technology 20(1) (2015), 17–27.

Zhang

Huang

Wang

and Zhou

, Resemblance and mergence based indexing for high performance data de-duplication, Journal of Systems and Software 128 (2017), 11–24.

Huang

Zhou

Shi

Wang

and Dan

, Public auditing for encrypted data with client-side de-duplication in cloud storage, Wuhan University Journal of Natural Sciences 20(4) (2015), 291–298.

Kwon

Hahn

Kim

and Hur

, Secure de-duplication for multimedia data with user revocation in cloud storage, Multimedia Tools and Applications 76(4) (2017), 5889–5903.

and Huang

, A secure cloud storage system supporting privacy-preserving fuzzy de-duplication, Soft Computing 20(4) (2016), 1437–1448.

Quinlan

and Dorward

, Venti: A new approach to archival storage, Proceedings of the FAST ’02 Conference on File and Storage Technologies, January 28-30, 2002, Doubletree Hotel, Monterey, California, USA. USENIX FAST.

Tolic

and Brodnik

, De-duplication in unstructured-data storage systems, Elektrotehniski Vestnik 82(5) (2015), 233–242.

10.

Bellare

Keelveedhi

and Ristenpart

, Message-locked encryption and secure de-duplication, Advances in Cryptology – EUROCRYPT 7881 (2013), 296–312.

11.

Xia

et al., A comprehensive study of the past, present, and future of data de-duplication, Proceedings of the IEEE 104(9) (2016), 1681–1710.

12.

Eshghi

and Tang

H.K.

, A framework for analyzing and improving content-based chunking algorithms, Hewlett-Packard Labs Technical Report TR 30 (2005).

13.

Xia

Jiang

Feng

and Tian

, Combining de-duplication and delta compression to achieve low-overhead data reduction on backup datasets, 2014 Data Compression Conference, Snowbird, UT (2014), 203–212.

14.

Xia

Jiang

Feng

Tian

and Wang

, P-De-dupe: Exploiting parallelism in data de-duplication system, 2012 IEEE Seventh International Conference on Networking, Architecture, and Storage, Xiamen, Fujian (2012), 338–347.

15.

Bjørner

Blass

and Gurevich

, Content-dependent chunking for differential compression, the local maximum approach, Journal of Computer and System Sciences 76(3) (2010), 154–203.

16.

Anand

Muthukrishnan

Akella

and Ramjee

, Redundancy in network traffic: Findings and implications, in Proc ACM SIGMETRICS, Seattle, WA, USA (2009).

17.

Jin

and Du

D.H.C.

, Frequency based chunking for data de-duplication, 2010 IEEE International Symposium on Modelling, Analysis and Simulation of Computer and Tele communication Systems, Miami Beach, FL (2010), 287–296.

18.

Zhou

and Wen

, Hysteresis re-chunking based metadata harnessing de-duplication of disk images, 2013 42nd International Conference on Parallel Processing, Lyon (2013), 389–398.

19.

Bobbarjung

D.R.

Jagannathan

and Dubnicki

, Improving duplicate elimination in storage systems, ACM Transactions on Storage (TOS) 2(4) (2006), 424–448.

20.

Bonwick

, Zfs de-duplication, https://blogs.oracle.com/bonwick/entry/zfsdedup (2009).

21.

Liu

Xue

and Wang

, A novel optimization method to improve de-duplication storage system performance, 2009 15th International Conference on Parallel and Distributed Systems, Shenzhen (2009), 228–235.

22.

Nam

Y.J.

and Du

D.H.C.

, BloomStore: Bloom-Filter based memory-efficient key-value store for indexing of data de-duplication on flash, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST), San Diego, CA (2012), 1–11.

23.

Liu

Shi

D.H.C.

and Wang

D.S.

, ADMAD: Application-driven metadata aware de-duplication archival storage system, 2008 Fifth IEEE International Workshop on Storage Network Architecture and Parallel I/Os, Baltimore, MD (2008), 29–35.

24.

Jiang

Xiao

Tian

Liu

and Xu

, Application-aware local-global source de-duplication for cloud backup services of personal storage, IEEE Transactions on Parallel and Distributed Systems 25(5) (2014), 1155–1165.

25.

Y.K.

Chen

Lee

P.P.C.

and Lou

, A hybrid cloud approach for secure authorized de-duplication, IEEE Transactions on Parallel and Distributed Systems 26(5) (2015), 1206–1216.

26.

Kaczmarczyk

and Dubnicki

, Reducing fragmentation impact with forward knowledge in backup systems with de-duplication, Proceedings of the 8th International Systems and Storage Conference (SYSTOR’15), IEEE, Haifa, Israel (2015), 201–208.

27.

Mirjalili

S.M.

and Lewis

, Grey Wolf optimizer, Advances in Engineering Software 69 (2014), 46–61.

28.

kumar

J.P.

and Govindharajulu

, Near-duplicate web page detection: An efficient approach using clustering, sentence feature and fingerprinting, International Journal of Computational Intelligence Systems 6(1) (2013), 1–13.

29.

Kudova

, Clustering genetic algorithm, 18th International Workshop on Database and Expert Systems Applications, IEEE Conference Publications, Regensburg, Germany (2007), 138–142.

30.

Gan

and Kwok-Po Ng

, k-means clustering with outlier removal, Pattern Recognition Letters 90 (2017), 8–14.

31.

Jiang

Liu

and Sui

, Initialization of k-modes clustering using outlier detection techniques, Original Research Article Information Sciences 332 (2016), 167–183.

32.

Manku

G.S.

Jain

and Sarma

A.D.

, Detecting near duplicatesfor web crawling, Proceedings of the 16th international conference on World Wide Web, Banff, Alberta, Canada (2007), 141–150.

33.

Huffman coding and Huffman tree, www.csc.lsu.edu/∼kundu/dstr/4-huffman.pdf.

34.

Park

Fan

Nam

Y.J.

and Du

D.H.C.

, A lookahead read cache: Improving read performance for deduplication backup storage, Journal of Computer Science and Technology 32(1) (2017), 26–40.

35.

Fan

Yang

M.C.

Zhang

and Du

D.H.C.

, Performance evaluation of host aware shingled magnetic recording (HA-SMR) drives, IEEE Transactions on Computers 66(11) (2017), 1932–1945.

36.

Fan

Park

Diehl

Voigt

and Du

D.H.

, Hibachi: A cooperative hybrid cache with nvram and dram for storage arrays, In Proc of IEEE Conference on Mass Storage Systems and Technologies (MSST) (2017).

37.

Kumar

B.S.S.

Manjunath

A.S.

and Christopher

, Improved entropy encoding for high efficient video coding standard, Alexandria Engineering Journal 57(1) (2018), 1–9.

38.

Kota

P.N.

and Gaikwad

A.N.

, Optimized scrambling sequence to reduce papr in space frequency block codes based MIMO-OFDM system, Journal of Advanced Research in Dynamical and Control System (2017), 502–525.

39.

Bhatnagar

and Gupta

S.C.

, Extending the neural model to study the impact of effective area of optical fiber on laser intensity, International Journal of Intelligent Engineering and Systems 10(4) (2017), 274–283.

40.

Balaji

G.N.

Subashini

T.S.

and Chidambaram

, Detection of heart muscle damage from automated analysis of echocardiogram video, IETE Journal of Research 61(3), 236–243.

41.

Bramhe

S.S.

Dalal

Tajne

and Marotkar

, Glass shaped antenna with defected ground structure for cognitive radio application, International Conference on Computing Communication Control and Automation, Pune (2015), 330–333.

42.

Yarrapragada

K.S.S.R.

and Krishna

B.B.

, Impact of tamanu oil-diesel blend on combustion, performance and emissions of diesel engine and its prediction methodology, Journal of the Brazilian Society of Mechanical Sciences and Engineering, 1–15.

43.

Sreedharan

N.P.N.

Ganesan

Raveendran

Sarala

and Dennis

, Grey Wolf optimisation-based feature selection and classification for facial emotion recognition, IET Biometrics (2018).

44.

Sarkar

and Murugan

T.S.

, Cluster head selection for energy efficient and delay-less routing in wireless sensor network, Wireless Networks (2017), 1–18.

45.

Wagh

A.M.

and Todmal

S.R.

, Eyelids, eyelashes detection algorithm and hough transform method for noise removal in iris recognition, International Journal of Computer Applications 112(3) (2015).

46.

Iyapparaja

and Tiwari

, Security policy speculation of user uploaded images on content sharing sites, IOP Conference Series: Materials Science and Engineering 263(4) (2017), 042019.

Grey wolf optimization based clustering of hybrid fingerprint for efficient de-duplication

Abstract

Keywords

1. Introduction

2. Literature review

3.1 Simhash fingerprint

4.1 Initial centroid

4.3 Centroid update

Table 1 Comparison of the proposed clustering method with existing methods

5.1 Procedure

5.2 De-duplication

Table 2 Reduction of computation time in the proposed clustering method when compared with existing methods

5.3.1 De-duplication vs restoration

6. Conclusion

Footnotes

Authors’ Bios

References

Table 1
Comparison of the proposed clustering method with existing methods

Table 2
Reduction of computation time in the proposed clustering method when compared with existing methods