Abstract
Near duplicate data not only increase the cost of information processing in big data, but also increase decision time. Therefore, detecting and eliminating nearly identical information is vital to enhance overall business decisions. To identify near-duplicates in large-scale text data, the shingling algorithm has been widely used. This algorithm is based on occurrences of contiguous subsequences of tokens in two or more sets of information, such as in documents. In other words, if there is a slight variation among documents, the overall performance of the algorithm decreases. Therefore, to increase the efficiency and accuracy performances of the shingling algorithm, we propose a hybrid approach that embeds Jaro distance and statistical results of word usage frequency for fixing the ill-defined data. In a real text dataset, the proposed hybrid approach improved the shingling algorithm’s accuracy performance by 27% on average and achieved above 90% common shingles.
1. Introduction
Nowadays, duplicate documents are a common problem in a big database. Advanced duplicate detection techniques are required not only to process a query, but also to filter the redundant (duplicate data) information from a large-scale document database, improving search quality. To address this issue, duplicate document detection techniques have been used to prevent the search results from including the multiple documents having the same or nearly the same content.
Search quality is negatively affected by redundant copies. When two documents contain identical content, they are regarded as duplicates and can be excluded during indexing. However, there might be some documents with small dissimilarities that are not declared as being ‘exact duplicates’ but are identical to such an extent that they can be declared near-duplicates [1]. Detecting near-duplicates is of the utmost important to improve the search quality.
The following are examples of near-duplicate samples seen in documents [2]:
documents containing a few different words – widespread form of near-duplicates;
documents with the same content but different formatting – for instance, the documents may contain the same text, but dissimilar fonts, bold type or italics;
documents with the same content but with typographical errors (mistyped words);
documents with the same content but different file type – for instance, Microsoft Word and PDF;
plagiarized documents and documents with different versions, and documents providing identical information written by the same author being published in more than one domain.
Besides the discussed above, we also encounter other cases where people repeat multiple letters to emphasize emotion and intensity or replace certain words with numbers when using blogs, Twitter or instant messaging. Although rules can be created for the common practices seen on those platforms, it is a challenge to identify all possible scenarios and tackle them with a rule-based approach. Another problem that is a challenge to address is the atomic typos, in which the words are accidently substituted by other valid words. For instance, in the case of ‘form’ and ‘from’, since both of them are valid words, it is difficult to argue that the word is miswritten without understanding the context of the sentence. This type of error usually occurs when the data are obtained from phone conversations or the result of using an automatic spelling correction tool.
Although all the discussed examples are part of the near-duplicate problem, in this work we will particularly address the problems raised with typographical errors. Typographical errors are spelling errors that can be captured simply because they are mistyped or misspelled [3]. As the name suggests, these are invalid strings, properly identified and isolated as incorrect representations of a valid word [4]. Fat fingering plays an important role in this type of misspelling. These errors are made assuming that the writer or typist knows how to spell the word, but may have typed the word hastily, resulting in an error [5]. No matter how the word is misspelled, this results in a decrease in the quality of the document and yields near-identical documents. Therefore, in this paper, we introduce a hybrid approach to improve a widely known near-duplicate detection algorithm, namely the shingling algorithm. In particular, we combine pattern matching technique and statistical results on the frequency of usage of words to improve the performance of the shingling algorithm.
In Section 2, we discuss related work on near-duplicate detection. In Section 3, the hybrid approach will be elaborated. In Sections 4 and 5, the test cases and experimental results will be discussed. Finally, the paper will conclude with the conclusion and discussion section.
2. Literature review
Several techniques have been developed to identify near-duplicate documents [6–10], web page duplicates [11–16], duplicate database records [17, 18] and bibliographic metadata [19]. Brin et al. proposed the COPS (Copy Protection System) to protect important and intellectual property of original digital documents via registration of those documents on the system [6]. As a part of the Stanford Digital Library project, Shiva and Garcia-Molina developed SCAM (Stand Copy Analysis Mechanism) to identify similar documents in an online library [7]. Theobald et al. created a new algorithm for extracting and matching signatures for near-duplicate detection in large web crawls. The authors combined the stop-word antecedents with short chains of adjacent content terms to naturally filter out noisy components of the web pages [16].
Lyon et al. investigated the theoretical background to automate plagiarism detection [8]. They observe that independently written texts have a comparatively low level of matching tri-grams. The Ferret plagiarism system counts the matching tri-grams of a pair of documents [8, 9]. In Xiao et al. [10], the authors offered a new filtering technique by exploiting the ordering information. With this method, they drastically reduced the candidate sizes and hence improved the efficiency of a search.
Shiva and Garcia-Molina [11] proposed two approaches to compute near-duplicates between all web documents simultaneously. Both approaches assume that two documents di and dj can be near-duplicates only when documents di and dj share more than ‘m’ fingerprints, where ‘m’ is a predefined threshold [11]. Das et al. proposed a TDW matrix-based algorithm with three phases: rendering, filtering and verification. In detail, receiving an input web page and a threshold in its first phase, prefix filtering and positional filtering to reduce the size of the record set in the second phase, and returning an optimal set of near-duplicate web pages in the verification (third) phase using the Minimum Weight Overlapping method [15] are the three phases of the algorithm.
I-Match is a technique in which the algorithm maps each individual document into a single hash value using the SHA1 hash algorithm [20]. Two documents are considered near-duplicates if their hash values are exactly the same [20]. One issue with this technique occurs when the retained word bag is small. Since the documents are effectively represented using only a small number of terms, different documents could easily be mistakenly categorized as near-duplicates. Hajishirzi et al. created a framework that represents each document as a real-valued sparse k-gram vector, in which the weights are learned to optimize for a specified similarity function, such as the cosine similarity or the Jaccard coefficient. This method has been tested and shown to be more accurate than the I-Match technique [21].
Broder [12] proposed the shingling algorithm, which keeps a sketch of shingles of each document to compute the resemblance of two documents. Any documents with at least one common shingle are examined and checked to see if they exceed the threshold for resemblance. Broder’s shingling method is implemented in the AltaVista search engine for duplicate document detection [12].
Charikar [22] showed that rounding algorithms for LPs and SDPs used in the context of approximation algorithms can be viewed as locality-sensitive hashing schemes. Simhash is a fingerprint technique that holds the property that the two documents are near-duplicate only if the fingerprints of the documents are identical, since near-duplicates differ only in a small number of bit positions [13, 22]. Since the errors we are focusing on are mistyped letters and the order of the words will not change in this type of problem, this algorithm does not suit well our case because it ignores the order of the tokens. SimFinder is an ad hoc term weighting scheme to measure each term’s discriminative ability [23]. The framework generates multiple fingerprints for each text, and only texts with at least one fingerprint in common are compared with each other to reduce the number of comparisons needed to be conducted [23].
As reflected above and discussed by Kumar and Govindarajulu, most of the existing techniques are looking for the resemblance of the documents at the exact similarity level or use some form of an approximation technique to detect similar documents [24]. If the first approach is used alone, then the misspelled or slightly incorrect representation will decrease the accuracy rating. If the latter approach is used in the whole documents, the computation overhead will be significant. Although the accuracy rating can be high, increased false-negative and false-positive rates may result. Therefore, it is vital to combine these two approaches in a way that the final product not only improves the accuracy rating, but also keeps the computational overhead and false-positive and false-negative rates to acceptable levels.
3. Methodology
In this work we propose first using Jaro distance and word usage frequency as a tiebreaker to correct the misspelled information, and then applying the shingling algorithm to detect near-duplicates. The shingling algorithm is a well-known technique used for detecting near-duplicates. Jaro distance is a spelling checking algorithm that returns the closest match in the dictionary for the misspelled word. A statistical result about the word usage is a method for finding the most frequently used words in a particular language. All of these methods are explained briefly in the following subsections. Before discussing the details of these techniques, first we will discuss how to pre-process unmatched data.
3.1. Pre-processing of unmatched data
As stated earlier, fat fingering or multiple repetitions of a letter are widely seen in unprocessed texts, especially in micro-blogs, short text messages, non-formal emails, etc. Therefore, these can result in lower accuracy rates when two near-identical documents are correlated. Fortunately, these types of problems can be easily and efficiently processed so that the ill-defined data can be modified. Either this modified data can be directly matched with a possible candidate or the distance for a valid word can be brought close to zero to make the decision process easier and accurate. However, in order to make sure the meaning of the word is not changed, one cannot remove all repetitive characters to a single letter. For instance, if the misspelled phrase ‘Gooood Woorkkk’ is reduced to ‘God Work’, then the meaning of the phrase will change. Since only two following characters can be exactly same in the English language, it should be reduced to ‘Good Woorkk’ which will eventually have closer distance to the original information.
3.2. Jaro distance
The Jaro algorithm/distance is a measure of closeness of two strings. Specifically, the Jaro measure is the weighted sum of the percentage of matched characters and number of transformation needing to be conducted on the ill-defined data so that it can be a complete match with a candidate word [25]. Jaro distance has been widely used as a spelling checking algorithm to determine the equivalent word if any word is misspelled. The higher the Jaro distance of two words, the more similar the words are.
The Jaro distance (dj ) is a normalized value between 0 and 1 by given two strings S 1 and S 2. The dj is calculated as follows:
where |Si| is the cardinality of characters in word Si ; m is the number of matching characters; and t is the number of transpositions. It reflects a value based on number of transpositions needed between two adjacent characters to fix the problem with the misspelled word.
For instance assume S1 = ‘civl’, which is a mistyped word. The Jaro score for this mistyped word is calculated based on comparing it with all words (Sn ) in the dictionary. The closest valid word would be ‘civil’ with a Jaro score of 0.93, since |S 1 | = 4, |S 2 | = 5, m = 4, and t = 0. The algorithm suggests that the mistyped word is replaced with this closest valid one.
3.3. The shingling algorithm
The shingling algorithm is a well-known technique introduced by Broder to estimate the degree of similarity among pairs of documents [12]. The algorithm does not depend on any linguistic knowledge other than the ability to tokenize documents into a set of words based on the shingle size [12]. In the shingling, the string is divided into words and all word sequences of adjacent words are extracted. If two documents contain the same set of shingles they are considered identical and if their sets of shingles appreciably overlap, they are accepted as exceedingly identical [12].
In detail, the shingling divides the text document into a set of substrings of length k (called k-shingles). The shingling algorithm is implemented using the sliding window method with the window size of k (shingle size of k) over the input document character by character or word by word, and placing the substrings into a set. For example, the string ‘abcabcac’ can be represented by {‘abc’, ‘bca’, ‘cab’, ‘abc’, ‘bca’, ‘cac’}, if the sliding window is chosen as three shingles. Since the algorithm eliminates the duplicate ones, only four shingles {‘abc’, ‘bca’, ‘cab’, ‘cac’} will be in the final set. Another example for the shingling algorithm is given below:
Given a string S = {a rose is a rose is a rose}
Using the shingling algorithm (with window size set at four words) the string can be tokenized as {(a rose is a), (rose is a rose), (is a rose is), (rose is a rose), (a rose is a), (rose is a rose)}.
Removing the duplicates will yield to the following shingles: {(a,rose,is,a), (rose,is,a,rose), (is,a,rose,is)}. Finally, Jaccard similarity is used for expressing similarity of sets. For sets A and B, it is defined as:
The shingling algorithm result ranges from 0 (no elements in common) to 1 (identical). It is evident that duplicate documents have higher similarity values for the original than those to unrelated documents.
3.4. Hybrid approach
The hybrid approach is a combination of the Jaro distance, statistical results on the frequency of usage of English words, and the shingling algorithms, as shown in Figure 1. Specifically, two relatively similar input files are pre-processed and inserted into the system to check whether they are near-identical documents because of mistyped information. If there are any inconsistencies (mistyped words) between them, then the hybrid approach utilizes the Jaro distance for finding the closest match for the mistyped data with the help of dictionary of English words. In order to eliminate irrelevant results from the suggestion list a minimum threshold of 0.8 is set as Jaro distance. When there is only one candidate that holds the highest score, that candidate is selected as the best possible suggestion and the mistyped word in the text is replaced with this particular candidate. However, if there is more than one candidate that holds the same Jaro score with the mistyped word, then the hybrid approach looks for the frequency of usage of the candidate words from the Corpus of Contemporary American English [26]. Based on the data, the more frequently seen candidate is selected as the best possible replacement for the mistyped word and is replaced as the ill-defined one in the document. After the mistyped words are modified according to the hybrid approach, the shingling algorithm is applied to see the degree of the closeness of the text files. The shingles in this work are selected word by word instead of character by character to greatly lessen the computational overhead. The closeness is calculated by the number of common shingles divided by the total number of shingles in both files. Applying a pattern-matching technique with statistical results, we improved the overall performance of the shingling algorithm as discussed in the next sections. The details of the algorithm are provided in the pseudo code in Figure 2.

Hybrid approach.

Algorithm.
4. Test case and results
The datasets used for testing the algorithm were public text files that were obtained from the internet. First, as a proof of concept, we executed the hybrid approach on a mistyped version of the Abraham Lincoln’s Gettysburg address [27]. As a first step, the mistyped words were corrected using the Jaro distance. Then, the word usage frequency played a role in a few cases to act as a tie breaker when there was more than one candidate for a misspelled word. Finally, the shingling algorithm was applied to the updated data set to evaluate the algorithms’ performance.
As shown in Table 1, before applying the hybrid approach and given a shingle size of 3, the number of common shingles between the two input files was 254 and the number of un-matched shingles was 26, after duplicate shingles were removed. Having applied the hybrid approach to the shingling algorithm, the number of common shingles increased to 263 and the number of un-matched shingles dropped to 8 with the shingle size of 3, once again after duplicate shingles were removed. The difference between the total number of shingles produced before and after using the hybrid approach is due to the removal of duplicates from the final set of shingles.
First test data results.
As a second task, we executed the same test on the gold set of near-duplicate news articles collected by Theobald et al. [16]. The dataset contains 2160 news articles crawled in 2006. These articles were manually clustered into 68 directories, where documents in the same directory have the same content news and are considered near-duplicate. Overall, out of the 2160 articles, 346 of them qualified as near-duplicate documents with misspellings and those were executed through the discussed hybrid approach. As shown in Figure 3, the average percentage of common shingles to all shingles was improved by an average of 27% with different shingle sizes of 3, 4 and 5. Although the average common shingle sizes were increased across the board, the highest percentage of common shingles was always seen when the shingle size was selected as 3.

Percentages of common shingles with different shingle sizes.
The hybrid approach is implemented as a compact Java prototype in only about 500 lines of code. All experiments were performed on a Xeon E5-1607 @ 3.00 GHZ CPU with 16 GB RAM desktop computer. Figure 4 represents the total processing time in seconds of the hybrid approach for different thresholds used in Jaro distance. As shown in Figure 5, although the overall precision ratio is highest when the threshold is selected as 0.9 for Jaro distance, the percentage of ill-defined records tested with this threshold is at the lowest as expected because of high threshold value. While the number of problematic data addressed is increased by lowering the threshold, it also negatively affects the accuracy of the algorithm. Therefore, it can be argued that the optimal threshold value should be 0.8 based on the precision rates, number of problematic data addressed and processing times.

Processing times for different Jaro distance threshold values.

Precision rates and percentage of ill-defined data addressed for different Jaro distance threshold values.
The answer of which shingle size is better suited for detecting the near-duplicates is vital while applying the shingling algorithm. Based on the tests we conducted, our observations are reflected in Table 2. Overall, the shingle size of 3 provides the highest degree of closeness between two near-identical documents and less overhead (total number of shingles) in all test data we experimented with. However, as expected, if all the misspelled words are corrected, then the accuracy percentage is the same among different shingle sizes and the computation overheads are very close to each other. Other shingle sizes, such as >5 or <3 are omitted, since those window sizes yield lower shingle scores or require extensive computation.
Selection of shingle size.
It is also important to note that correctly fixing one ill-defined data with the hybrid approach in most cases yields a substantial decrease in the number of non-matching shingles. The main reason for this is that most of the fixing attempts can provide multiple matching shingles based on the selected window size. Specifically if a mistyped word is fixed when the window size is 4, then this may result in an increase of four common shingles between the two documents, unless the mistyped word is not within the first or last three words of the document.
5. Pattern matching algorithms
The string matching algorithms can be divided into two, namely pattern matching and phonetic matching. Since each language has its own structure, applying a phonetic technique to a generic text data most likely provides low accuracy rates when fixing the ill-defined information. Therefore, in this work, we omitted the phonetic matching algorithms.
To validate that the Jaro distance performs best as the pattern-matching algorithm of this newly proposed hybrid system, we compared its efficiency with other well-known pattern-matching algorithms, such as Levenshtein Edit Distance, n-gram technique, Hamming Distance and Ratcliff/Obershelp. The Levenshtein Edit Distance is based on the minimal number of elementary edit operations needed to transform one string into another. This algorithm accomplishes the task by single character insertion, deletion, or substitution [3]. The character n-gram-based technique coincides with the character n-gram analysis in non-word detection. However, instead of observing certain bi-grams and tri-grams of letters that never or rarely occur, this technique can calculate the likelihood of one character following another and use this information to find possible correct word candidates [28]. For instance, the bi-gram of ‘kitchen’ will be represented as {{ki}, {it}, {tc}, {ch}, {he}, {en}}, while the tri-gram representation would be {{kit}, {itc}, {tch}, {che}, {hen}}. For two strings of the same length, the Hamming distance is the number of positions in which the two strings have different characters [29]. The Ratcliff/Obershelp algorithm computes the similarity of two strings as the doubled number of matching characters divided by the total number of characters in the two strings. Matching characters are those in the longest common subsequence plus, recursively, matching characters in the unmatched region on either side of the longest common subsequence [30].
In order to reduce the overhead, information has to be better organized and should produce relevant results. Two major metrics commonly associated with information retrieval systems are precision and recall [31]. Precision can be defined as the number of relevant documents retrieved by a search divided by the total number of documents retrieved by that search. In a more formal way, the precision can be defined as follows:
Precision measures one aspect of information retrieval overhead for a user performing a particular search. If a search has 90% precision, then 10% of the user’s effort concerns overhead reviewing non-relevant items. Note that the definition of relevant documents is broadened so that the suggestion algorithms provide the correct result as one of the possible candidates for the misspelled word.
Recall is different from precision. Recall can be defined as the number of relevant documents retrieved by a search divided by the total number of existing relevant documents, or in other words recall is the percentage of spelling correction rate. Recall gauges how well a system processing a particular query is able to retrieve the relevant items that the user is interested in seeing.
A measure that combines precision and recall is the harmonic mean of precision and recall, called the F-measure or balanced F-score. The F-score is calculated by:
The five discussed algorithms are tested on the second dataset. As shown in Table 3, on average, Jaro distance achieved the highest F-score, precision and recall rates. In other words, Jaro distance provided highest correction rate with less overhead (least number of irrelevant suggestions in the pool). This is the main rationale why Jaro distance is selected as part of the hybrid approach.
Comparison of string matching algorithms.
6. Conclusion and future work
In addition to the negative effects of the redundant data, there are some positive aspects of this duplication. For instance, redundant data can be useful and may even be necessary for satisfying service-level performance. Also, duplicate data may provide advantages when it comes to accessibility and availability. Furthermore, a variety of different representations of the same data in a data warehouse can help the acquisition of new, unknown information. Therefore, it is worth knowing that the data quality can be improved if the duplicate information is properly merged. Most of the current de-duplication techniques are based on merging completely matching inputs into one dataset. However, the biggest challenge in the identification of duplicates lies in detecting non-exact matching duplicate records – data that appear to be multiple sets of unique information, but are actually duplicate records. These ‘non-exact matching’ duplicates are very difficult to identify. To address this challenge a number of studies have concentrated on nearly identical document matching as discussed in the review of related research. Because of the computation and accuracy shortcomings, we created a system that conducts spelling matching on the misspelled text and then rank the suggestions by frequency of usage. The framework automatically selects the best available candidate for the ill-defined text and then is correlated with the information in the other document to see if near-identical documents are indeed the same ones but with one or more containing an error-prone information. Finally, the shingling algorithm is applied to reflect the closeness of the two documents.
By removing the manual labour on the selection of the candidate word for the misspelled information from the system, one can argue that the accuracy rating of the near-identical matching will be lower because of the missing human sentimental adjustment. However, by eliminating the data administrator’s manual involvement, the following advantages can be achieved:
a baseline or benchmark is set in all cases as needed for external audits or other reference;
performance is improved since post-processing can assume that the input field is formatted the same every time;
multiple datasets can be run through one standardized process;
costs are lowered by limiting erroneous data generated by the data administrator used in marketing;
the amount of labour-intensive work is decreased;
the processing time is improved;
the processes can be understood throughout the case life-cycle;
the technical procedures are easily documented;
integrity is automatically built into the handling of the case; and
different data analyst can work or collaborate on the same case without significant disruption.
Overall, experiments reflected that using the hybrid approach is an encouraging solution for large-scale short text duplicate detection and increased the performance of the shingling algorithm. As future work, the hybrid approach will be used in various domains, such as in healthcare data, to improve the detection of near-duplicates. Moreover, other language options will be added to the frequency of usage of words data to address the near-duplicate problem globally.
Footnotes
Funding
This research received no specific grant from any funding agency in the public, commercial or not-for-profit sectors.
