A Fast and Simple Pattern Matching with Hamming Distance on Large Alphabets

Abstract

We present a fast method for calculation of Hamming distance vector based on a simple preprocessing of the target text. For applications on protein sequences, with alphabet of 20 symbols or more, the proposed method is an order of magnitude faster than the brute force approach while much simpler than previously published methods.

1. Introduction

Approximate pattern matching under Hamming distance is a part of a more general problem of approximate pattern matching that is ubiquitous in computational biology applications (Al-Okaily, 2015). Hamming distance is the number of substitutions necessary to transform one string to another of equal length. In case of different lengths, Hamming distance vector (HDV) stores the number of mismatches between a shorter pattern of length m and substrings of the same length of a longer text with length n, at the consecutive \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$( n - m + 1 )$$ \end{document} positions in the text. The problem of calculating HDV is a generalization of a more researched k-mismatch problem, where the output is the list of the positions in the text at which Hamming distance \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\le k$$ \end{document} .

The k-mismatch problem, as well as the problem of matching with wildcards, is trivially solved during the construction of HDV. In addition, HDV is employed to find the position of the best match, regardless of the number of errors, which can be useful in motif scanning and de novo peptide sequencing with mass spectrometry.

The fastest known solution for HDV computation described in Abrahamson (1987) has complexity of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$O ( n \sqrt {mlogm} )$$ \end{document} , and the complexity of the most recent solution for the k-mismatch problem in Al-Okaily (2015) is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$O ( {m^k} / k! + occ )$$ \end{document} , where occ is the number of reported matches. Existing methods are fairly complicated, often require supra-linear space, and considerable effort to implement. Most of the methods do not factor in the size of the alphabet, the same algorithms are used for DNA and protein sequences. However, in the case of proteins, with the alphabet of 20 amino acids (or more, in the extended notation), it is possible to exploit the sparse distribution of symbols to reduce the necessary work by counting only the positions with matches. We present a simple and fast method for HDV computation that we used for finding the best Hamming distance matches for a peptide in a predefined protein database.

2. Algorithm and Results

Our method is a variant of the principle of counting used in Amir et al. (2004) to solve the k-mismatch problem in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$O ( n \sqrt {klogk} )$$ \end{document} time. The difference is that we preprocess the text and not the pattern, and we increment the count for a position in text based on all positions of a character in pattern, not only the first one. In the preprocessing phase we store for each alphabet symbol all of its positions in the text in separate lists. Then, using these lists, for every position in the pattern we increment the match count for a corresponding position in the text. The essential pseudo-code is presented in Algorithm 1. Since our method is oriented toward applications on protein sequences, we have used protein and peptide in lieu of text and pattern. MatchVector, HammingDistanceVector, and PositionList are initialized to zeros; peptide [i] and protein [i] denote symbol at position i in peptide and protein, respectively; \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\mid A \mid$$ \end{document} is the size of alphabet A.

If we denote with \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${f_a}$$ \end{document} the frequency of a symbol (amino acid) a in a protein, then the probability of a symbol a is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${p_a} = {f_a} / n$$ \end{document} . The complexity of the algorithm in the average case is then \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$O ( mn \mathop \sum \limits_{a \in A} p_a^2 )$$ \end{document} , which reduces to \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$O ( mn / \mid A \mid )$$ \end{document} in case of equiprobable symbols. For alphabet of amino acids, and typical peptide lengths, this is a significant improvement on the \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$O ( n \sqrt {mlogm} )$$ \end{document} previous result. In the setting of protein search, our method is a practical and viable solution to k-mismatch and wildcard matching problems, too. The precomputed lists of positions serve as an inverted file index to protein. Obviously, this index is constructed in linear time and space.

As a benchmark, we have compared our method with the \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\Theta ( mn )$$ \end{document} brute force approach on the complete recent UniProt database that consists of, approximately, 200 million amino acids in 550,000 proteins (The UniProt Consortium, 2015). The \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\mathop \sum \limits_{a \in A} p_a^2$$ \end{document} factor that would be 0.05 if all 20 amino acids have the same frequency is measured to be 0.06 for the actual distribution. The time needed for computing the complete HDV for a peptide length of 20 characters is 19 seconds with brute force algorithm and 1 second with our method, on a 3.2 GHz Intel Xeon with 16 GB RAM running on CentOS 6.5. The measured times change in linear manner with the peptide length. These times are for core processing only, the data input requires about half a minute in both cases. In case of our method, the initial construction of the index requires ∼40 seconds. As a result, our method is best suited for batch and iterative HDV computing, where the index is loaded into RAM only once.

Footnotes

Acknowledgment

This work was supported, in part, by the Croatian Science Foundation under the projects IP-11-2013-9623 and IP-11-2013-9070.

Author Disclosure Statement

No competing financial interests exist.

References

Abrahamson

1987. Generalized string matching. SIAM. J. Comput., 16, 1039–1051.

Al-Okaily

2015. Error tree: A tree structure for Hamming & edit distances & wildcards matching. J. Comput. Biol., 22, 1118–1128.

Amir

, Lewenstein

, and Porat

2004. Faster algorithms for string matching with k mismatches. J. Algorithm., 50, 257–275.

The UniProt Consortium. 2015. UniProt: A hub for protein information. Nucleic Acids Res. 43, D204–D212.