Linear trace similarity matching based on improved longest common substring

Abstract

The linear trace indicates the external morphological structure of the contact portion of clamping and cutting tools, which is not easy to be destroyed, has a high occurrence rate and high significant on identification. It is of great significance for prosecutor to determine the nature of the case and determine the tools used in the crime so as to find the criminals. The traditional linear trace analyzing methods include microscopy, manual comparison of characteristics, image recognition and three-dimensional scanning methods. The single-point laser picks up the toolmark detection signal, and the longest common substring is obtained after noise reduction. In addition, the improved dynamic programming algorithm calculates and generates matching results. Finally, the effectiveness of the algorithm is verified by the actual detection data.

Keywords

Linear trace single-point laser longest common substring dynamic programming matching calculation

1 Introduction

During the “Thirteenth Five-Year Plan”, China’s high-speed rail operating mileage will reach 30,000 kilometers which can cover more than 80% of large cities and will continue to renovate and complete current airport building, as well as build 74 civil aviation airports and 850 general airports. There are many optical cables, leaky cables, signal cables, cables for rail transit vehicles and links up grounding wire along the railway. In the meanwhile, there are many navigation lights cables and communication optical cables around various airports. Since the inner conductors of various types of cables are mostly copper which has high value, it has gradually become the object of criminals. In recent years, frequent wire-cutting crimes against high-speed rail and airports have caused huge losses of tens of millions of dollars, and frequently interrupt power supply, signal and communication equipment power supply, which potentially cause more serious accidents [1].

Statistics show that these criminals used clamping and cutting tools including bolt cutters, cable cutters and other pliers to destroy the cables which left toolmarks in the crime scene [2]. The linear trace indicates the toolmark morphology of clamping and cutting tool, which is not easy to be destroyed, has a high occurrence rate and high significant on identification. It is of great significance for prosecutor to determine the nature of the case and determine the tools used in the crime so as to find the criminals [3].

Image recognition and three-dimensional scanning techniques have emerged in recent years which provide some new solutions for non-destructive quantitative testing of linear traces compared with traditional methods such as microscopy and manual features [4 –6].

Although these methods can automatically match the linear traces to some extent, there are still a few existing problems [7, 8]:

Comparison method based on the image has high requirements for the camera device. Inconsistency reflection, the shooting angle and the focus will directly lead to the distortion of the original data, which may weaken the robustness of the analysis result;

Although the three-dimensional scanning robustness can more accurately reflect the detailed features of the linear traces, the detection hardware cost is high, and the file volume is too large, the calculation is geometric increase;

The existing matching method needs to be able to distinguish individual characteristics of the tool itself as well as distinguishing the class characteristics and the sub-class features. The linear traces of such tests including the rolling traces and the bullet-shot marks, are relatively regular and highly measurable. However, the complicated morphology of the clamp-type linear traces is very random which causes low accuracy.

This paper proposes a method in which the single-point laser picks up the toolmark detection signal, and the longest common substring is obtained after noise reduction. In addition, the improved dynamic programming algorithm calculates and generates matching results in order to solve the existing problems in traditional methods. The state matrix generated by the dynamic programming method contains multiple matching results which output the basic longest substring. In addition, all the candidate results can be obtained according to different minimum matching lengths to improve the result when only one calculation is carried out.

2 Longest common substring

Longest Common Substring (LC-Substring) is an algorithm for describing the similarity between two substrings. For a given sample sequence S and a sequence T to be matched, the L -Substring correlation algorithm is used to find the longest common substring between S and T. The longest common substring refers to the longest common part of the sequence S and T that. The longest common substring LC-Substring algorithm is mainly applied to find the exact overlap between sequences.

The algorithms implementing LC-Substring mainly include two types, brute force method and dynamic programming method. The basis of the two types are as follows:

(1) Brute force method

The brute force method is solved by a simple search solution. The brute force method checks whether each substring of S is a substring of T, thereby determining whether it is a common substring of S and T, and selecting the longest common substring. The longest common substrings of S and T can be found after all substrings of S and T have been examined. When the brute force method is used to solve the problem in which S is a string of length n and T is a string of length m, the time complexity is O (2^m+n), and the space complexity is O (n).

The brute force method has the advantages of simple implementation and small space occupation, so it is suitable for problems with small string length which only needs to provide the longest common substring. If the problem needs more outputs, the brute force rule has its limitations.

(2) Dynamic programming design method

The dynamic programming design method is developed for solving the optimization problem, rather than a special algorithm. The constraints and conditions are different for different optimization problems because of the different nature, so the design method of dynamic programming has different problem-solving methods for different problems.

The main idea of dynamic programming is to establish a n * m two-dimensional matrix A according to the length of S and T in the LC-Substring problem. The element of the matrix A (i, j) indicates the matching information at the i position of the S string and the j position of the T string: either matching or not matching or already a continuous match.

LC-Substring dynamic programming transfer equation: $A (i, j) = {\begin{matrix} 0, i = 0 or j = 0 or S (i) \neq T (j) \\ A (i - 1, j - 1) + 1, S (i) = T (j) \end{matrix}$ (1)

The equation mainly focuses on the local optimal matching data of the current position. The algorithm is suitable for the pattern under strict matching. The LC-Substring has a few applications on:

Finding the match between two trace signals using the dynamic programming method.

Restoring path by obtaining enough matching substrings in addition to the longest common substring according to the dynamic programming result.

3 Matching calculation

The first step is to calculate the matching state of the sequence S and the sequence T by using the motion detection method to perform the similarity matching process. The space and time and complexity in this step are both o (m * n). So when the signal data volume is too large, it will be moderately scaled down and will not have a big impact on the result because the overall signal-to-noise ratio is improved, thereby offsetting the effect of a certain decrease in resolution when the original signal detection signal is improved in the process of zooming down.

The dynamic programming method used in the matching calculation is consistent with the dynamic planning steps of the LC-Substring in the execution steps and the calculation is in the order of rows or columns. And the main improvement of the algorithm is the decision rules of the state transition which uses the method of “local minimum difference”. The basic flow of matching calculations is shown in Fig. 1:

Fig. 1

Matching calculations basic flow.

3.1 Variance calculation for dynamic programming optimization

The dynamic equations are used to calculate the local optimal solution of the current position in a certain order (by row or column) after establishing the two-dimensional state matrix. The difference from traditional dynamic programming is the introduction of variance calculation in the process of comparison design. Because the different traces have a Y-axis position in which its deviation is difficult to correct, it is unrealistic to use simple distance difference to measure. While the variance can measure the degree of stability of the similar parts of the two sequences. If the two sequences are all within a certain range of deviation, sequence overlap can be done using the displacement. At the same time, an additional storage matrix is used to calculate the variance due to the particularity of the variance calculation.

The basic formula for variance calculation is: $D = \frac{\sum_{1}^{n} (x_{i} - \bar{x})^{2}}{n}$ (2)

This formula is not suitable for dynamic programming because the matching state of each position is uncertain. The average must be calculated once at each position before the variance calculation. Hence it can well match the characteristics of dynamic programming using another calculation method: $D = E (X^{2}) - E (X)^{2}$ (3)

This method only needs to add the following matrix according to the length of the sequence:

D (0 . . . n, 0 . . . m), Save the variance of the current optimal result

SumEx (0 . . . n, 0 . . . m), Save the sum of the difference of the current optimal result

SumExPower2 (0 . . . n, 0 . . . m), Save the sum of the squares of the difference of the current optimal result

len (0 . . . n, 0 . . . m), Save the length of the current optimal result

Where x is the difference between S and T at the corresponding position

Thus the variance of each position (i, j) can be calculated as follows: $D (i, j) = \frac{SumExPower 2 (i, j)}{len (i, j)} - {(\frac{SumEx (i, j)}{len (i, j)})}^{2}$ (4)

SumEx, SumExPower2, len the three matrices only need to perform a one-time operation according to the transfer equation during state transition to avoid a large number of repeated calculations.

3.2 Steps

The matrix involved in the algorithm has to be defined as the first step. The matrix involved needs to include:

The ones are introduced in previous sections:

D (0 . . . n, 0 . . . m)

SumEx (0 . . . n, 0 . . . m)

SumExPower2 (0 . . . n, 0 . . . m)

len (0 . . . n, 0 . . . m)

and

Way (0 . . . n, 0 . . . m), Save the correlation between the current location and the previous match.

In which len (0 . . . n, 0 . . . m) is the core of the entire dynamic planning. The workflow of the entire algorithm in LC-Substring strict mode is as follows:

With the given sequence S and T, initialize five matrices with empty values D, SumEx, SumExPower2, len, Way in order to guarantee the validity of the data. The data with row number 0 and column number 0 is the initial state data, which must exist. In the meanwhile, a threshold value maxDiffirence needs to be given according to the matching error tolerance expected by the user.

Row (or column) traversal for dynamic programming calculation starts from the first row (or first column), in order to calculate (1,1) row #1 and column #1, (1, 2) row #1 and column #2 of data. Step 3 is performed at each position; the entire matching calculation process finishes when the calculation for all the positions is completed.

Find the variance value D (i - 1, j - 1) of its precursor and calculate the difference from maxDiffirence for the current position (i, j). If D (i - 1, j - 1) is less than maxDiffirence, continue to step 4. Otherwise, the state equation is not changed, and go to step 5.

4. Step 4 is to update the state equation for the current situation when the variance of the current precursor position is less than the tolerance value. If the previous position has already started matching, len (i - 1, j - 1) >0 then the following update is performed: ${\begin{matrix} len (i, j) = len (i - 1, j - 1) + 1 \\ SumEx (i, j) = SumEx (i - 1, j - 1) + S (i) - T (i) \\ SumExPower 2 (i, j) = SumExPower 2 (i - 1, j - 1) + (S (i) - T (i))^{2} \\ Way (i, j) = 1 \end{matrix}$ (5)

The main purpose is to update the current position (i, j), the smallest degree of difference that can be achieved (the maximum similarity), and the lengths len and Way. These results will be used for subsequent backtracking.

However, if the previous position is not matched, len (i - 1, j - 1) =0, then the current position is selected as the starting point for further matching and following updates: ${\begin{matrix} len (i, j) = 1 \\ SumEx (i, j) = S (i) - T (i) \\ SumExPower 2 (i, j) = (S (i) - T (i))^{2} \\ Way (i, j) = - 1 \end{matrix}$ (6)

Whether or not the previous position has already been matched, there will be a corresponding match to each situation, and go to step 5.

5. Step 5 is to update the difference of the current location. After completing the following calculation, return to step 2: $D (i, j) = \frac{SumExPower 2 (i, j)}{len (i, j)} - {(\frac{SumEx (i, j)}{len (i, j)})}^{2}$ (7)

3.3 Generate results

The position of an optimal solution is recorded in the dynamic programming calculation in the traditional longest common subsequence algorithm. The longest common sequence can be obtained by backtracking and traversing afterward. The state matrix saves the local optimal matching result at any position which can be traced back to obtain the longest result, or the Nth long result. The trace signal cannot be used in strict matching compared with the original numeric string of the LCS application, hence the matching is relatively unstable. It is important to obtain matching results of any positions/lengths in addition to the longest match result required in the trace matching.

Results generation includes the following steps:

Match Filtering: Filter out all possible matching information based on specified criteria.

Results analysis: The results are analyzed including matching results correction, connection or topological sorting based on the generated matching information,

The workflow diagram is shown in Fig. 2:

Fig. 2

Orkflow diagram for result generation.

4 Experimental testing

The output of the Nth long partial matching is usually used for the fast matching process. The Nth long result obtained by the algorithm is accurate and continuous in general. The approximated (most strict) longest common part of the two traces can be achieved under such circumstances. The implementation steps are as follows:

The searching procedure for the Nth largest value position (i, j) is based on the size of the matrix len. The Nth largest value position in the matrix len represents the Nth longest match because the len matrix stores the longest continuous matching size of S and T at the corresponding position.

Firstly, the end position is marked (i, j). And the i position of the S string and the j position of the T string are the matched end positions. Then go to step 3.

It outputs and moves according to the state of Way(x, y) each time when it goes back to the position to find (x, y), ${\begin{matrix} end, Way (x, y) = 0 \\ x = x - 1; y = y - 1, Way (x, y) = 1 \\ y = y - 1, Way (x, y) = 2 \\ x = x - 1, Way (x, y) = 3, \end{matrix}$

Step 3 will go through Way (x, y) =1 states len (i, j) -1 times and outputs a corresponding sequence of S and T each time, which is compatible with all modes in the comparison algorithm.

The coordinate position at the end of step 3 is marked as the start position. The previous end position and the current start position are returned as well as outputting the sequence corresponding to S and T, which is the longest common sequence of the two sequences.

4.1 Simulation testing

A total of 8 samples, of which data0 is a sample, data1∼100 is overlapped with data0 at different positions, and the overlap percentage is around 30%. The tolerance used in the test is 1, and the test results are as shown in Fig. 3:

Fig. 3

Simulation data matching testing.

4.2 The actual testing

It is difficult to determine the overlap by naked eyes when different cuts of the actual cuts are under testing. The corresponding positions are very different as well. The actual test is performed by the algorithm, and the test tolerance is 4, which is divided into 100 groups. The 100 groups are all repeated tests of the same sample, and the maximum overlap percentage of each group is 30% or 60%. The matching result is shown in Figures 4 and 5, and the matching accuracy is 85%.

Fig. 4

60% overlap actual testing data matching test.

Fig. 5

30% overlap actual testing data matching test.

The test continues with increasing the difficulty of continuous cut. The different cut ends of the actual cut with large randomness and change. The overlapping percentage is 30% ∼60% while other conditions are remaining the same. The matching result is shown in Figure 6 in which the matching accuracy is 80%.

Fig. 6

45% overlap actual testing data matching test.

5 Conclusion

This paper proposes a similarity matching algorithm for linear traces. The algorithm uses a single-point laser to pick up trace detection signals. The LC-Substring and improved dynamic programming algorithms are used for matching calculation and result generation after noise reduction. The complexity of the algorithm is low which can be directly programmed in Python. The executable can run on a computer with medium configuration. The toolmark detection experts from the Institute of Forensic Science Ministry conducted actual tests and optical comparison to prove that the algorithm can be effectively applied to detect the traceability of the clamping and cutting tool based on the linear trace comparison. It provides a reliable reference for the subsequent identification work. Future works need to improve the computational performance of a large amount of data, complete the cross-regional parallel comparison and traceability source to be able to handle actual case handling environment.

Footnotes

Acknowledgments

This paper was supported by the National Natural Science Foundation of China under Grant 51965030, the informatization construction project of Ministry of Public Security Material Certification Center under Grant SGS2019102901, the key project of technology research program funded by Yunnan Province under Grant 2016RA042, the Innovation Fund for Technology based firms in Yunnan Province under Grant 2017EH028, and the Ministry of public security technology research program under Grant 2016JSYJA03.

References

Wang

, Si

and Xu

, A Study on the Optical Technology for Recognizing Toolmark Trace, Journal of Experimental Mechanics 18(1) (2003), 131–136.

Spotts

, Chumbley

L.S.

, Ekstrand

, et al., Optimization of a statistical algorithm for objective comparison of toolmarks, Journal of Forensic Sciences 60(2) (2015), 303–314.

Cui

, Yang

and Wang

, Determination of the Suspect Tool by Comparing the Impressed and Striated Marks Left on the Cutting Surface, Forensic Science and Technology (nature and science) 40(4) (2015), 272–274.

Lock

A.B.

and Morris

M.D.

, Significance of angle in the statistical comparison of forensic tool marks, Technometrics 55(4) (2013), 548–561.

Carol

, Patrick

McLaughlin

, Kuo

Loretta

, et al., Forensic surface metrology: Tool mark evidence, Scanning 33(5) (2011), 272–278.

Michael

, Techniques for the segmentation of striation patterns, IEEE Trans on Image Processing 3 (2006), 624–631.

Baiker

, Keereweer

, Pieterman

, et al., Quantitative comon of striated toolmarks, Forensic Science International 242 (2014), 186–199.

Chumbley

, Zhang

, Morris

M.D.

, et al., Development of a Mobile Toolmark Characterization/Comparison System, Journal of Forensic Sciences 62 (2017), 83–91.