Abstract
When using the current authentication code recognition system to identify the character authentication code, there are the problems of low integrity and low recognition accuracy. In this regard, a design method of artificial intelligence recognition system for cracking character type authentication code is proposed in this paper. The denoising algorithm based on the connected domain is used to remove the noise in the character type authentication code, and the character authentication code after the denoising is normalized. The feature extraction module is used to extract color moments, color correlation diagrams and LBP texture features of character authentication codes, and complete the feature extraction of character authentication codes. The similarity matching module is used to match the characters of the character authentication code. In the recognition module, the character authentication code is classified by the classification algorithm based on multi-feature SVM, and the recognition of the character authentication code is completed. The experimental results show that the proposed method has high information integrity and high recognition accuracy.
Introduction
In the process of Internet development, network security is becoming more and more important, and various security technologies have been put forward one after another. Among them, authentication code is an important way to protect network security, [1]. The use of authentication code prevents automatic programs from illegal attacks on network services, and the wanton propagation of spam. It provides a clean and healthy network platform for businesses and users [2]. However, as a result of some other purposes, more and more authentication codes are being maliciously cracked. Therefore, the research on authentication code and authentication code recognition has become more and more important [3]. Through the research of authentication coderecognition, people can design a simpler and more secure authentication code. Because many machine learning algorithms are used in authentication code recognition, the research of authentication code recognition is also a great progress of artificial intelligence. When using the current recognition system to identify the character authentication code, there are some problems of low information integrity and low recognition accuracy. Thus, we need to study the design method of the authentication code recognition system [4].
Shao et al. proposed a design method for the authentication code recognition system based on the convolution neural network. By designing a new convolution network topology, a method of character segmentation based on K-means clustering was proposed. For the inseparable authentication code, the whole authentication code was directly used as the convolution model by leaving out the character segmentation operation. The preprocessing of affine transformation, water filling and the binary classification of SVM algorithm were introduced, to classify the authentication code and complete the recognition of the authentication code. When the method was denoised for the authentication code, the information integrity of the character authentication code was low [5]. Zhang et al. proposed a design method of authentication code recognition system based on fuzzy neural network. The method was based on the minimum fuzzy degree optimization model to obtain the fuzzy membership function of the recognition of authentication code. A multi-dimensional fuzzy neural network based on adaptive neural network fuzzy inference system was applied to realize the decision fusion of multi-sensor information. The authentication code control value was obtained, and the recognition of the authentication code was completed. The accuracy of the recognition result obtained by this method was low [6]. Mao et al. put forward a design method of authentication code recognition system based on compressed sensing. In this method, the orthogonal matching pursuit algorithm was used for signal reconstruction, to set the correlation threshold and threshold of signal recovery. Through the Gammatone filter bank, the characteristic parameters of the reconstructed signal were extracted to complete the recognition of the authentication code in the Gauss mixture model. The authentication code information obtained by this method was low [7]. Zhang et al. proposed an authentication code recognition system design method based on discrete particle swarm optimization. In this method, morphological localization method was used to extract interest the region of authentication code recognition. The extracted character feature of authentication code was as the local feature of recognition. Discrete particle swarm optimization (PSO) was applied to get the best coverage of the feature points to be identified in standard images. The recognition was done according to the corresponding feature similarity. The recognition accuracy of the method was low [8].
To sum up, a design method of artificial intelligence recognition system for cracking character authentication code is proposed. The specific steps are as follows:
Preprocessing of the character authentication code. The character authentication code is denoised by the denoising algorithm based on the connected domain, and the character authentication code after denoising is normalized.
The feature extraction module is used to extract the feature of the character authentication code.
The similarity matching of character authentication code is completed by the similar matching module.
In the recognition module, the multi-feature SVM classification algorithm is used to complete the recognition of the character authentication code.
Results. The overall effectiveness of the artificial intelligence recognition system used to crack the character authentication code is verified through information completeness and recognition accuracy.
Conclusions. The full text is summed up, and the next step of the work is put forward.
Material and methods
Denoising of authentication code
The artificial intelligence recognition system which is used to crack the character authentication code uses the denoising algorithm based on the connected domain to remove the noise in the character authentication code. Connectivity refers to that in a set, there is a path to this set between two points. The character authentication code image contains numbers and letters. In addition to lowercase letters i and j, all the other numbers and letters belong to the same independent connected domain [9]. The black pixels in the connected domain with smaller speckle noise is generally less than the number of pixels in the character connected domain, while the number of pixels in the larger block noise is more than that in the character connected domain [10]. The width, height and aspect ratio of noise are often different from characters. By using these characteristics of connected domain, the background noise in character authentication code can be removed.
The advantage of the scanning line seed filling algorithm is to deal with adjacent points without recursion, and save a lot of stack space. The implementation process of scanning line seed filling algorithm is as follows: for a given seed point p (x, y), the seed point p (x, y) in the two directions of left and right directions is firstly filled up, which is in a section on the horizontal scanning line located in the given area, and the range [xLeft, xRight] of this section is recorded. Then, the section on the upper and lower of two scanning lines connected to this section is located in the given area and saved in turn. The process is repeated until it meets the point where the conditions are not satisfied.
The scanning seed filling algorithm can be realized by the following four steps:
Initialization. An empty stack to store seed points is established, the seed point p (x, y) is put into the stack;
To judge whether the stack is empty. If the stack is empty, the algorithm is ended. If it is not empty, then the element point in the stack top is removed as the seed point p (x, y) of the current scanning line, and y is the ordinate of current scanning line’s.
Starting from the seed point p (x, y), in the left and right two directions, and it is filled along thecurrent horizontal direction of the scanning lineuntil the boundary is encountered. The left andright endpoints of the section are marked respectively, and the coordinates are xLef and xRight, respectively.
The pixels of the current scanning line in theinterval [xLeft, xRight] adjacent to the twoscanning lines y - 1 and y + 1 are examined. From xLef, the searching is from the xRight direction. If there are pixels that are not border and unfilled,the most right-hand one in these adjacent pixels is found out. It is pressed into the stack as a new seed point and then returns to the second step of thealgorithm.
After the processing of the search algorithm based on connected domain, the authentication codes are divided into independent connected regions. Some of these connected regions contain some character information. Some of them only contain noise. According to the different characteristics of character connected domain and noise connected domain, the noise connected domain is removed [11].
The connected domain of a common character is shown in Fig. 1. The width w and the height h of the connected domain of the general character are have certain range. The connected domain with the width w and the height h greater than the threshold can be removed.
A character connected domain.
Let V be a set of all separated connected domains, which can be filtered according to the formula (1).
In the formula, T is the set of connected domain in the target area. S1 and S2 are the set of noise areas. For any x ∈ V, w (x) is the width of x area, h (x) indicates the height of x area, ΔT w and ΔT h are thresholds.
Through the formula (1), it can be used to filter over wide and high connected domains. The connected domain with larger width is not necessarily noise. Because many of the characters in the authentication code are attached to each other, they will form a larger connected domain, which itself is useful character information and cannot be removed [12].
When the width to height ratio of the connected domain is less than the threshold value or greater than the threshold, it can be judged as block noise, and the noise is removed by formula (2) according to the width to height ratio of connecteddomain.
In the formula, ΔTrmax is the upper limit of the threshold of the character’s width to height ratio, and ΔTrmin is the lower limit of the threshold of the character’s width to height ratio.
The number of pixels in connected domain can be used to filter out noise. The number of pixels in the connected area of general point noise is less than that in the character area, and the number of pixels in the block noise is usually more than that in the character area [13]. The number of pixels in a connected domain is too large or too small to be judged as noise, and the noise that has this characteristic can be filtered out according to the formula (3).
Where, ΔTcmax is the upper limit of threshold value of the pixels number in the character connected domain, and ΔTcmin is the lower limit. This method can remove the point noise and block noise of authentication code. Compared with the conventional denoising algorithm, it does not damage the character information, and has a good de-noising effect for the large block noise. A flowchart for using a connected domain to denoise is shown in Fig. 2.
Flow chart of connected domain denoising.
By normalizing the location of pixels of the character authentication code image in the original image, the gray value of the pixel points in the normalized character authentication code image is determined, and proceed from the bottom left corner of the source image as the starting point. Set f (x, y) be the character authentication code image, and g (u, v) is the normalized character authentication code image. (x, y) is an arbitrary point in the normalized character authentication code image, corresponding to the point (u, v) in f (x, y). According to the specific situation of (u, v), it indicates the value of each point in g (x0, y0), seen in formula (4).
The mapping formula of pixel position of the normalized character authentication code image and pixel point in the original image is as follows:
Where, r
u
and r
v
are the parameters, and the calculation formula is:
In the formula, w is the width of the original character authentication code image; w
i
is the width of the character authentication code image; h is the height of the original character authentication code image, and h
i
is the height of the normalized character authentication code image. Point (x0, y0) in the normalized image mapped to point (u, v) in the original image is not necessarily an integer. It may not be defined at that point. When pixel (u, v) is not integer, pixel interpolation transformation is needed. If (u, v) is an integer, which indicates that (x0, y0) corresponds to the grid point of the original the character authentication code image, it does not need to interpolate operation, which directly makes the gray value of (x0, y0) equal to the gray value of position (a, b).
The character position is normalized by the normalization of the centroid based position, and the character position in the character verification code is normalized. Supposing that f (x, y) is the original image, g (u, v) is the normalized image, and (x, y) is any point in the normalized image, corresponding to the point (u, v) in f (x, y). According to the specific case of (u, v), the value of the point g (x0, y0) is represented. According to the actual operation process, it can be divided into two steps. First, the translation operation is done, then the rotation operation is performed. The translation operation is that the coordinates of all points in the uv plane image are all added Δx and Δv, respectively. Its transformation expression is as follows:
The rotation is to rotate the relative coordinates of all points in the uv plane image against clockwise θ angle, and the transformation expression is as follows:
After normalizing the location and size of the character authentication code image, it needs to re-write the image to another image, submit it to the feature recognition module, and extract the feature.
System design is the most important part of the transformation of demand into a software system, [14]. The goal of the system architecture design is to get a robust, easy to expand system framework, and to carry out the detailed design and even realize the system.
The design of an artificial intelligence recognition system used to crack the character authentication code complies with the following principles:
Applicability. A suitable system architecture is designed according to the characteristics of the requirements, which can meet both functional and nonfunctional requirementsat the same time [15–17].
Translatability. With the rapid development of Internet technology, new technologies and applications are emerging. The system should be flexible enough to adapt to the changes of environment and user needs, and can be transplanted among different authentication code generation technologies.
Reusability. Reuse helps improve product quality, increase productivity and reduce costs. By extracting the commonality in the application domain, the parts that can be reused form an independent module. At the same time, reusable and easy to expand are interrelated to some extent. Good layering and modular design can improve the reusability of code and component.
The functional structure diagram of the artificial intelligence recognition system used to crack the character authentication code is shown in Fig. 3.
Diagram of system functional structure.
Feature extraction module, Each character authentication code has many different kinds of characteristics. Each feature reflects the feature of a character authentication code from different angles. The degree of description is different for character authentication code, and the complexity of data is also different. Therefore, in the process of feature selection, it needs to reflect more character authentication codes as accurately as possible, so as to improve the accuracy of search, and minimize data dimension to improve efficiency. According to this principle, color moments, color correlation graphs and rotation invariable LBP textures are selected as the features of character authentication codes. The HSV color space is selected for the color space and the character authentication code feature extraction flowchart, such as Fig. 4.
Feature extraction flow chart.
Color moments can use three parameters of one, two or three order moments to describe the features for each color channel of character authentication code. In HSV space, the color moment dimension of three components H, S and V is 9, which can describe the color features of the character authentication code with fewer dimensions, and the flow chart of color moment extraction is as shown in Fig. 5.
Color moment extraction.
The theoretical basis of the color moments is to view the color distribution in the character authentication code as a probability distribution. In probability theory, the central moments of probability distributions can represent the features of probability distribution, so the moments of each channel in color space can be used as the features of character authentication codes. In general, the three orders color moments are sufficient to represent the color information features of the character authentication code. The formula for the first order moment E(R, G, B) of the character authentication code is as follows:
In the formula, N represents the number of character authentication codes. P(R, G, B) represents the information characteristics of RGB color models, and the formula for calculating two moment δ(R, G, B) of the character authentication codes is:
The formula for calculating the three orders moment S(R, G, B) of the character authentication code is as follows:
Color correlation diagram can not only describe color characteristics, but also reflect color spatial relations. After quantizing H, S and V into 4, 2 and 1 grades, color correlation diagram can be represented by eight dimensions data, and the flow chart of color moment extraction is shown as Fig. 6.
Extraction of color correlation graph.
The rotation invariance equivalent LBP texture features combine the properties of rotation invariance and equivalent LBP. Firstly, the LBP feature of the character authentication code is extracted and transformed it into 36 dimensions rotation invariance equivalent LBP. Then the 36 dimensions data is transformed according to the equivalent rule, and finally the rotation invariance equivalent LBP feature represented by 9 dimensions data can be gotten. The color moment extraction flowchart is shownin Fig. 7.
LBP texture feature extraction.
Gabor filter is a Gabor wavelet family obtained by rotation and translation of Gabor wavelet function. Gabor wavelet transform window function can analyze the local information of signals very well. The common used window function is Gaussion function.
Supposing that the function f is a specific Gaussion function, and f ∈ L2 (R), the definition of the Gabor Transformation is as:
In the formula, g
a
(t) is Gauss window function,
Then the signal is reconstructed to:
A character authentication code image can be seen as a two-dimensional signal. Each wavelet function in Gabor filter can be used to get the energy of all directions in all frequencies, and the texture information of character authentication code can be obtained from these energy information.
For a character authentication code image I (x, y) with a size of P × Q, its discrete Gabor wavelet transform is:
Where, s and t are filter’s scale variables, m and n are constants,
In the formula, σ
x
and σ
y
are wavelet coefficients, and W is the frequency of adjustment. The rotational translation method of the mother wavelet is as follows:
After the Gabor filtering of different direction and scale of character authentication code I, it can get a set of data to represent the energy of different direction and scale of character authentication code.
Where, m = 0, 1, ⋯, M - 1, n = 0, 1, ⋯, N - 1.
The texture features of the character authentication code are represented by the mean value μ
mn
and the standard deviation σ
mn
of E (m, n).
Similarity matching module. The Euclidean distance is as the similarity of the character authentication code and the standard to measure artificial intelligence recognition system for cracking character authentication code. Similarity matching is the Euclidean distance between the eigenvectors of the search character authentication code and the vector in the eigenvector library. And then according to the order of distance from small to large, the two character authentication codes with the smaller distance more are more similar. The steps of similarity matching are as follows: upload the character authentication code to be searched; calculate the eigenvector of the character authentication code to be searched; read the eigenvector in the eigenvector library; calculate the eigenvector and the Euclidean distance in the eigenvector library and sort the results from small to large.
Recognition module. Through the multi-feature hybrid classification algorithm, the character authentication codes are classified according to the feature extraction results, to complete the recognition of character authentication codes. The classification algorithm based on multi-feature SVM is as follows:
Input character authentication code image library I, and get different feature spaces U1, U2 and U3 respectively.
The SVM classifier is used in the feature space U1, U2 and U3, respectively, to get the classifier M1, M2 and M3.
Using classifier M1, M2, and M3 to predict unlabeled samples in feature spaces U1, U2 and U3, the classification matrices W1, W2 and W3 are obtained.
The following weighting is made to the classification matrix W1, W2, and W3, and the weight value is added to the final classification matrix W by the PSO algorithm.
In the formula, a1, a2, and a3 are constant. The recognition process for the artificial intelligence recognition system used to crack the character authentication code is shown in Fig. 8.
Authentication code recognition process based on multi-feature fusion.
User interface. The user interface is designed by MATLAB GUI based on MATLAB platform. The interface includes 10 search results display boxes of upload, searching button, searching picture, etc., as shown in Fig. 9.
User interface.
In order to verify the overall performance of the artificial intelligence recognition system used to crack character authentication codes, it need to test the design method of AI recognition system for cracking character authentication codes. The experimental platform for this test is the Simulink platform. The denoising of the character authentication code image is the preprocessing of the authentication code recognition. The character information is easily damaged and the integrity of character information is of great significance to the recognition of the character authentication code when it is denoised. The design method for character authentication code recognition system based on artificial intelligence recognition system (Method 1), the design method for character authentication code recognition system based on convolutional neural network (Method 2) and the design method for character authentication code recognition system based on fuzzy neural network (Method 3) are tested, and the information integrity of character authentication code of three kinds of denoising methods are compared. The test results are shown in Fig. 10.
Denoising results of three different methods.
Comparison of Fig. 10(a) and (b), we can see that the information integrity of the de-noised character authentication code by using the artificial intelligence recognition system is higher. And the comparison results of Fig. 10(a), (c) and (d) show that the information integrity after denoising the character authentication code by using the design method of authentication code recognition system based on convolutional neural network and fuzzy neural network for character code denoising is low.
The design method for character authentication code recognition system based on artificial intelligence recognition system (Method 1), the design method for character authentication code recognition system based on convolutional neural network (Method 2) and the design method for character authentication code recognition system based on fuzzy neural network (Method 3) are tested, and the recognition accuracy of character authentication code of three kinds of denoising methods are compared. The test results are shown in Fig. 11.
Test results of three different methods.
Analysis of Fig. 11(a), we can see that the fitting degree of the recognition result curve obtained by the AI recognition system used to crack the character authentication code is very high to the actual result curve. Analysis of Fig. 11(b) and (c) show that when the above two methods are used to recognize the character authentication code, the fitting result between the recognition result curve and the actual result curve is relatively low. Comparing the results of three different methods, we can see that the recognition result of AI recognition system for cracking character authentication codes is relatively high.
Authentication code is used to prevent sites from being attacked by malicious programs and to protect the security of the website. The research of authentication code cracking technology is of great significance to discover the design defects of the authentication code and improve the security of the Internet. At present, there is a problem of low information integrity and low recognition accuracy in the design of authentication code recognition system. Thus, in this paper, a design method of artificial intelligence recognition system for cracking character authentication codes is proposed. The first step is to preprocess the character authentication code. The second step is to extract the feature of the character authentication code through the feature extraction module; the third step is to use the similar matching module to match the character authentication code; and the fourth step is to complete the recognition of the character authentication code through the recognition module.
In the future work, the following aspects can be studied.
A more effective denoising algorithm is studied. In the proposed algorithm, the connected domain denoising algorithm is used, but in the process of cracking the authentication code, it is found that the noise that is connected with the character of the authentication code is very difficult to remove, so the algorithm needs further improvement. There are many kinds of Internet authentication codes, and the methods of removing background noise are different. So we need to study a more comprehensive denoising algorithm.
When the character authentication code is segmented, it is easy to be affected by the character width, character distortion and skew. Therefore, we need to further study the effective character segmentation method.
The study of character recognition algorithm is further studied.
