Abstract
In order to solve the clustering problem of unknown binary protocols, an improved k-means unknown binary protocol clustering method is proposed, which determines the initial clustering center and improves the clustering distance. Firstly, the k value is determined and the clustering center is extracted by using DCBP (Determine the initial clustering center of binary Protocol) algorithm and the change rate of error square, and then the data are clustered by improving the k-means algorithm of distance function. The unknown binary protocol bit stream is divided into different subsets of binary protocols. By improving the k-means algorithm, the Pearson distance improves the accuracy of binary protocol clustering from 96% to 98.9%. The DCBP algorithm helps us to determine the k value accurately. The k value determined in this paper is 5, and the clustering accuracy is 98.9%. The clustering accuracy is 80% when k is 4 and 92.2% when k is 6. And the operation speed of the improved k-means algorithm is better than that of the AGNES algorithm. The algorithm is better adapted to the clustering of unknown binary protocols, and improves the accuracy of clustering and the speed of operation.
Introduction
Protocol analysis is a main research area in protocol reverse engineering [1], while protocol clustering is an important link in protocol analysis. The existing research on protocol identification is usually based on the protocol type, port, length, direction and special fields contained in the bit stream, based on pattern matching [2], machine learning [3] and so on. With the emergence of more and more unknown protocols, binary protocol has been widely used because of its high transmission performance and good security. Compared with the text protocol, one of the more important features of the binary protocol in the reverse processing of the protocol is the fixed format and multi-position alignment [4]. At present, there are few studies on the classification of unknown binary protocols, and most of them are unknown protocol classification for bit stream. Unknown protocols do not distinguish their hierarchies and default to undisclosed link layer protocols, which are bitstream data composed of completely unknown protocols. Unknown protocols [5] do not distinguish their hierarchies and default to undisclosed link layer protocols, which are bitstream data composed of completely unknown protocols. Although the object of study is essentially the same, but there are still great differences, unknown binary protocol has its fixed characteristics. Therefore, studying the clustering of unknown binary protocols is the basis for improving the efficiency of network protocol analysis and analyzing unknown protocol formats.
In order to solve the clustering problem of unknown binary protocols, an improved k-means algorithm is designed in this paper. The specific method is to design a DCBP algorithm to automatically determine the k value and the initial clustering center, and to replace the Euclidean distance with the Pearson distance in order to solve the low clustering accuracy. By improving the k-means algorithm, the Pearson distance improves the accuracy of binary protocol clustering from 96% to 98.9%. The DCBP algorithm helps us to determine the k value accurately. The k value determined in this paper is 5, and the clustering accuracy is 98.9%. The clustering accuracy is 80% when k is 4 and 92.2% when k is 6. And the time complexity of the improved k-means is less than AGNES, so the operation speed is better than the AGNES algorithm.
The work flow diagram of the improved k-means algorithm is shown in Fig. 1:

Work flow chart of improved k-means algorithm.
It is mainly divided into three stages: preprocessing, determining k value and clustering center, and K-means clustering using Pearson distance. First of all, the data is preprocessed, and according to the characteristics of the binary protocol, 4bit is selected as the processing unit; when processing the data, the shortest data is used as the basis for cutting; each unit is taken as a feature to get a n×m two-dimensional matrix. Then the DCBP algorithm and the sum of squares of errors are used to extract the k value and the clustering center. Finally, the K-means algorithm with improved distance function is used to cluster the data, and the unknown binary protocol is divided into a subset of binary protocols.
Binary protocol
The application protocols included in the contrast stream are parsed, among which one class of protocols is called text protocol [6], such as HTTP protocol and SMTP protocol. There are a large number of characters in the text protocol that are convenient for people to understand and read, such as numbers, letters, percentage signs, carriage returns, spaces and so on. At the same time, in order to facilitate the further analysis of the text protocol, some special characters are usually used for segmentation, such as spaces, line breaks and so on. A lot of redundancy is added to the protocol design, which reduces the efficiency of transmission, but also makes the use of the protocol more flexible. An example of the HTTP protocol is shown in Fig. 2.

Sample HTTP protocol after parsing.
As can be seen from Fig. 2, the HTTP protocol uses spaces to separate data from keywords, such as spaces between the keywords “Server:” and “nginx’’, and “ r n” between information. The delimiters used by different text protocols vary slightly, but are generally consistent. On the one hand, it is convenient for people to read, on the other hand, it also determines the way the server parses the text protocol, that is, the regular expression matching is used to parse the text protocol.
Similarly, in the protocol after the parsing of the bitstream, there is another protocol, which we call binary protocol [7], such as TCP protocol, UDP protocol. The concept of binary protocol is relative to the text protocol, not to say that the smallest data unit in the binary protocol is at the bit level. As can be seen from some early RFC (Request For Comments) definitions of protocol specifications, binary protocols can also be byte protocols, but even if parsed, there are no readable characters in the protocol structure field. The fundamental difference between text protocol and binary protocol is not in the encoding of binary numbers, but in whether the protocol is data structure-oriented or text-string-oriented. The binary protocol studied in this paper does not limit the type of user data carried by the protocol. The TCP protocol format is shown in Fig. 3.

Sample diagram of TCP protocol format.
As can be seen from Fig. 3, compared to the HTTP protocol, the TCP protocol has a fixed format; there is no spacer between the fields; the length of the field is fixed; and the format of the protocol (excluding user data) is fixed. The various characteristics of the above binary protocol determine that the resolution of the binary protocol is not to use wildcards for segmentation and parsing, but to parse the fields directly according to the position and fixed length. The fixed format makes the running status of the protocol can be represented by numbers, although it is not convenient for people to read, but it increases the efficiency of data transmission and parsing. Compared with the text protocol, the binary protocol is generally shorter, which indicates the application prospect of the binary protocol in the era of the Internet of things.
The essence of clustering is to divide a given data set into several clusters according to the similarity between sample data, so that the similarity of data objects in the same cluster after clustering is as high as possible, The similarity between different clusters is as low as possible. The types of clustering methods mainly include prototype clustering, hierarchical clustering and density clustering [8]. The prototype clustering algorithm assumes that the clustering structure can be characterized by a set of prototypes, initializing the prototype, and then iteratively updating the prototype. Different prototype representations and different solving methods will produce different algorithms. The typical examples are K-means [9], learning vector quantization and Gaussian mixture clustering. This kind of algorithm is characterized by its low implementation complexity, but its disadvantage is that it needs to determine the number of classes k in advance.Hierarchical clustering algorithm attempt to divide the data set at different levels, so as to form a tree-shaped clustering structure. Its typical representative is AGNES [10] (Agglomerative NESting), CURE(Clustering Using Representatives) and other algorithms, when the data set is large, the structure and scale of the tree will become very complex. Density clustering assumes that the density structure can be determined by the compactness of the sample distribution. The connectivity between samples is investigated from the point of view of sample density, and the clustering is continuously extended based on the connectable samples to obtain the final clustering results. It typically represents DASCAN (Density-Based Spatial Clustering of Applications with Noise), DENCLUE (DENsity-based CLUstEring) and other [11] algorithms.
Prototype clustering is also known as “prototype-based clustering”. This kind of algorithm assumes that the clustering structure can be described by a set of prototypes, which refers to the representative points in the sample space. This clustering method is very common in real tasks. The algorithm initializes the prototype and then updates the prototype iteratively. Figure 4 shows the flow chart of the k-means algorithm [12] in prototype clustering.

Flow chart of k-means algorithm in prototype clustering.
The binary protocol format is fixed and multi-position aligned.Compared with other protocols, the complexity of binary protocol is lower, and although k-means algorithm in prototype clustering has some limitations such as difficulty in selecting clustering parameter k, it is widely used in the clustering of large data sets because of its low implementation complexity. In the process of clustering, the Pearson correlation distance can better reflect the similarity between binary protocol bit streams. Therefore, in the process of clustering unknown binary protocols, the improved k-means algorithm will be used for clustering, and the clustering distance can be based on Pearson correlation distance.
The Pearson correlation coefficient [13] is:
Because all distances are required to be positive when calculating the sum of squares of errors, And the value of the Pearson correlation is between-1 and 1, So the Pearson distance is d p = 1–d, the range of the Pearson distance is between 0 and 2.
The Euclidean distance [14] is derived from the distance between two points x1 and x2 in N-dimensional Euclidean space. The formula is:
We take the real protocol data frame as an example to illustrate the effectiveness of Pearson distance. As shown in Table 1, there are three preprocessed protocol frames;
Real data frames
The Euclidean distance, Manhattan distance and Pearson distance calculated by the program are shown in Table 2:
Results of three distances
The range of Pearson distance is between 0 and 2, the range of Euclidean distance is 0 to + ∞, and the range of Manhattan distance is 0 to + ∞. When the two protocol data are calculated at the same time, the values are 0, 0, 0, respectively. There is no difference among the three distances, which is in line with the actual situation. The values calculated when the two protocol data are the same protocol are 0.159763, 17.748239, and 39.000000. When the two protocol data are different types of protocols, the calculated values are 0.943311, 35.425979, 161.000000 and 0.990106, 38.961519, 172.000000. From the data, we can see that although the increase and decrease trends of the three methods of distance calculation are the same, the degree of differentiation is very different, The distinguishing degree of Pearson distance is
Distribution range of three distances for the same protocol
It can be seen from the table that the distance distribution of Euclidean distance and Manhattan distance between the same protocol is very scattered, and it will be difficult to determine the clustering parameters by using these two distances in binary protocol clustering. The Pearson distance can better reflect the similarity of protocol bit flow. Therefore, the Pearson distance is used as the clustering distance in the k-means algorithm.
Implementation scheme
In order to realize the efficient clustering of unknown binary protocols, an unknown binary protocol clustering method based on improved k-means algorithm is proposed. Because the experimental data of unknown binary protocols are difficult to obtain, the clustering effect is difficult to evaluate. Therefore, the known binary protocol is used instead of the unknown protocol. And the research object is the bitstream which has been initially segmented and the location of the protocol header is determined (hereinafter referred to as the unknown binary protocol bitstream). First of all, the data is preprocessed, according to the characteristics of the binary protocol, 4bit is selected as the processing unit; the shortest data is used as the basis for cutting the data; each unit is used as a feature to obtain an n×m two-dimensional matrix. Then the k value and clustering center are extracted by using the self-designed DCBP algorithm and the error square sum method, and then the improved distance function k-means algorithm is used to cluster the data, and the unknown binary protocol is divided into binary protocol subset. In the improved algorithm, the PEARSON correlation distance is used as the measure to improve the clustering accuracy.
Data preprocessing
First, the redundant data in the protocol header is removed, and then processed in 4bit units. For example, 111111110011 is converted to ff3 and 15153, and f, f, 3 and 15, 15, 3 are processing units. The input data frame is constructed into a two-dimensional matrix of n×m. N is the number of rows of the input data frame, and m is the first m processing units of the intercepted data frame [15]:
A matrix of m×n is constructed. Where n is the total number of protocols, because the head of the binary protocol is relatively short, too much data will increase unnecessary time overhead, and the number of protocols has little impact on the clustering results, so each protocol takes 150. Because the length of the protocol is unknown, in order to better retain the effective information of the protocol, and remove the content of the data part. The M value is determined by the minimum m value. First, the shortest bit stream length of each protocol is determined, and then the length of the shortest protocol in all protocols is taken as the m value. The reasonable values of m an dna re given in the experiment.
K value and determination of clustering center
On the basis of data preprocessing, an algorithm is designed to determine the constraint conditions by statistical character frequency, screen out frequent characters [16], and then determine K self value and clustering center by fixed position frequent pattern.
At the same time, the traditional K-means algorithm is used to cluster the bit stream in the experimental data set. The clustering parameter k is taken from 2 to 9, and the clustering results are output and the sum of error squares under the corresponding k value is output.
In SSE(sum of the squared errors) formula, Ci is the i cluster, p is the sample point in Ci, mi is the centroid of Ci, SSE is the clustering error of all samples, which represents the clustering effect.
According to sse’s formula, with the increase of cluster number k, the sample partition will be more precise, and the aggregation degree of each cluster will gradually increase, then the sum of error square SSE will naturally become smaller, and when k is less than the real cluster number, As the increase of k will greatly increase the degree of aggregation of each cluster, the decrease and increase will be very large, and when k reaches the real number of clusters, the return of the degree of aggregation will be reduced rapidly, so the amplitude will be greatly reduced. Then it tends to be flat as the k value continues to increase, and at this time the corresponding k value is the true clustering number of the data. In order to better determine the k value, the slope change rate of the I point is the slope value of the point minus the slope value of the previous point, and the slope change value is normalized in order to facilitate the comparison.
The k value with the maximum slope variation is taken as the K SSE value under the condition of sum of square error.
Because of the shortest data packet length of 144 bits in the experiment, 4bit is used as a processing unit, that is, 36 features. In order to prevent the phenomenon that the clustering granularity is too fine, the average k value determined by the two methods is rounded down, and the experimental k value is obtained, so as to improve the overall clustering accuracy.
Using the real protocol data to determine, through the calculation found that the PEARSON correlation distance can better reflect the similarity between binary protocols, so as to improve the clustering accuracy. Because the range of PEARSON correlation coefficient is [–1, 1], in k-means clustering, the distance must be positive, and the PEARSON correlation distance is 1-d. It better reflects the difference between data frames.
Part of the code for the improved K-means algorithm is as follows:
Experiments
The experimental data set is obtained from the real network environment. P1, P2, P3, P4 and P5 are used to represent the subset of four unknown binary protocol bitstreams: ARP, DNS, ICMP, TCP and SMB,. Without losing its generality, it is assumed that all protocols have been initially segmented, and the datagrams begin with the corresponding protocol header and contain the data portion. Take the shortest data message length 144bit.
Determination and Analysis of parameters m and n
By using different numbers of protocols and calculating the accuracy of clustering recognition under the corresponding number of protocols, it is shown that the method of determining parameters m and n is effective.
As shown in Fig. 5, the number of protocol entries ranges from 400 to 750, with an interval of 50. The accuracy of protocol clustering ranges from 0.96 to 0.98, and there is no obvious fluctuation, so it is considered that the number of protocols has little influence on the improved algorithm.

Accuracy of different n values.
As shown in Fig. 6, the protocol length ranges from 40bit to 200bit, and some protocols are not long enough to fill in zeros. Interval 20bit. The accuracy of protocol clustering ranges from 0.75 to 0.98. the overall trend is that it increases with the increase of protocol length, and then remains basically stable. When the protocol length exceeds the shortest protocol length, the accuracy of protocol clustering begins to decrease. This is because the protocol data is too short to affect the integrity of the information, and zero compensation interferes with the protocol data. Therefore, it is reasonable for m value to take the shortest protocol length.

Accuracy of different m values.
The k value in k-means algorithm is adjusted, and the clustering number is set to 2 to 9. Through different k values, the corresponding SSE values are recorded respectively. The results are shown in the figure below.
By calculating the change rate of the slope of sse value, it is found that the change rate of slope is the highest when k is 5, but it also changes greatly when k is 8, so the best cluster number for this data set is 5. At the same time, k 8 is also used as a comparative item to carry out the next experiment.
First, the frequency of each character in each column in the matrix is calculated, and then the maximum frequency character is selected as the frequent character of each column. As shown in Fig. 8, the left figure shows the frequency of the frequent characters for each column, and the right figure shows the frequency of the frequent characters for each column. Through observation, it can be found that the occurrence frequency of character 0 is the largest, and the frequency of frequent characters in each column is generally larger. The problem of frequency range should be considered when determining parameters.

Variation of sum of square error sse with k value.

Maximum character distribution of each column frequency.
Set the parameters pmin = 0.25 and fmin = 0.1, (the set principle will be analyzed in the following experiment), calculate the k value as shown in the figure above, the longitudinal coordinate is the k value, and the transverse coordinate is the number of iterations to the set c. A point represents the number of times a collection is merged in a traverse. The calculated k value is 5.
Finally, the average value of k obtained by the two methods is calculated and rounded down. The final k value is 5.
Set the range of fmin to [0.1, 0.9], the step size is 0.1, and pmin is fixed to 0.3. When the total number of protocols is 750 and 1000 respectively, the corresponding k values are calculated as shown in the figure above, according to the principle fmin of including all the information as far as possible but omitting the data part at the same time.
Set the value of fmin to 0.2, where the range of pmin is [0.1, 0.9] and the step size is 0.1. When the total number of protocols is 750 and 1000 respectively, the corresponding k values are calculated as shown in the figure above. According to the principle of maximum feature difference in this range, pmin takes the median value 0.2 when the frequency of k value is the maximum.
Then pmin = 0.2 is substituted into the algorithm for determining pmin = 0.2, and the result is as shown in the figure above, fmin = 0.1.
Similarly, fmin = 0.1 is replaced into pmin algorithm, and the result is shown in the figure above, pmin = 0.25. Then continue the loop until the result remains the same. Finally, we get pmin = 0.25, fmin = 0.1.
Clustering results and analysis of improved distance K-means algorithm
On the basis of determining the k value, the K-means algorithm is used to cluster the bit stream in the experimental data set. When the clustering parameter k is 5, a more accurate clustering result is obtained. In order to verify that it is reasonable to take 5 in the previous step, and then take the clustering parameter k is 8, the clustering results are shown in Table 4.
clustering results of k = 5 and k = 8
clustering results of k = 5 and k = 8
The data in the table represent the number of messages contained in each cluster after clustering.
In Fig. 10, the x axis is the sample number and the y axis is the cluster category number, reflecting the distribution of protocol categories with the number. In the process of data preprocessing, the known binary protocol is used as an unknown binary protocol, which does not affect the final result. The five protocols are arranged in order for statistical convenience. Through the chart, we can see that the clustering effect is good, all kinds of data can be distinguished effectively, but in the third category, there are still 8 protocols are divided into the fourth category. When k≥8, we can see that most of the three new categories are distributed in p1, p4, p5 protocols. By studying the clustering results of specific protocols, it is found that when the k value is large, the classification granularity is too fine. The protocol messages with different internal formats are grouped into different categories.

Clustering result diagram.

variation of k value with fmin at n = 750.
Recognition rate and false identification rate of all kinds of protocols when K = 5
Under the same k value, the clustering results of the two methods are compared by the accuracy Acc, the recognition rate TP and the false recognition rate FP.
The time complexity of K-Means clustering algorithm and improved K-Means clustering algorithm is that the complexity of O (nkt), AGNES algorithm is O (N2). When n≥750 and k ∼ 5, the k-means clustering time is 1.2 s, the accuracy of 96% Ages algorithm is 15 s, the accuracy is 98%, and the improved k-means clustering time is 0.9 s, and the accuracy is 98.9%.
Taking the five kinds of protocols P1, P2, P3, P4 and P5 as transverse coordinates, the line graphs of TP and FP values under two different algorithms are drawn.
Through comparison, it is found that the operation speed of k-means algorithm is faster than that of AGNES algorithm, but the clustering result is poor. Theoretically, the operation speed of k-means is better than that of AGNES; because the time complexity of AGNES; is less than that of AGNES. However, in the process of clustering, k-means uses Euclidean distance and AGNES uses PEARSON correlation distance, which is more in line with the characteristics of binary protocol, so the accuracy is not as good as AGNES.
Clustering accuracy of different algorithms when K takes different values
Different k values are set according to the same set of data, and the accuracy of the three clustering methods is calculated respectively.
As can be seen from Fig. 16 and Table 6, the clustering accuracy of the improved K-Means clustering algorithm is similar to that of the AGNES algorithm, and is superior to the traditional k-means algorithm. At the same time, when the k value is the real number of categories, the clustering accuracy is the highest, accurate determination of k value is the key to clustering. However, when k is 7, the accuracy of the traditional k-means algorithm is higher than that of the improved k-means algorithm.
comparison of clustering results between k-means, AGNES and improved k-means
comparison of clustering results between k-means, AGNES and improved k-means
comparison of clustering results with different values of k
The left figure of Fig. 13 is the result of the traditional k-means algorithm, and the right figure is the result of the improved k-means. The traditional algorithm only classifies the fourth class of protocols into three categories, and the actual classification effect is not good. Therefore, it can be obtained that the accuracy degradation is due to the fact that the improved algorithm is more in line with the characteristics of the binary protocol, which leads to the phenomenon that the clustering granularity is too fine when the k value is greater than the real value.

Variation of k value with fmin when n = 750.

Variation of k value with fmin after iteration.

Variation of k value with pmin after iteration.

Distribution map of clustering results at k = 5 and k = 8.

Comparison of clustering effects of three algorithms.

comparison of clustering results of three algorithms.

k = 7 clustering result distribution map.
Generally speaking, in the process of clustering unknown binary protocols, the clustering accuracy of the improved k-means algorithm is better than that of the traditional k-means algorithm, and the time complexity is lower than that of AGNES clustering, so it is suitable for clustering a large number of unknown binary protocol data. Through the improved k-means clustering with different k values, it can be found that to some extent, k values reflect the granularity of clustering. The higher the k value is, the finer the clustering granularity is.
In order to solve the clustering problem of unknown binary protocols under the condition of lack of prior knowledge, an improved k-means algorithm is proposed to determine the initial clustering center and improve the clustering distance. On the basis of data preprocessing, the k value and clustering center are extracted by DCBP algorithm and sse method, and then the k-means algorithm of improved distance function is used to cluster the data, and the unknown binary protocol is divided into binary protocol subset. The algorithm is more suitable for the clustering of unknown binary protocols and improves the accuracy of clustering and the speed of operation.
Based on the clustering of unknown binary protocols, the proposed method can calculate the clustering results quickly. The selection of clustering k value has a great influence on the granularity of clustering. How to obtain the clustering results of the required granularity more quickly and accurately will be a new research direction.
