Abstract
The paper studied the encrypted network behavior recognition and mining in a large amount of network data environment, and proposed a fast online recognition method for the encryption network behavior based on the combination of correlation coefficient and k-nearest neighbor (KNN). Taking the encrypted Twitter traffic as the research object, a lot of encrypted Twitter network behaviors including message sending, pictures sending and other behaviors were analyzed, and then the statistical characteristics to express the encryption network behavior were extracted, and the samples library of encryption network behaviors based on correlation coefficient were established. Then, through the real-time collection of interactive network data, the correlation coefficient between the interactive data and the sample library were calculated, in order to overcome the noise interference of the similar data traffic. Meanwhile, the data packets after the similarity filtering were classified as the true behavior or the false behavior by using the KNN algorithm, and then the encryption network behavior was identified automatically by the default threshold of the correlation coefficient in big data environment, and compared with the traditional correlation coefficient method, the recognition efficiency of this method was greatly improved, which reaches to about 94%. Based on above, combined with the network vulnerability analysis, web crawler and virtual identity mining, the comprehensive encryption network behavior mining was successfully realized in the environment of big data.
Keywords
Introduction
According to the statistical report of China Internet Network Information Center (CNNIC), the Internet has penetrated into all aspects of people’s lives, which provides convenience for internet users in chatting, making friends, shopping and correspondence, etc. However, there are also some unsafe factors hiding behind the data, such as internet phone fraud, terrorist speech and so on, which have brought adverse effects to people’s property security and social stability. At the same time, in order to protect the user’s personal privacy and ensure the security of user’s information, the information transmission with data encryption has become the mainstream of current network information transmission. Therefore, the classification of network traffic is of great significance to the specification the network application, purify the network environment and protect the privacy of internet users, which is also one of the hot issues to be studied of network security [1].
Encryption network behavior [2] refers to the encryption of the key and content of transmission of network content in the transmission process by encryption algorithms, which can hide various actions of internet users, such as login, sending message, sending picture, sending video, and so on. It is mainly designed to prevent eavesdropping and tampering, but has brought problems and challenges to the country’s public safety regulatory business, especially with the growing number of encryption network behavior prevailing, such as Twitter, Facebook, Gmail, Skype, QQ, etc. It is even more difficult to analysis the criminal’s behavior on the network. Therefore, accurate real-time identification of encryption network behavior in big data environment is a big problem to be solved in public security supervision area. The solution of the problem can also provide the basic data for public safety network monitoring mining analysis. The paper takes an international representative and popular encryption network tool twitter as the study object. Twitter includes login, sending information, sending picture and video and other encryption network behavior, such as a criminal’s illegal speech, the crawler can get Twitter content through crawler technology, and the platform can classify to determine whether the network content is illegal or not, if it is the violation of speech, and then according to the publication time of the speech on the internet and the captured time of the encrypted network behavior information at the regional network export, a smaller range of encryption network behavior can be delineated. At the same time, the deep mining and analysis of the other virtual identities of the object can give the identified location, time, network behavior and the corresponding network content. As shown in Fig. 1, it is an encryption network behavior mining model in big data environment.

The mining model of encryption network behavior in big data environment.
At present, the encryption network traffic often adopts dynamic Internet protocol, dynamic port and fake port, etc. The traditional methods of traffic identification include recognition method based on depth packet detection [3, 4] and recognition method based on depth flow detection [5]. In the encryption traffic identification, the traditional identification method faces a series of problems. For example, method based on the depth of the packet detection depends on the matching analysis of message header or load content. The message load of the encryption traffic is a cipher text data and does not have the matching feature field. The traditional deep message detection algorithm is difficult to use directly. Although some commonly used encryption protocols have fixed communication ports, which can be identified by specific port information, but with the widely application of the random port and private protocol, the accuracy of this kind of detection method is seriously reduced. What’s more, it is difficult to match the recognition based on the load information and encryption implement pattern, which is easy to be affected by the version upgrade. The recognition method based on the depth flow detection traffic depends on a single data stream itself or multiple data streams belonging to the same service [6, 7]. For encrypted traffic, it is often difficult to break through only deep stream detection for a single piece of data. For multiple data flow statistics between the flows of the same business, the literature [8] proposed the identification method of encrypted traffic. The method is a machine learning method based on the packet header feature set and the traffic statistics feature set. It does not depend on the IP address, port and load information during the recognition process, so it can effectively identify the encrypted network traffic. In this paper, the encrypted traffic of SSH (Secure Shell) and Skype are tested. It is proposed that the method has the advantages of high precision of encrypted traffic, but high computational cost and high complexity, which cannot meet the requirements of online real-time identification. Therefore, the literature [9] proposes a classification method that is mainly used to distinguish the port number with the 80 Hypertext Transfer Protocol (HTTP) traffic and non-HTTP traffic. The method judges according to the size of the first four packet load characteristics of the initial flow and eliminates the least likely data stream. And then according to results of the statistical results of the signature code compared with the previously established waterfall decision tree, the remaining data packets will be classified again. The method is real-time, can be used for online identification system and can also be combined with the firewall system for traffic monitoring, filtering and other related work. But the shortcoming is that the method is only for the 80 port traffic classification and identification, it cannot identify the other port traffic, and the establish rules of decision tree also need further improvement.
In summary, based on the author’s previous research [10], this paper proposes an automatic recognition method of encryption network based on the correlation coefficient and KNN. The method is first trained for specific network behavior to construct the sample library and determine the threshold classification reference. Then the real-time acquisition data are statistically correlated with the samples to calculate the correlation coefficient. Then, in order to overcome the noise interference of the similar behavior data flow set, the KNN algorithm is used to authenticate the data packet after similarity filtering. The actual test results and the actual project application show that the method adopted in this paper realizes the fast and automatic identification of encrypted network behavior in big data environment, and solves the problem of large amount of development work and low online recognition efficiency caused by the frequent upgrade of encryption protocol, which opens up a new way for the network behavior analysis and identification.
The network information interaction based on the HTTPS protocol, the server interacts with the client according to certain rules. The content of the interaction between the server and the client is encrypted by encryption algorithm. It is well known that the encrypted content is difficult to translate directly, but after a lot of observations and experiments, it is found that the encrypted traffic varies with the size of content [11]. It can be used as an external feature of specific network behavior (such as chat, video, etc.). Different network behaviors have different sets of traffic data. These data sets have relatively stable characteristics as a whole. In addition, there are still distance characteristics between the same behavior, the extraction and analysis of these features can identify the specific behavior of the encrypted network, the basic model of encryption network behavior automatic identification is shown in Fig. 2.

Automatic identification model of encryption network behavior.
It can be seen from Fig. 2 that the auto-recognition model of encrypted network behavior mainly includes three categories of the creation of sample database, real-time data acquisition and encryption network behavior classification. In order to determine the characteristics of a certain network behavior, it is necessary to carry out sample training in advance, extract different network behavior characteristics, determine the corresponding sample, give a preliminary similarity to determine the initial threshold and build training samples. In addition, different network applications such as Twitter, Facebook, etc., the network information have different domain name features during real-time interactive transmission between the server and the client. The real-time acquisition of network data needs to be first screened and classified according to the domain name. Secondly, the real-time data set and the sample are statistically and correlatively analyzed. In order to effectively filter out the interference of the miscellaneous packets, the KNN algorithm is used to remove the false behavior, and the online fast identification of the encrypted network behavior in the big data environment is completed.
To ensure the security of network data communication, Netscape has developed a Secure Socket Layer (SSL) that utilizes data encryption technology to ensure that data is not intercepted and eavesdropped over the network. The network communication needs to be followed by the SSL handshake protocol. After the SSL handshake is completed, the network data is encrypted and transmitted according to the encryption algorithm established by the session, which is the process of converting the plaintext into unrecognizable cipher text. Unauthorized persons can’t recognize and tamper with handshake protocol. In order to achieve the purpose of hiding the plaintext and key information in the cipher text, we will try to eliminate all the characteristic information contained in the cipher text when designing the encryption algorithm. It is difficult to obtain the plaintext and key information according to the cipher text, so it can resist all kinds of violence cracking and cryptanalysis attacks. However, encryption only changes the external manifestation of content, and the size of network communication content is proportional to the size of traffic. Therefore, through a large number of statistical characteristics of traffic, the specific encryption network behavior can analyzed. The paper sets the Twitter of android version sending messages as an example. When sending a text message, which is a long connection, after repeated sampling observation, there are two sets of data representing the flow of its posting. As shown in Fig. 3, where the wireframe marked the digital part is the size of the data packets generated in the process of interaction client and Twitter server. The unit is bytes. The topmost data packet 199 indicates that the packet length is 199 bytes during the session.

A session of the data packets of Twitter.
In Fig. 3 the corresponding traffic are recorded as follows: The data packets that are less than or equal to 66 are ignored, because they usually contain 14 MAC header, 20 IP header and 32 TCP header, and no actual content information are included, so the data packets that are more than 66 are selected, and each packet is subtracted header information, and considering the direction of the packet information, counting the number of times from the server to the client (Dx) and the number of times from the client to the server (Dy) in a certain behavior (Setting Twitter sending messages as an example). It is recorded as X =[199, 407, 135, 183, 167, 199, 519, 135, 199, 327, 135, 199, 407, 151], X is a vector. And repeat it hundreds of observations and experiments to find a vector with representative characteristics and form a vector set, which is the sample data set, it is needed to calculate the correlation coefficient between the vectors during the observation and experimental process to determine the similarity of same behavior (Setting Twitter as an example) and give the threshold. In addition, we can see that the overall message flow characteristics of Twitter are similar to the sample data set and conform to similar probability distribution.
In actual data acquisition process, network data packets fell into place according to the time and the interaction process, and different network behaviors have different links. These links can be classified according to domain name filtering, and therefore the interference from other packets is reduced. Taking Twitter application for example, when sending the Twitter text messages, we can get the domain name “twitter.com” in the interaction process of the secure socket layer (SSL). Sending the message belongs to a long connection. Through the domain name filtering, the interference from other data traffic can be effectively filtered, and data can be effectively collected.
In order to perform online real-time identification, a sliding window is used to collect the data set to be tested as Equation (1).
And Y has the same length as the reference sample. The collected method is shown as Fig. 4.

The network data collected in real time.
Where, L selects the maximum value of sample length in the sample library. When the messages are collected in the process of sending Twitter message, the sliding window method is employed to successively obtain the measured data which has the same length as the reference sample, and the correlation coefficients are constantly computed between collected data and reference sample. This will be shown in Fig. 5.

The new collected network data when sending twitter message.
It can be analyzed from Fig. 5, Y vector is
In order to better observe the statistical characteristics of real-time acquisition data, Fig. 6 shows the behavior probability distribution of the representative sending message in Figs. 5, 6 uses Gauss function as kernel density function for smoothing window, the abscissa is divided into 100 units, and the frequency values in each unit are calculated, the probability density of vector Y is estimated.

The probability distribution of real-time acquisition data.
As seen in Fig. 6, the probability distribution between real-time acquisition data and sample has strong similar features. In addition, in the real-time data acquisition process, the direction information of data acquisition is needed to record, in order to distinguish the data sending from the server to the client or from the client to the server. Through changing the length of acquisition data, it will be compared in turn with the sample in samples library.
The correlation coefficient [12] reflects the correlation degree between two random vectors, and the formula is shown as Equation (2).
Where, X represents sample data set where the sample data is focused, it is recorded as Equation (3).
Y represents the new data collected real-timely, and it is recorded as Equation (1).
Correlation coefficient,
When ρ > 0, it denotes positive correlation,
When ρ < 0, it denotes negative correlation,
When ρ = 0, it denotes linear independence.
Moreover, the bigger the absolute value of ρ is, the higher the correlation degree is. According to the formula Equation (2), the real-time collection data and the sample data set are calculated, and the obtained results are respectively compared with the preset thresholds of sample data set. Furthermore, one less than the threshold cannot be determined to belong to the corresponding behavior of the sample, meanwhile one greater than or equal to the threshold can be considered as the same network behavior, i.e., the data collected in real time has the characteristic information of the sample.
Usually, for the same network environment, different network behaviors also have similar characteristics. Therefore, besides of considering the corresponding statistical characteristics, the distance information should also be considered, that is, the distance between different streams of the same behavior is relatively small.
By using the correlation coefficient, the correlation degree between sample library and real-time collected information can be determined. By neglecting the information distance between them, the data communication of network behavior is complicated, and some false and very similar statistical traffic information is inevitably existed, which is shown as Fig. 7.

The similar data traffic with false information.
As you can see from Fig. 7, there is very strong correlation degree between three data stream, however the dotted line represents that the data stream information is pseudo, which is easy to cause misjudgment. In order to further improve the accuracy of behavior recognition, the distance information between the real-time acquisition data and sample library needs to be considered, and the authenticity of the message can be effectively identified by using KNN algorithm.
By using KNN algorithm, a set of data with known category is taken as training sample, and the format of training sample is shown as Equation (4).
Where, each row of data in Equation (4) represents a training sample in the training sample set, and the calculation formula is shown as Equation (5).
Moreover, the classification information is attached at the end of each row of data, such as Y represents positive sample, and N represents a negative sample.
In order to distinguish the classification information of the test sample X, that is Equation (3), the distance between the test sample and the training sample needs to be calculated, and the calculation formula is shown as Equation (6).
According to the results of distance calculation with Equation (6), the sequencing is carried out according to the order from small to large, in order to find the smaller distance in the first top K as the similar observation, where these observations are called nearest neighbors of x. According to the previous K nearest neighbors, a certain class is found out, where the number of the class is the largest. As a result, the category of the new sample is determined to belong to this class.
From a lot of experiments, it is found that there are differences between the traffic models of encrypted network behaviors for different client versions, such as PC client, Android client, IOS client and so on. Therefore, it is required to respectively establish the sample data set for different types of clients. That is, it is required to respectively calculate the correlation coefficient of each vector in network behavior data and sample data set with Equation (2), which are real-timely collected by using sliding window method. Moreover, the authenticity classification of behavior information is carried out for data packets after the similarity filtering by using Equations (4) and (6). The experiment takes the network behavior of Twitter message sending under Android client version for example, which is the most popular internationally, and it is carried out based on the proposed method by combing the correlation coefficient and KNN algorithm. First, make the judgment based on the domain name “twitter.com”. According to the characteristics of different domain names, we distinguish which one belongs to the network behavior of Twitter sending message from the massive network traffic.
According to the proposed algorithm based correlation coefficient and KNN in this paper, through the capture packet platform, we artificially conduct 1000 capture packet experiments about Twitter sending message in order to establish the sample database, where 100 times Chinese, 800 times English, 100 times text with expression, 500 times short message more than 0 and less than or equal to 50 characters, 350 times sending message more than 50 characters and less than or equal to 100 characters, meanwhile 150 times sending message more than 100 characters and less than or equal to 140 characters, the specific distribution of training samples are shown in Table 1, Where X is a character variable, the unit is number of characters.
Distribution of training samples for Twitter encrypted traffic identification
Distribution of training samples for Twitter encrypted traffic identification
Then, select the template of message sending by using Twitter encrypted traffic identification is y = [1100, 38, 150, 1050], and set the similarity coefficient threshold as 0.7. The training sample in the KNN training sample set are shown in Table 2, and Y represents positive sample and N represents a negative sample in Table 2, meanwhile the test results by using a test sample set containing the true message, false message and miscellaneous packets are shown in Table 3.
Training samples using Twitter encrypted traffic identification
Experimental results using Twitter encrypted traffic identification
It can be seen from the Table 3 experimental results, in the 15 times information sending processes, there contains 8 times real information and 7 times uncorrelated information. Through presetting the correlation coefficient as 0.7, 2 times uncorrelated information including number 14 and number 15 can be effectively filtered out, and Table 3 shows that the uncorrelated other 5 times information including from number 9 to number 13 have a strong similarity with these real information. In order to further verify the authenticity of the information, the KNN algorithm with Equations (4) and (6) is used to calculate the angle cosine distance between training sample and test sample, and selecting top K = 5 to judge, 1 represents true sending message, and 0 represents false sending message, the calculated results in the Table 3 show that the correct recognition rate reaches to 100%.
In order to verify the real-time property of the proposed algorithm based correlation coefficient and KNN, this paper still takes the network behavior of Twitter message sending under Android client version for example, and experiments are configured on the Intel (R) Xeon (R) CPU@ 2.0 GHz 2 kernel 16 G memory devices, 877M data online is real-timely collected to simulate the packet sending test, and the size of data packet is 686M after the domain name “twitter.com” filtering. According to a lot of test results, the threshold value of the correlation coefficient is set as 0.7. The correlation coefficient with Equation (2) is calculated by 1957916 times, and the output is recorded for 2287 times, 434 times are wrongly accepted, 27 times are missed, so the online recognition rate is 81.02%, and the missed detection rate is 1.44%. After the analysis of the KNN algorithm with Equations (4) and (6), the output is recorded by 1959 times, 118 times are wrongly accepted, 39 times are missed, so the online recognition rate is 93.98%, and the missed detection rate is 2.07%. The total consuming time including Equations (2, 4, 6) is 6.96 s, and compared with the traditional correlation coefficient method, the recognition efficiency of this method is greatly improved, which reaches a correct recognition rate of about 94%, and the missed detection rate is almost same. The actual test results show that the recognition efficiency of this proposed method can meet the requirements of real-time identification online.
Besides Twitter message sending, the proposed method in this paper can be extended to the online automatic identification of other encrypted network behaviors, such as Facebook, Instagram, IMO, Hangouts, Viber, and so on.
Through the analysis of Twitter and other communication tools, the vulnerabilities of communication tools, communication process, browsers, operation systems and other related tools or applications are mined and collected, and then the Twitter session key are obtained. Furthermore, according to this session key, the open network data are acquired through the network crawler way. By means of the text classification method [13], the open network data will be classified, and the illegal speech will be found out.
On the basis of the found content, behavior and time of the illegal speech, the behavior, time, location and target of the encrypted network behavior found out by the proposed method in this paper will be automatically compared and analyzed, which can further narrow the scope of the encrypted network behavior target. Then, through the further mining analysis of the virtual identity of the target, we can finally find out when and where the illegal speech was made by illegal target.
Conclusions
By means of SSL encrypted transmission, the transmission content of network data cannot be identified, and the data flow is very large, which brings challenges to real-time online identification. In this paper, according to the interactive characteristics of network data, and combining with the specific network behavior, the behavior characteristics of the encrypted network is extracted through a lot of analysis and observations, and the sample library of the encrypted network behavior is constructed. Based on domain name filtering, the target data packets are acquired real-timely. Through analyzing the correlation coefficient and KNN between the collected data packet flow set and the sample library, the automatic classification is completed, the online automatic identification of encrypted network behavior in big data environment is realized, and compared with the traditional correlation coefficient method, the recognition efficiency of this method is greatly improved, which reaches to about 94%, and the missed detection rate is almost same, at the same time the problems of large number of development works caused by the frequent upgrade of encryption protocols and the low efficiency of online identification are solved. Moreover, combined with the identified encrypted network behavior, and through the analysis of the encrypted data obtained by mining, we can find and locate the illegal public opinion, which solves the problem of locating the Internet objects via encrypted internet and provides strong supports for public safety supervision. By means of the experiments and the actual project results, the effectiveness and the feasibility of encrypted network behavior recognition and mining approach in big data environment are verified in this paper.
