Abstract
Conventional brute-force attacks can now be detected and identified based on statistical analysis of logs and traffic data. However, they fail to detect low-frequency and distributed brute-force attack behaviors. To address different attack methods, new detection techniques have emerged. This study compares various machine learning algorithms and selects two methods, namely the clustering algorithm k-means and bdscan, as well as the decision tree algorithm for data learning. In one approach, normal user login data is integrated with enterprise email log data. The data is first statistically analyzed and filtered, followed by quantifying data characteristics using information entropy. Subsequently, machine learning algorithms are employed for classification, and the results are visualized for display. In another approach, labeled raw data is used to train a model using the decision tree algorithm. By comparing the two analysis results, a more accurate model can be obtained. These analytical methods can help enterprises strengthen email security and defend against low-frequency and distributed brute-force attacks.
Introduction
Brute-force attacks are a common method used in web attacks, especially in email attacks, to gain unauthorized access. The process involves repeatedly attempting authentication until a successful login is achieved. To improve efficiency, attackers often use tools with pre-existing dictionaries for these operations. With the advancement of computer technology, brute-force attacks can now generally be detected and identified. However, there is a growing trend of low-frequency and distributed attacks, where the traces of such attacks are becoming increasingly concealed. Email security experts speculate that these targeted attacks, characterized by broad targeting, precise attacks, and successful conversions, occur in a cyclical pattern. Therefore, it is necessary to research new detection and defense techniques for these two types of targeted attacks.
For email systems, particularly enterprise email systems, the primary detection logic involves analyzing and inspecting logs and traffic to perform data statistics and calculations. The main analysis method is based on setting a threshold for the number of failed login attempts from a specific IP within a fixed time window. If the threshold is exceeded, the IP is blocked from making further requests within the next variable-length time window. However, this approach is no longer effective in defending against low-frequency and distributed brute-force attacks targeted at email systems. In low-frequency brute-force attacks, once the attacker has infiltrated the system, they can adjust their activity below the detection threshold set by the system. In distributed brute-force attacks, the attacker controls multiple compromised hosts (botnets) to simultaneously launch attacks at a similar frequency, bypassing detection mechanisms.
Research status
Currently, the most efficient and accurate method for detecting brute-force attacks in email systems, based on the analysis of system log files, is the detection approach that relies on traffic features. In Yang Jia’s research paper titled “Research on Abnormal Behavior Detection in Campus Email Based on Big Data Analysis,” the author analyzed a dataset consisting of email log records. They filtered out the login failure entries from the logs and calculated the occurrence frequency of each IP address and each account. A threshold was then set, and by comparing the occurrences with the threshold, the presence of brute-force attacks was determined. After selecting relevant data, the author visualized the data using charts. By observing the visualized data patterns, they employed the information entropy method to quantify the data and identify collaborative attack IP addresses, thereby improving the detection accuracy [1].
In this study, three machine learning algorithms, which are k-means, dbscan, and decision trees, were employed for the analysis and detection of brute-force login behavior. The first method involved quantifying data features, such as entropy, mean, and standard deviation, and then data was standardized [2]. The quantified data features were then subjected to classification using the k-means and dbscan algorithms. The second method utilized a labeled dataset consisting of real data, which was suitable for the decision tree algorithm, enabling accurate judgment of the results. Finally, the results of the two methods were compared to ensure high accuracy in the detection outcomes [3].
Data description
In this study, the DataCon2021-Enterprise Email Security Dataset was used. This dataset is derived from the log data generated in the production environment of a large-scale enterprise email server (with sensitive data anonymized). The dataset reflects the real-world email environment, where attackers employ various attack methods with different objectives, techniques, and levels of threat. The complex nature of the environment contributes to enhancing the comprehensiveness, stability, and practicality of the designed detection techniques. Additionally, the dataset has been labeled using internal methods, making it suitable for the selected machine learning algorithms. The dataset consists of two log files: one containing records of real login behaviors and the other containing corresponding labels. Both files contain a total of 35,159,978 entries [4, 5, 6].
Here is an example of the data that records login behaviors: {“id”: 1, “ip”: “106.49.214.144”, “logtime”: “2021-07-25 18:56:30”, “type”:“IMAP”, “isp”: “China Telecom”, “latitude”: “34.775838”, “longitude”: “113.686037”, “email”: “ac98bbae313babad”, “password_hash”: “cc80f8ed”, “geo”: “Zhengzhou City, Henan Province, China”, “client_software_fingerprint”: “bf1e0fbbc4d0c6f2”, “auth_result”: 0}.
The labels for the login behaviors were obtained by analyzing the data and identifying eight features that describe the login behaviors. Here is an example of a dataset consisting of login behavior features and corresponding labels:{“id”: 1, “BF attempt”: false, “User miss”: false, “Abnormal geo”: false, “VPN”: false, “Spam related”: false, “BF succeed”: false, “New spam Account”: false, “label”: false}ã
Design of email brute-force detection system
The goal of this system is to identify brute-force attack entries in the DataCon2021-Enterprise Email Security Dataset. In the process of implementing the system, data processing, statistics, analysis, and deriving analysis results are required. Based on the results, adjustments and improvements are made to enhance the detection rate and accuracy of brute-force attacks. System architecture diagram is shown as Fig. 1 [7, 8, 9].
System architecture diagram.
Data processing
Data preprocessing
Due to the large number of entries in the original dataset (over 35 million) and the high memory consumption (over 10 GB), it is slow to read and contains some irrelevant information. Therefore, the information content is filtered based on requirements. In this case, the request IP, email, and time (date) are selected as the data to be analyzed. Regular expressions are used to select the information content and identify the failed login entries. They are then separated by commas for easy re-reading and saved in a new file. However, the file still occupies a large amount of memory, which creates inconvenience during script reading and writing. Considering that the dataset is generated from a real enterprise email system over a month, selecting the keyword “date” is appropriate. Additionally, it is observed that the distribution of entries per day is mostly even. Therefore, the file generated in the first round of preprocessing is further divided into separate files based on the date, allowing subsequent reading processes to determine the date based on the file name without frequent use of conditional statements in the script, thus improving system efficiency [10].
Email data statistics
For this problem, a multi-level classification based on different dimensions is chosen. Since the attack behavior is not fixed overall, different statistical methods and filtering logic are used to analyze the data. Due to different detection logics, three multi-level dictionaries are generated to record the data
Patterns [11].
The first dictionary records the total number of logins by individual IPs. During the occurrence of attack behavior, the attacker usually has absolute initiative, so the focus is on the attacker, namely the IP that can be recorded for login behavior. Since the entries saved after preprocessing are all failed logins, if the number of occurrences for a single IP is too low, it is considered a normal login behavior and is excluded as a possible attack behavior. An example of a dictionary is represented below as shown in Table 1.
Storing samples by IP
Storing samples by IP
The second dictionary records the total number of logins by individual IP after being aggregated on a daily basis. Traditional detection methods often set parameters for short time windows and long time windows. The long time window is typically set to one day. Since the dataset used in this study includes all the real data within a natural month, it is divided on a daily basis, and each day’s value in the dictionary is a set as a secondary dictionary to store the total number of IP occurrences for that day. An example of a dictionary is represented below as shown in Table 2.
Store samples by IP and group by day
The third dictionary is divided on a daily basis, and each day’s value in the dictionary is a set as a secondary dictionary. Then, it is further divided based on IP, and the values for each IP in the secondary dictionary are a set as lists to store all attempted login email accounts for that IP on that day. This dictionary provides a more detailed and comprehensive statistics of the preprocessed dataset. Although it does not directly provide data for the final calculations, it can be used to performing internal statistics. An example of a dictionary is represented below as shown in Table 3.
Store samples with email by IP and group by day
Since the dataset is derived from real enterprise email log files, there are inevitably a large number of normal user login behaviors. Even though only login failure entries are considered during preprocessing, there may still be unintentional login errors included. To differentiate these normal login behaviors from malicious brute force attacks, additional data filtering is required [12]. Firstly, the dictionary containing the statistics “by IP address” is filtered. If the number of login attempts by an IP address is below a certain threshold, it is discarded. Secondly, the dictionary containing the statistics “grouped by day and IP address” is filtered. If the number of login attempts by an IP address on a specific day exceeds a certain threshold, it is considered suspicious. If the number is below a certain threshold, the IP address and its data are also discarded. Lastly, the dictionary containing the statistics “grouped by day, IP address, and occurrence of email addresses” is filtered [13]. The number of entries in the three-level dictionary is calculated, and if it is below a certain threshold, indicating a low number of unique email addresses attempted to log in, the corresponding IP address is discarded.
Email data visualization
We will visualize the contents of the dictionary “grouped by day and IP address” to represent the trend of login attempts by IP address over time. After traversing the dictionary and making slight modifications, we set the horizontal axis as the dates and the vertical axis as the number of login attempts, depicting the variation in the quantity for each IP address. We compare the daily sum of login attempts for each IP address, sort the sums in descending order, and divide all the data into groups of 20 IP addresses each for display. This ensures a balanced vertical axis within each group, allowing for clear visualization of the trends for most IP addresses [14, 15].
The visualization result of all the data in a single graph is shown in Fig. 2.
The visualization result of all the data.
Data grouping visualization diagram 1.
Data grouping visualization diagram 2.
It can be observed that in order to showcase a few IP addresses with a large number of login attempts, the vertical axis units in the visualization interface were stretched too much, resulting in poor presentation of most entries. To address this, we sorted the sums and grouped them into sets of 20, visualizing each group separately with a more reasonable distribution of vertical axis units. Here, we have selected three groups that exhibit better visualization results. These are shown in Fig. 3, Fig. 4, and Fig 5.
Data grouping visualization diagram 3.
From the comparison between Fig. 2 and the other three figures, it can be concluded that visualizing the data as a whole has poor effectiveness, while selecting grouped displays makes the data more intuitive. Additionally, Fig. 3 clearly shows that some IP addresses exhibit very similar line graphs, and there may even be overlapping occurrences. This suggests that these login behaviors should be organized and planned, indicating a distributed brute-force attack. Figures 4 and 5 also contain IP addresses with similar patterns. In Fig. 4, the IP address 193.223.64.18, and in Fig. 5, the IP address 109.248.135.34, show evenly distributed patterns, which can be considered as brute-force attack behaviors. However, further judgment is required to determine if they belong to low-frequency attacks.
While visual analysis and observations can provide insights into IP addresses, the judgment process and obtained results tend to be subjective and lack mathematical support, leading to unstable judgments and a lack of persuasiveness. Therefore, quantifying the data to achieve stability and objectivity becomes a better choice [16].
We have obtained the number of attempts to log in each email address for each day. From these counts and the total sum, we can calculate the probability of each email address being attempted to log in. By utilizing mathematical knowledge and writing functions, we can calculate the entropy. Additionally, due to the differences in the overall quantities, an additional evaluation function is needed to help reduce the gap between the data. Entropy reflects the strategies of IP addresses in attempting to log in on a given day, indicating the level of uncertainty in their choices. This allows us to evaluate the attempted login behavior from a mathematical perspective [17].
Since this system is attack-oriented, we choose to analyze IP addresses. We calculate the average and standard deviation of the entropy for each IP address on a daily basis, using these two quantification measures from different dimensions to describe the attempted login behavior. If these two results are relatively similar, meaning they are close or overlapping after visualization, we can consider the corresponding IP addresses to have similar attack strategies, indicating a possible distributed brute-force attack. The formula for calculating entropy is shown in Eq. (1):
Here, due to the different numbers of entries within each group, we need to use an entropy evaluation function to achieve balance. The formula for this function is shown in Eq. (2):
As an entropy evaluation function, the average entropy and standard deviation are calculated to describe the relationships between each subgroup and the overall pattern of the larger group. The average entropy is obtained by summing up the entropies of each subgroup and dividing it by the total number of subgroups. It is used to describe the average uncertainty present in the larger group. The formula for calculating the average entropy is shown in Eq. (3):
The standard deviation is used to quantify the stability of entropy between subgroups and, together with the average entropy, describe the overall pattern. The formula for calculating the standard deviation of entropy isshown in Eq. (4):
Analyzing the dataset, we grouped the results by day and IP, calculating the entropy of the occurrence probabilities for internal email addresses. We applied two different entropy evaluation functions to adjust the entropy values. Then, we further grouped the data by IP and calculated the mean and standard deviation of entropy for each day. We visualized the average and standard deviation of entropy for each IP.
Perform entropy evaluation function for all quantities.
Next, we applied the k-means and DBSCAN clustering algorithms to these visualizations and compared their performance. Finally, we assessed the results based on three different adjustments: using the total count as the entropy evaluation function, using the count after removing duplicates as the entropy evaluation function, and not applying any entropy evaluation function.
Entropy evaluation function for quantity after deduplication.
Evaluation function without entropy.
The visualizations, as shown in Figs 6–8, indicate that a large number of data points appear at a standard deviation of 0. Although this may suggest high stability values for many IPs, it is actually unreasonable. Upon investigation, almost all these points have data for only one day, indicating a probability of 1 for attempting login within that single day. Hence, these data points with a standard deviation of 0 are not discussed here.
Figure 6 represents the visualization using the total count as the entropy evaluation function. The data is concentrated within two ranges, but the clustering algorithm does not perform well. Figure 8 represents the visualization without applying any entropy evaluation function. The data appears too scattered, and the clustering algorithm’s performance is also unsatisfactory. Therefore, we chose to use the count after removing duplicates as the entropy evaluation function for logical consistency in the evaluation process,which is shown in Fig. 7.
K-means result with 
Dbscan results with a distance parameter of 0.015.
After evaluating uncertainty using entropy and describing IP attack strategies using mean and standard deviation, we visualized the data by plotting mean on the y-axis and standard deviation on the x-axis as data points. However, this visualization method is not suitable for observation and analysis. Since similar points in the visualization correspond to IP that can be considered as distributed brute-force attacks, the first approach is to consider algorithms related to visualized distances, such as k-means and DBSCAN.
We present the results of dividing the data into 20 clusters using k-means, DBSCAN with a distance parameter of 0.015, and DBSCAN with a distance parameter of 0.03, as shown in Figs 9–11 respectively.
In Fig. 9, k-means divides all nodes into 20 clusters, resulting in each point more or less belonging to a certain cluster. However, a clear issue arises as many nodes with apparently different patterns are grouped together, causing many non-distributed brute-force attack behaviors to be mistakenly included. This significantly reduces the accuracy of recognition. During the experiment, attempts were made to decrease or increase the number of clusters, but smaller cluster values performed worse, while larger cluster values would increase system load and reduce recognition efficiency.
In Fig. 10, DBSCAN automatically divides the results into 10 clusters. It can be observed that the clusters in the upper left corner are more refined, indicating effective results and identifying intra-cluster data as distributed attacks. However, there is still an issue as the data on the right side does not have any cluster results. Therefore, Fig. 11 shows the results after adjusting the parameter, resulting in five new clusters. Here, these clusters can also be considered as distributed attacks. Thus, by combining the results of these two approaches, we obtain quantified results for distributed attacks.
Dbscan results with a distance parameter of 0.03.
The above method is applicable to real data in the dataset, where the content undergoes multiple classification processes. Through visualization and quantification, potential brute-force attack behaviors are analyzed. However, the dataset itself contains a set of data files labeled based on real information. Except for the data type of identifiers, which is integer, all other data types, including the judgment results, are Boolean. With such a large amount of data and corresponding Boolean labels, it is natural to consider using a binary classification decision tree algorithm.
A decision tree divides the dataset multiple times by binary splits to ultimately achieve classification or prediction of samples. The results obtained from the analysis are shown in Fig. 12. Here, brackets of the same color indicate the same level.
Decision tree generation results.
Certainly, the decision-making process based on the obtained results can be used for making judgments. However, it is evident that the previous visualization is not suitable for manual reading and analysis. To address this, we can use the matplotlib and networkx packages to visualize the results. Figure 13 shows the visualization of the results, demonstrating the network structure and relationships between different IPs. This visualization provides a clearer and more intuitive representation of the data, allowing for easier interpretation and analysis.
Decision tree visualization results.
The visualization image has been successfully generated, but the results are still not satisfactory. After trying several other methods without achieving the desired outcome, we decided to manually create a decision tree diagram. Please refer to Fig. 14 for the generated diagram.
It is worth noting that in this diagram, we have replaced “false” values in the original dataset with 0, and “true” values with 1. This substitution was done to alleviate memory pressure and facilitate the decision-making process.
Final decision tree image.
From Fig. 14, it is evident that the condition with the highest weight is “abnormal geo,” which refers to non-standard geographic locations. The next significant conditions are “BF attempt” and “User miss,” which indicate whether the account has previously experienced brute force attempts and user input errors, respectively. The importance of other conditions follows in a similar manner.
Based on the statistics, there are 246,796 unique IPs in the dataset, of which 38,220 IPs were flagged as suspicious login behavior by the data filtering module and 68,267 IPs were identified as brute force attack behavior by the machine learning decision tree. The comparable magnitudes of these two methods indicate a relatively high level of reliability for both. However, there are still differences in the data obtained by the two methods. Through analysis, it is believed that the main reason for this discrepancy is the suboptimal threshold settings in the data filtering module. In the data analysis module, the thresholds were set as 5, 2, and 50, indicating that an IP must meet the criteria of having a login attempt count greater than 5 on at least two days and a total login attempt count greater than 50 to be considered suspicious. It appears that these thresholds were set too strictly. After conducting experiments, it was found that fine-tuning the thresholds resulted in a closer number of IPs identified by the two methods, leading to more rigorous experimental results.
Conclusion
This paper has implemented two methods for detecting brute force attacks on enterprise email systems. The system demonstrated low resource consumption, accuracy, and efficiency during testing. It provides functionalities such as data filtering and analysis, visualization, quantitative comparison, machine learning-assisted analysis, and binary classification decision-making. The thresholds in the system can also be adjusted according to different data environments, providing users with flexibility. The system leverages existing techniques for learning, discussing, and making reasonable improvements and optimizations based on practical scenarios. It compares and adjusts results obtained from different methods.However, the system still has several limitations, including the following:
The system relies on log files for data analysis and cannot perform real-time analysis. If used in an enterprise, it may lead to data leakage and asset losses within a time window. Threshold settings are subjective. Even when comparing results and adjusting thresholds, it is based on this specific dataset, and it is uncertain whether they will remain reliable when applied to other datasets. Prior knowledge and manual prediction of thresholds are still required before comparing results. The system relies on labels generated by the email system. In this dataset, complete labels were available, simplifying the analysis process and resulting in more accurate results. However, if label files are not available, alternative methods for result verification need to be considered. The system has limited throughput. In the current analysis, the dataset covered a one-month time span, and the system already encountered issues with slow analysis and insufficient memory. Although these issues were addressed gradually, they may become more severe when dealing with larger datasets.
To address the limitations mentioned above, future improvements and optimizations can be made in the following areas:
Long-term tracking of datasets from different time periods for analysis, and adding repeatedly identified suspicious IPs to a blacklist for real-time IP interception. Collecting a large variety of datasets and performing analyses to obtain more suitable threshold selections. Optimizing the system by exploring more efficient statistical approaches, reducing memory consumption, and striving to improve efficiency further.
Footnotes
Acknowledgments
This work was support by National Natural Fund “The research of the trusted and security environment for high energy physics scientific computing system.”
