Abstract
Entities providing services based on Information and Communications Technologies (Internet access providers, landline and mobile, among others) are targets of malicious activities that cause millions in losses and affect their prestige. In order to prevent such damage, it is necessary to analyze event streams generated by service provision. Event streams have special features, such as high speeds and large amounts of data, as well as diversity of sources and formats. Therefore, the use of effective models that can be used in real time are required. Rule-based models are reported as one of the most used for malicious activities detection. In this paper, several classification rule-based models are discussed. For a better understanding of each model, their general schemes are outlined. Finally, identified problems in the models are presented.
Introduction
Nowadays, there are several entities such as telephone companies, banks, among others, providing services based on Information and Communications Technologies (ICT). The execution of malicious activities (MA) through such services, causes millions in losses to the affected entities [5]. The malicious activities include actions such as fraud in telecommunications services, telecommunications network intrusion and fraud in banking transactions.
Those entities providing ICT-based services, require techniques to detect as soon as possible events associated to occurrence of malicious activities. These techniques should be designed to run in complex scenarios. Due to their specific features, the scenarios where malicious activities take place are considered as complex scenarios. Some features of these scenarios are described below:
Data streams with high number of instances, being able to reach the order of millions in minutes. Instances with high number of features. High amount of different classes of malicious activities for classification. Data describing malicious activities can be modified in order to make fail the detection method.
A wide variety of techniques have been proposed for the malicious activities detection task [17, 21]. These techniques are based in specific models. However, there are three models quite used in practice: rule-based, anomaly-based and hybrid.
The classification rule-based model requires previous domain knowledge for the rule generation process (see Fig. 1). The rule generation process can be done manually by an analyst, or automatically using data mining methods. In the manual process, the analyst creates the rules based on known malicious activities. Otherwise, performing automatic rule generation (ARG) process, a labeled training collection is used, where each instance is labeled with its respective class (the class can be normal or some kind of MA). Such training collection is then processed by a data mining method, where the output is a rule set. The created rules are evaluated in real time, and if any rule is satisfied then an alert is raised. The rule-based model has some advantages, including: (1) high effectivity over MA already known, (2) detection of MA in real time, and (3) a better insight to analysts about how malicious activities are described. Moreover, the main drawbacks identified in this model are: (1) fails to detect subtle changes in data, and (2) the rule set needs to be frequently updated, in order to detect new MA.
Rule-based model for malicious activities detection.
The anomaly-based model tries to define normal behavior and detect abnormal actions (see Fig. 2). Generally, the normal behavior is defined using an unlabeled training collection consisting of historical information. Then, the normal behavior defined can be compared regarding the current behavior in order to determine if occur significant changes indicating a possible anomaly. This model has advantages among which stand out: (1) subtle changes in the subscribers behavior can be detected, and (2) a prior domain knowledge is not required, which allows to identify new and unknown malicious activities. On the other hand, the drawbacks include: (1) the anomalies are associated to MA (increase of false positives) and (2) malicious activities can not be detected in real time.
Anomaly-based model for malicious activities detection.
There are also methods [34, 36] based on hybrid model, which is a combination of the previous two models (see Fig. 3). Hybrid model uses anomaly detection [32] to label the training collection in an unsupervised way. Then, rules defining anomalies are generated. Just as in the rules-based model, the generated rules are evaluated in real time. The advantages of this model are: (1) anomalies can be detected in real time, (2) a prior domain knowledge is not required, and (3) it provides a better insight to analysts, about how anomalies are defined. The disadvantages are consistent with some drawbacks of the models described above, among which are included: (1) the anomalies are associated to MA (increase of false positives), and (2) the rule set needs to be frequently updated, in order to detect new anomalies.
Hybrid model for malicious activities detection.
Even with advancing technology, which promises more speed on data analysis and tighter service levels, the malicious activities detection and prevention remains a major challenge for most entities providing ICT services. More than 50% of malicious activities are detected after loss has occurred [33]. Moreover, in many cases, the selected model for malicious activities detection generates a large number of false positives affecting correct flow of information, and therefore the quality of service (QoS). In this sense, entities also need a model to detect MA in real time, making possible to prevent or minimize damage. In addition, such model should operate without affecting the QoS provided by the entities.
In real time MA detection, the activity that generates an alert is blocked by some techniques in order to prevent damage. In this sense, models such as anomaly-based and hybrid have disadvantages in common. They associate anomalies to malicious activity, when an anomaly may represent a new normal behavior, also the MA with similar behavior to normal ones are likely to go undetected, and they fail to exploit prior knowledge about many known attacks [18]. This results in an increase of the number of false positives, and makes normal activities to be blocked affecting the QoS. On the other hand, classification rule-based model uses domain knowledge in order to define representative rules of MA, making the generation of false positives to be unlikely. For this reason, classification rule-based models are widely used in the scenarios described above.
This paper presents a review of classification rule-based models for malicious activities detection. The purpose is to analyze such models, and determine traits that affect their effectiveness over malicious activity detection. This work consists of five sections. In this section, an introduction to the subject dealt was presented. Next, some basic concepts are introduced. In the third section, classification rule-based models for malicious activities detection are analyzed. Later, the problems identified in such models are discussed. Finally, in the fifth section investigation overall conclusions are presented.
In this section some basic concepts are presented for a better understanding of the models analyzed later.
Quality of service (QoS) in the field of telecommunications can be defined as a set of specific requirements by a network to users, which are necessary to achieve the required functionality of a service [8]. The QoS parameters and measures are necessary to provide an indication of how well a service is working. Each entity providing ICT-based services is responsible for guaranteeing a certain level of performance for the data stream, which will define the QoS offered by such entity.
In order to understand the concept of rule, it is necessary to define some elements first. Let
Given this, a classification rule
The fuzzy rules analyzed in this paper are a special type of classification rules
On the other hand, there are methodologies such as KDD [13], SEMMA [2] and CRISP-DM [6] for the application of data mining techniques, which helps to link different processes to form an efficient and effective model. According to the poll published in [23], there is a preference regarding the CRISP-DM methodology to be used in data mining projects. The CRoss-Industry Standard Process for Data Mining (CRISP-DM) was proposed by Chapman et al. [6]. CRISP-DM helps in the planning and execution of data mining models. In addition, it is a standard process for data mining techniques which are used for big data analysis [1]. As is shown in Fig. 4, there are six phases [38] defining life cycle of a data mining project: business understanding, data understanding, data preparation, modeling, evaluation and deployment. The arrows indicate the most important and frequent dependencies between phases. In a particular project, arrows can indicate which phase has to be performed next.
Phases of the CRISP-DM process model [38].
Classification rule-based models represent very effective solution for data analysis in real time. Manual rules generation process in scenarios described above has an important drawback. The problem is that the analyst must manually define the specific rules for each existing class, which is an extremely complex and time consuming task [12]. Given this, the automatic rules generation has started to play a key role in classification rule-based models for malicious activities detection [20].
The analysis of such models allows us to define a general model (see Fig. 5). In this model, a labeled training collection
General classification rule-based model for malicious activities detection.
[h] General_Rule_Generation (
[h] Rule_Evaluation (
Taking into account the features of the scenario, a cyclic model is required, that is, data mining does not end with the evaluation process. The achieved results during each step in the model and from the deployed solution can contribute to improve the bottom line. In this way, data mining processes will benefit from the experiences of previous ones. This fact causes the general model bears some resemblance to the CRISP-DM process model.
The methods based on ARG can detect malicious activity patterns and represent them as rules. This fact provides a better assistance to analysts, since outputs are defined as expressions in an understandable language, unlike other ones such as those based on anomalies detection [32], considered as black boxes. Automatic rule generation methods require a labeled training collection, where the instances composing it must be properly labeled with the class that defines them. If the data set does not satisfy such requirement, learning process for ARG can not be performed. The generated rules are used in evaluation process to classify new instances.
The expressiveness and suitability of graphs makes them widely used to model data in various application contexts, including malicious activities detection. An intrusion detection technique that creates rules from HTTP logs for e-voting protection was presented by Supeno et al. [35]. Here, a graph-based model was used (see Fig. 6). In order to collect requests, a honeypot was used. A honeypot is a set of software or computers whose intention is to attract attackers, pretending to be vulnerable or weak systems.
Graph-based model.
The collected requests by honeypot are preprocessed to ensure that unnecessary characters are not included. Here the analyst must specify such characters. Then, a graph clustering step is performed. Here, the requests are stored in a graph
After clustering all requests, the rule generation takes place. In this process, the graph
These steps are repeated periodically. Each newly arrived request will be directly added to the existing graph. Then, the rules generation process is repeated for each subgraph.
In this section, methods based on greedy search model (AQ21 [28], X2R [26], RIPPER [7], PART [15] and Ant-Miner [29]) are described. As shown in Fig. 7, the first step consists in preprocess the training collection for subsequent actions. This step is not mandatory within a greedy method, because the training collection may be ready to be processed in the following steps. However, some of the analyzed methods use preprocessing to discretize the training collection (X2R) or assign an initial value to certain state variables (RIPPER). Both cases require the analyst to define initial parameters.
Greedy search model.
Then, an iterative process is performed. Generally, the stopping criterion checks if there still remaining instances to be processed. However, there are greedy methods that follow other strategies besides the above; for example: AQ21 also checks if training collection contains positive instances (in our case, instances representing malicious activities), and RIPPER in addition verifies if the last found rule is not very complicated (complication degree of a rule is given in terms of the total description length [30]).
Within each iteration, a training collection subset that will be used to generate candidate rules is selected. Next, a first set of candidate rules is generated. Some methods require the analyst to define initial parameters; for example: RIPPER needs the minimum total weight of the instances in a rule, PART requires the confidence factor used for pruning, and Ant-Miner needs the max number of iterations. After having generated the first rule set, a filtering process is carried out, adding the best rules to the filtered rule set. Generally, a quality measure defined by the analyst is used for filtering process. At the end of each iteration, those instances that have been covered by the filtered rules are removed from the training collection. Table 1 contains a summary of iteration strategies reported, showing different ways to implement the following processes: instances selection, rules generation and rules filter.
Summary of iteration strategies reported in methods based on greedy search model
After all iterations have been completed, the greedy methods may perform a last filtering process over the filtered rule set. This last step is not mandatory in greedy methods. However, some methods (AQ21) use measures such as Lexicographic Evaluation Functional [28] (LEF), to remove some rules from the filtered rule set, and other methods (X2R and RIPPER) remove redundant rules in each class and replace specific rules by others more general.
Fuzzy logic often can be used to generate an approximate fuzzy rule set for malicious activities detection. A fuzzy rule based model for Peer-to-Peer (P2P) botnet detection is proposed by Barthakur et al. [3]. In data preprocessing step, a detail analysis of behavioral characteristic of botnet traffic flow was performed, after which, useful features for classification were selected from packet headers.
The rule generation step takes as input a reduced data set obtained in the previous step. Then, a fuzzy rule set is generated using FURIA [22] algorithm. This algorithm performs an adaptation of the RIPPER [7] algorithm, whereby a first rule set is computed. Such rules are transformed into fuzzy rules, by processing conditions associated to numeric features. Fuzzy rules derived by FURIA fulfill the definitions referred in Section 2. In this case, the
Decision tree model
There are several methods based on decision trees model, for example C5.0 [24] and CART [4]. Here, a rule is generated for each decision tree leaf, following the path from the root to leaf.
As shown in Fig. 8, the first step in these models is to generate a decision tree. The criteria used to build the decision tree can vary depending on the proposal. Also, the analyst must define some parameters like the minimum number of instances per leaf for both C5.0 and CART. Next, a pruning process is applied to the generated decision tree, obtaining a collection of trees. According to the used method, the collection may comprise one or more trees. In this process also should be defined some parameters like the number of folds in the internal cross-validation of CART, and the confidence factor used by C5.0 for pruning (smaller values incur more pruning). Finally, an optimization process is performed. This process can be applied to a collection of trees
Decision trees model.
Summary of iteration strategies reported in methods based on decision trees model
The rules obtained with C5.0 can contain irrelevant conditions, which must be generalized removing such conditions without affecting their accuracy. Suppose we have the following rule
For some data sets, the process to select the final tree in CART method is usually unstable. The solution is to use the one-standard-error rule proposed by Breiman et al. [4]. Applying this rule, CART may reduce instability in choosing the final tree, and find the simplest pruned subtree where its performance is comparable with that obtained by the most optimal subtree. After the final tree selection, the rule set is generated.
A learning task is considered incremental, if the training instances become available over time [9]. There are some proposals as FLORA [37], AQ11-PM+WAH [27], FACIL [14], VFDR [16] and RILL [10] based on incremental model for rule generation.
As shown in Fig. 9, when a new learning instance
Incremental model.
FLORA framework uses three rule sets: accepted descriptors (ADES), negative descriptors (NDES) and potential descriptors (PDES). ADES contains rules covering only positive instances, NDES only negatives ones and PDES contains rules matching with both, positive and negative instances. During the covering test step, if a new instance
In FACIL method, when a new instance
AQ11-PM+WAH method checks in the covering test step, if a new instance is misclassified by the existing rules. Misclassified instances are combined with the ones in the partial memory (instance set) to form the current
In VFDR method, when a new learning instance
In RILL as well as in the VFDR method, when a new learning instance
After reviewing the proposals presented in this paper, some limitations in terms of effectiveness on scenarios as described above were detected.
The complexity and robustness are characteristics that must be considered when you need to build a model. The more complex a model is, the less effective it will be to predict future instances. In models based on decision trees, the data preprocessing dedicated solely to detect missing data, this fact can lead the tree obtained to reach a high level of complexity. To prevent this from happening, stopping rules can be used during the process of building a decision tree to prevent the model becomes too complex. In some situations, stopping rules do not work well. The error consists in the assumption that an appropriate threshold can be set without much understanding of the data. This may results in a large number of rules, and many of these with irrelevant conditions. An alternative way is to grow a decision tree that is too large, and then use an effective pruning process to reduce the complexity of such tree. However, the number of instances to be processed in the MA detection scenarios is considerably large, therefore, if a data preprocessing is not considered to reduce the initial training collection, the model performance may be affected by the increased of memory consumption and processing time.
The limitation in terms of the features that can be processed, also can be a disadvantage in the scenarios described. For example, some methods based on greedy search model only can process categorical features, and to processing continuous features a discretization process must be carried out. This fact limits such models in terms of operators which can be employed by the generated rules; forcing them to use only operators such as “
Both, the decision tree and the greedy search model are designed for static environments. In this way, the concept drift which can be held in data flows is not contemplated. This fact makes difficult to detect malicious activities that have been modified to evade detection mechanism. In addition, the existing rules do not evolve automatically, which prevents them to adapt to possible changes in scenarios. This fact implies that eventually some conditions of the existing rules become irrelevant.
The methods based on the incremental model that were presented in this paper, have not been evaluated over large volumes of data. These methods store instances after the learning process. Taking into account the characteristics of the scenarios described above, its performance could be affected due to the high number of instances and features. Also, the incremental model lacks the data preprocessing stage, which could affect the amount and complexity of the rules.
Focusing on the general classification rule-based model for malicious activities detection presented above, six problems were detected. Two of them related to the data preprocessing and four related to the rules generation.
Problems with data preprocessing:
In some cases it is not performed, ignoring the data dimensionality and the high number of instances (Affects the efficiency of the model.). In cases where the data preprocessing is applied, the high number of instances in the training collection is ignored (Affects the efficiency of the model.).
Problems with rule generation:
High number of rules (Affects the efficiency of the model and QoS.). Rules with irrelevant conditions (Affects the effectiveness of the model.). Inconsistency between rules in some cases (Affects the effectiveness of the model and QoS.). Existing rules can not be automatically modified in some cases (Affects the effectiveness of the model.).
Malicious activities detection recently became a popular topic of research. Rule-based models are reported as one of the most used for detecting events associated to malicious activities in the shortest time possible. In this paper, several classification rules-based models for malicious activities detection were analyzed. In order to achieve greater effectiveness in malicious activities detection, such models must be able to handle problems related to the data dimensionality and concept drift. In addition, a model should not reduce the QoS level established by the entity where it will be deployed.
It would be interesting to perform a comparison of the presented methods on different data sets. Unfortunately, their implementations are not publicly available. In addition, some of the data sets used are not publicly available, which is due to privacy reasons and legal limitations. This fact makes it difficult a comparison between the achieved results by different methods.
Some problems identified during the models analysis, indicate that both, their effectiveness and QoS may be affected. Those problems can be addressed in future research for proposing new solutions, designed to make the malicious activities detection more effective task.
