Classification rule-based models for malicious activity detection

Abstract

Entities providing services based on Information and Communications Technologies (Internet access providers, landline and mobile, among others) are targets of malicious activities that cause millions in losses and affect their prestige. In order to prevent such damage, it is necessary to analyze event streams generated by service provision. Event streams have special features, such as high speeds and large amounts of data, as well as diversity of sources and formats. Therefore, the use of effective models that can be used in real time are required. Rule-based models are reported as one of the most used for malicious activities detection. In this paper, several classification rule-based models are discussed. For a better understanding of each model, their general schemes are outlined. Finally, identified problems in the models are presented.

Keywords

Malicious activities detection rule generation data mining

1. Introduction

Nowadays, there are several entities such as telephone companies, banks, among others, providing services based on Information and Communications Technologies (ICT). The execution of malicious activities (MA) through such services, causes millions in losses to the affected entities [5]. The malicious activities include actions such as fraud in telecommunications services, telecommunications network intrusion and fraud in banking transactions.

Those entities providing ICT-based services, require techniques to detect as soon as possible events associated to occurrence of malicious activities. These techniques should be designed to run in complex scenarios. Due to their specific features, the scenarios where malicious activities take place are considered as complex scenarios. Some features of these scenarios are described below:

•
Data streams with high number of instances, being able to reach the order of millions in minutes.
•
Instances with high number of features.
•
High amount of different classes of malicious activities for classification.
•
Data describing malicious activities can be modified in order to make fail the detection method.

A wide variety of techniques have been proposed for the malicious activities detection task [17, 21]. These techniques are based in specific models. However, there are three models quite used in practice: rule-based, anomaly-based and hybrid.

The classification rule-based model requires previous domain knowledge for the rule generation process (see Fig. 1). The rule generation process can be done manually by an analyst, or automatically using data mining methods. In the manual process, the analyst creates the rules based on known malicious activities. Otherwise, performing automatic rule generation (ARG) process, a labeled training collection is used, where each instance is labeled with its respective class (the class can be normal or some kind of MA). Such training collection is then processed by a data mining method, where the output is a rule set. The created rules are evaluated in real time, and if any rule is satisfied then an alert is raised. The rule-based model has some advantages, including: (1) high effectivity over MA already known, (2) detection of MA in real time, and (3) a better insight to analysts about how malicious activities are described. Moreover, the main drawbacks identified in this model are: (1) fails to detect subtle changes in data, and (2) the rule set needs to be frequently updated, in order to detect new MA.

Figure 1.
Rule-based model for malicious activities detection.

The anomaly-based model tries to define normal behavior and detect abnormal actions (see Fig. 2). Generally, the normal behavior is defined using an unlabeled training collection consisting of historical information. Then, the normal behavior defined can be compared regarding the current behavior in order to determine if occur significant changes indicating a possible anomaly. This model has advantages among which stand out: (1) subtle changes in the subscribers behavior can be detected, and (2) a prior domain knowledge is not required, which allows to identify new and unknown malicious activities. On the other hand, the drawbacks include: (1) the anomalies are associated to MA (increase of false positives) and (2) malicious activities can not be detected in real time.

Figure 2.
Anomaly-based model for malicious activities detection.

There are also methods [34, 36] based on hybrid model, which is a combination of the previous two models (see Fig. 3). Hybrid model uses anomaly detection [32] to label the training collection in an unsupervised way. Then, rules defining anomalies are generated. Just as in the rules-based model, the generated rules are evaluated in real time. The advantages of this model are: (1) anomalies can be detected in real time, (2) a prior domain knowledge is not required, and (3) it provides a better insight to analysts, about how anomalies are defined. The disadvantages are consistent with some drawbacks of the models described above, among which are included: (1) the anomalies are associated to MA (increase of false positives), and (2) the rule set needs to be frequently updated, in order to detect new anomalies.

Figure 3.
Hybrid model for malicious activities detection.

Even with advancing technology, which promises more speed on data analysis and tighter service levels, the malicious activities detection and prevention remains a major challenge for most entities providing ICT services. More than 50% of malicious activities are detected after loss has occurred [33]. Moreover, in many cases, the selected model for malicious activities detection generates a large number of false positives affecting correct flow of information, and therefore the quality of service (QoS). In this sense, entities also need a model to detect MA in real time, making possible to prevent or minimize damage. In addition, such model should operate without affecting the QoS provided by the entities.

In real time MA detection, the activity that generates an alert is blocked by some techniques in order to prevent damage. In this sense, models such as anomaly-based and hybrid have disadvantages in common. They associate anomalies to malicious activity, when an anomaly may represent a new normal behavior, also the MA with similar behavior to normal ones are likely to go undetected, and they fail to exploit prior knowledge about many known attacks [18]. This results in an increase of the number of false positives, and makes normal activities to be blocked affecting the QoS. On the other hand, classification rule-based model uses domain knowledge in order to define representative rules of MA, making the generation of false positives to be unlikely. For this reason, classification rule-based models are widely used in the scenarios described above.

This paper presents a review of classification rule-based models for malicious activities detection. The purpose is to analyze such models, and determine traits that affect their effectiveness over malicious activity detection. This work consists of five sections. In this section, an introduction to the subject dealt was presented. Next, some basic concepts are introduced. In the third section, classification rule-based models for malicious activities detection are analyzed. Later, the problems identified in such models are discussed. Finally, in the fifth section investigation overall conclusions are presented.
2. Basic concepts

In this section some basic concepts are presented for a better understanding of the models analyzed later.

Quality of service (QoS) in the field of telecommunications can be defined as a set of specific requirements by a network to users, which are necessary to achieve the required functionality of a service [8]. The QoS parameters and measures are necessary to provide an indication of how well a service is working. Each entity providing ICT-based services is responsible for guaranteeing a certain level of performance for the data stream, which will define the QoS offered by such entity.

In order to understand the concept of rule, it is necessary to define some elements first. Let $F_{1},F_{2},...,F_{k}$ be sets of features. An instance $n=(f_{1},f_{2},...,f_{k})$ can be defined as a vector belonging to the universe $U=F_{1}\times F_{2}\times...\times F_{k}$ , where $f_{i}\in F_{i}$ for every $1\leqslant i\leqslant k$ .

Given this, a classification rule $r$ in the universe $U\times C$ , is represented by $r=\langle\overleftarrow{r},\overrightarrow{r}\rangle$ , where $C=\{c_{1},c_{2},...,c_{l}\}$ is a set of classes with $l$ being an integer $l\geqslant 2$ , $\overleftarrow{r}$ is the premise part of $r$ and $\overrightarrow{r}$ is an equality condition defined on the corresponding component classes $(C)$ . The premise part $\overleftarrow{r}$ is a conjunction of conditions. A single condition $m$ can be represented as $f_{i}\oplus\nu$ , where $f_{i}$ is a conditional feature, the variable $\nu$ is a value from the domain of $f_{i}$ , and $\oplus$ is a relational operator from the set of relations $\{<,\leqslant,=,\neq,>,\geqslant,\in\}$ . For simplicity, a rule is usually represented as $\langle\overleftarrow{r},c_{i}\rangle$ , where $c_{i}\in C$ . The intuitive meaning of a rule is that the fulfillment of the $\overleftarrow{r}$ premise in an instance implies that such instance belongs to the class $c_{i}$ .

The fuzzy rules analyzed in this paper are a special type of classification rules $\langle\overleftarrow{r},c_{i}\rangle$ , where $\overleftarrow{r}$ conditions differ somewhat from those presented in the classification rules. Non-fuzzy rules produce conditions with exact boundaries, while the fuzzy rules produce conditions with soft boundaries. For example, a condition constraining a numerical feature $f_{i}$ (with domain $\mathbb{D}_{i}=\mathbb{R}$ ) can be expressed in the form ( $f_{i}\in I$ ), where $I\in\mathbb{R}$ is an interval: $I=(-\infty,x]$ if the rule contains a condition $(f_{i}=x)$ , or $I=[y,\infty)$ if it contains a condition $(f_{i}=y)$ . In this sense, an interval can be defined using a membership function $\lambda$ .

On the other hand, there are methodologies such as KDD [13], SEMMA [2] and CRISP-DM [6] for the application of data mining techniques, which helps to link different processes to form an efficient and effective model. According to the poll published in [23], there is a preference regarding the CRISP-DM methodology to be used in data mining projects. The CRoss-Industry Standard Process for Data Mining (CRISP-DM) was proposed by Chapman et al. [6]. CRISP-DM helps in the planning and execution of data mining models. In addition, it is a standard process for data mining techniques which are used for big data analysis [1]. As is shown in Fig. 4, there are six phases [38] defining life cycle of a data mining project: business understanding, data understanding, data preparation, modeling, evaluation and deployment. The arrows indicate the most important and frequent dependencies between phases. In a particular project, arrows can indicate which phase has to be performed next.

Figure 4.

Phases of the CRISP-DM process model [38].

3. Classification rule-based models for malicious activities detection

Classification rule-based models represent very effective solution for data analysis in real time. Manual rules generation process in scenarios described above has an important drawback. The problem is that the analyst must manually define the specific rules for each existing class, which is an extremely complex and time consuming task [12]. Given this, the automatic rules generation has started to play a key role in classification rule-based models for malicious activities detection [20].

The analysis of such models allows us to define a general model (see Fig. 5). In this model, a labeled training collection $D_{t}$ is required, where the instances composing it are properly labeled with the class that defines them as malicious or legitimate. Then, data preprocessing method is performed, allowing us to reduce the volume of training collection and improve its quality by removing noisy data [19] (see Algorithm 5, line 3). Next, the obtained data collection $\widetilde{D_{t}}$ is processed by a rule generation algorithm [20] (see Algorithm 5, line 4), which creates a rule set $R$ to classify new instances during the rule evaluation process (see Algorithm 5). When a rule is satisfied during the evaluation process, an alert is generated (see Algorithm 5, lines 3–5), and sent to the user interface, where the analyst is notified. Through the user interface, the analyst can check how the rules are behaving and determines whether to prepare a new training collection in order to upgrade the existing rules set. In addition, this allows to better adjust the parameters in each process and get increasingly a more effective model. Note, that the analyst can interact with preprocessing process and rule generation process by adjusting certain parameters that contribute to improving the final rule set; for example: $P_{1}$ can represent the number of features which are desired to select during preprocessing process, and $P_{2}$ can represent some threshold used to select rules during the rule generation process.

Figure 5.

General classification rule-based model for malicious activities detection.

[h] General_Rule_Generation ( $D_{t},P_{1},P_{2}$ )

$D_{t}$ - labeled training collection, $P_{1}$ - preprocessing parameter, $P_{2}$ - rule generation parameter $R$ - rule set

$R\leftarrow\emptyset$ ; $\widetilde{D_{t}}\leftarrow\emptyset$ ;

$\widetilde{D_{t}}\leftarrow$ Preprocessing( $D_{t},P_{1}$ ); $R\leftarrow$ Rule_Generation( $\widetilde{D_{t}},P_{2}$ );

$R$ ;

[h] Rule_Evaluation ( $R, n$ )

$R$ - rule set obtained by algorithm 5, $n$ - new instance $A$ - alert

$A\leftarrow NULL$ ; $r\in R$ $r$ cover $n$ $A\leftarrow New\_Alert(r)$ ;

$A$ ;

Taking into account the features of the scenario, a cyclic model is required, that is, data mining does not end with the evaluation process. The achieved results during each step in the model and from the deployed solution can contribute to improve the bottom line. In this way, data mining processes will benefit from the experiences of previous ones. This fact causes the general model bears some resemblance to the CRISP-DM process model.

The methods based on ARG can detect malicious activity patterns and represent them as rules. This fact provides a better assistance to analysts, since outputs are defined as expressions in an understandable language, unlike other ones such as those based on anomalies detection [32], considered as black boxes. Automatic rule generation methods require a labeled training collection, where the instances composing it must be properly labeled with the class that defines them. If the data set does not satisfy such requirement, learning process for ARG can not be performed. The generated rules are used in evaluation process to classify new instances.

3.1 A graph-based model

The expressiveness and suitability of graphs makes them widely used to model data in various application contexts, including malicious activities detection. An intrusion detection technique that creates rules from HTTP logs for e-voting protection was presented by Supeno et al. [35]. Here, a graph-based model was used (see Fig. 6). In order to collect requests, a honeypot was used. A honeypot is a set of software or computers whose intention is to attract attackers, pretending to be vulnerable or weak systems.

Figure 6.

Graph-based model.

The collected requests by honeypot are preprocessed to ensure that unnecessary characters are not included. Here the analyst must specify such characters. Then, a graph clustering step is performed. Here, the requests are stored in a graph $G=\{V,E\}$ , where $V$ is a set of vertices and $E$ is a set of edges connecting the vertices. Each vertex $v\in V$ denotes an incoming request. After a new vertex is created, a comparison with existing vertices is performed, to find how similar it is regarding the others. In this way, to find the distance between two vertices Euclidean distance is used; and the edge label of such vertices is the Euclidean distance value. Two vertices will be connected if their distance is smaller than predetermined threshold by the analyst, or smaller than previous smallest distance of the vertex. The resulting graph $G$ can be an unconnected graph, with several completely separated subgraphs $S\in G$ .

After clustering all requests, the rule generation takes place. In this process, the graph $G$ is pruned by minimum spanning tree algorithm, where several subgraphs $S$ will be analyzed. After pruning, each subgraph $s\in S$ is used to generated a rule. But, there is a threshold $T_{v}$ defined by the analyst, which is minimum number of vertices in a sub graph $s_{1}$ . If $|V|$ does not exceed $T_{v}$ , then the sub graph $s_{1}$ is not used to generate a rule. In case of $|V|$ exceed $T_{v}$ , then a rule is generated from $s_{1}$ . First, each root vertex is searched, then each subgraph $s$ of that root vertex will be traversed. When visiting a vertex, the request data is taken as a string. Then, the longest common substring among all taken strings is searched. From the longest common substring found, a rule of an attack is created.

These steps are repeated periodically. Each newly arrived request will be directly added to the existing graph. Then, the rules generation process is repeated for each subgraph.

3.2 Greedy search model

In this section, methods based on greedy search model (AQ21 [28], X2R [26], RIPPER [7], PART [15] and Ant-Miner [29]) are described. As shown in Fig. 7, the first step consists in preprocess the training collection for subsequent actions. This step is not mandatory within a greedy method, because the training collection may be ready to be processed in the following steps. However, some of the analyzed methods use preprocessing to discretize the training collection (X2R) or assign an initial value to certain state variables (RIPPER). Both cases require the analyst to define initial parameters.

Figure 7.

Greedy search model.

Then, an iterative process is performed. Generally, the stopping criterion checks if there still remaining instances to be processed. However, there are greedy methods that follow other strategies besides the above; for example: AQ21 also checks if training collection contains positive instances (in our case, instances representing malicious activities), and RIPPER in addition verifies if the last found rule is not very complicated (complication degree of a rule is given in terms of the total description length [30]).

Within each iteration, a training collection subset that will be used to generate candidate rules is selected. Next, a first set of candidate rules is generated. Some methods require the analyst to define initial parameters; for example: RIPPER needs the minimum total weight of the instances in a rule, PART requires the confidence factor used for pruning, and Ant-Miner needs the max number of iterations. After having generated the first rule set, a filtering process is carried out, adding the best rules to the filtered rule set. Generally, a quality measure defined by the analyst is used for filtering process. At the end of each iteration, those instances that have been covered by the filtered rules are removed from the training collection. Table 1 contains a summary of iteration strategies reported, showing different ways to implement the following processes: instances selection, rules generation and rules filter.

Table 1

Summary of iteration strategies reported in methods based on greedy search model

	AQ21	Ant-Miner	PART	X2R	RIPPER
Instances selection	Select a positive instance called seed.	The selection process is not per- formed. The entire training collection is used for the following steps.		Select a subset with the most frequent instances in the training collection.	They are randomly selected $2/3$ of the instances in the training collection.
Rules generation	A set of rules (approximate star) for the seed is obtained, which one not cover any of the negative instances.	A set of rules is computed based on ant colony method .	Using C4.5 algorithm , the simplified desicion tree associated to the training collection is computed. Then, the leaf covering more instances is searched, and the path leading from the root to such leaf is determined. The obtained path is represented as a rule.	An empty rule is created and conditions learned from the training subset obtained in the previous step are added, this is done until the rule does not cover more instances of other classes. Each condition is added maximizing some measure of quality defined according to the application context.
Rules filter	The rules are selected given a quality measure defined according to the application context.	Given a quality measure defined according to the application context, only the best rule is selected discarding the remaining.	No actions performed.		The rule is replaced with a more general one, by removing conditions from the premise.

After all iterations have been completed, the greedy methods may perform a last filtering process over the filtered rule set. This last step is not mandatory in greedy methods. However, some methods (AQ21) use measures such as Lexicographic Evaluation Functional [28] (LEF), to remove some rules from the filtered rule set, and other methods (X2R and RIPPER) remove redundant rules in each class and replace specific rules by others more general.

3.3 A fuzzy rules model

Fuzzy logic often can be used to generate an approximate fuzzy rule set for malicious activities detection. A fuzzy rule based model for Peer-to-Peer (P2P) botnet detection is proposed by Barthakur et al. [3]. In data preprocessing step, a detail analysis of behavioral characteristic of botnet traffic flow was performed, after which, useful features for classification were selected from packet headers.

The rule generation step takes as input a reduced data set obtained in the previous step. Then, a fuzzy rule set is generated using FURIA [22] algorithm. This algorithm performs an adaptation of the RIPPER [7] algorithm, whereby a first rule set is computed. Such rules are transformed into fuzzy rules, by processing conditions associated to numeric features. Fuzzy rules derived by FURIA fulfill the definitions referred in Section 2. In this case, the $\lambda$ membership function is defined as a trapezoidal function. Moreover, the premise of these rules is formed by the logical conjunction of one or more conditions. Therefore, coverage of an instance is given by the product of all conditions values that make the premise.

3.4 Decision tree model

There are several methods based on decision trees model, for example C5.0 [24] and CART [4]. Here, a rule is generated for each decision tree leaf, following the path from the root to leaf.

As shown in Fig. 8, the first step in these models is to generate a decision tree. The criteria used to build the decision tree can vary depending on the proposal. Also, the analyst must define some parameters like the minimum number of instances per leaf for both C5.0 and CART. Next, a pruning process is applied to the generated decision tree, obtaining a collection of trees. According to the used method, the collection may comprise one or more trees. In this process also should be defined some parameters like the number of folds in the internal cross-validation of CART, and the confidence factor used by C5.0 for pruning (smaller values incur more pruning). Finally, an optimization process is performed. This process can be applied to a collection of trees $T$ , or to a rule set $R$ , obtained from a decision tree $t_{1}$ , where $t_{1}\in T$ . Table 2 contains the strategy followed in each step by C5.0 and CART methods.

Figure 8.

Decision trees model.

Table 2

Summary of iteration strategies reported in methods based on decision trees model

	C5.0	CART
Decision tree generation	The maximum information gain criterion is used, to determine which condition must be in a given decision node .	The Gini gain criterion is used, to assess which division is the optimal for partition the training set at a specific node .
Decision tree pruning	The pruning method begins by the end of tree and examines each subtree that is not a leaf. If the estimated error rate is reduced by replacing the subtree in analysis, by a leaf or the most frequently used branch, then the tree is pruned making replacement.	As a result of replacing each edge of the tree by the root node of that edge, the cost-complexity value is computed several times during the pruning process. The lower cost-complexity value obtained in this process, will indicate where the tree should be pruned. Such process is recursively repeated, until the subtree that is ready to prune only have the root node. Each time the pruning process is executed, the subtree obtained is added to a collection of subtrees.
Optimizing	The rules of the decision tree are extracted and optimized by a generalization process.	The optimal subtree from the collection of subtrees obtained previously is selected using cross- validation. Next, the selected subtree is used to create rules.

The rules obtained with C5.0 can contain irrelevant conditions, which must be generalized removing such conditions without affecting their accuracy. Suppose we have the following rule $r_{1}=\langle\overleftarrow{r_{1}},c\rangle$ , and a more general rule $r_{2}=\langle\overleftarrow{r_{2}},c\rangle$ , where $\overleftarrow{r_{2}}$ is the result of removing a condition $m$ from $\overleftarrow{r_{1}}$ . The criteria for selecting $m$ may vary according to the application context.

For some data sets, the process to select the final tree in CART method is usually unstable. The solution is to use the one-standard-error rule proposed by Breiman et al. [4]. Applying this rule, CART may reduce instability in choosing the final tree, and find the simplest pruned subtree where its performance is comparable with that obtained by the most optimal subtree. After the final tree selection, the rule set is generated.

3.5 Incremental model

A learning task is considered incremental, if the training instances become available over time [9]. There are some proposals as FLORA [37], AQ11-PM+WAH [27], FACIL [14], VFDR [16] and RILL [10] based on incremental model for rule generation.

As shown in Fig. 9, when a new learning instance $n$ is available, a covering test is performed to find out which rules cover $n$ . This step will provide the necessary information to determine whether to generalize an existing rule or create a new one during the rule induction step. After rule induction step, the instance set and rule set are updated. Finally, the rule set is evaluated over unclassified instances in order to detect malicious activities.

Figure 9.

Incremental model.

FLORA framework uses three rule sets: accepted descriptors (ADES), negative descriptors (NDES) and potential descriptors (PDES). ADES contains rules covering only positive instances, NDES only negatives ones and PDES contains rules matching with both, positive and negative instances. During the covering test step, if a new instance $n$ is added to the learning window $W$ , its respective rule set (ADES if $n$ is positive, NDES if $n$ is negative) is tested in order to find a rule covering $n$ . If ${n}$ is not covered by any rule, a generalization of rules is performed. If there does not exist any generalization that matches $n$ , the instance full premise is added to its respective rule set in the rule induction step. In order to update the existing rules, PDES set is searched and counters of positive instances are incremented for the rules that cover $n$ . Finally, the opposite set (NDES if $n$ is positive, ADES if $n$ is negative) is visited, and rules that match with $n$ are moved to PDES and their counters are updated. Next, for instances update, if $W$ is full, the oldest instance is removed from $W$ and appropriate counters are decreased. This fact may result in a removal of a rule or its migration from PDES to ADES or NDES, with respect to the instance type: negative or positive.

In FACIL method, when a new instance $n$ arrives, the rules with same $c_{l}$ as $n$ are checked to find a candidate. If $n$ is not covered by any of rules with its same $c_{l}$ , the rest of the rules with different $c_{l}$ are checked. When the purity of a rule $r$ is reduced below the minimum threshold defined by the user, new rules are induced from instances associated with $r$ . The necessary generalization to describe a new instance $n$ is calculated according to growth measure $G$ [9]. For each rule with same class label $c_{l}$ as $n$ , $G$ value is computed. If a rule $r$ has the minimum $G$ value, and $n$ can be seized with a moderate growth, then $r$ becomes a candidate. When $r$ does not cover $n$ , the intersection of $r$ with candidate is calculated. In this way, if $\textit{intersection}\neq\emptyset$ , then, the candidate is rejected, and a specific rule $r_{n}$ describing $n$ is generated, and is added to the rule set. Otherwise, if $\textit{intersection}=\emptyset$ , then, the candidate is generalized regarding $n$ , and is added to the rule set. When a rule $r$ from different $c_{l}$ covers $n$ , its negative support is increased. Additionally, $n$ is added to the $r$ instance window (instance set). The rule set is updated based on the support of the rules. In this sense, if the support of a rule $r$ is less than the support of any rules generated from it, then $r$ is removed from the rule set. Moreover, those instances older than a defined threshold by a user, are removed from the instance set. When instances are no longer relevant (they no longer lie on any of the rules boundary) also are removed.

AQ11-PM+WAH method checks in the covering test step, if a new instance is misclassified by the existing rules. Misclassified instances are combined with the ones in the partial memory (instance set) to form the current $D_{t}$ . Then, AQ11-PM+WAH randomly selects a positive training instance, called “the seed”. The seed is generalized as much as possible regarding the constraints from the negative instances, and a single rule $r$ is induced. Positive instances from $D_{t}$ covered by $r$ are removed, and the whole process is repeated until all positive instances from $D_{t}$ are covered. Rules obtained in the rule induction step are added to rule set. During the update instances step, the rule set is modified to match the instances that lie on the rules boundaries. For each rule, the algorithm searches minimum and maximum values of each feature. Then, each condition in such rule is modified to form an interval between minimum and maximum values of condition. The instances from training collection $D_{t}$ that match the edges of the transformed rule using the strict matching strategy are the extreme ones. Extreme instances are combined with previously obtained ones. Instances are removed from the instance set, when no longer force a boundary.

In VFDR method, when a new learning instance $n$ is available, all rules are visited. VFDR employs a data structure, which contains information for new instances classification, and includes the statistics used for extending a rule. Each rule $r$ is associated to its corresponding data structure $L_{r}$ . This structure could be seen as the instance set, but $L_{r}$ stores additionally statistics to compute the entropy for every class label. During the rule induction step, the entropy $\epsilon$ is compared regarding Hoeffding boundary $S$ , which defines when a rule set should be updated either by extending some existing rules or inducting a new rule. Therefore, if $\epsilon>S$ , then a rule should be extended. If none of the rules covers $n$ , the default rule $r_{1}$ statistics $L$ is compared regarding $S$ , and if $L>S$ , then a new rule is induced from $r_{1}$ . As in the previous method, the rules obtained during the rule induction step are added to rule set. Also, when a rule $r$ covers $n$ , its corresponding statistics $L_{r}$ is updated.

In RILL as well as in the VFDR method, when a new learning instance $n$ is available, all rules are visited. Also, their index is incremented and $n$ is added to the sliding window $W$ (instance set). In this method, statistics like the number of covered positive examples in the window and the timestamp of its last usage are updated and stored for every rule covering $n$ . If no positive rule cover $n$ , a generalization process is performed in the rule induction step. In case when finding positive coverage and generalization process fail, then full description of $n$ is added to the rule set as the most specific rule. In this method, a rule is removed from the rule set if: (1) it is used for more than a maximum age threshold, (2) it has too low purity, or (3) it makes too many prediction errors. Moreover, when the number of stored instances exceeds the maximum threshold of $W$ , the oldest instance from $W$ is removed.

4. Discussion

After reviewing the proposals presented in this paper, some limitations in terms of effectiveness on scenarios as described above were detected.

The complexity and robustness are characteristics that must be considered when you need to build a model. The more complex a model is, the less effective it will be to predict future instances. In models based on decision trees, the data preprocessing dedicated solely to detect missing data, this fact can lead the tree obtained to reach a high level of complexity. To prevent this from happening, stopping rules can be used during the process of building a decision tree to prevent the model becomes too complex. In some situations, stopping rules do not work well. The error consists in the assumption that an appropriate threshold can be set without much understanding of the data. This may results in a large number of rules, and many of these with irrelevant conditions. An alternative way is to grow a decision tree that is too large, and then use an effective pruning process to reduce the complexity of such tree. However, the number of instances to be processed in the MA detection scenarios is considerably large, therefore, if a data preprocessing is not considered to reduce the initial training collection, the model performance may be affected by the increased of memory consumption and processing time.

The limitation in terms of the features that can be processed, also can be a disadvantage in the scenarios described. For example, some methods based on greedy search model only can process categorical features, and to processing continuous features a discretization process must be carried out. This fact limits such models in terms of operators which can be employed by the generated rules; forcing them to use only operators such as “ $=$ ” or “ $\neq$ ”.

Both, the decision tree and the greedy search model are designed for static environments. In this way, the concept drift which can be held in data flows is not contemplated. This fact makes difficult to detect malicious activities that have been modified to evade detection mechanism. In addition, the existing rules do not evolve automatically, which prevents them to adapt to possible changes in scenarios. This fact implies that eventually some conditions of the existing rules become irrelevant.

The methods based on the incremental model that were presented in this paper, have not been evaluated over large volumes of data. These methods store instances after the learning process. Taking into account the characteristics of the scenarios described above, its performance could be affected due to the high number of instances and features. Also, the incremental model lacks the data preprocessing stage, which could affect the amount and complexity of the rules.

Focusing on the general classification rule-based model for malicious activities detection presented above, six problems were detected. Two of them related to the data preprocessing and four related to the rules generation.

Problems with data preprocessing:

•
In some cases it is not performed, ignoring the data dimensionality and the high number of instances (Affects the efficiency of the model.).
•
In cases where the data preprocessing is applied, the high number of instances in the training collection is ignored (Affects the efficiency of the model.).

Problems with rule generation:

•
High number of rules (Affects the efficiency of the model and QoS.).
•
Rules with irrelevant conditions (Affects the effectiveness of the model.).
•
Inconsistency between rules in some cases (Affects the effectiveness of the model and QoS.).
•
Existing rules can not be automatically modified in some cases (Affects the effectiveness of the model.).

5. Conclusions

Malicious activities detection recently became a popular topic of research. Rule-based models are reported as one of the most used for detecting events associated to malicious activities in the shortest time possible. In this paper, several classification rules-based models for malicious activities detection were analyzed. In order to achieve greater effectiveness in malicious activities detection, such models must be able to handle problems related to the data dimensionality and concept drift. In addition, a model should not reduce the QoS level established by the entity where it will be deployed.

It would be interesting to perform a comparison of the presented methods on different data sets. Unfortunately, their implementations are not publicly available. In addition, some of the data sets used are not publicly available, which is due to privacy reasons and legal limitations. This fact makes it difficult a comparison between the achieved results by different methods.

Some problems identified during the models analysis, indicate that both, their effectiveness and QoS may be affected. Those problems can be addressed in future research for proposing new solutions, designed to make the malicious activities detection more effective task.

References

Ahn

S.H.

Kim

N.U.

and Chung

T.M.

, Big data analysis system concept for detecting unknown attacks, In Advanced Communication Technology (ICACT), 2014 16th International Conference on, IEEE, 2014, pp. 269–272.

Rojão-Lourenço Azevedo

A.I.

, KDD, SEMMA and CRISP-DM: A parallel overview, in: IADIS European Conference on Data Mining, Amsterdam, The Netherlands, 2008.

Barthakur

Dahal

and Ghose

M.K.

, Adoption of a fuzzy based classification model for P2P botnet detection, International Journal of Network Security 17(5) (2015), 522–534.

Breiman

Friedman

Stone

C.J.

and Olshen

R.A.

, Classification and regression trees, CRC press, 1984.

CFCA. 2013 global fraud loss survey. [Online]. Available: http://www.cvidya.com/media/62059/global-fraud_loss_survey2013.pdf, 2013.

Chapman

Clinton

Kerber

Khabaza

Reinartz

Shearer

and Wirth

, CRISP-DM 1.0: Step-by-step data mining guide, SPSS Inc., 2000.

Cohen

W.W.

, Fast effective rule induction, In Proceedings of the 12th International Conference on Machine Learning, Tahoe City, 1995, pp. 115–123.

de Gouveia

F.C.

and Magedanz

, Quality of service in telecommunication networks, Telecommunication Systems and Technologies, 2002.

Deckert

, Incremental rule-based learners for handling concept drift: an overview, Foundations of Computing and Decision Sciences 38(1) (2013), 35–65.

10.

Deckert

and Stefanowski

, Rill: Algorithm for learning rules from streaming data with concept drift, In Foundations of Intelligent Systems, Springer, 8502 (2014), 20–29.

11.

Dorigo

Maniezzo

and Colorni

, Ant system: optimization by a colony of cooperating agents, Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on 26(1) (1996), 29–41.

12.

Fanaee-T

and Gamak

, Event labeling combining ensemble detectors and background knowledge, Progress in Artificial Intelligence 2(2-3) (2014), 113–127.

13.

Fayyad

Piatetsky-Shapiro

and Smyth

, From data mining to knowledge discovery in databases, AI magazine 17(3) (1996), 37–54.

14.

Ferrer-Troyano

F.J.

Aguilar-Ruiz

and Riquelme-Santos

J.C.

, Incremental rule learning and border examples selection from numerical data streams, Journal of Universal Computer Science 11(8) (2005), 1426–1439.

15.

Frand

and Witten

I.H.

, Generating accurate rule sets without global optimization, University of Waikato, Department of Computer Science, 1998.

16.

Gama

and Kosina

, Learning decision rules from data streams, In IJCAI’11 Proceedings of the 22 International Joint Conference on Artificial Intelligence, Citeseer, Vol. 22, 2011, pp. 1255–1260.

17.

Ghorbani

A.A.

and Tavallaee

, Network intrusion detection and prevention: Concepts and techniques, Advances in Information Security, 2010.

18.

Gurav

S.S.

and Todmal

S.R.

, A survey on activity detection using data mining, International Journal of Innovative Research in Computer and Communication Engineering 2(11) (2014), 6947–6952.

19.

Han

Kamber

and Pei

, Data mining: concepts and techniques, Elsevier, 2011.

20.

Herrera-Semenets

and Gago-Alonso

, Búsqueda automática de reglas para detección de fraudes en flujos de eventos, Technical Report RT_029, Serie Gris, Advanced Technologies Application Center (CENATAV), La Habana, Cuba, January 2015.

21.

Herrera-Semenets

Prado-Romero

M.A.

and Gago-Alonso

, Análisis de los métodos de detección de fraude en servicios de telecomunicaciones, Technical Report RT_023, Serie Gris, Advanced Technologies Application Center (CENATAV), La Habana, Cuba, February 2014.

22.

Hühn

and Hüllermeier

, FURIA: an algorithm for unordered fuzzy rule induction, Data Mining and Knowledge Discovery, Springer 19 (2009), 293–319.

23.

Crisp-dm

Kdnuggets.

,still the top methodology for analytics, data mining, or data science projects. [Online]. Available: http://www.kdnuggets.com/2014/10/crisp-dm-top-methodology-analytics-data-mining-data-science-projects.html, 2014.

24.

Kuhn

and Johnson

, Applied predictive modeling, Springer, 2013.

25.

Lawrence

R.L.

and Wright

, Rule-based classification systems using classification and regression tree (CART) analysis, Photogrammetric Engineering and Remote Sensing 67(10) (2001), 1137–1142.

26.

Liu

and Tan

S.T.

, X2R: A fast rule generator, In Proceedings of IEEE International Conference on Systems, Man and Cybernetics, Vancouver, Canada, October 1995.

27.

Maloof

, Incremental rule learning with partial instance memory for changing concepts, In Proceedings of the International Joint Conference on Neural Networks, IEEE, Vol. 4, 2003, pp. 2764–2769.

28.

Michalski

R.S.

Kaufman

K.A.

Pietrzykowski

Wojtusiak

Mitchell

and Seeman

, Natural induction and conceptual clustering: A review of applications, Reports of the Machine Learning and Inference Laboratory, George Mason University, Fairfax, VA, USA, 2006.

29.

Parpinelli

R.S.

Lopes

H.S.

and Freitas

A.A.

, An ant colony algorithm for classification rule discovery, Data Mining: A Heuristic Approach 208 (2002), 191–208.

30.

Quinlan

J.R.

, MDL and categorical theories (continued), In In Machine Learning: Proceedings of the Twelfth International Conference, Lake Taho, Morgan Kaufmann, 1995, pp. 464–470.

31.

Quinlan

J.R.

, C4.5: programs for machine learning, San Francisco, California, Morgan Kaufmann, 1993, 302.

32.

Rivero-Pérez

J.L.

, Técnicas de aprendizaje automático para la detección de intrusos en redes de computadoras, Revista Cubana de Ciencias Informáticas (RCCI) 8(4) (2014), 37–49.

33.

SAP. Detect and prevent fraud to reduce financial loss. [Online]. Available: http://www.sap.com/bin/sapcom/de_de/downloadasset.2013-09-sep-17-10.detect-prevent-and-deter-fraud-in-big-data-environments-pdf.html, 2013.

34.

Mohd Shukran

M.A.

and Maskat

, An intelligent network intrusion detection using data mining techniques, Jurnal Teknologi 76(12) (2015), 127–131.

35.

Supeno

Baskoro

A.P.

Hudan

Radityo

and Henning

T.C.

, Coro: Graph-based automatic intrusion detection system signature generator for evoting protection, Journal of Theoretical and Applied Information Technology 81(3) (2015), 535–546.

36.

Tesfahun

and Bhaskari

D.L.

, Effective hybrid intrusion detection system: A layered approach, International Journal of Computer Network and Information Security (IJCNIS) 7(3) (2015), 35–41.

37.

Widmer

and Kubat

, Learning in the presence of concept drift and hidden contexts, Machine Learning 23(1) (1996), 69–101.

38.

Wirth

and Hipp

, CRISP-DM: Towards a standard process model for data mining, In Proceedings of the 4th International Conference on the Practical Applications of Knowledge Discovery and Data Mining, Citeseer, 2000, pp. 29–39.

39.

Zhu

Wang

Yan

and Wu

, Research and application of the improved algorithm C4.5 on decision tree, In International Conference on Test and Measurement, ICTM’09, IEEE, Vol. 2, 2009, pp. 184–187.