A new method to classify malicious domain name using neutrosophic sets in DGA botnet detection

Abstract

In Botnet Detection, Domain generation algorithms are the most effective method to intercept and analyze captured package. In this article, we propose a new method to classify harmful domain names using Neutrosophic Sets. Data of domain name, after being selected featured and fuzzed into Neutrosophic Sets will be used to classify benign domain names, malicious domain names and indeterminacy domain names, minimizing false detection of benign domain names. The proposed model is going to be tested and evaluated with other malicious domain detection models in the aspects of accuracy points, Accuracy, Revocation, and F1, all of which show that our proposed model has good results.

Keywords

DGA domain detection neutrosophic clustering classifying

1 Introduction

Botnets are computer networks, electronic devices with internet connection are infected with malicious code and controlled by a C&C Server [19]. The main connection architectures of botnets include Client – Server, peer to peer and hybrid models. The centralized architecture consists of a central controller, a system of controlling machines and maliciously controlled workstations. The control station sends a message to the entire network to issue control commands, but firewalls and IDS/IPSs easily detect it. Peer to peer model uses the method of transmitting messages from botnet machines. Although detecting C&C Server using this method is more complicated, the construction and design of this model are very complicated. According to hybrid models, botnets will not contact the C&C Server directly but listen to connections and commands from specific servers. C&C Servers will scan on the network for botnets and send control messages when botnets are detected.

Currently, most botnets are still using centralized architecture due to the ease of construction and development [37, 38]. DGA domains are designed to overcome the disadvantages of centralized architecture [4]. With previous botnets, bots will periodically connect to the control server and wait for the command. Therefore, if the control server is detected, the botnet will be destroyed. So the method of using DGA Domain is applied to conceal the behavior of connecting to the botnet’s control server [5].

In DGA botnets, the domain of the control server will be randomly generated. When bots want to connect to the control server, they will run the algorithm and generate the domain name. Then the bot will connect to each domain name in this episode in turn. From time to time, the botmaster knows the set of domain names created by the bot to register the address for the control server. Domains of the control servers in the domain name set by the botmaster are mostly unregistered and not correspondent to IP addresses, and non-existent domain names NXDomain (NoneXistent Domain) or DGA Domain.

The advantage of DGA Botnet is that if the control server’s address is detected and blocked all connections to these addresses, the botnet is still not completely removed. The problem is that at each point in time, the domain name set will be different. Therefore, at subsequent connections, the domain name will be different from the previous ones. The control server only needs to register a new address, and the bot will still work as usual.

In this paper, we propose a method of malicious domain classification using neutral fuzzy set and application in DGA Botnet detection system. First, we will conduct the extraction of all essential features of the sample domain name, then use the correlation map to select the critical characteristics, significantly affecting the classification and reduction results. Next, we build a Neutrosophic set by calculating a truth membership function (T), an indeterminacy membership function (I) and a falsehood membership function (F) of each element. Finally, we use the Neutrosophic C-Means clustering (NCM) algorithm to conduct the classification of elements.

Our contributions in this article include:

Reducing the number of features used: The characteristics of domain names which are less affected by factors such as abbreviations, dialects, or not using keywords in English will be selected. The correlation matrix is used to select important features, minimize the number of characteristics, reduce the number of dimensions.

Proposing the use of Neutrosophic NCM clustering algorithm to detect DGA domains. Experimental comparison with previous Domain DGA detection models shows that the use of NCM has better results in detection and time.

Neutrosophic algorithms utility helps the model to classify domains that need to be questioned and labeled to continue monitoring to increase the accuracy and minimize the false detection in the DGA Domain detection problem.

The article structure is distributed as follows: In part 2, related studies and comments will be presented. The reduced set of attributes to use for DGA domain detection is presented in Part 2. In Part 3, we present the proposed model, using the NCM algorithm to detect DGA domains. Part 4 will conduct experiments, discuss and compare with previous results. The last part is the conclusion and the direction of development.

2 Related works

In recent years, there have been many studies on botnets being published. Author Stalmans [8] proposed using botnet behavior through DNS traffic characteristics. This mechanism eliminates the need to maintain blacklists or update bot signs. The method uses the characteristics of DNS traffic such as the Server Name record, IP address, domain name life span, and the letters that appear in the domain name. The classification process is made using the Naive Bayes algorithm. The effectiveness of this approach is not high because the use of the above features and algorithms is not sufficient to perform accurate detection. Subsequently, author Rajalakshmi et al. [33] proposed a learning model using Naive Bayes subdivision combining CNN to classify domain data sets based on features such as numbers, letters, solid characters, and white space, finding DGA domains. Wei et al. [43] proposed a method of detecting malicious domain names based on the new n-gram feature, Nhauo Davuth and Sung-Ryul Kim [9] provide a method for classifying domain names based on SVM and bi-gram distribution of data sets. Bigram features of the domain name are used, extracted and filtered by a threshold. Then, SVM Light classifier is used to classify normal domain names and DGA Domains. This method is quite effective but can only distinguish known botnets, while there are unknown forms of botnets (untrained forms), leading to a decrease in classification efficiency. Schiavoni et al. [36] proposed the Phoenix mechanism based on the semantic information of domain names and IP-based features to detect domains generated by DGA. In this process, the system uses Mahalanobis distance function to evaluate compatibility and use IP addresses to cluster DGA Domains

Antonakakis et al. [27] has developed a malware detection model based on Definitions and Notation, n-gram Features, Structural Domain Features combined with the X-means algorithm and discovered several new DGA variants.

Schuppen et al. [40] proposed a DGA domain detection system called FANCI. The system uses domain names similar to Antonakakis, to build the training data set, then use the SVM algorithm and Random Forest to classify and detect DGA domains independently. Results showed that FANCI had significantly better results than Antonakakis.

The above results all have certain advantages and disadvantages. Publication [8] using a featured data set also misses some essential features for DGA domain detection; The models in [9, 33] depend on the DGA sample data set, not yet discovered unknown DGA algorithms. Model [36] has a low level of real-time response because it takes much time to perform calculations on features. Models [27 , 40], although they have solved the above problems as well as discovered new DGA variants that are easily detected by the abbreviated domain names and domain names used Domain or domain name does not use English words.

In most of the above, data clustering has many vital applications in data mining, pattern recognition, information retrieval, and machine learning. This is an indispensable component in the DGA Domain detector problem. However, in practice, the data is often complicated, missing or vague, uncertain. To solve this problem, fuzzy set theory was proposed by Zadeh, in which uncertain information is modeled in terms of element membership into a set. Zadeh’s fuzzy clustering algorithm, Fuzzy C-Means (FCM), by Bezdek et al. [16] has now been applied in many different fields and yielded better results than the Clear clustering algorithm.

A fundamental problem in studies related to Zadeh’s traditional fuzzy set is the ability to represent information related to “non-affinity” and “hesitation”. For example, when diagnosing a patient, the doctor usually concludes the severity of the patient’s disease but does not indicate what the disease is. Then, the traditional fuzzy set of Zadeh is not suitable to model information about “no” and “hesitant” properties. Alternatively, for the DGA Domain detection problem, a typical domain, made up of a meaningful word, can still be a malicious DGA Domain that is used to connect to the control server. Some extensions of the traditional fuzzy set have been proposed, such as Antanassov’s fuzzy or Smarandache’s Neutrosophic Sets. For the fuzzy clustering problem on Neutrosophic Sets, the essential problem is to determine similar measurements to divide the elements into clusters [34]. Sahin [32] has proposed solutions to improve hierarchical clustering methods on Neutrosophic sets to conduct clustering. Guo and Sengur [48] improved Fuzzy C-mean algorithm on neutrosophic sets to find neutral elements and noise elements. It is applied very effectively in image processing, margin finding problems. Ye et al. [21 –23] proposed three solutions using the same measurement for Neutrosophic Sets set including measuring Jaccard, Dice, and Cosine to apply to multi-criteria decision-making system with simple neutral data.

Models [27 –40] have improved the disadvantages in previous models, but there are some unresolved issues such as doubtful labeling of domain names that are vague, unclear, uncertain, and time for calculating characteristics is still high. In order to improve the shortcomings that exist in the models [27 –40], we made a selection of a set of features based on two models [27 –40], eliminating some features without affecting the clustering results. Some features are shortened to include features related to issues such as abbreviated keywords, domain names using dialects or non-English domain names. Correlation matrices are also used to select relevant features, helping to reduce computation time. This stage is described in Part 3 of this article.

3 Proposed characteristics for domain names in the DGA Domain detection system

For Neutrosophic clustering, the selection of characteristics is significant. We classify the characteristics of the domain name into three groups as follows:

Structural characteristics

Grammar characteristics

Semantic statistics characteristics

3.1 Structural characteristics

Table 1 describes structural characteristics which use for algorithm:

Table 1
Structural characteristics

Characteristics Meaning vnexpress.net tccyyuytiymh.pw

(Normal Domain) (DGA Domain)

DNL Domain Name Length 13 15

NoS Number of Subdomains 1 1

SLM Subdomain Length Mean 9 12

HwP Has www Prefix 0 0

HVTLD Has a Valid Top Level Domain 1 1

CTS Contains Top Level Domain as Subdomain 0 0

UR Underscore Ratio 0.0 0.0

CIPA Contains IP Address 0 0

Characteristics	Meaning	vnexpress.net	tccyyuytiymh.pw
DNL	Domain Name Length	13	15
NoS	Number of Subdomains	1	1
SLM	Subdomain Length Mean	9	12
HwP	Has www Prefix	0	0
HVTLD	Has a Valid Top Level Domain	1	1
CTS	Contains Top Level Domain as Subdomain	0	0
UR	Underscore Ratio	0.0	0.0
CIPA	Contains IP Address	0	0

DNL: Domain Name Length.

Example: vnexpress.net has DNL = 13.

NoS: Number of Subdomains.

Example: finance.gov.ls has NoS = 3; baomoi.com has NoS = 2.

SLM: Subdomain Length Mean.

Example: vnexpress.net has SLM = 9.

HwP: Has www Prefix.

Example: www.edu.vn has HwP value = 1;

HVTLD: Has a Valid Top Level Domain: Contains valid root domain names. A valid root domain database is taken at Root-zone database (www.iana.org). Domains with root domains that are not in the root-zone database will be treated as DGA domains. Example: DGA Domain shtkwcex.bit has HVTLD value = 0 (bits not in Root-zone database).

CTS: Contains Top Level Domain as Subdomain: Contains a subdomain located in Root-zone database.

Example: dantri.com.vn has CTS value = 1.

UR: Underscore Ratio: Contains “_” in the domain name. Usually, Domain DGAs will not contain this value; the formula determines UR: $UR (domain) = \frac{\sum count (_)}{len (domain)}$ (1)

count (“_”) is the number of “_” characters in the domain name; Wool (domain) is the length of the domain.

Example: 2018_smileshop99.com has the value UR = 0.0625

CIPA: Contains IP Address: Domain name is an IP address. Many websites are accessed without domain names, but through IP addresses, these IP addresses will be considered valid addresses. Example: 8.8.8.8 is Google’s DNS with CIPA value = 1; vnexpress has CIPA value = 0.

3.2 Grammar characteristics

The selected gramamar characteristecs is shown in Table 2:

Table 2
Grammar characteristics

Characteristics Meaning vnexpress.net tccyyuytiymh.pw

(Normal Domain) (DGA Domain)

contains_digit Contains digit 0 0

Vowel_ratio The ratio of vowel/ length of the domain name 0.222222 0.166667

Digit_ratio The ratio of digit/ length of the domain name 0 0

Characteristics	Meaning	vnexpress.net	tccyyuytiymh.pw
contains_digit	Contains digit	0	0
Vowel_ratio	The ratio of vowel/ length of the domain name	0.222222	0.166667
Digit_ratio	The ratio of digit/ length of the domain name	0	0

contains_digit: Contains digit.

Example: naroberts27.github.io has contains_digit value = 1; DGA domain 7rsbs8sz1hq5y6qya1.ru has contains_digit value = 1.

Vowel_ratio: The ratio of vowel: The ratio of vowe/ length of the domain name. This value is determined by the formula:

$\begin{matrix} Vowel_ratio (domain) \\ = \frac{\sum count (Vowel (domain))}{len (domain)} \end{matrix}$ (2)

Vowel(domain) has values: “a”, “e”, “i”, “o”, “u”.

Example: Valid domain name: kcsecurities.com has Vowel_ratio value = 0.416666667.

Digit_ratio: The ratio of digit: The ratio of digit/ length of the domain name. The formula determines this value: $Digit_ratio (domain) = \frac{\sum count (Digit (domain))}{len (domain)}$ (3)

Digit (domain) is the number in the domain.

Example: Valid domain name: naroberts27.github.io has the value Digit_ratio = 0.181818182. DGA domain afj7jmxngvrsm4p15d.com is valid Digit_ratio = 0.071428571.

3.3 Semantic statistic characteristics

Table 3 shows the Semantic statistic characteristics:

Table 3
Semantic statistic characteristics

Characteristics Meaning vnexpress.net tccyyuytiymh.pw

(Normal Domain) (DGA Domain)

RRC The ratio of repeated characters in a subdomain 0.285714 0.428571

RCC The ratio of consecutive consonants 0.555556 0.583333

RCD The ratio of consecutive digits 0 0

Entropy The entropy of subdomain 2.725481 2.584963

Characteristics	Meaning	vnexpress.net	tccyyuytiymh.pw
RRC	The ratio of repeated characters in a subdomain	0.285714	0.428571
RCC	The ratio of consecutive consonants	0.555556	0.583333
RCD	The ratio of consecutive digits	0	0
Entropy	The entropy of subdomain	2.725481	2.584963

RRC: The ratio of repeated characters in a subdomain: The formula determines this value:

$\begin{matrix} Repeated_ratio (domain) \\ = \frac{\sum count (repeated (domain))}{len (domain)} \end{matrix}$ (4)

Repeated (domain) is the number of times the character is repeated in the domain.

RCC: The ratio of consecutive consonants: This value is determined by the formula:

$\begin{matrix} consecutive_consonants_ratio (domain) \\ = \frac{\sum count (Cconsonants (domain))}{len (domain)} \end{matrix}$ (5)

Cconsonants (domains) are the number of repeated consecutive vowels in the domain.

RCD: The ratio of consecutive digits: The formula determines this value:

$\begin{matrix} consecutive_digits_ratio (domain) \\ = \frac{\sum count (Cdigits (domain))}{len (domain)} \end{matrix}$ (6)

Cdigits (domain) is the number of times a digit is repeated consecutively in a domain. Entropy: The entropy of subdomain: The formula determines this value: $E (d) = - \sum_{t \in p} \frac{count (t)}{len (domain)} * \log (\frac{count (t)}{len (domain)})$ (7)

t is a character in the domain, p is a set of characters in the domain.

4 Proposed malicious domain classification model using Neutrosophic Sets

Similar to the Phoenix model, we apply stages to remove black domain names from blacklists, then we conduct domain classification according to the model below:

Classification model, detecting malicious domain names includes three stages in Fig. 1:

Characteristics calculation phase: For each domain name, related characteristics will be calculated, creating a characteristic data source for the next selection phase. The characteristics calculation methods have been presented in the next section of this article.

Characteristics selection phase: We use the correlation map to select the influential characteristics in the characteristics data set to achieve results for the clustering process, minimize interference, and increase calculation performance for the algorithm. The more characteristics, the higher the accuracy; however, it also leads to the reduction of the performance by time. Therefore, the selection of the most influential characters is the critical phase of the model. From the selected characteristics, we build Neutrosophic Sets.

Neutrosophic clustering phase: Using Neutrosophic clustering to conduct clustering, find out the normal domain names, DGA domain names, neutral domain names. In this model, we propose using the C-Means Neutrosophic clustering algorithm (NCM), which is suitable for the problem of short calculation time, a large data set, and good processing in real-time.

Fig. 1

Diagram of the malicious domain classification model using Neutrosophic Sets.

4.1 Characteristic selection

In this model, we use the correlation matrix to select the most influential characteristics in the clustering process. There are three methods of correlation matrix construction, including Pearson Method, Spearman’s method, Kendall method, and Goodman and Kruskal’s method. In this model, we use Pearson’s correlation coefficient to select the influence characteristic. Pearson method is as follows:

For two elements x and y we have: $r_{xy} = \frac{\sum_{i = 1}^{n} (x_{i} - \bar{x}) (y_{i} - \bar{y})}{\sqrt{\sum_{i = 1}^{n} (x_{i} - \bar{x})} \sqrt{\sum_{i = 1}^{n} (y_{i} - \bar{y})}}$ (8)

n is the number of elements; x_i, y_i:element x_i and y_i

$\bar{x}$ : the average value of x is determined by: $\bar{x} = \frac{1}{n} \sum_{i = 1}^{n} x_{i}$ (9)

$\bar{y}$ : the average value of y is determined by (5).

After determining the correlation values, we built up the correlation matrix of characteristics from the database sets [1 –3] described in Fig. 2:

Fig. 2

Matrix correlates the characteristics in the DGA Domain classification model.

We made the selection of the six highest quality features to classify, using the type feature to label, serving for clustering results, Entropy, Type, SLM, DNL, RCC, RRC, vowel ratio will be selected for classification.

As illustrated in Figs. 3, 4, 5 and 6, we can see that DGA domains always have higher entropy values than benign domains. Values of RCC and SLN are of high value for DGA domains, characteristic such as RRC and vowel ratio tend to cluster into a value area.

After selecting essential characteristics, using the selected characteristics to conduct clustering on the Neutrosophic Sets, the clustering method is presented in the next section of this article.

Fig. 3

SLM characteristic.

Fig. 4

RRC characteristic.

Fig. 5

Vowel ratio characteristic.

Fig. 6

RCC characteristic.

4.2 Clustering method on Neutrosophic Sets

We have the concept of Neutrosophic Sets as follows: Let X be a non-empty set, with an element of X denoted by x∈X. Neutrosophic Sets A defined on space X is characterized by three functions: function T_A (x) belonging to the degree indicating that event x will occur, neutral measurement function I_A (x) means that there is no idea whether or not event x occurs, and function Zero measurement F_A (x) believes that event x will not happen with X.

Unlike traditional clustering methods like K-means, fuzzy C-means. The Neutrosophic method used to conduct clustering can eliminate the data of interference or exceptions, neutral among clusters [26]. To detect the DGA Domain, finding interference data, neutral data is considered very important and need to be determined to conduct inspection and monitoring. Guo has proposed the algorithm of Neutrosophic C-Means clustering to find out interference data, neutral data based on Neutrosophic Sets. This algorithm determines the degree of dependence, neutrality, and degree of non-element belonging to input data with all clusters. The objective function and measure of the elements are as follows:

$\begin{matrix} J_{NCM} (T, I, F, c) = \sum_{i = 1}^{N} \sum_{j = 1}^{C} {(ω_{1} T_{ij})}^{m} \\ ∥ x_{i} - c_{j} ∥^{2} + \sum_{i = 1}^{N} {(ω_{2} I_{i})}^{m} ∥ x_{i} - {\bar{c}}_{imax} \\ ∥^{2} + δ^{2} \sum_{i = 1}^{N} {(ω_{3} F_{i})}^{m} \end{matrix}$ (10)

T_ij, I_i, F_i to determine the membership, neutrality, and non-degree of each element. 0 < T_ij, I_i, F_i < 1, With T_ij, I_i, F_i satisfy with the following formula: $\sum_{j = 1}^{c} T_{ij} + F_{i} + I_{i} = 1$ (11)

For each point i, the value of ${\bar{c}}_{imax}$ is determined by the value of the cluster center with the largest and second-largest value of T_ijaccording to: ${\bar{c}}_{imax} = \frac{c_{pi} + c_{qi}}{2}$ (12) $p_{i} = argmax (T_{ij}) j = 1, 2, \dots C$ (13) $q_{i} = argmax (T_{ij}) j \neq p_{i} \cap j = 1, 2, \dots C$ (14)

With m being constant, p_i and q_i are the second largest and largest T values of each cluster. When p_i and q_i are calculated, the value ${\bar{c}}_{imax}$ is calculated and will not change for each data point i.

The values T_ij, I_i, F_i are defined as follows: $T_{ij} = \frac{K}{ω_{1}} {(x_{i} - c_{i})}^{- \frac{2}{m - 1}}$ (15) $I_{i} = \frac{K}{ω_{2}} {(x_{i} - {\bar{c}}_{imax})}^{- \frac{2}{m - 1}}$ (16) $F_{i} = \frac{K}{ω_{3}} δ^{- \frac{2}{m - 1}}$ (17)

$\begin{matrix} K = [\frac{1}{ω_{3}} \sum_{j = 1}^{C} {(x_{i} - c_{j})}^{- \frac{2}{m - 1}} + \\ {\frac{1}{ω_{2}} {(x_{i} - {\bar{c}}_{imax})}^{- \frac{2}{m - 1}} + \frac{1}{ω_{3}} δ^{- \frac{2}{m - 1}}]}^{- 1} \end{matrix}$ (18) $c_{i} = \frac{\sum_{i = 1}^{N} {(ω_{1} T_{ij})}^{m} x_{i}}{\sum_{i = 1}^{N} {(ω_{1} T_{ij})}^{m}}$ (19)

Clustering is repeated, combining optimization of the objective function. The values of dependency, neutrality and non-dependency will be updated according to the above expressions in each iteration.

The value of ${\bar{c}}_{imax}$ is also updated in each iteration.

When satisfying the repeat condition $| T_{ij}^{(k + 1)} - T_{ij}^{(k)} | < ɛ$ . In which 0< ɛ < 1 <d k are the number of iterations. The algorithm can be summarized in the following steps:

Algorithm 1: Neutrosophic C-Means

Input: A data set was fuzzed into Neutrosophic Sets

Output: Data clusters include clusters membership, cluster falsehood, and cluster indeterminacy, Function NCM(X, ɛ, k) returns k cluster

Step 1: Initialize T⁽⁰⁾, I⁽⁰⁾, F⁽⁰⁾;

Step 2: Initialize parameters C, m, ɛ, δ, ω₁, ω₂, ω₃;

Step 3: Calculate the value of the central vector c^(k)in step k by Equation (19)

Step 4: Calculate the value of

{\bar{c}}_{imax}

based on the first and second-largest T-indicators

Step 5: Update T^(k+1), I^(k+1), F^(k+1) values based on T^(k), I^(k), F^(k) by Equations (15), (16), (17)

Step 6: If

| T_{ij}^{(k + 1)} - T_{ij}^{(k)} | < ɛ

then the loop stops, if it goes back to Step 3;

Step 7: Assign each data into clusters with the largest value TM = [T, I, F] with x (i) ɛk^th, where k is the class, such that k = argmax (TM_ij) j = 1, 2, … . , C + 2

Then if k = C + 1 means the element belongs to the neutral class; if k = C + 2 the element in the class does not belong.

In the detection model, the algorithm will help to find elements of the DGA domain class or normal domain. Finding normal domain names but having the characteristics of a DGA domain name or a DGA domain name is hidden, possessing the characteristics of a usual domain name. From there, will give the level of domain names.

5 Experimental and evaluation

5.1 Experimental tools

In this section, we perform NCM algorithm assessment in the proposed model with other methods based on the following 3 criteria: Evaluation between neutrosophic algorithms: Sahin [32]; Evaluation of fuzzy clustering algorithms: FCM [16], FSVM [39]; evaluation between DGA detection models: X-means [27], SVM [40].

In the proposed model, we use the Manhattan similarity measure to distance the two elements x and y: $Sim (X, Y) = Dis (X, Y) = \sum_{i = 1}^{m} | x_{i} - y_{i} |$ (20)

The algorithm is implemented in Python 3.5 programming language, using computers with Intel (R) Core (TM) i7-3470QM CPU clocked at 2.7 GHz, 8192MB RAM and using Windows 10 Professional 64 bits operating system.

5.2 Experimental database

We use training databases, including:

One million most visited domains by Alexa [2] statistics.

Malicious domain name database provided by Bambenek Consulting [3] at http://osint.bambenekconsulting.com/feeds/DGA feed.txt

DGA domain database provided by 360 Lab [1] at https://data.netlab.360.com/feeds/dga/dga.txt

Table 4 provides a general description of the empirical data set. Tables 11 and 12 describe in detail the number of DGA domains in two data sets [1] and [3]. In these two DGA data sets, this domain contains information such as botnet types, URLs generated by DGA algorithms.

Table 4
General description of experimental data

Database Number of elements Number of classes

Alexa 1.000.000 1

Bambenek Consulting 1.169.720 35

360 Lab 872.763 41

Database	Number of elements	Number of classes
Alexa	1.000.000	1
Bambenek Consulting	1.169.720	35
360 Lab	872.763	41

The data set of 10,000 elements will be extracted from the Alexa data set in combination with the remaining two datasets. The above test set of databases will be labeled and mixed to build the training data set.

5.3 Measure evaluation

Our experiment will evaluate whether a domain has a DGA domain, so the specified value will be binary. The classification results will be in four cases, as shown in Table 5:

Table 5
Classification result

Real sample Predict the benign domain Predict DGA Domain

Benign sample True Positive (TP) False Negative (FN)

DGA domain sample False Positive (FP) True Negative (TN)

Real sample	Predict the benign domain	Predict DGA Domain
Benign sample	True Positive (TP)	False Negative (FN)
DGA domain sample	False Positive (FP)	True Negative (TN)

We evaluate the accuracy of the model according to Accuracy and Micro-averaging assessment methods as follows: $Accuracy = \frac{TP + TN}{TP + FP + TN + FN}$ (21) $Precision = \frac{\sum_{i} {TP}_{i}}{\sum_{i} ({TP}_{i} + {FP}_{i})}$ (22) $Recall = \frac{\sum_{i} {TP}_{i}}{\sum_{i} ({TP}_{i} + {FN}_{i})}$ (23) $F 1 - score = \frac{2}{1 / Precision + 1 / Recall}$ (24)

We use two measurements of Davies-Bouldin and Calinski Harabas to evaluate our method with the model of using FCM [16], FSVM [39]:

- Davies-Bouldin (DB) measurement: $DB = \frac{1}{k} \sum_{l = 1}^{k} D_{l}$ (25) $D_{l} = max_{l \neq m} {D_{l, m}}; D_{l, m} = ({\bar{d}}_{l} + {\bar{d}}_{m}) / d_{m, l}$ (26)

For ${\bar{d}}_{n}$ , ${\bar{d}}_{m}$ , is the average group distance of the lth and mth clusters respectively, while d_m,l is the distance between these clusters. With the formula calculated as follows: ${\bar{d}}_{l} = \frac{1}{N_{l}} \sum_{x_{i} \in C_{l}} ∥ x_{i} - {\bar{x}}_{l} ∥; d_{l, m} = ∥ {\bar{x}}_{l} - {\bar{x}}_{m} ∥ .$ (27)

For algorithms to be tested, the smaller the result of DB measurement, the better.

- Calinski-Harabasz Criterion (VRC):

The Calinski-Harabasz criterion is called the variance ratio criterion (VRC). VRC is defined as ${VRC}_{k} = \frac{{SS}_{B}}{{SS}_{W}} \times \frac{(N - k)}{(k - 1)}$ (28) where SS_B is the overall between-cluster variance, SS_W is the overall within-cluster variance, k is the number of clusters, and N is the number of observations.

SS_B is defined as: ${SS}_{B} = \sum_{i = 1}^{k} n_{i} ∥ m_{i} - m ∥^{2}$ (29) where k is the number of clusters, m_i is the centroid of cluster i, m is the overall mean of the data, and ∥m_i - m∥ is the L² norm (Euclidean distance) between the two vectors.

SS_Wis defined as: ${SS}_{W} = \sum_{i = 1}^{k} \sum_{x \in c_{i}} ∥ x - m_{i} ∥^{2}$ (30) where x is a data point, c_i is the ith cluster, m_i is the centroid of cluster i and ∥x - m_i∥ is the L² norm (Euclidean distance) between the two vectors.

The maximal value of VRC shows better performance.

5.4 Experimental results

Evaluating the proposed model results with algorithms of Sahin, X-means, SVM, FCM, TSVM, and FSVM on the data set above, we have obtained the following evaluation results as shown in Table 6, Tables 7 and 8:

Table 6
Evaluate the results of DGA domain classification algorithms

Method Precision Recall F1-score

NCM 0.7880 0.8543 0.8211

Sahin 0.4777 0.7953 0.5969

X-means 0.7532 0.8622 0.8077

SVM 0.82 0.72 0.77

FCM 0.7638 0.8262 0.7937

FSVM 0.83 0.73 0.78

Method	Precision	Recall	F1-score
NCM	0.7880	0.8543	0.8211
Sahin	0.4777	0.7953	0.5969
X-means	0.7532	0.8622	0.8077
SVM	0.82	0.72	0.77
FCM	0.7638	0.8262	0.7937
FSVM	0.83	0.73	0.78

Table 7

Comparing results between algorithms for normal domain classification

Method	Precision	Recall	F1-score
NCM	0.8629	0.76484	0.8109
Sahin	0.8053	0.5215	0.6330
X-means	0.8449	0.7312	0.7881
SVM	0.75	0.84	0.79
FCM	0.8107	0.7445	0.7762
FSVM	0.76	0.85	0.80

Table 8

Comparing Accuracy index between algorithms

Method	Accuracy
NCM	0.8109
Sahin	0.6710
X-means	0.7756
SVM	0.7815
FCM	0.7854
FSVM	0.7905

Based on the chart we can see that NCM’s F1-Score, Precision, and Accuracy results are better on both the DGA-domain proper classification and the Common Domain compared to Shain algorithms, X-means, SVM, FCM, FSVM.

For the advantages of the method, we evaluated two NCM and Sahin neutrosophic algorithms on neutral elements. We have the classification rate as follows:

As can be seen from Fig. 10, using NCM classifier is highly effective in detecting the DGA domain, achieving the best DGA detection rate, reducing the number of smaller indeterminacy elements compared to Sahin algorithm.

Although the NCM algorithm has a smaller number of domain names in question than Sahin, the ratio of Precision, Recall, F1-score, and Accuracy (Figs. 7, 8, 9) are higher than algorithms. Sahin Performing NCM comparison with two other fuzzy clustering algorithms are FCM [16] and FSVM [39] by two measures of DB and VRC, we obtained the results shown in Table 9:

Fig. 7

Compare indicators between classification algorithms.

Fig. 8

Comparing indicators between classification algorithms.

Fig. 9

Comparing Accuracy indicators between classification algorithms.

Table 9

Comparing DB, VRC index between algorithms

Method	DB	VRC
NCM	1.32735	47344.1623
FCM	1.73976	42004.3925
FSVM	2.029794	40022.3472

The above results show that at both measurement levels of DB and VRC, NCM algorithm has achieved the best results, namely the DB index achieved the smallest result and the VRC of NCM achieved the largest results.

The classification results are presented based on some important characteristics such as Entropy, RRC, DNL... by our proposed model is shown in Fig. 11:

Fig. 10

Classification results on two neutrosophic algorithms.

Fig. 11

Classification results of the proposed model.

Through clustering results, we can see the level of the elements. The purple element denotes the normal domain; the dark green element represents the DGA domain; the yellow element is considered the interference element; it should be evaluated on other features. The blue is considered neutral elements.

Neutrosophic-based clustering method has advantages in DGA domain filtering phase, finding interference elements and neutral elements, These elements need to be used methods or based on other characteristics to assess level toxicity and suitable for application with multi-layer analysis model. Neutrosophic is highly effective.

Through the evaluation results of the runtime between algorithms, it can be seen that the NCM usage model achieved the most optimal results, the index comparison between FSVM and NCM has the same level of accuracy. However, NCM math has better computation time than FSVM algorithm. It can be seen that the model using NCM algorithm will achieve the most optimal results in terms of calculation time and exact detection level (Table 10).

Table 10

Classify execution time between algorithms

Method	Time (s)
NCM	42.4442
Sahin	262.2590
X-means	18.4375
SVM	53.7284
FCM	36.2356
FSVM	715.7148

Table 11

Describe in detail the 360 Lab data set

DGA type	Number	DGA type	Number	DGA type	Number	DGA type	Number	DGA type	Number
banjori	452428	virut	9833	dyre	1000	fobber_v1	298	tofsee	20
emotet	286816	murofet	8560	chinad	1000	tempedreve	204	blackhole	2
rovnix	179980	necurs	8192	vawtrak	812	pykspa_v2_real	200	xshellghost	1
tinba	94138	symmi	4256	pykspa_v2_fake	800	padcrypt	168	madmax	1
pykspa_v1	44647	shifu	2547	dircrypt	765	bamital	104	ccleaner	1
simda	23837	suppobox	2302	conficker	493	gspy	100
ramnit	18735	qadars	2000	matsnu	465	vidro	100
gameover	12000	locky	1149	nymaim	450	proslikefan	100
ranbyus	9845	cryptolocker	1000	fobber_v2	299	tinynuke	32

Table 12

Detailed description of the Bambenek Consulting dataset

DGA type	Number	DGA type	Number	DGA type	Number	DGA type	Number
banpjori	439223	pykspa	14215	ramdo	2000	symmi	384
tinba	66688	shiotob/urlzone	12521	pushdo	1680	corebot	280
Post	66000	locky	8028	suppobox	1014	tempedreve	249
ramnit	56174	dyre	7998	Volatile	996	beebone	210
qakbot	40000	kraken	6958	dircrypt	720	hesperbot	192
necurs	32768	Cryptolocker	6000	virut	600	bedep	178
murofet	28520	nymaim	6000	fobber	600	cryptowall	94
ranbyus	26040	shifu	2331	padcrypt	576	matsnu	48
simda	14755	P2P	2000	geodo	576

6 Conclusion and future works

In this paper, we have proposed a benign domain classification model and DGA domain based on NCM algorithm. Experimenting on 3 data sets of Alexa, Bambenek Consulting and 360lab shows that our model has better results with the recent classification methods, with reaching the highest Accuracy and F1-Score indicators. The computational time of the model is reduced compared to the TSVM model, while the Accuracy and F1-Score indicators are similar. In addition to higher accuracy, our model also can detect noise elements and provide neutral, exceptional cases. However, the model also needs to have additional testing methods for the cases where the element is determined by some other advanced fuzzy sets [7 , 48] and the performance of proposed method needs to compare with some other machine learning method [6 , 47]. This will be our next research direction in the future.

Footnotes

Acknowledgment

This work was supported by the Domestic Master/PhD Scholarship Programme of VinGroup Innovation.

References

360 Lab DGA Domains: https://data.netlab.360.com/feeds/dga/dga.txt.

Alexa Top Sites-Up-to-date lists of the top sites on the web: https://aws.amazon.com/alexa-top-sites/.

Bambenek Consulting provided malicious algorithmically-generated domains: http://osint.bambenekconsulting.com/feeds/DGA feed.txt.

, Gray

D.L.

, Pan

, De Cock

and Nascimento

A.C.

, Inline DGA detection with deep networks. In: IEEE International Conference on Data Mining Workshops (ICDMW), pp. 683–692. 2017

, Pan

, Hu

, Nascimento

and Cock

M.D.

, Character Level Based Detection of DGA Domain Names, (2018).

Livadas

, Walsh

, Lapsley

and Strayer

, Using machine learning techniques to identify botnet traffic, Proceedings 2006 31st IEEE Conference on Local Computer Networks, (2006), pp. 967–974.

Dey

, Pal

and Long

H.V.

, Fuzzy minimum spanning tree with interval type 2 fuzzy arc length: formulation and a new genetic algorithm. Soft Computing 1–12.

Stalmans

, A Framework for DNS-Based Detection and Mitigation of Malware Infections on a Network, Information Security South Africa Conference, (2011).

Davuth

and Kim

S.R.

, Classification of Malicious Domain Names using Support Vector Machine and Bi-gram Method, International Journal of Security and Its Applications 7(1) (2013).

10.

Smarandache

, Neutrosophy: Neutrosophic probability, set, and logic. American Research Press, Rehoboth, 1998.

11.

Smarandache

, A Unifying Field in Logics Neutrosophic Logic. Neutrosophy, Neutrosophic Set, Neutrosophic Probability, third ed., American Research Press, (2003).

12.

, Perdisci

, Zhang

and Lee

, BotMiner: clustering analysis of network traffic for protocol- and structure-independent botnet detection, in: Proceedings of the 17th USENIX Security Symposium (Security’08), (2008).

13.

, Zhang

and Lee

, BotSniffer: Detecting botnet command and control channels in network traffic, in: Proceedings of the 15th Annual Network and Distributed System Security Symposium (NDSS’08), (2008).

14.

Cheng

H.D.

and Guo

, A new neutrosophic approach to image thresholding, New Math Nat Comput 499(3) (2009), 291–308.

15.

Cohen

, Cohen

, West

S.G.

and Aiken

L.S.

, Applied multiple regression/correlation analysis for the behavioral sciences, Psychology Press 3 (2006).

16.

Bezdek

J.C.

, Ehrlich

and Full

, FCM: The fuzzy c-means clustering algorithm, Computers & Geosciences 10 (1984), 191–203.

17.

Gardiner

and Nagaraja

, On the security of machine learning in malware c&c detection: a survey, ACM Computing Surveys 49 (2016), 59.

18.

Jha

, Kumar

, Priyadarshini

, Smarandache

and Long

H.V.

, Neutrosophic image segmentation with dice coefficients, Measurement 134 (2019), 762–772.

19.

Nazario

and Holz

, As the Net Churns: Fast-Flux Botnet Observations, 3rd International Conference on Malicious and Unwanted Software (MALWARE), (2008).

20.

Saxe

and Berlin

, Acharacter-level convolutional neural network with embeddings for detecting malicious URLs, Cryptography and Security, (2017).

21.

and Smarandache

, Similarity measure of refined single-valued neutrosophic sets and its multicriteria decision making method, Neutrosophic Sets and Systems 12 (2016), 41–44.

22.

and Zhang

Q.S.

, Single valued neutrosophic similarity measures for multiple attribute decision making, Neutrosophic Sets and Systems 2 (2014), 48–54.

23.

, Clustering methods using distance-based similarity measures of single-valued Neutrosophic sets, Journal of Intelligent Systems 23 (2014), 379–389.

24.

Woodbridge

, Anderson

H.S.

, Ahuja

and Grant

, Predicting domain generation algorithms with long short-term memory networks, Cryptography and Security, (2016).

25.

Jayadeva

and Khemchandni

S.C.

, Twin support vector machines for pattern classification, IEEE Trans Pattern Anal Mach Intell 29 (2007), 905–910.

26.

Long

H.V.

, Ali

, Khan

and Tu

D.N.

, A novel approach for fuzzy clustering based on neutrosophic association matrix, Computers & Industrial Engineering 127 (2019), 687–697.

27.

Antonakakis

, Perdisci

, Nadji

, Vasiloglou

II, , Abu-nimeh

, Lee

and Dagon

, rom Throw-Away Traffic to Bots: Detectingthe Rise of DGA-BasedMalware, In 21th USENIX Security Symposium (2012).

28.

Yang

M.S.

and Tsai

H.S.

, A Gaussian kernel-based fuzzy c-means algorithm with a spatial bias correction, Pattern Recognit Lett 29(12) (2008), 1713–1725.

29.

Ménard

, Demko

and Loonis

, The fuzzy c+2 means: solving the ambiguity rejection inclustering, Pattern Recognition 33 (2000), 1219–1237.

30.

David

M.W.

, Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation, Journal of Machine Learning Technologies 2 (2011), 37–63.

31.

Pal

N.R.

, Pal

and Bezdek

J.C.

, A mixed c-means clustering model, in: Proceedings of the Sixth IEEE International Conference on Fuzzy Systems, Barcelona, (1997), 11–21.

32.

Şahin

, Neutrosophic Hierarchical Clustering Algoritms, Neutrosophic Sets and Systems 2 (2014).

33.

Rajalakshmi(&)

, Ramraj

and Ramesh Kannan

, Transfer Learning Approach for Identification of Malicious Domain Names.

34.

Son

N.T.K.

, Dong

N.P.

and Long

H.V.

, Towards granular calculus of single-valued neutrosophic functions under granular computing, Multimedia Tools and Applications (2019), 1–37.

35.

Son

N.T.K.

, Dong

N.P.

, Long

H.V.

and Khastan

, Linear quadratic regulator problem governed by granular neutrosophic fractional differential equations. ISA Transactions (2019).

36.

Schiavoni

, Maggi

, Cavallaro

and Zanero

, Phoenix: DGA Based Botnet Tracking and Intelligence, International Conference on Detection of Intrusions and Malware and Vulnerability Assessment, DIMVA (2014), pp. 192–211.

37.

Yadav , Reddy

A.K.K.

, Reddy

A.N.

and Ranjan

, Detecting algorithmically generated malicious domain names, Proceedings of the 10th Mannual Conference on Internet Measurement, IMC ’10, (2010), pp. 48–61.

38.

Yadav

, Reddy

A.K.K.

, Reddy

and Ranjan

, Detecting algorithmically generated domain-flux attacks with dns traffic analysis. IEEE/ACM TON 20, (2012).

39.

Wang

S.D.

and Lin

C.F.

, Fuzzy support vector machines, IEEE Transactions on Neural Networks 13 (2002).

40.

Schuppen

, Teubert

and Herrmann

, FANCI: Feature-based Automated NXDomain Classification and Intelligence, In 27th USENIX Security Symposium (2018).

41.

Tong

and Nguyen

, A method for detecting DGA botnet based on semantic and cluster analysis, Proceedings of the Seventh Symposium on Information and Communication Technology. ACM, (2016), 272–277.

42.

Vinayakumar

, Soman

and Poornachandran

, Detecting malicious domain names usingdeep learning approaches at scale, J Intell Fuzzy Syst 34(3) (2018), 1355–1367.

43.

, Rammidi

and Ghorbani

A.A.

, Clustering botnet communication traffic based on n-gram feature selection, Computer Communications 34 (2011), 502–514.

44.

Zhen

, Zhongtian

and Zhang

, A Detection Scheme for DGA Domain Names Based on SVM, International Conference on Mathematics, Modelling, Simulation and Algorithms, (2018).

45.

Jiang

, Yi

and Lu

J.C.

, Fuzzy SVM with a new fuzzy membership function, Neural Comput & Applic 15 (2006), 268–276.

46.

Akbulut

, Sengur

, Guo

and Polat

, KNCM: Kernel Neutrosophic c-Means Clustering, Applied Soft Computing 52 (2017), 714–724.

47.

Tang

, Deep learning using linear support vector machines, arXivpreprint arXiv:1306.0239, (2013).

48.

Guo

and Sengur

, NCM: Neutrosophic c-means, Clustering Algorithm Pattern Recognit 48 (2015), 2710–2724.

A new method to classify malicious domain name using neutrosophic sets in DGA botnet detection

Abstract

Keywords

1 Introduction

2 Related works

3 Proposed characteristics for domain names in the DGA Domain detection system

3.1 Structural characteristics

Table 2 Grammar characteristics Characteristics Meaning vnexpress.net tccyyuytiymh.pw (Normal Domain) (DGA Domain) contains_digit Contains digit 0 0 Vowel_ratio The ratio of vowel/ length of the domain name 0.222222 0.166667 Digit_ratio The ratio of digit/ length of the domain name 0 0

5.1 Experimental tools

Table 4 General description of experimental data Database Number of elements Number of classes Alexa 1.000.000 1 Bambenek Consulting 1.169.720 35 360 Lab 872.763 41

Table 5 Classification result Real sample Predict the benign domain Predict DGA Domain Benign sample True Positive (TP) False Negative (FN) DGA domain sample False Positive (FP) True Negative (TN)

Table 6 Evaluate the results of DGA domain classification algorithms Method Precision Recall F1-score NCM 0.7880 0.8543 0.8211 Sahin 0.4777 0.7953 0.5969 X-means 0.7532 0.8622 0.8077 SVM 0.82 0.72 0.77 FCM 0.7638 0.8262 0.7937 FSVM 0.83 0.73 0.78

Footnotes

Acknowledgment

References

Table 2
Grammar characteristics

Characteristics Meaning vnexpress.net tccyyuytiymh.pw

(Normal Domain) (DGA Domain)

contains_digit Contains digit 0 0

Vowel_ratio The ratio of vowel/ length of the domain name 0.222222 0.166667

Digit_ratio The ratio of digit/ length of the domain name 0 0

Table 4
General description of experimental data

Database Number of elements Number of classes

Alexa 1.000.000 1

Bambenek Consulting 1.169.720 35

360 Lab 872.763 41

Table 5
Classification result

Real sample Predict the benign domain Predict DGA Domain

Benign sample True Positive (TP) False Negative (FN)

DGA domain sample False Positive (FP) True Negative (TN)

Table 6
Evaluate the results of DGA domain classification algorithms

Method Precision Recall F1-score

NCM 0.7880 0.8543 0.8211

Sahin 0.4777 0.7953 0.5969

X-means 0.7532 0.8622 0.8077

SVM 0.82 0.72 0.77

FCM 0.7638 0.8262 0.7937

FSVM 0.83 0.73 0.78