Abstract
Visualization-based malware detection gets more and more attention for detecting sophisticated malware that traditional antivirus software may miss. The approach involves creating a visual representation of the memory or portable executable files (PEs). However, most current visualization-based malware classification models focus on convolution neural networks instead of Vision transformers (ViT) even though ViT has a higher performance and captures the spatial representation of malware. Therefore, more research should be performed on malware classification using vision transformers. This paper proposes a multi-variants vision transformer-based malware image classification model using multi-criteria decision-making. The proposed method employs Multi-variants transformer encoders to show different visual representation embeddings sets of one malware image. The proposed architecture contains five steps: (1) patch extraction and embeddings, (2) positional encoding, (3) multi-variants transformer encoders, (4) classification, and (5) decision-making. The variants of transformer encoders are transfer learning-based models i.e., it was originally trained on ImageNet dataset. Moreover, the proposed malware classifier employs MEREC-VIKOR, a hybrid standard evaluation approach, which combines multi-inconsistent performance metrics. The performance of the transformer encoder variants is assessed both on individual malware families and across the entire set of malware families within two datasets i.e., MalImg and Microsoft BIG datasets achieving overall accuracy 97.64 and 98.92 respectively. Although the proposed method achieves high performance, the metrics exhibit inconsistency across some malware families. The results of standard evaluation metrics i.e., Q, R, and U show that TE3 outperform the TE1, TE2, and TE4 variants achieving minimal values equal to 0. Finally, the proposed architecture demonstrates a comparable performance to the state-of-the-art that use CNNs.
Introduction
Malware “malicious software”, refers to any software that has been intentionally created to inflict damage upon computer systems, networks, cloud etc. Malware can manifest in various forms, such as adware, spyware, ransomware, trojans, viruses, and worms [18]. Recent instances of malware include Trickbot, Emotet, Ryuk ransomware, Conti ransomware, LokiBot, Azorult, etc. Malware detection methods are used to identify and prevent malicious software from infecting computers, networks, and other digital devices. Signature-based and anomaly-based approaches are two categories of techniques used to detect malware. Signature-based techniques compare the digital signatures of known malware with the code on the system to find out a match whereas anomaly-based techniques monitor the behaviour of a program using machine learning to detect anomalies that may indicate malicious activity [3]. However, signature-based techniques have a significant limitation in that they are unable to detect new or unknown malware threats that do not match any existing signatures in the database. This means that these techniques can be easily bypassed by attackers using new or modified malware code that is not yet recognized by security experts [3]. Static analysis and dynamic analysis are two techniques for malware analysis. Static analysis is used as a first step to detect suspicious activity, it examines the code and structure of the malware without executing it. This can be done using various tools, such as disassemblers and decompilers, to reverse engineer the code and understand its behaviour. The drawbacks of static analysis are the difficulty in identifying obfuscated code, and the Inability to analyze encrypted or polymorphic malware. On the other hand, dynamic analysis is used as a second step to detect suspicious activity, it runs the malware in a controlled environment and observes its behaviour. This can be done using virtual machines or sandbox environments that isolate the malware from the rest of the system. By monitoring the malware’s behaviour, analysts can identify its malicious activities, such as attempts to steal data or launch attacks on other systems [17]. As dynamic malware analysis relies on ML-based techniques, improper feature engineering and false positives could adversely affect the accuracy of malware classification leading to security breaches or unnecessary alerts [17, 25].
Motivation
In recent studies, visualization-based malware detection gets more and more attention for detecting sophisticated malware that traditional antivirus software may miss [17]. The approach involves creating a visual representation of the memory or portable executable files (PEs) and then, identifying fundamental characteristics of malware code that can be captured in a visual representation. As per these characteristics, the classification of code as either benign or malicious can be accomplished using DL models. visualization-based malware detection is a dynamic technique that can detect unknown and obfuscated malware variants without a need for feature engineering [25]. Convolutional neural networks (CNNs) are the primary dynamic technique used in most recent studies for malware detection based on visualisation [18, 34]. Convolutional neural networks (CNNs) examine the visual representation of malware code and detect malicious patterns by extracting features from the input image. Nevertheless, CNNs have certain limitations in detecting minute distinctions among different malware versions [34]. Consequently, they may fail to detect slight modifications in malicious code that could have an impact on the malware’s behaviour [34].
Though ViT [12] has a higher performance than CNNs in image classification [7, 26] and maintains the spatial representation of malware [28], a few researches are performed on malware classification using a vision transformer [4].
Research Objectives
This paper proposes a multi-variants vision transformer-based malware image classification model using multi-criteria decision-making. The key contributions of the proposed method can be outlined as follows. The proposed architecture is a visualization-based method to detect obfuscated and packed malware that may be hidden within legitimate software. This approach eliminates the need for feature engineering or domain experts. The proposed method employs Multi-variants transformer encoders to show different visual representation embeddings sets of one malware image. These transformer encoders are based on transfer learning, as they were initially trained on the large-scale ImageNet dataset. The proposed malware classification model adopts a hybrid standard evaluation approach, known as MEREC-VIKOR, which combines various evaluation criteria to provide a comprehensive and reliable assessment of the classifier’s performance. The performance of the transformer encoder variants is assessed both on individual malware families and across the entire set of malware families within two datasets i.e., MalImg and Microsoft BIG datasets. Furthermore, a comparison has been made between the proposed classifier and other state-of-the-art malware classification approaches.
Organization
The subsequent sections of the paper are structured as follows: Section 2 presents a literature review on malware classification models. The architecture and the phases of the proposed visualization-based malware classification method are discussed in section 3. Section 4 discusses the method’s performance on individual malware families and across the entire set of malware families within two datasets i.e., MalImg and Microsoft BIG datasets besides, a comparative analysis with the other state-of-the-art approaches. Section 5 concludes the paper and shows the limitations and future scopes.
Related work
Static methods for detecting malware are employed to identify malicious software without running it. It analyzes the malware code and other static characteristics such as signatures, file headers, code structure and syntax. Liangboonprakong et al. [20] use N-grams sequential pattern features to classify malware by disassembling the binary executable into a hexadecimal string and looking for frequently occurring sequences. Avdiienko et al. [2] use data flow analysis to detect Android malicious apps by mining the data flow for benign Android apps using FlowDroid. Narouei et al. [24] propose a technique that can detect malware accurately and resist code obfuscation or injection into legitimate software. Naik et al. [22] involve using fuzzy and import hashing techniques to improve the accuracy of malware detection. Although static malware detection techniques are useful for performance analysis, they have limitations such as being vulnerable to code obfuscation, encrypted or polymorphic malware.
Dynamic methods for detecting malware are employed to detect malicious code while it is running. These methods analyze the code’s behaviour and collect some behavioural features such as API calls, system calls, and resources’ status which help to identify the code whether malicious or benign [17]. Li et al. [1] developed a DL-based framework for detecting malware. First, the software behaviour is represented through embedding several APIs then, the semantic features of API calls are represented. Finally, Bi-LSTM is used to extract the relationship among APIs. Li et al. [19] introduced the DMalNet approach for dynamic malware detection, which extracts semantic features from API arguments and names using a hybrid encoder and then represents the relationship between API calls in a graph structure. While detection approaches using API calls have shown high accuracy in detecting malware, their complexity makes them difficult to implement in real-time environments. As a result, recent studies have proposed detection approaches using hardware to enhance the performance of detection models or frameworks [33]. Chen et al. [9] proposed a DL-based malware detector based on traces of the target software’s control flow. processors that support Intel Processor Trace are employed to minimize the overhead of execution tracing. Tian et al. [33] introduced a runtime detection technique namely, MDCHD which examines the Processor traces to collect the target software’s control flows. Unfortunately, dynamic malware detection techniques are not scalable as they require significant resources and are specific to the environment in which they are executed [17]. Also, improper feature engineering and false positives could adversely affect these models’ performance [17].
Recent research has shown that visualization-based techniques are gaining popularity in detecting sophisticated malware that traditional techniques fail to detect [17]. These methods are cost-effective, accurate, and require fewer human resources as they eliminate the need for feature extraction [25]. Falana et al. [13] presented a virtualization-based method for malware detection that uses an ensemble method for feature extraction comprising Deep-GAN and Deep-CNN. The features are extracted from RGB images that are obtained from malware binary files. [34] introduced a malware classification approach, IMCFN, that generates the RGB visual representation of malware binary sequences, and then uses a fine-tuned CNN model to classify them. Kumar et al. [17] suggested a malware classifier (DTMIC) based on pre-trained CNN. DTMIC show the grayscale visual representation of PEs which are then classified into their respective malware families. Vasan et al. Makandar et al. [21] create a texture features vector using the multi-resolution wavelets based on the discrete wavelet transform GIST and the Gabor wavelet, etc. Then, multi-class SVM is used to classify the malware. Landman et al. [18] developed a framework named Deep-Hook that applied for unknown malware detection in a Linux-based cloud. Deep-Hook shows the visual representation of the memory dump sequences of a VM that are captured while it is operating. Narayanan et al. [23] produce the visual representation of malware and viruses, and PCA is used for feature extraction. Then, the malware is classified based on a hybrid model. Tian et al. [33] introduced a visualization approach called MDCHD for detecting malware in VM environments. MDCHD generates the RGB visual representation of the target software’s control flow which is captured using IPT method at runtime. Then CNN model is used to detect them.
According to recent studies, Convolutional Neural Network (CNN) is a powerful tool frequently utilized in malware detection. However, CNNs may have difficulty distinguishing small differences and subtle changes between malware types, which leads to misclassified malware [34]. Many recent studies have employed attention mechanisms in conjunction with CNN models to improve their performance. Malbert, an approach for malware detection, was proposed by Xu et al. [36] in which a pre-trained DL model consists of 12 transformers, each comprising an attention layer followed by a DNN. Zhang et al. [37] introduced a static analysis method for ransomware detection using self-attention-based CNN and N-gram opcodes. Dosovitskiy et al. [12] introduced a vision transformer (ViT) approach, which employs an attention mechanism to classify images. ViT has been successfully employed in various domains, demonstrating superior performance compared to other methods. Xu et al. [35] used ViT to construct a framework for multi-task classification, which analyzes MR images and simultaneously predicts four glioma molecular expressions. Haurum et al. [15] classified sewer defects using a multi-scale hybrid ViT approach. Okolo et al. [26] proposed an improved ViT architecture called IEViT to classify chest X-ray images.
Though ViT [12] has a higher performance than CNNs in image classification [7, 26] and maintains the spatial representation of malware [28, 31] a few research is performed on malware classification using a vision transformer. SHERLOCK [29] is a method for Android malware detection presented by SENEVIRATNE et al. A Self-Supervised ViT that has been trained on a sizable corpus of Android application (APK) files is used in the procedure. SHERLOCK determines whether or not an APK file contains malware by capturing the representations of APK files. An improved ViT [27] was proposed by Park et al. for classifying the malware. Multiple patch encodings are used in this method to capture positional information of both global and local characteristics. Belal et al. [4] proposed butterfly vision transformer which relies on global-local attention-based for Visualization-based malware classification/detection. In this paper proposes and analyzes the performance of a multi-variant vision transformer-based malware image classification model using multi-criteria decision-making.
Methodology
Generating grayscale malware images
In our approach, the binary sequences associated with Portable Executable (PE) files are transformed into grayscale malware images without the need for expert knowledge or feature engineering. These grayscale images serve as inputs for the proposed model, enabling malware classification. The malware visualization process is outlined in Algorithm 1. Initially, the binary PE import sequence is divided into s octets, each representing a pixel in the gray malware image. Following this, the binary pixel values are converted into 8-bit unsigned integers, and normalized within the range [0, 1]. After normalizing the pixel values, they are structured into a vector then, this vector is reshaped into a 2D gray malware image with dimensions (h×w) and a single channel.
1:
2:
3: N ← k/8
4: M1D ← []
5:
6: R j ← M B [j * 8 +1] , . . . . , M B [j * 8 +8]
7:
8:
9:
10: M1D ← [
11: [M2D] h×w ← [M1D] N×1 x ∈ Rh×w, M1D ∈ RN×1
The proposed multi-variant vision transformer-based malware image classification model
This manuscript proposes a multi-variant vision transformer-based malware image classification model using multi-criteria decision-making. The Multi-variants transformer encoders are employed to show different visual representation embeddings sets of one malware image. Moreover, the proposed malware classifier adopts a hybrid standard evaluation approach, known as MEREC-VIKOR, which combines various evaluation criteria to provide a comprehensive and reliable assessment of the classifier’s performance. The block diagram of the multi-variant vision transformer-based malware image classification model is illustrated in Fig. 1. The proposed architecture takes a grayscale image of malware as its input, and its output is the classification of the malware image. It contains five steps: (1) patches extraction and embeddings, (2) positional encoding, (3) multi-variants transformer encoders, (4) classification, and (5) decision-making. In the sections below, we’ll go through each phase’s details.

Multi-variants vision transformer-based malware image classification model.
This is the first step in the proposed malware classifier which takes a grayscale image of malware M ∈ RD1×D2×C as an input with (D1 × D2) dimensions; C is no. of channels. Then, the malware image is divided into K smaller sections, known as patches M(i) ∈ RD3×D3×C ; i = 1, . . . , N using a sliding window approach;
As shown in Fig. 1, the malware image is partitioned using two different patch sizes (D3 × D3) = (16 × 16) and (32 × 32), where the no. of patches are equal to k1 and k2 respectively. Then, the (16 × 16) patch embeddings are computed by mapping the flattened patches into 768 and 1024 dimension vector space using Θ function. Similarly, the (32 × 32) patch embeddings are computed by mapping the flattened patches into 768 and 1024 dimension vector space using Θ function.
Since the malware image patches are unordered and do not contain explicit positional information, it is necessary to incorporate positional encoding into the patches’ embeddings to provide spatial information to the transformer encoders and improve their ability to recognize complex visual patterns. Positional encoding is achieved by adding the patches’ spatial position embeddings pos(i) . E to the patches’ embeddings PE(i) as shown in Equation 3. The patch positions pos(i) are mapped into D-dimension vector space (same size as patch embeddings) to generate the position embeddings pos(i) . E.
Then, the class embedding
As shown in figure 1, there are four groups of patches’ embeddings, each group encodes the visual features of the same malware image with different variants. Positional encoding for one group is achieved by adding the patches’ spatial position embeddings to the patches’ embeddings of the respective group.
The transformer encoder is a key block of ViT architecture, which is a deep-learning model used for image classification tasks. The transformer encoder block is stacked into l layers. The architecture of the transformer encoder is shown in Fig. 2. It consists of multi-head self-attention and a feedforward neural network, both are followed by layer normalization aids in stabilizing the training process and enhancing the overall performance of the model. The self-attention component uses multi-head H, which enables the model to capture different representations of the patches’ embeddings in parallel. Then, the final attended vector is obtained by concatenating and linearly projection the output vectors of heads. The feedforward neural network with two hidden layers, applies a non-linear transformation on the final attended vector.

Transformer encoder architecture.
As shown in figure 1, the proposed architecture utilizes four transformer encoders with four different variants. The variants of transformer encoders are shown in Table 1.
The variants of transformer encoders
The proposed architecture has two transformer encoders, each with 12 attention heads and an MLP size of (3072× 768). The input data comprises a partitioned malware image with two different sizes, one of which is partitioned into 16x16 patches, while the other is partitioned into 32x32 patches. The patches are then represented as embeddings and mapped to a 768-dimensional space. The 12-layer transformer encoder addresses the 16x16 patches, while the 24-layer transformer encoder addresses the 32x32 patches. Given the figure presented in 2, the parameters for the Multi-Layer Perceptron (MLP) are determined based on the multi-head self-attention mechanism. Specifically, the MLP parameters in TE1 and TE2 are selected as follows:- 12 (no of heads) * 256 (vector size of each head) and 768 is the dimensional space of patch embedding. Moreover, there are two transformer encoders, each with 16 attention heads and an MLP size of (4096×1024). The input data comprises a partitioned malware image with two different sizes, one of which is partitioned into 16x16 patches, while the other is partitioned into 32x32 patches. The patches are then represented as embeddings and mapped to a 1024-dimensional space. The 12-layer transformer encoder addresses the 16x16 patches, while the 24-layer transformer encoder addresses the 32x32 patches. Given the figure presented in 2, the parameters for the Multi-Layer Perceptron (MLP) are determined based on the multi-head self-attention mechanism. Specifically, the MLP parameters in TE3 and TE4 are selected as follows:- 16 (no of heads) * 256 (vector size of each head) and 1024 is the dimensional space of patch embedding.
The transformer encoders show four different visual representation embeddings sets of one malware image, these four embeddings sets are used for malware classification.
In this proposed architecture, the transformer encoders are used as transfer learning-based models, meaning that they were previously trained on a large-scale dataset, in this case, the ImageNet dataset. The parameters learned during this pre-training phase are used to initialize the parameters of four transformer encoder variants. Then, the MLP head classifier is fine-tuned to adapt the model to the specific task i.e., malware classification. The fine-tuning process updates the parameters of the MLP head classifier while keeping the parameters of the transformer encoders fixed. This approach saves considerable computational resources and time as the model does not have to be trained from scratch. Additionally, it does not necessitate an extensive quantity of training samples and overcomes the overfittingissue.
In the proposed architecture, there are four MLP head classifiers which address the four different visual representation embeddings of one malware image. Each MLP head classifier is trained on malware image training samples. Besides, each one is evaluated on malware image testing samples in terms of accuracy, F-score, recall, and precision. Two malware datasets i.e., MalImg, and Microsoft BIG are used to train and evaluate the proposed architecture. Each MLP classifier has two hidden layers: (1024, 512) units and an output layer: x units where x denotes the no of classes in dataset, x = 25 and 9 for MalImg, and Microsoft BIGrespectively.
During training, the proposed malware classifier is optimized using the following hyperparameters: the batch size is 16, and the Adam optimizer is chosen, with a learning rate of 10-4. This specific set of hyperparameters has been identified experimentally as the optimal performance for the malware classifier is achieved. The loss function used in the proposed architecture is the cross-entropy loss, which is given by Equation 5:
Data augmentation methods are often employed in the training phase to address the problem of overfitting in transformers-based image classification models. These techniques involve various methods such as scaling, shearing, rotating, shifting, flipping, and zooming, etc. By applying these methods, the number of training samples can be increased, which can help improve the model’s performance and generalization capability. This study employed a set of data augmentation techniques to generate additional training samples from the original malware samples. Table 2 presents a summary of the augmentation steps used in this paper. Notably, the testing images were not subjected to any augmentation to ensure a realistic evaluation of the model’s performance on the original data. The MalImg and Microsoft BIG datasets, used to train the proposed architecture, are imbalanced, with some malware families having notably larger or smaller image numbers than others, as indicated in Figs. 3 and 4. However, by applying data augmentation techniques to increase the no. of malware images in certain classes, the impact of unbalanced data can be reduced. The augmentation details for Malimg and Microsoft BIG are in Tables 3 and 4. However, there exists an imbalance in the number of original training samples (before augmentation) across various malware families. Several augmentation techniques (Table 2) have been implemented on these samples to address this imbalance. The objective is to augment the dataset and increase the number of training samples for each malware family. The ultimate goal is to achieve a balanced dataset with equal samples in each malware family.

Malware families in MalImg Dataset.

Malware families in Microsoft BIG Dataset.
The augmentation techniques used on Malware Images
Augmentation details of Malimg
Augmentation details of Microsoft BIG
The objective of this stage in the proposed architecture is to determine the most optimal transformer encoder variant out of the four available options that is best suited for performing malware classification. The transformer-based malware classifiers generate multiple performance metrics such as precision, F1-score, accuracy, and recall. Evaluating the model based on a single metric is insufficient to satisfy all the criteria. Therefore, a standard evaluation approach should be there. Multiple Criteria Decision Making (MCDM) is an approach utilized to evaluate and rank various techniques, such as multi-variant transformers-based malware classifiers, based on multiple criteria to determine the most optimal one.
Firstly, the decision matrix A is constructed by the crossover between the multi-variant transformers-based malware classifiers and performance criteria as shown in Equation 6. The decision matrix A consists of four rows that represent the multi-variant transformers-based malware classifiers mentioned in Table 1. Each row corresponds to a specific transformer variant. The columns of A correspond to the performance metrics of the classifiers, namely precision, F1-score, accuracy, and recall. There are a total of four columns in A, each representing one metric for the transformers.
secondly, In order to determine the weights of the criteria, an objective weighting strategy called MEREC is employed [5]. Each weight assigned to a criterion represents its impact on the overall performance. The weights are assigned in such a way that criteria with a greater impact on performance are given higher weights, indicating their importance in the evaluation process. By following this procedure, the MEREC method determines the weights of the criteria:
1- The performance criteria values in the decision matrix are normalized using Equation 7
2- The performance of each transformer-based malware classifier variant is calculated using Equation 8.
3- The performance of transformer-based malware classifier variants is calculated by considering the impact of eliminating each criterion on their performance using Equation 9.
4- The sum of absolute deviations is calculated using Equation 10.
5- The weights of criteria are determined using Equation 11.
Where W j denotes the j-th criterion weight.
Thirdly, in order to rank the transformer-based malware classifier variants, an MCDM-based ranking method called VIKOR is employed [5]. The following steps demonstrate the VIKOR method’s procedure:
1- The best
2- The weighted decision matrix is computed using Equation 13.
3- For each transformer-based malware classifier variant, the values U i , R i and Q i are calculated as shown in Equations 14, 15, and 16 respectively.
4- Transformer-based malware classifiers are assessed using three standard evaluation lists: R i , U i , and Q i . These lists are sorted in ascending order, forming a basis for ranking the classifiers. The best classifier is one that exhibits the lowest Q value, indicative of superior performance in malware classification as shown in Equation 17.
The performance evaluation of the proposed architecture focused on its application in visualization-based malware classification. The performance of the transformer encoder variants is assessed both on individual malware families and across the entire set of malware families within two datasets i.e., MalImg and Microsoft BIG datasets. The performance of multi-variant (four variants) transformer encoders is evaluated in terms of multi-inconsistent metrics such as accuracy (Acc), F1-score (F1), precision (Pr), and recall (Re) as well as standard evaluation metrics such as (S, R, and Q). The multi-inconsistent metrics are defined in Equations 18, 19, 20, and 21. Whereas the standard evaluation metrics are defined in Equations 14, 15, and 16.
For datasets, the malimg 1 dataset comprises a collection of more than 9,000 grayscale images, organized into 25 different malware families, as detailed in Fig. 3. On the other hand, the Microsoft BIG 2 dataset comprises over 10,000 malware byte files, represented in hexadecimal format. These files are distributed among 9 different malware families, as detailed in Fig. 4. The samples in datasets are split into an 80:20 ratio, allocating 80% for training and the remaining 20% for testing during the experiments. Figure 5 shows some malware samples from MalImg and Microsoft BIG datasets.

Samples of malware families in MalImg and Microsoft BIG datasets.
Furthermore, a comparison has been made between the proposed architecture and other state-of-the-art malware classification approaches.
The proposed architecture was implemented on a platform with the following capabilities: Graphics card: an NVIDIA GeForce GTX 1080 Ti with CUDA 11.2; CPU: Intel Xeon 4 GHz; operating system: Windows 10 (64-bit); RAM: 64 GB; IDE: Spyder (Python 3.9.12); Libraries: vit_keras, TensorFlow, and Keras.
In this section, the performance of four transformer encoder variants on each malware family in two datasets i.e., Malimg and Microsoft BIG, is analyzed in terms of multi-inconsistent and standard evaluation metrics. Figure 6 shows the results of performance metrics i.e., accuracy, F1-score, precision, and recall for four transformer encoder variants on the 25 malware families individually in Malimg dataset. Figure 7 shows the results of performance metrics i.e., accuracy, F1-score, precision, and recall for four transformer encoder variants on the 9 malware families individually in Microsoft BIG dataset. As shown in two Figs. 6 and 7, the performance metrics exhibit inconsistency across some malware families. This inconsistency implies that certain variants of the transformer encoder achieve higher performance for certain metrics, while not necessarily excelling in other metrics. Consequently, determining the best variant becomes a challenging task based on these multi-inconsistent metrics. In contrast, determining the best transformer encoder variant becomes easier when a single metric is employed for evaluation. For C2LOP.gen!g malware family in MalImg, TE4 achieves the highest accuracy, while having lower precision and F1-score than TE3. For simda malware family in Microsoft, TE4 achieves the highest recall, while has lower precision and accuracy thanTE3.

Multi-variant transformer encoders performance analysis on Malware family-wise in MalImg dataset.

Multi-variant transformer encoders performance analysis on Malware family-wise in Microsoft BIG dataset.
To overcome this challenge, a MEREC-VIKOR technique is used to decide the best transformer encoder variant based on a standard evaluation system using multi-criteria. MEREC assigns weights to the inconsistent metrics, and VIKOR ranks transformer encoder variants to determine the optimal one. Table 5 shows the weights of the four performance metrics for the 25 malware families individually in Malimg dataset. Table 6 shows the weights of the four performance metrics for the 9 malware families individually in Microsoft BIG dataset. The weights W1, W2, W3, and W4 represent the importance assigned to accuracy, F-score, precision, and recall. The results show that the impact of metrics among malware families varies, and no single metric can be deemed preferable. In MalImg, Dontovo.A, Allaple.L, Allaple.A, Obfuscator.AD, Skintrim.N, and Wintrim.BX, prioritize accuracy as a significant metric. Adialer.C, Agent.FYI, Autorun.K, C2LOP.gen!g, Dialplatform.B, Fakerean, Instantaccess, Lolyda.AA2, Malex.gen!J, Rbot!gen, Swizzor.gen!, VB.AT, and Yuner.A prioritize precision. Alueron.gen!J, C2LOP.P, Lolyda.AA1, Lolyda.AA3, Lolyda.AT, Swizzor.gen!E, Rbot!gen, and Malex.gen!J prioritize recall. In Microsoft BIG, kelihos_ver1 and Gatak prioritize precision as a significant metric. Lollipop, Vundo, Obfuscator.ACY, Ramnit, kelihos_ver3, and simda prioritize recall. Tracur has the same impact on precision and recall.
The metrics’ weights for each malware family in MalImg
The metrics’ weights for each malware family in Microsoft BIG
Figure 8 shows the results of standard evaluation metrics i.e., Q, R, and U for four transformer encoder variants on the 25 malware families individually in Malimg dataset. Figure 9 shows the results of standard evaluation metrics i.e., Q, R, and U for four transformer encoder variants on the 9 malware families individually in Microsoft BIG dataset. As mentioned previously, the best classifier is one that exhibits the lowest Q value, indicative of superior performance in malware classification. As noted in Fig. 8, the TE3 and TE4 variants outperform the TE1 and TE2 variants across all malware families. This superiority is substantiated by the lower Q values observed in TE3 and TE4 compared to TE1 and TE2. Furthermore, among the TE3 and TE4 variants, TE3 demonstrates a consistently better performance across all malware families, except in the case of gent.FYI, Allaple.A, Allaple.L, C2LOP.gen!g, VB.AT, and Wintrim.BX. Similarly, as noted in Fig. 9, TE3 outperform the TE1, TE2, and TE4 variants in most malware families. This superiority is substantiated by the lower Q value observed in TE3 compared to TE1, TE2 and TE4.

Multi-variant transformer encoders standard evaluation metrics on Malware family-wise in MalImg dataset.

Multi-variant transformer encoders standard evaluation metrics on Malware family-wise in Microsoft BIG dataset.
In this section, the performance of four transformer encoder variants on overall malware families in two datasets i.e., Malimg and Microsoft BIG, is analyzed in terms of multi-inconsistent and standard evaluation metrics. Table 7 shows the results of performance metrics (accuracy, F1-score, precision, and recall) and standard evaluation metrics (U, R, and Q) for four transformer encoder variants across the entire set of 25 malware families in Malimg dataset. Table 8 shows the results of performance metrics (accuracy, recall, precision, and F1-score) and standard evaluation metrics (U, R, and Q) for four transformer encoder variants across the entire set of 9 malware families in Microsoft BIG dataset. As noted in Tables 7 and 8 , TE3 outperform the TE1, TE2, and TE4 variants across the entire set of malware families in two dataset i.e., MalImg and Microsoft BIG. This superiority is substantiated by the lower Q value observed in TE3 compared to TE1, TE2 and TE4. Table 9 shows the weights of the four performance metrics across the entire set of malware families in MalImg and Microsoft BIG datasets. Both datasets prioritize precision as a significant metric. Figures 10 and 11 show the relationship between the performance metrics and standard evaluation metrics for the four transformer encoders variants in MalImg and Microsoft BIG datasets respectively.
The overall performance metrics (DM) and standard evaluation metrics of multi-variant transformer encoders for MalImg
The overall performance metrics (DM) and standard evaluation metrics of multi-variant transformer encoders for MalImg
The overall performance metrics (DM) and standard evaluation metrics of multi-variant transformer encoders for Microsoft BIG
The metrics’ weights for overall malware families in MalImg and Microsoft BIG

The relationship between the performance metrics and standard metrics for multi-variant transformer encoders in MalImg dataset. (a): TE1, (b): TE2, (c): TE3, (d): TE4.

The relationship between the performance metrics and standard metrics for multi-variant transformer encoders in Microsoft BIG dataset. (a): TE1, (b): TE2, (c): TE3, (d): TE4.
Table 10 shows a comparison between the proposed architecture and recent approaches relevant to the malware classification task. The studies being compared in this context have utilized identical datasets, which have also been employed in the proposed architecture. As noted, the proposed method has employed MCDM for standard evaluation which has not been used in recent studies. However, The multi-variant ViT outperforms the other ViT-based malware classifiers in terms of malware classification performance such as accuracy, precision, recall, and F1-score. Plus, the multi-variant ViT outperforms the CNN-based classifiers in terms of computational efficiency such as training time and predictiontime.
A comparison between the proposed architecture and other state-of-the-art approaches
This paper proposes a butterfly construction-based vision transformer (B_ViT) model for visualization-based malware classification and detection. B_ViT is trained and evaluated on grayscale malware images collected from MalImg or Microsoft BIG datasets or converted from portable executable imports. B_ViT has four phases: image partitioning and patches embeddings; local attention; global attention; and training and malware classification. In local attention phase, self-attention-based local transformer encoders along with local positional encoding process the input image’s patches simultaneously to capture the local representation and features of malware image. In global attention phase, one self-attention-based global transformer encoder along with global positional encoding process the input image as one block, to capture the global representation and features of malware image. B_ViT is a transfer learning-based model that uses a pre-trained ViT model on the ImageNet dataset to initialize the training parameters of transformers, then the B_ViT is fine-tuned to fit malware classification task. All B_ViT variants i.e, BViT/B16, BViT/B32, BViT/L16, and BViT/L32 are experimented and evaluated using MalImg and Microsoft BIG datasets for image malware classification and compared with the respective variants of IEViT and ViT. The comparative analysis shows that B_ViT variants outperform the IEViT and ViT variants in visualization-based malware classification achieving accuracy equal to 98.65%, 98.28%, 99.32%, and 99.11% in MalImg; and 98.80%, 98.62%, 99.49%, and 99.26% in Microsoft BIG for BViT/B16, BViT/B32, BViT/L16, and BViT/L32 respectively. Besides, B_ViT variants are evaluated using portable executable imports for image malware detection and compared with the respective variants of IEViT and ViT. The comparative analysis shows that B_ViT variants outperform the IEViT and ViT variants in visualization-based malware detection achieving accuracy equal to 99.84%, 99.87%, 99.99%, and 99.97% for BViT/B16, BViT/B32, BViT/L16, and BViT/L32 respectively. The results show that B_ViT achieves average improvement in terms of F1-score over IEViT and ViT equal to 1.21%, and 2.48% respectively. Since B_ViT is a parallel-based architecture, a parallel analysis of B_ViT over sequential B_ViT, IEViT, and ViT is performed. The results show that B_ViT is time-effective for malware classification and detection where the average speed-up of B_ViT variants over sequential B_ViT, IEViT and ViT variants are equal to 3.84, 2.42 and 1.81 respectively. Moreover, the analysis shows the efficiency of texture-based malware detection as well as the resilience of B_ViT to polymorphic obfuscation. The proposed malware classifier/detector is visualization-based so, does not require domain experts for feature extraction, feature engineering, etc. Finally, the proposed method that uses B_ViT architecture outperforms recent visualization-based malware classification methods that use CNN architectures as well as ViT-based malware classifiers. The utilization of the B_ViT-based malware classifier/detector in practice presents certain limitations that should be acknowledged. Firstly, its implementation necessitates a high-resource platform, especially when employing a high degree of parallelism because more local transformer encoders should be run to capture the local representation of malware images. Secondly, to ensure its effectiveness, the proposed method must be thoroughly tested in an on-site environment. To overcome these limitations, future work should focus on further optimization and refinement of the approach.
