Android malware detection framework based on sensitive opcodes and deep reinforcement learning

Abstract

Malware attack is a growing problem on the Android mobile platform due to its popularity and openness. Although numerous malware detection approaches have been proposed, it still remains challenging for malware detection due to a large amount of constantly mutating apps. The opcode, as the most fundamental part of Android app, possesses good resistance against obfuscation and Android version updates. Due to the large number of opcodes, most opcode-based methods employ statistical-based feature selection, which disrupts the correlation and semantic information among opcodes. In this paper, we propose an Android malware detection framework based on sensitive opcodes and deep reinforcement learning. Firstly, we extract sensitive opcode fragments based on sensitive elements and then encode the features using n-gram. Next, we use deep reinforcement learning to select the optimal subset of features. During the process of handling opcodes, we focus on preserving semantic information and the correlation among opcodes. Finally, our experimental results show an accuracy of 0.9670 by using the 25 opcode features we obtained.

Keywords

Android malware deep reinforcement learning feature selection machine learning

1 Introduction

As smartphones have become ubiquitous, the severity of Android malware has increased significantly, which poses risks to both user privacy and data security. In addition, the rapid growth and continuous variation of Android malware present serious challenges to traditional malware detection methods. Therefore, it is essential to explore and develop efficient and accurate Android malware detection methods.

In order to combat the threat of malware, researchers have proposed various methods in recent years. Extracting good features is of significant importance for detecting Android malware. Currently, feature extraction can be classified into two main categories: static analysis and dynamic analysis. Static analysis is completed by analyzing the app installation package, without running the app. For example, some studies construct detection frameworks by extracting static features such as APIs [1], permissions [2], and intents [3] from Android Package (APK) files. In contrast, dynamic analysis extracts features by running the app on real devices or emulators, including taint analysis [4], network flow extraction [5], and system call analysis [6], etc.

However, many existing dynamic analysis methods are inefficient and require a substantial amount of computational resources. While static analysis is more efficient, many static analysis techniques have poor resistance to obfuscation techniques. Additionally, with the evolution of Android system versions, many static analysis models are plagued by serious obsolescence issues. To address these problems, researchers have proposed using opcodes to detect malware. Opcode is a segment of the Android virtual machine language that represents instructions within the Android system. They are used to perform various operations such as variable assignment, method invocation, and control flow transitions. Analyzing an app’s opcodes provides insights into its behaviors and execution paths, valuable for the analysis and detection of malware. Compared to other features such as permission requests and API calls, opcode provides a lower-level and finer-grained representation. Opcode is based on instruction-level information and exhibits minimal changes across different versions of the Android system, thus offering better generality and adaptability.

However, due to the vast quantity of opcodes in each application, it is essential to efficiently decrease their dimensionality. Some researchers utilize filter-based approaches to perform feature selection on opcodes. Poudyal et al. [7] used term frequency–inverse document frequency (TF-IDF) for opcode dimensionality reduction. Sihag et al. [8] employed frequency-based feature selection methods to filter features. These methods ignore the correlations between features and fail to capture the semantic information among features. Moreover, using a wrapper-based feature selection approach is also impractical for high-dimensional features like opcodes, as it is difficult to exhaustively enumerate all possible feature subsets.

In this paper, we utilize a constructed sensitive API set to identify sensitive opcode sequences, which effectively reduces the dimensionality of features while preserving semantic information. Subsequently, we encode opcode features based on n-gram and employ deep reinforcement learning to select the optimal n-opcode features for classification. Deep reinforcement learning automatically selects the optimal feature subsets to identify malware by training an agent to learn and make decisions. In summary, the main contributions of this paper are as follows:

We propose a method for to extract sensitive opcode based on a constructed sensitive API set, which greatly reduces the quantity of opcodes while preserving the semantic information, and extracts more representative sensitive opcode features.

Based on n-gram technology, we construct opcode features and utilize deep reinforcement learning to automatically select features, which efficiently selects a subset of features with good classification performance.

Our experiment results show that the final opcode features have good effectiveness and efficiency on detection. Additionally, it still has good detection effectiveness on unknown samples.

The rest of this paper is organized as follows. Section 2 introduces the related work of the paper, Section 3 provides a detailed explanation of the proposed method of this paper, Section 4 presents the experiments including design, results, and evaluation, and Section 5 concludes our entire article.

2 Related work

2.1 Android malware detection based on machine and deep learning

In recent years, deep learning and machine learning techniques have been widely applied in Android malware detection. Using methods of deep learning, such as convolutional neural networks, can avoid complex feature engineering and effectively cope with rapid updates of malicious software [9]. Naeem et al. [10] proposed a deep convolutional neural network stacked ensemble for malware threat classification, which detected potential operations within a diverse range of discrete-sized image features, enabling the identification of malware families. Amer et al. [11] relied on various static and dynamic features. They grouped correlated features into several cluster classes and trained the Long Short-Term Memory (LSTM) model using random snapshots of the newly constructed sequences of API and system call clusters. Emphasizing feature construction through machine learning methods, subsequent to intricate feature engineering, and inputting these features into machine learning classifiers, results in more precise and efficient detection. SigPID [12] found that only 22 permissions are significant and detected malware with 67 machine learning algorithms using these permissions. Şahin et al. [13] combined 27 permission features with commonly used machine learning classifiers to detect malware.

In our paper, we employ deep reinforcement learning to automatically select features and input these obtained features into machine learning classifiers to achieve rapid and accurate detection.

2.2 Android malware detection based on opcodes

As a low-level feature of the Android virtual machine, opcodes contain rich semantic information and can resist obfuscation techniques. Consequently, researchers often employ them in the detection of malware. Kang et al. [14] proposed n-gram encoding of opcodes and performed feature selection using information gain on the encoded features. Niu et al. [15] constructed function call graphs based on opcodes and used LSTM for classification. Zhang et al. [16] proposed a lightweight framework based on graph theory and information theory for detecting Android malware variants. They constructed a weighted probability graph of opcodes and used information entropy to trim the graph while preserving as much information as possible. Finally, they extracted global topological features to represent malware. Pektas et al. [17] used features extracted from instruction call graphs combined with deep learning for malware detection. They applied grid search method to find the optimal parameters of the network and discovered combinations of hyperparameters that maximize the statistical metrics. Li et al. [18] detected malware by training an optimized convolutional neural network several times using raw opcode sequences extracted from decompiled Android files.

However, the aforementioned methods lose much semantic information while reducing the dimensionality of opcodes. Whereas, our method extracts sensitive opcodes with a constructed sensitive API set, which effectively removes a large number of redundant opcodes while retaining the semantic information.

2.3 Feature selection in Android malware detection

Feature selection is an important step in Android malware detection tasks. Redundant features not only affect the effectiveness of classifiers but also consume excessive computational resources. Babaagba et al. [19] have experimentally demonstrated the significance of feature selection in malware analysis using machine learning. Currently, there are two main kinds of feature selection methods: filter-based and wrapper-based.

Filter-based feature selection methods are independent of any specific machine learning algorithm. These methods evaluate and rank features solely based on their correlation with the target variable. Visalakshi et al. [20] used an improved filter-based feature selection technique based on the K-nearest neighbors relief algorithm to detect malware. Şahin et al. [13] used eight filter-based feature selection methods to enhance machine learning algorithms and improved detection efficiency. Among them, information gain, odds ratio, Chi-square, and inverse document frequency have been used in various Android malware research. The remaining methods, namely document frequency thresholding, Acc and Acc2, M2 Method, and relevance frequency feature selection, are adopted from the field of text classification.

Wrapper-based feature selection methods embed the feature selection process into specific machine learning model training. These methods utilize an evaluation function to measure the performance of feature subsets and select the optimal feature subset through a search algorithm. Fatima et al. [21] employed a genetic algorithm for discriminative feature selection and compared the capabilities on identifying malware before and after feature selection. They reduced the feature dimensionality to less than half of the original features. Yang et al. [22] utilized recursive feature elimination with cross-validation for feature selection and employed DenseNet for classification.

However, filter-based feature selection methods only consider the correlation between features and the target variable, without consideration of the inter-feature associations. On the other hand, wrapper-based feature selection is hindered by an extensive search space, often rendering the search infeasible. Therefore, we propose a feature selection method based on deep reinforcement learning. The intelligent agent continuously adjusts its classification strategy based on feedback from the classifier, efficiently selecting the optimal feature subset while preserving the inter-feature associations.

3 Methodology

As shown in Fig. 1, our model consists of three parts: sensitive opcode extraction, feature selection based on deep reinforcement learning(FSDRL), and building detection system. Firstly, we obtain all opcodes of the APK through reverse engineering. Then, we use a constructed sensitive API list to filter the opcode sequence, where only opcode fragments that invoke APIs from the sensitive API list are considered sensitive opcodes. Next, we process the extracted opcodes using n-gram. In contrast to existing methods that rely on statistical opcode frequency, we encode whether a certain n-gram opcode appears, ensuring that the encoded features are not affected by differences in the sizes of malicious and benign samples. Then we utilize deep reinforcement learning to select opcode features, thereby better preserving the relationships and semantics among opcodes. Finally, we re-encode the app using the selected features and combine it with commonly used machine learning classifiers to implement malware detection.

Fig. 1

System structure.

3.1 Sensitive opcode extraction

The installation package of an Android app (. apk) is a compressed file that contains a manifest file, resource files, and Dalvik executable files (. dex). We use the Androguard tool to decompile the dex file into multiple smali files, where each smali file represents a class and includes the smali source code of all methods within that class. In the smali file, each method’s smali code starts with a .method tag and ends with an .end method tag. The segment between the start and end tags comprises the brief instructions of each method, with each short instruction consisting of an opcode and multiple operands.

After obtaining the smali instructions of all methods in an APK file, we need to select those that are more likely to carry malicious behavior. For an APK file, the corresponding smali instructions can reach hundreds of thousands. To train a classifier, it requires not only more computational resources but also the presence of redundant information that can interfere with classification results. In our approach, we use constructed sensitive APIs to identify smali code fragments that may have malicious behavior or suspicious characteristics.

The constructed sensitive API list is derived from argusDroid [23]. For methods within the analyzed APK, if a method invokes an API from the sensitive API list, we consider the corresponding smali code fragment of that method to be sensitive. After extracting sensitive smali code fragments from the APK, we discard the operands of the smali code and only retain the opcodes. As shown in the Fig. 2, the function calls the sensitive API “getLastKnownLocation” to obtain the location, making this segment of opcodes considered sensitive. The corresponding opcodes for this segment are {..., invoke-virtual, const-string, invoke-virtual, move-result-object, input-object,...}.

Fig. 2

A sample of a sensitive opcode fragment.

If we directly extract n-opcode features, the feature space will become very large due to the numerous kinds of opcodes. Therefore, before processing the opcodes, we classify them based on their functionalities. The classified result is shown in Table 1. The first seven categories in the table are opcodes classified according to their core functionalities, while the last category consists of opcodes that serve purposes other than the core functionalities, such as linking and initialization in programs.

Table 1

Opcode type mapping table

Type	Opcode
M	move, move/from16, move/16, move-wide...
R	return, return-void, return-wide, return-object
G	goto, goto/16, goto/32
I	if-eq, if-ne, if-lt, if-ge...
T	aget, aget-wide, aget-object...
P	aput, aput-wide, aput-object, aput-boolean...
V	invoke-virtual, invoke-super, invoke-direct...
X	other opcodes

In this way, we can obtain multiple opcode fragments from an APK file. We traverse these fragments separately to obtain multiple n-opcode features. All different sequences form the feature space, which includes all kinds of n-opcodes that appear in the samples. We encode each app using all the features in the feature space. Existing n-opcode methods mostly count the occurrences of n-opcodes in the APK, which fully utilize all opcodes but is influenced by the sample size. For Android malware detection tasks, malicious samples are usually small in size, while benign samples are more larger. The number of n-opcodes in malicious samples is significantly less than that in benign samples. Therefore, counting the number of n-opcodes may mislead the classifier and cause misjudgment due to sample size. For each app, we encode it as {b₁, b₂, b₃, . . . , b_n}, where the value of b_i represents whether a certain n-opcode appears (1 for appearance, 0 for non-appearance), and n is the number of features in the feature space. Compared to the existing method of counting the number of n-opcodes, our method focuses on whether a certain behavior appears and is not affected by the sample size.

3.2 Feature selection based on deep reinforcement learning

In order to select the optimal subset of features from the extracted n-opcode features for classification without disrupting the dependencies among features, we employ FSDRL. In FSDRL, we train an agent to continuously select feature subsets from the feature space and optimize them through interactions with the feature space until the best feature subset is determined. By receiving feedback from the classifier regarding the performance of features, we can effectively select features without exhaustively exploring the entire feature space.

3.2.1 Core components of reinforcement learning

Then we will introduce the core components of reinforcement learning and explain what they represent in our system.

Environment: In reinforcement learning tasks, the environment is the external context in which an agent operates. It serves as the space for interaction and learning for the agent. This dynamic system allows the agent to acquire observations, perform actions, and receive rewards through interaction with it.

In our approach, the environment contains feature vectors of all the samples. The environment is responsible for providing the current state as input to the classifier, obtaining rewards, and returning them to the agent. The reward is defined as the accuracy obtained by using the features of a certain state for detection. In each training iteration, we set a number of selectable features. When the maximum number of features reaches this limit, the agent terminates the feature selection for that iteration and returns the final reward.

Actions: Actions refer to the operations or decisions that an agent can undertake in a given state. Actions are the means through which the agent interacts with the environment.

In our framework, actions refer to selecting one feature from a feature set and adding it to the current state. The action space consists of all possible n-opcode sets, which represent the feature space.

Reward: In reinforcement learning tasks, reward is a signal that represents the goodness or badness of an agent’s actions. The reward function defines the immediate rewards obtained by the agent during the interaction with the environment, which guides the agent in optimizing its strategies and behaviors.

In our approach, reward refers to the accuracy obtained by the classifier after each action execution. After the agent selects an action, the environment enters a new state. Based on the updated state, the environment selects corresponding features and inputs them into the classifier for classification. The accuracy obtained by the classifier is then returned as a reward to the environment.

Strategies for exploration and exploitation: In reinforcement learning tasks, a policy refers to the rules that the agent follows when interacting with the environment. In our approach, it specifically refers to how the agent selects features.

In our framework, we utilize the ∈ - Greedy strategy to explore optimal features. The core idea of the ∈ - Greedy strategy is to find a balance between exploration and exploitation. Exploration refers to the agent actively trying out new actions in unknown environments or unexplored states to discover more information and obtain more rewards. Exploitation refers to the agent selecting the action deemed optimal based on existing knowledge and experience to maximize cumulative rewards. We choose random features with a probability of ∈ and the current known optimal features with a probability of 1 - ∈. Additionally, we decay the value of ∈, prioritizing exploration in the early stages and exploitation in the later stages, in order to discover the optimal combination of features.

3.2.2 Training phase

In our FSDRL process, we employ the Double Deep Q-Network (DDQN) algorithm for feature selection. DDQN is an improved algorithm proposed based on the Deep Q-Network (DQN). In DQN, the current Q-network is used to evaluate the value of actions when selecting the next action. However, it may lead to overestimation of the value of certain actions, further resulting in training instability and overestimation. DDQN addresses this issue by introducing an additional target network. The target network is a delayed-updating network used to evaluate the action values of the next state. DDQN leverages the current Q-network for selecting the optimal action in the subsequent state, and the target network for evaluating the corresponding action value. This approach reduces overestimation, improves training stability and performance, thereby selecting a more optimal subset of features. Additionally, we employ an ∈ - Greedy strategy to balance the exploration and exploitation process. To break the temporal correlation, improve training efficiency and stability, we utilize experience replay to store the agent’s experience samples in the environment.

Our reinforcement learning process is described in Algorithm 1. At the beginning of each training round, the feature list and ∈ value are reset. Then, actions are randomly selected based on the policy or the optimal action is chosen based on known experience, and relevant variables are updated. Experience data is selectively stored. The training round stops when the desired number of selected features is reached. To monitor the feature selection during the training process, an evaluation is conducted every 10 episodes. During the evaluation, features are not randomly selected, only the optimal features based on existing experience are chosen to assess the learning progress of the model. After multiple rounds of training, the final optimal subset of features will be discovered.

Algorithm 1 feature selection based reinforcement learning

Input: FS: n-opcode set, N: max episode, MC: max feature count, I: Interval for updating two Q-networks,

Output: FFS: final n-opcode subset

1: Initialize current Q-network, target Q-network, replay memory, ∈ ← 1

2: for episode from 1 to N do

3: reset state

4: for t from 1 to MC do

5: according ∈ value to choose random action or choose max Q-value action

6: State_t+1 = State_t + action_t

7: calculate reward_t+1 with State_t+1

8: selectively store {State_t, action_t, State_t+1, reward_t+1}

9: every I steps update target Q-network with current Q-network

10: if number of selected features == MC then

11: break

12: end if

13: end for

14: update ∈ ← 1 - t/MC, ensure ∈ ≥ 0.3

15: end for

16: return final n-opcode subset FFS

3.3 Building detection system

After the final optimal subset of features is obtained, we use the features to re-encode app. Each app is represented as a new feature vector, denoted as P = {p₁, p₂, ⋯ p_n}, where n represents the number of features preserved in the end. For each app, if n-opcode feature i appears, the corresponding i-th value is set to 1, if not, the value is set to 0. The encoded vectors are input into common classification algorithms for detection, including random forest (RF), decision tree (DT), k-nearest neighbors (KNN), support vector machines (SVM), multilayer perceptron (MLP), etc.

4 Experiment and evaluation

4.1 Dataset and experiment environment

The malicious samples in our dataset are sourced from two widely recognized and significantly different datasets, Drebin [24] and AndroZoo [25]. All benign samples are sourced exclusively from AndroZoo. Among a total of 16,000 samples, 10,000 samples are used to validate the effectiveness of sensitive opcode and deep reinforcement learning. The rest 6,000 samples, which are more recent, are used as unknown software to test the model’s generalization ability.

Our experiments were conducted on a server equipped with an Intel Xeon Silver 4210R (40) @ 3.200GHz CPU and NVIDIA TITAN RTX GPU. We used the Androguard library for APK file decompilation and the python for the development of the deep reinforcement learning part.

4.2 Results

4.2.1 Extracting n-opcode sequences

To examine the reliability and effectiveness of our method, we calculate the average number of original opcodes and sensitive opcodes in different samples shown in Fig. 3. From the figure, it can be found that the number of opcodes in the samples is drastically reduced after we filter them using the constructed sensitive API list. In the Drebin samples, there is a reduction of over 80% in the number of opcodes, while in the AndroZoo samples, the number of opcodes decreases by over 90%. Additionally, we can also find that there is a significant difference in the number of benign and malicious samples, as well as variations among samples from different sources. If we use a frequency-based statistical method, it could result in classification bias due to the influence of sample size. This further validates the reliability of our encoding method.

Fig. 3

Average number of original opcodes and sensitive opcodes.

Any n-gram-based model has to face the issue of exponential growth of the feature space as n increases, and our n-opcode model is no exception. Table 2 shows the number of n-opcodes and the detection performance for different values of n. The table demonstrates that the detection performance peaks at n = 4. Beyond this point, while the number of n-opcodes significantly increases, the performance improvement is not substantial. Consequently, we proceed with further experiments using the value of 4 for n.

Table 2

Results from different values of n

n	n-opcode count	Precision	Recall	F1-score	Accuracy
2	64	0.9387	0.6311	0.7548	0.7953
3	511	0.9538	0.9350	0.9443	0.9450
4	3882	0.9649	0.9565	0.9607	0.9609
5	25005	0.9680	0.9658	0.9669	0.9670

4.2.2 Feature selection based on deep reinforcement learning

In order to explore the optimal number of features, we conducted multiple experiments using different features and classifiers. Figure 4 shows the maximum reward obtained by reinforcement learning using different classifiers and numbers of features. From the figure, we find that the performance reaches its peak when the number of features reaches 25. After that, increasing the number of features only leads to a limited improvement in detection rate. Therefore, we ultimately select 25 features as the final feature set. Different machine learning algorithms yield different sets of features. We choose the feature set generated by the random forest algorithm for our subsequent experiments because the curve produced by this algorithm is relatively stable.

Discussion: As the number of features increases, the detection performance continuously improves. However, after 25 features, the improvement in the detection performance is more and more weak. Thus, we ultimately selected 25 features for the subsequent experiments.

Fig. 4

The number of features and accuracy using different machine learning algorithms.

4.2.3 Effectiveness and efficiency

Compared with the original features without feature selection, our method has greatly improved in efficiency. Table 3 shows the runtime overhead and accuracy of the algorithm before and after feature selection. From the table, it can be observed that the algorithms’ runtime have been significantly reduced, greatly improving the computational efficiency. This is because the number of features has been reduced from the original 3382 to the current 25. As shown in the table, SVM is the most efficient machine learning algorithm. When SVM is used with 25 features, the runtime is only 0.012 s, compared to 34.976 s using 3882 features.

Table 3
Runtime and accuracy before and after feature selection

Number _of_Feature 3882 25

Algorithms Runtime (Second) Accuracy Runtime (Second) Accuracy

RF 6.623 0.9915 0.441 0.9670

KNN 1.218 0.9935 0.435 0.9625

DT 9.370 0.9860 0.079 0.9585

SVM 34.976 0.9725 0.012 0.9590

MLP 38.468 0.9945 20.273 0.9635

Number _of_Feature	3882	25
RF	6.623	0.9915	0.441	0.9670
KNN	1.218	0.9935	0.435	0.9625
DT	9.370	0.9860	0.079	0.9585
SVM	34.976	0.9725	0.012	0.9590
MLP	38.468	0.9945	20.273	0.9635

Moreover, among different machine learning algorithms, our accuracy has not shown a significant decrease. Despite the substantial reduction in features, the accuracy remains above 95% for various machine learning algorithms. This also demonstrates that we have selected an excellent subset of features to the greatest extent.

We input the final selected 25 features into common machine learning classifiers and obtain various performance metrics for classification shown in Fig. 5. From the figure, we can see that our metrics for common machine learning algorithms have all achieved over 95%. The recall rate on the random forest algorithm exceeded 97%, thereby proving the final selected 25 features can be used to classify malware and benign apps.

Fig. 5

Detection performance using different machine learning algorithms.

4.2.4 Detecting unknown samples

In order to assess the resilience of our method to unknown samples, which are not used in previous process. We apply the extracted features to unknown samples and evaluate the detection performance as shown in Table 4. From the table, we can see that the overall detection performance for unknown malicious samples slightly decreased, but it still exceeds 90% for common machine learning algorithms. Our method can also achieve good performance in detecting unknown software. This is because we used sensitive opcodes as features, and opcodes have good resistance to variations in samples.

Table 4
Detection performance on unknown samples

Algorithm Precision Recall F1-score Accuracy

RF 0.9586 0.9267 0.9424 0.9433

KNN 0.9207 0.9100 0.9153 0.9158

DT 0.9172 0.9417 0.9293 0.9283

SVM 0.9428 0.9067 0.9244 0.9258

MLP 0.9454 0.9517 0.9485 0.9483

Algorithm	Precision	Recall	F1-score	Accuracy
RF	0.9586	0.9267	0.9424	0.9433
KNN	0.9207	0.9100	0.9153	0.9158
DT	0.9172	0.9417	0.9293	0.9283
SVM	0.9428	0.9067	0.9244	0.9258
MLP	0.9454	0.9517	0.9485	0.9483

4.2.5 Comparison with related work

To demonstrate the effectiveness and superiority of our approach, we conducted a qualitative comparison of our method with existing relevant works. The well-known method Drebin [24] uses multiple features to detect malware. It provides open dataset which is used in our method. DroidRL [26] utilizes permissions and opcodes as features and also employs deep reinforcement learning for feature selection. MGOPDroid [27] utilizes multi-granular opcode features to detect obfuscated samples and employs TF-IDF for feature selection.

Table 5 shows the dataset and performance comparison between our method and these methods. From the table, we can see that our method achieves better detection performance.

Table 5
The performance comparison with other proposed systems

Method Malware Benign Performance

Ours 8000 8000 Accuracy:0.9670

Recall:0.9723

Drebin [24] 5560 123453 Recall:0.9390

DroidRL [26] 5560 5000 Accuracy:0.956

MGOPDroid [27] 5560 4631 Accuracy:0.9635

Method	Malware	Benign	Performance
Ours	8000	8000	Accuracy:0.9670
Recall:0.9723
Drebin [24]	5560	123453	Recall:0.9390
DroidRL [26]	5560	5000	Accuracy:0.956
MGOPDroid [27]	5560	4631	Accuracy:0.9635

In addition, compared to related methods, our method has the following advantages.

Compared to multi-feature approaches, the single feature we extract is more efficient. Furthermore, opcode has good general applicability and stability on the Android platform, as it does not vary significantly across version iterations, which leads to a lower risk of model aging.

Compared to other detection methods using opcode, our method extracts sensitive opcode and can keep more semantic information. In contrast to the original opcode, our approach reduces the feature dimension, eliminates redundant opcodes, and reduces computational costs. While reducing dimensionality, our method retains semantic information, resulting in both highly efficient and effective detection.

Furthermore, we use deep reinforcement learning for feature selection, which does not disrupt the interdependencies among features. By utilizing the reward from the classifier, we optimize the feature selection process and improve its efficiency.

5 Conclusion

In this paper, we propose an Android malware detection approach based on sensitive opcodes and deep reinforcement learning. We use constructed sensitive API list to obtain sensitive opcode fragments, greatly reducing the dimensionality of opcodes while preserving the semantic information of opcodes. Next, we utilize n-gram encoding for opcodes and employ deep reinforcement learning for feature selection, which is efficient and preserves the correlation among opcodes. In the end, we obtain 25 n-opcode features, and our experimental results show that the features we extracted have a good performance in Android malware detection.

Footnotes

Acknowledgment

The work was funded by the Chongqing Technology Innovation and Application Project (Grant No. CSTB2022TIAD-KPX0054).

References

Jung

, Kim

, Shin

, Lee

, Cho

S.-J.

, Suh

Android malware detection based on useful api calls and machine learning, in 2018 IEEE First International Conference on Artificial Intelligence and Knowledge Engineering (AIKE), pp. 175–178, IEEE, 2018.

Zarni Aung

W.Z.

, Permission-based android malware detection, International Journal of Scientific & Technology Research 2(3) (2013), 228–234.

Verma

and Muttoo

, An android malware detection framework-basedon permissions and intents, Defence Science Journal 66(6) (2016).

Enck

, Gilbert

, Han

, Tendulkar

, Chun

B.-G.

, Cox

L.P.

, Jung

, McDaniel

and Sheth

A.N.

, Taintdroid: an information-flowtracking system for realtime privacy monitoring on smartphones, ACM Transactions on Computer Systems (TOCS) 32(2) (2014), 1–29.

Wang

, Yan

, Chen

, Yang

, Zhao

and Conti

, Detectingandroid malware leveraging text semantics of network flows, IEEE Transactions on Information Forensics and Security 13(5) (2017), 1096–1109.

Bhatia

, Kaushal

Malware detection in android based on dynamic analysis, in 2017 International Conference on Cyber Security And Protection of Digital Services (Cyber Security), pp. 1–6, IEEE, 2017.

Poudyal

, Dasgupta

, Akhtar

, Gupta

A multilevel ransomware detection framework using natural language processing and machine learning, in 14th International Conference on Malicious and Unwanted Software, MALCON, no. October 2015, 2019.

Sihag

, Mitharwal

, Vardhan

, Singh

Opcodengram based malware classification in android, in 2020 FourthWorld Conference on Smart Trends in Systems, Security and Sustainability(WorldS4), pp. 645–650, IEEE, 2020.

Shu

, Dong

, Su

, Huang

Android malware detection methods based on convolutional neural network: A survey, IEEE Transactions on Emerging Topics in Computational Intelligence, 2023.

10.

Naeem

, Cheng

, Ullah

, Jabbar

and Dong

, A deepconvolutional neural network stacked ensemble for malware threatclassification in internet of things, Journal of Circuits,Systems and Computers 31(17) (2022), 2250302.

11.

Amer

and El-Sappagh

, Robust deep learning early alarmprediction model based on the behavioural smell for android malware, Computers & Security 116 (2022), 102670.

12.

, Sun

, Yan

, Li

, Srisa-An

and Ye

, Significantpermission identification for machine-learning-based android malwaredetection, IEEE Transactions on Industrial Informatics 14(7) (2018), 3216–3225.

13.

Sahin

D.Ö.

, Kural

O.E.

, Akleylek

, Kílíç

A novel android malware detection system: adaption of filter-based feature selection methods, Journal of Ambient Intelligence and Humanized Computing, pp. 1–15, 2021.

14.

Kang

, Yerima

S.Y.

, McLaughlin

, Sezer

Nopcode analysis for android malware classification and categorization, in 2016 International conference on cyber security and protection of digital services (cyber security), pp. 1–7, IEEE, 2016.

15.

Niu

, Cao

, Zhang

, Ding

, Zhang

and Li

, Opcode-levelfunction call graph based android malware classification using deeplearning, Sensors 20(13) (2020), 3645.

16.

Zhang

, Qin

, Zhang

, Yin

and Zou

, Dalvik opcode graphbased android malware variants detection using global topologyfeatures, IEEE Access 6 (2018), 51964–51974.

17.

Pektas

and Acarman

, Learning to detect android malware viaopcode sequences, Neurocomputing 396 (2020), 599–608.

18.

, Zhao

, Cheng

, Lu

and Shi

, Opcode sequence analysisof android malware by a convolutional neural network, Concurrency and Computation: Practice and Experience 32(18) (2020), e5308.

19.

Babaagba

K.O.

, Adesanya

S.O.

A study on the effect of feature selection on malware analysis using machine learning, in Proceedings of the 2019 8th international conference on educational and information technology, 2019, pp. 51–55.

20.

Visalakshi

, et al. Detecting android malware using an improvedfilter based technique in embedded software, Microprocessorsand Microsystems 76 (2020), 103115.

21.

Fatima

, Maurya

, Dutta

M.K.

, Burget

, Masek

Android malware detection using genetic algorithm based optimized feature selection and machine learning, in 2019 42nd International conference on telecommunications and signal processing (TSP), pp. 220–223, IEEE, 2019.

22.

Yang

, Zhang

and Fan

, Android malware detectionmethod based on highly distinguishable static features and densenet, Plos One 17(11) (2022), e0276332.

23.

Bai

, Chen

, Xing

and Li

, Argusdroid: detecting androidmalware variants by mining permission-api knowledge graph, Science China Information Sciences 66(9) (2023), 1–19.

24.

Arp

, Spreitzenbarth

, Hubner

, Gascon

and Rieck

, Drebin: Effective and explainable detection of android malware in your pocket, in Ndss 14 (2014), 23–26.

25.

Allix

, Bissyandé

T.F.

, Klein

, Le Traon

Androzoo: Collecting millions of android apps for the research community, in 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR), pp. 468–471, IEEE, 2016.

26.

, Li

, Zeng

, Yang

, Wang

, Fang

and Cheng

, Droidrl: Feature selection for android malware detection withreinforcement learning, Computers & Security 128(2023), 103126.

27.

Tang

, Li

, Jiang

, Gu

and Li

, Android malwareobfuscation variants detection method based on multigranularityopcode features, Future Generation Computer Systems 129(2022), 141–151.

Number _of_Feature	3882		25
Algorithms	Runtime (Second)	Accuracy	Runtime (Second)	Accuracy
RF	6.623	0.9915	0.441	0.9670
KNN	1.218	0.9935	0.435	0.9625
DT	9.370	0.9860	0.079	0.9585
SVM	34.976	0.9725	0.012	0.9590
MLP	38.468	0.9945	20.273	0.9635

Android malware detection framework based on sensitive opcodes and deep reinforcement learning

Abstract

Keywords

1 Introduction

2 Related work

2.1 Android malware detection based on machine and deep learning

2.2 Android malware detection based on opcodes

2.3 Feature selection in Android malware detection

3 Methodology

3.2.1 Core components of reinforcement learning

3.2.2 Training phase

3.3 Building detection system

4 Experiment and evaluation

4.1 Dataset and experiment environment

4.2 Results

4.2.1 Extracting n-opcode sequences

Table 4 Detection performance on unknown samples Algorithm Precision Recall F1-score Accuracy RF 0.9586 0.9267 0.9424 0.9433 KNN 0.9207 0.9100 0.9153 0.9158 DT 0.9172 0.9417 0.9293 0.9283 SVM 0.9428 0.9067 0.9244 0.9258 MLP 0.9454 0.9517 0.9485 0.9483

Table 5 The performance comparison with other proposed systems Method Malware Benign Performance Ours 8000 8000 Accuracy:0.9670 Recall:0.9723 Drebin [24] 5560 123453 Recall:0.9390 DroidRL [26] 5560 5000 Accuracy:0.956 MGOPDroid [27] 5560 4631 Accuracy:0.9635

Footnotes

Acknowledgment

References

Table 4
Detection performance on unknown samples

Algorithm Precision Recall F1-score Accuracy

RF 0.9586 0.9267 0.9424 0.9433

KNN 0.9207 0.9100 0.9153 0.9158

DT 0.9172 0.9417 0.9293 0.9283

SVM 0.9428 0.9067 0.9244 0.9258

MLP 0.9454 0.9517 0.9485 0.9483

Table 5
The performance comparison with other proposed systems

Method Malware Benign Performance

Ours 8000 8000 Accuracy:0.9670

Recall:0.9723

Drebin [24] 5560 123453 Recall:0.9390

DroidRL [26] 5560 5000 Accuracy:0.956

MGOPDroid [27] 5560 4631 Accuracy:0.9635