Abstract
Malware attack is a growing problem on the Android mobile platform due to its popularity and openness. Although numerous malware detection approaches have been proposed, it still remains challenging for malware detection due to a large amount of constantly mutating apps. The opcode, as the most fundamental part of Android app, possesses good resistance against obfuscation and Android version updates. Due to the large number of opcodes, most opcode-based methods employ statistical-based feature selection, which disrupts the correlation and semantic information among opcodes. In this paper, we propose an Android malware detection framework based on sensitive opcodes and deep reinforcement learning. Firstly, we extract sensitive opcode fragments based on sensitive elements and then encode the features using n-gram. Next, we use deep reinforcement learning to select the optimal subset of features. During the process of handling opcodes, we focus on preserving semantic information and the correlation among opcodes. Finally, our experimental results show an accuracy of 0.9670 by using the 25 opcode features we obtained.
Introduction
As smartphones have become ubiquitous, the severity of Android malware has increased significantly, which poses risks to both user privacy and data security. In addition, the rapid growth and continuous variation of Android malware present serious challenges to traditional malware detection methods. Therefore, it is essential to explore and develop efficient and accurate Android malware detection methods.
In order to combat the threat of malware, researchers have proposed various methods in recent years. Extracting good features is of significant importance for detecting Android malware. Currently, feature extraction can be classified into two main categories: static analysis and dynamic analysis. Static analysis is completed by analyzing the app installation package, without running the app. For example, some studies construct detection frameworks by extracting static features such as APIs [1], permissions [2], and intents [3] from Android Package (APK) files. In contrast, dynamic analysis extracts features by running the app on real devices or emulators, including taint analysis [4], network flow extraction [5], and system call analysis [6], etc.
However, many existing dynamic analysis methods are inefficient and require a substantial amount of computational resources. While static analysis is more efficient, many static analysis techniques have poor resistance to obfuscation techniques. Additionally, with the evolution of Android system versions, many static analysis models are plagued by serious obsolescence issues. To address these problems, researchers have proposed using opcodes to detect malware. Opcode is a segment of the Android virtual machine language that represents instructions within the Android system. They are used to perform various operations such as variable assignment, method invocation, and control flow transitions. Analyzing an app’s opcodes provides insights into its behaviors and execution paths, valuable for the analysis and detection of malware. Compared to other features such as permission requests and API calls, opcode provides a lower-level and finer-grained representation. Opcode is based on instruction-level information and exhibits minimal changes across different versions of the Android system, thus offering better generality and adaptability.
However, due to the vast quantity of opcodes in each application, it is essential to efficiently decrease their dimensionality. Some researchers utilize filter-based approaches to perform feature selection on opcodes. Poudyal et al. [7] used term frequency–inverse document frequency (TF-IDF) for opcode dimensionality reduction. Sihag et al. [8] employed frequency-based feature selection methods to filter features. These methods ignore the correlations between features and fail to capture the semantic information among features. Moreover, using a wrapper-based feature selection approach is also impractical for high-dimensional features like opcodes, as it is difficult to exhaustively enumerate all possible feature subsets.
In this paper, we utilize a constructed sensitive API set to identify sensitive opcode sequences, which effectively reduces the dimensionality of features while preserving semantic information. Subsequently, we encode opcode features based on n-gram and employ deep reinforcement learning to select the optimal n-opcode features for classification. Deep reinforcement learning automatically selects the optimal feature subsets to identify malware by training an agent to learn and make decisions. In summary, the main contributions of this paper are as follows: We propose a method for to extract sensitive opcode based on a constructed sensitive API set, which greatly reduces the quantity of opcodes while preserving the semantic information, and extracts more representative sensitive opcode features. Based on n-gram technology, we construct opcode features and utilize deep reinforcement learning to automatically select features, which efficiently selects a subset of features with good classification performance. Our experiment results show that the final opcode features have good effectiveness and efficiency on detection. Additionally, it still has good detection effectiveness on unknown samples.
The rest of this paper is organized as follows. Section 2 introduces the related work of the paper, Section 3 provides a detailed explanation of the proposed method of this paper, Section 4 presents the experiments including design, results, and evaluation, and Section 5 concludes our entire article.
Related work
Android malware detection based on machine and deep learning
In recent years, deep learning and machine learning techniques have been widely applied in Android malware detection. Using methods of deep learning, such as convolutional neural networks, can avoid complex feature engineering and effectively cope with rapid updates of malicious software [9]. Naeem et al. [10] proposed a deep convolutional neural network stacked ensemble for malware threat classification, which detected potential operations within a diverse range of discrete-sized image features, enabling the identification of malware families. Amer et al. [11] relied on various static and dynamic features. They grouped correlated features into several cluster classes and trained the Long Short-Term Memory (LSTM) model using random snapshots of the newly constructed sequences of API and system call clusters. Emphasizing feature construction through machine learning methods, subsequent to intricate feature engineering, and inputting these features into machine learning classifiers, results in more precise and efficient detection. SigPID [12] found that only 22 permissions are significant and detected malware with 67 machine learning algorithms using these permissions. Şahin et al. [13] combined 27 permission features with commonly used machine learning classifiers to detect malware.
In our paper, we employ deep reinforcement learning to automatically select features and input these obtained features into machine learning classifiers to achieve rapid and accurate detection.
Android malware detection based on opcodes
As a low-level feature of the Android virtual machine, opcodes contain rich semantic information and can resist obfuscation techniques. Consequently, researchers often employ them in the detection of malware. Kang et al. [14] proposed n-gram encoding of opcodes and performed feature selection using information gain on the encoded features. Niu et al. [15] constructed function call graphs based on opcodes and used LSTM for classification. Zhang et al. [16] proposed a lightweight framework based on graph theory and information theory for detecting Android malware variants. They constructed a weighted probability graph of opcodes and used information entropy to trim the graph while preserving as much information as possible. Finally, they extracted global topological features to represent malware. Pektas et al. [17] used features extracted from instruction call graphs combined with deep learning for malware detection. They applied grid search method to find the optimal parameters of the network and discovered combinations of hyperparameters that maximize the statistical metrics. Li et al. [18] detected malware by training an optimized convolutional neural network several times using raw opcode sequences extracted from decompiled Android files.
However, the aforementioned methods lose much semantic information while reducing the dimensionality of opcodes. Whereas, our method extracts sensitive opcodes with a constructed sensitive API set, which effectively removes a large number of redundant opcodes while retaining the semantic information.
Feature selection in Android malware detection
Feature selection is an important step in Android malware detection tasks. Redundant features not only affect the effectiveness of classifiers but also consume excessive computational resources. Babaagba et al. [19] have experimentally demonstrated the significance of feature selection in malware analysis using machine learning. Currently, there are two main kinds of feature selection methods: filter-based and wrapper-based.
Filter-based feature selection methods are independent of any specific machine learning algorithm. These methods evaluate and rank features solely based on their correlation with the target variable. Visalakshi et al. [20] used an improved filter-based feature selection technique based on the K-nearest neighbors relief algorithm to detect malware. Şahin et al. [13] used eight filter-based feature selection methods to enhance machine learning algorithms and improved detection efficiency. Among them, information gain, odds ratio, Chi-square, and inverse document frequency have been used in various Android malware research. The remaining methods, namely document frequency thresholding, Acc and Acc2, M2 Method, and relevance frequency feature selection, are adopted from the field of text classification.
Wrapper-based feature selection methods embed the feature selection process into specific machine learning model training. These methods utilize an evaluation function to measure the performance of feature subsets and select the optimal feature subset through a search algorithm. Fatima et al. [21] employed a genetic algorithm for discriminative feature selection and compared the capabilities on identifying malware before and after feature selection. They reduced the feature dimensionality to less than half of the original features. Yang et al. [22] utilized recursive feature elimination with cross-validation for feature selection and employed DenseNet for classification.
However, filter-based feature selection methods only consider the correlation between features and the target variable, without consideration of the inter-feature associations. On the other hand, wrapper-based feature selection is hindered by an extensive search space, often rendering the search infeasible. Therefore, we propose a feature selection method based on deep reinforcement learning. The intelligent agent continuously adjusts its classification strategy based on feedback from the classifier, efficiently selecting the optimal feature subset while preserving the inter-feature associations.
Methodology
As shown in Fig. 1, our model consists of three parts: sensitive opcode extraction, feature selection based on deep reinforcement learning(FSDRL), and building detection system. Firstly, we obtain all opcodes of the APK through reverse engineering. Then, we use a constructed sensitive API list to filter the opcode sequence, where only opcode fragments that invoke APIs from the sensitive API list are considered sensitive opcodes. Next, we process the extracted opcodes using n-gram. In contrast to existing methods that rely on statistical opcode frequency, we encode whether a certain n-gram opcode appears, ensuring that the encoded features are not affected by differences in the sizes of malicious and benign samples. Then we utilize deep reinforcement learning to select opcode features, thereby better preserving the relationships and semantics among opcodes. Finally, we re-encode the app using the selected features and combine it with commonly used machine learning classifiers to implement malware detection.

System structure.
The installation package of an Android app (. apk) is a compressed file that contains a manifest file, resource files, and Dalvik executable files (. dex). We use the Androguard tool to decompile the dex file into multiple smali files, where each smali file represents a class and includes the smali source code of all methods within that class. In the smali file, each method’s smali code starts with a .method tag and ends with an .end method tag. The segment between the start and end tags comprises the brief instructions of each method, with each short instruction consisting of an opcode and multiple operands.
After obtaining the smali instructions of all methods in an APK file, we need to select those that are more likely to carry malicious behavior. For an APK file, the corresponding smali instructions can reach hundreds of thousands. To train a classifier, it requires not only more computational resources but also the presence of redundant information that can interfere with classification results. In our approach, we use constructed sensitive APIs to identify smali code fragments that may have malicious behavior or suspicious characteristics.
The constructed sensitive API list is derived from argusDroid [23]. For methods within the analyzed APK, if a method invokes an API from the sensitive API list, we consider the corresponding smali code fragment of that method to be sensitive. After extracting sensitive smali code fragments from the APK, we discard the operands of the smali code and only retain the opcodes. As shown in the Fig. 2, the function calls the sensitive API “getLastKnownLocation” to obtain the location, making this segment of opcodes considered sensitive. The corresponding opcodes for this segment are {..., invoke-virtual, const-string, invoke-virtual, move-result-object, input-object,...}.

A sample of a sensitive opcode fragment.
If we directly extract n-opcode features, the feature space will become very large due to the numerous kinds of opcodes. Therefore, before processing the opcodes, we classify them based on their functionalities. The classified result is shown in Table 1. The first seven categories in the table are opcodes classified according to their core functionalities, while the last category consists of opcodes that serve purposes other than the core functionalities, such as linking and initialization in programs.
Opcode type mapping table
In this way, we can obtain multiple opcode fragments from an APK file. We traverse these fragments separately to obtain multiple n-opcode features. All different sequences form the feature space, which includes all kinds of n-opcodes that appear in the samples. We encode each app using all the features in the feature space. Existing n-opcode methods mostly count the occurrences of n-opcodes in the APK, which fully utilize all opcodes but is influenced by the sample size. For Android malware detection tasks, malicious samples are usually small in size, while benign samples are more larger. The number of n-opcodes in malicious samples is significantly less than that in benign samples. Therefore, counting the number of n-opcodes may mislead the classifier and cause misjudgment due to sample size. For each app, we encode it as {b1, b2, b3, . . . , b n }, where the value of b i represents whether a certain n-opcode appears (1 for appearance, 0 for non-appearance), and n is the number of features in the feature space. Compared to the existing method of counting the number of n-opcodes, our method focuses on whether a certain behavior appears and is not affected by the sample size.
In order to select the optimal subset of features from the extracted n-opcode features for classification without disrupting the dependencies among features, we employ FSDRL. In FSDRL, we train an agent to continuously select feature subsets from the feature space and optimize them through interactions with the feature space until the best feature subset is determined. By receiving feedback from the classifier regarding the performance of features, we can effectively select features without exhaustively exploring the entire feature space.
Core components of reinforcement learning
Then we will introduce the core components of reinforcement learning and explain what they represent in our system.
Environment: In reinforcement learning tasks, the environment is the external context in which an agent operates. It serves as the space for interaction and learning for the agent. This dynamic system allows the agent to acquire observations, perform actions, and receive rewards through interaction with it.
In our approach, the environment contains feature vectors of all the samples. The environment is responsible for providing the current state as input to the classifier, obtaining rewards, and returning them to the agent. The reward is defined as the accuracy obtained by using the features of a certain state for detection. In each training iteration, we set a number of selectable features. When the maximum number of features reaches this limit, the agent terminates the feature selection for that iteration and returns the final reward.
Actions: Actions refer to the operations or decisions that an agent can undertake in a given state. Actions are the means through which the agent interacts with the environment.
In our framework, actions refer to selecting one feature from a feature set and adding it to the current state. The action space consists of all possible n-opcode sets, which represent the feature space.
Reward: In reinforcement learning tasks, reward is a signal that represents the goodness or badness of an agent’s actions. The reward function defines the immediate rewards obtained by the agent during the interaction with the environment, which guides the agent in optimizing its strategies and behaviors.
In our approach, reward refers to the accuracy obtained by the classifier after each action execution. After the agent selects an action, the environment enters a new state. Based on the updated state, the environment selects corresponding features and inputs them into the classifier for classification. The accuracy obtained by the classifier is then returned as a reward to the environment.
Strategies for exploration and exploitation: In reinforcement learning tasks, a policy refers to the rules that the agent follows when interacting with the environment. In our approach, it specifically refers to how the agent selects features.
In our framework, we utilize the ∈ - Greedy strategy to explore optimal features. The core idea of the ∈ - Greedy strategy is to find a balance between exploration and exploitation. Exploration refers to the agent actively trying out new actions in unknown environments or unexplored states to discover more information and obtain more rewards. Exploitation refers to the agent selecting the action deemed optimal based on existing knowledge and experience to maximize cumulative rewards. We choose random features with a probability of ∈ and the current known optimal features with a probability of 1 - ∈. Additionally, we decay the value of ∈, prioritizing exploration in the early stages and exploitation in the later stages, in order to discover the optimal combination of features.
Training phase
In our FSDRL process, we employ the Double Deep Q-Network (DDQN) algorithm for feature selection. DDQN is an improved algorithm proposed based on the Deep Q-Network (DQN). In DQN, the current Q-network is used to evaluate the value of actions when selecting the next action. However, it may lead to overestimation of the value of certain actions, further resulting in training instability and overestimation. DDQN addresses this issue by introducing an additional target network. The target network is a delayed-updating network used to evaluate the action values of the next state. DDQN leverages the current Q-network for selecting the optimal action in the subsequent state, and the target network for evaluating the corresponding action value. This approach reduces overestimation, improves training stability and performance, thereby selecting a more optimal subset of features. Additionally, we employ an ∈ - Greedy strategy to balance the exploration and exploitation process. To break the temporal correlation, improve training efficiency and stability, we utilize experience replay to store the agent’s experience samples in the environment.
Our reinforcement learning process is described in Algorithm 1. At the beginning of each training round, the feature list and ∈ value are reset. Then, actions are randomly selected based on the policy or the optimal action is chosen based on known experience, and relevant variables are updated. Experience data is selectively stored. The training round stops when the desired number of selected features is reached. To monitor the feature selection during the training process, an evaluation is conducted every 10 episodes. During the evaluation, features are not randomly selected, only the optimal features based on existing experience are chosen to assess the learning progress of the model. After multiple rounds of training, the final optimal subset of features will be discovered.
1: Initialize current Q-network, target Q-network, replay memory, ∈ ← 1
2:
3: reset state
4:
5: according ∈ value to choose random action or choose max Q-value action
6: Statet+1 = State t + action t
7: calculate rewardt+1 with Statet+1
8: selectively store {State t , action t , Statet+1, rewardt+1}
9: every I steps update target Q-network with current Q-network
10: if number of selected features == MC
11: break
12:
13:
14: update ∈ ← 1 - t/MC, ensure ∈ ≥ 0.3
15:
16:
Building detection system
After the final optimal subset of features is obtained, we use the features to re-encode app. Each app is represented as a new feature vector, denoted as P = {p1, p2, ⋯ p n }, where n represents the number of features preserved in the end. For each app, if n-opcode feature i appears, the corresponding i-th value is set to 1, if not, the value is set to 0. The encoded vectors are input into common classification algorithms for detection, including random forest (RF), decision tree (DT), k-nearest neighbors (KNN), support vector machines (SVM), multilayer perceptron (MLP), etc.
Experiment and evaluation
Dataset and experiment environment
The malicious samples in our dataset are sourced from two widely recognized and significantly different datasets, Drebin [24] and AndroZoo [25]. All benign samples are sourced exclusively from AndroZoo. Among a total of 16,000 samples, 10,000 samples are used to validate the effectiveness of sensitive opcode and deep reinforcement learning. The rest 6,000 samples, which are more recent, are used as unknown software to test the model’s generalization ability.
Our experiments were conducted on a server equipped with an Intel Xeon Silver 4210R (40) @ 3.200GHz CPU and NVIDIA TITAN RTX GPU. We used the Androguard library for APK file decompilation and the python for the development of the deep reinforcement learning part.
Results
Extracting n-opcode sequences
To examine the reliability and effectiveness of our method, we calculate the average number of original opcodes and sensitive opcodes in different samples shown in Fig. 3. From the figure, it can be found that the number of opcodes in the samples is drastically reduced after we filter them using the constructed sensitive API list. In the Drebin samples, there is a reduction of over 80% in the number of opcodes, while in the AndroZoo samples, the number of opcodes decreases by over 90%. Additionally, we can also find that there is a significant difference in the number of benign and malicious samples, as well as variations among samples from different sources. If we use a frequency-based statistical method, it could result in classification bias due to the influence of sample size. This further validates the reliability of our encoding method.

Average number of original opcodes and sensitive opcodes.
Any n-gram-based model has to face the issue of exponential growth of the feature space as n increases, and our n-opcode model is no exception. Table 2 shows the number of n-opcodes and the detection performance for different values of n. The table demonstrates that the detection performance peaks at n = 4. Beyond this point, while the number of n-opcodes significantly increases, the performance improvement is not substantial. Consequently, we proceed with further experiments using the value of 4 for n.
Results from different values of n
In order to explore the optimal number of features, we conducted multiple experiments using different features and classifiers. Figure 4 shows the maximum reward obtained by reinforcement learning using different classifiers and numbers of features. From the figure, we find that the performance reaches its peak when the number of features reaches 25. After that, increasing the number of features only leads to a limited improvement in detection rate. Therefore, we ultimately select 25 features as the final feature set. Different machine learning algorithms yield different sets of features. We choose the feature set generated by the random forest algorithm for our subsequent experiments because the curve produced by this algorithm is relatively stable.
Discussion: As the number of features increases, the detection performance continuously improves. However, after 25 features, the improvement in the detection performance is more and more weak. Thus, we ultimately selected 25 features for the subsequent experiments.

The number of features and accuracy using different machine learning algorithms.
Compared with the original features without feature selection, our method has greatly improved in efficiency. Table 3 shows the runtime overhead and accuracy of the algorithm before and after feature selection. From the table, it can be observed that the algorithms’ runtime have been significantly reduced, greatly improving the computational efficiency. This is because the number of features has been reduced from the original 3382 to the current 25. As shown in the table, SVM is the most efficient machine learning algorithm. When SVM is used with 25 features, the runtime is only 0.012 s, compared to 34.976 s using 3882 features.
Runtime and accuracy before and after feature selection
Runtime and accuracy before and after feature selection
Moreover, among different machine learning algorithms, our accuracy has not shown a significant decrease. Despite the substantial reduction in features, the accuracy remains above 95% for various machine learning algorithms. This also demonstrates that we have selected an excellent subset of features to the greatest extent.
We input the final selected 25 features into common machine learning classifiers and obtain various performance metrics for classification shown in Fig. 5. From the figure, we can see that our metrics for common machine learning algorithms have all achieved over 95%. The recall rate on the random forest algorithm exceeded 97%, thereby proving the final selected 25 features can be used to classify malware and benign apps.

Detection performance using different machine learning algorithms.
In order to assess the resilience of our method to unknown samples, which are not used in previous process. We apply the extracted features to unknown samples and evaluate the detection performance as shown in Table 4. From the table, we can see that the overall detection performance for unknown malicious samples slightly decreased, but it still exceeds 90% for common machine learning algorithms. Our method can also achieve good performance in detecting unknown software. This is because we used sensitive opcodes as features, and opcodes have good resistance to variations in samples.
Detection performance on unknown samples
Detection performance on unknown samples
To demonstrate the effectiveness and superiority of our approach, we conducted a qualitative comparison of our method with existing relevant works. The well-known method Drebin [24] uses multiple features to detect malware. It provides open dataset which is used in our method. DroidRL [26] utilizes permissions and opcodes as features and also employs deep reinforcement learning for feature selection. MGOPDroid [27] utilizes multi-granular opcode features to detect obfuscated samples and employs TF-IDF for feature selection.
Table 5 shows the dataset and performance comparison between our method and these methods. From the table, we can see that our method achieves better detection performance.
The performance comparison with other proposed systems
The performance comparison with other proposed systems
In addition, compared to related methods, our method has the following advantages.
Compared to multi-feature approaches, the single feature we extract is more efficient. Furthermore, opcode has good general applicability and stability on the Android platform, as it does not vary significantly across version iterations, which leads to a lower risk of model aging.
Compared to other detection methods using opcode, our method extracts sensitive opcode and can keep more semantic information. In contrast to the original opcode, our approach reduces the feature dimension, eliminates redundant opcodes, and reduces computational costs. While reducing dimensionality, our method retains semantic information, resulting in both highly efficient and effective detection.
Furthermore, we use deep reinforcement learning for feature selection, which does not disrupt the interdependencies among features. By utilizing the reward from the classifier, we optimize the feature selection process and improve its efficiency.
In this paper, we propose an Android malware detection approach based on sensitive opcodes and deep reinforcement learning. We use constructed sensitive API list to obtain sensitive opcode fragments, greatly reducing the dimensionality of opcodes while preserving the semantic information of opcodes. Next, we utilize n-gram encoding for opcodes and employ deep reinforcement learning for feature selection, which is efficient and preserves the correlation among opcodes. In the end, we obtain 25 n-opcode features, and our experimental results show that the features we extracted have a good performance in Android malware detection.
Footnotes
Acknowledgment
The work was funded by the Chongqing Technology Innovation and Application Project (Grant No. CSTB2022TIAD-KPX0054).
