Abstract
The focus of a large amount of research on malware detection is currently working on proposing and improving neural network structures, but with the constant updates of Android, the proposed detection methods are more like a race against time. Through the analysis of these methods, we found that the basic processes of these detection methods are roughly the same, and these methods rely on professional reverse engineering tools for malware analysis and feature extraction. These tools generally have problems such as high time-space cost consumption, difficulty in achieving concurrent analysis of a large number of Apk, and the output results are not convenient for feature extraction.
Is it possible to propose a general malware detection process implementation platform that optimizes each process of existing malware detection methods while being able to efficiently extract various features on malware datasets with a large number of APK? To solve this problem, we propose an automated platform, AmandaSystem, that highly integrates the various processes of deep learning-based malware detection methods.
At the same time, the problem of over privilege due to the openness of Android system and thus the problem of excessive privileges has always required the accurate construction of mapping relationships between privileges and API calls, while the current methods based on function call graphs suffer from inefficiency and low accuracy.
To solve this problem, we propose a new bottom-up static analysis method based on AmandaSystem to achieve an efficient and complete tool for mapping relationships between Android permissions and API calls, PerApTool.
Finally, we conducted tests on three publicly available malware datasets, CICMalAnal2017, CIC-AAGM2017, and CIC-InvesAndMal2019, to evaluate the performance of AmandaSystem in terms of time efficiency of APK parsing, space occupancy, and comprehensiveness of extracted features, respectively, compared with existing methods were compared.
Introduction
With the accelerated development of smartphones, Android has gradually become the main platform for mobile Internet, and Android has a global share of 84.1.% worldwide [1]. However, with the popularity of Android, malware developed for Android is also booming. Meanwhile, with the remarkable achievements of deep learning in NLP and CV, researchers have proposed various methods for malware detection using deep neural networks.
The current research of this kind on Android system mainly focuses on how to propose new neural networks to improve the accuracy of malware detection, and these methods usually follow the same process: firstly, parsing an apk file; secondly, extracting the feature information required by different detection methods, which may be the inherent properties of Android system such as the invoked permissions and API, or may be the features that need to be summarized by detailed analysis of the apk file, such as system call sequence, taint propagation analysis path, opcode call sequence, etc. Finally, a feature space that can represent all samples is created and a classification method is trained to distinguish malware from benign software.
Focusing on the first step, a large number of malware detection methods choose to use third-party reverse engineering tools, such as Jadx, ApkTool or dex2jar. However, these tools can only provide decompiled files of apk’s smali and suffer from long running time and large space overhead.To extract the required features from these decompiled files, especially the need to explore some deep features in the apk such as static analysis of the mapping of sensitive permissions and APIs, the data flow diagram of malicious behavior execution, the frequency of malicious behavior in the various life cycles of malware, etc., dynamic analysis of the relationship between system calls and functions, the transfer of handles between different system calls, etc., these tools What these tools provide is just to turn a dex file into a smali file. To obtain the above features, researchers need to have a professional background in reverse engineering, master smali syntax, reorganize the program structure, and write analysis scripts for different apk.
Of course, some analysts will use Andground to extract static features, or DroidBox [2] to extract dynamic features seems to be a solution. However, the same problem is inevitable, from the command Andground provides, it only provides apk inherent features such as permissions, strings, component information and other such shallow features, but if you want to extract the flow of Intent between components and other deep features, there is no way. DroidBox requires researchers to analyze the dynamic behavior of the apk from log files. Moreover, all these tools need to be deployed and executed separately, and these are only for one apk analysis. In the context of the need to construct feature spaces for a large number of samples and to perform a complete and multi-level analysis of each apk (i.e., combining the outputs of different static and dynamic features), the existing approach is clearly not a good choice.
In this work, we present AmandaSystem, an intelligent and fully automated malware detection platform tailored specifically for the Android platform. AmandaSystem is the first platform that systematizes the entire malware detection process based on deep neural networks. Rather than focusing on proposing some neural network structure to improve the accuracy of detecting malware, AmandaSystem provides an all-in-one tool for researchers who want to implement the detection of malware using neural networks, perform code inspection to retrieve a wide range of features and process all the collected information to extract multi-source features that can model the behavior of samples and can be used to identify its nature, whether it is malware or good software.
AmandaSystem mainly contribution is the fast and efficient extraction of static and dynamic features from Android applications. This is made possible by the implementation of a custom lightweight apk parsing module that outperforms existing decompiling tools Apktool and Jadx in terms of time efficiency, space occupation, and included analysis results. and we have highly coupled the apk analysis and feature extraction processes to ensure that the feature information needed to be extracted is available at the end of the apk analysis.
We simplified the malware detection process and modularized the individual fine-grained processes and systematized them on a modular basis.We design and implement a lightweight apk parsing module for malicious code dataset, which can concurrently analyze a large number of apk and provide good feature form, which is more suitable for malware detection scenarios.In the subsequent feature extraction stage, it makes full use of the parsing results from the previous stage to complete the feature extraction work efficiently, accurately and easily. Meanwhile, AmandaSystem implements the main Android feature types and encoding methods in existing detection methods and provides templates for generic neural network models.
Meanwhile, constructing a mapping relationship between permissions and APIs invoked by Android applications is the key to solve the problem of excessive permissions in Android applications. This mapping relationship well connects Android permissions and fine-grained API invocations, providing a basis for feature encoding, sensitive information location, and identification of sources and sinks for taint analysis. There are three current solutions:
1.Timothy et al. [3] refer to the Android API documentation to manually build a database of permission-API mapping relationships, find all the API calls by traversing the application source code, query the database to compare with the permissions declared in the manifest.xml file, and determine whether there is an excessive permission problem. However, the most fatal thing about this method is that it can’t help to analyze the files of APK without source code, and the constructed database is no longer accurate with the update of Android system.
2.Stowaway [4] finds the required permissions of each method in Android API by traversing the application source code through the automated testing tool. However, not only does Stowaway require manual specification of some API input parameters and sequences, its specification is incomplete due to Stowawy’s reliance on feedback directed API fusion to extract the specification. Stowawy can only practice the API calls it can find, of which it can only successfully execute 85%.
3.Finally the current call graph based approach requires the AOSP source code for analysis [5], first establish the complete mapping relationship within the framework, then collect Api calls of apk by UI Fuzzer, and finally construct a coarse-grained mapping relationship between permission declaration and API calls after comparing the complete mapping relationship.
At the same time, this analysis method restricts the analysis object to Android API, and does not consider establishing the relationship between permissions and third-party API calls. With the update of Android version, the original complete mapping relationship constructed by AOSP is no longer accurate. How to accurately build the mapping relationship between permissions and API calls inside Apk has become a new problem. In order to solve the problems of these methods, we propose a bottom-up static analysis method by relying on the calling and being called mechanism of AmandaSystem and using the permission application process of Android to realize a complete Android permission and API (from Android, Java, third party) inside apk. The mapping relationship between apk and API (from Android, Java, third party) is complete.
AmandaSystem can not only be used to train neural networks for malware detection, but also help researchers to quickly replicate existing experiments or modularly implement new detection methods. Moreover, AmandaSystem is scalable. Although we have implemented mainstream feature types and encoding methods on the platform, we still support users to write scripts to call our parsing results to extract special types of features and implement their unique encoding methods.
To summarize, this work presents the following original contributions: AmandaSystem is the first platform that realizes the whole process of apk parsing, feature extraction, feature fusion and neural network systematization. Implemented a lightweight apk parsing module. A new static analysis method is proposed to realize the mapping relationship between Android permission and API inside apk, and PerApTool is developed based on it. Users can use the platform to reproduce experiments related to malware detection, or design their own detection methods. We fully evaluated the performance of the tool on three publicly available malware datasets, CICMalAnal2017, CIC-AAGM2017, and CIC-InvesAndMal2019 datasets.
The paper is organized as follows. Section 2 discusses related work, Section 3 explains the overall architecture of our Android malware detection platform and details its workings and systematic advantages, Section 4 presents PerApTool, Section 5 evaluates AmandaSystem and PerApTool, and Section 6 concludes.
Related work
The closest related work is the AndroPyTool [6] proposed by Alejandro et al. It integrates relevant third-party research tools in the malware research domain, such as AndroGuard, Virustotal, FlowDroid, DroidBox, etc., to provide the extraction results of dynamic and static features in the form of CSV or Json to provide the output results. However, the framework is only a simple integration of various related tools, each tool has its own analysis process, an apk needs to be repeatedly analyzed between various third-party tools, unnecessary analysis process increases, wasting a lot of time and space costs. Because AndroPyTool only integrates these third-party tools, users need to go through a tedious and complicated environment configuration process, and grasp the version correspondence problem of each environment and third-party tools by themselves, which makes the cost of learning the tool increase for researchers. As the Android system environment is constantly updated, some tools stop iterating and updating, and it is difficult to update and maintain AndroPyTool, which completely relies on third-party tools, for a long time. In contrast, AmandaSystem is a custom malware analysis framework implemented in a unified environment, which not only simplifies the process of malware detection and provides diverse static and dynamic analysis results, but also allows concurrent analysis of a large number of APKs, facilitating researchers to build feature spaces on malware datasets with a large number of APKs.
Table 1 summarizes the static features used by our current malware detection methods based on the work of liu et al. [7]. Most of the existing methods focus on the inherent properties of Android, while the use of features involving program analysis such as CFG/DFG or Source/Sink Path is very sparse.
Examples of some static features used in deep learning-based malware detection methods
Examples of some static features used in deep learning-based malware detection methods
We list some more representative methods, these research methods [8–12] have the research idea of outputting a decompiled file through reverse engineering tools such as ApkTool, dex2jar, jadx, on the basis of which the feature extraction work is performed. However, for some easy to extract features, such as directly from the information stored in the Manifest.xml file such as permission, contentprovider, etc., the number of methods related to malware detection is also the largest, however, the drawback of these methods is that they can not well dig into the deep nature of the program, the performance of the program properties, for example, the The highest usage of permission in Manifest.xml file, but Android programs also have the problem of excessive permissions, that is, the permissions declared in manifest.xml are not necessarily used in the program code. Similarly, the use of such shallow features to represent program properties is problematic. For researchers who focus their research on how to modify the structure of neural networks, the inability to construct feature spaces based on static features from program analysis results by batch output through third-party tools quickly and efficiently, and the inability to represent program properties well using shallow features, pose difficulties for research efforts.
The main work related to establishing mapping relationships between Android permissions and API calls is the Stowaway project [4], Bartel et al. [13], Vidas et al. [14] and PSCout [5].
Stowaway [4] uses API Fuzzing to extract Android permission specifications, not all of them like PerApTool does. But for their purposes, this is sufficient, as the main purpose of their work is to measure the number of permission over-declarations. Bartel et al. [13] perform a call graph-based analysis of the Android framework, very similar in this respect to PScout [5], not as efficient as the call- and counter-call-based analysis method PerApTool, and Bartel does not provide a check on permission functions. Vidas et al. [14] extracted a permission specification by scanning the Android documentation. As a result, their specification is the least complete of all previous work, as the Android documentation is incomplete.
PScout [5]’s implementation of the mapping relationship between permissions and APIs is also based on the analysis of the call graph, and it also handles intent and content provider functions, however it is only available for Android version 4.0 and at higher levels, in particular it proposes that the handling of intent and provider functions is no longer applicable. In contrast, PerApTool is based on AmandaSystem’s high-precision analysis of four granularity levels and stores call and counter-call information for each granularity, which is more accurate and efficient than PSCout.
These static analysis methods based on call graphs first need to draw a complete call graph of the Android framework to find the mapping relationship between permissions and Android’s Api by analyzing the path reachability, while the analysis of apk compares the permission declaration in manifest.xml and the Android’s API library called by apk to establish the mapping relationship between apk’s permissions and Android API mapping relationship. The analysis of AOSP is time consuming operation. At the same time, this call graph has some limitations, firstly, it is developed based on the Application Framework of Android, and it cannot establish the mapping relationship of native Api developed based on C/C++, secondly, because it is analyzed based on AOSP, the mapping relationship cannot be established for the third-party libraries integrated by apk, and nowadays, the apk integrated with a large number of Second, since the analysis is based on AOSP, no mapping relationship can be established for third-party libraries integrated in apk, and nowadays apk integrating a large number of third-party libraries is becoming mainstream, so it is necessary to analyze the mapping relationship between permissions and API of third-party libraries. Third, the current method can only support Android 5.0 version or below, while Android in version 6.0 after the runtime management of google on dangerous permissions, the application process of permissions has changed, obviously these previous methods have been unable to do.
We propose a new platform in this paper, AmandaSystem. It is the first platform that realizes the whole process of apk parsing, feature extraction, feature fusion, and neural network model systematically.
AmandaSystem is capable of concurrently processing a large number of apk in malware dataset, and provides a variety of malware feature forms, encoding methods, and neural network templates, contributing a complete set of one-stop solutions for deep neural network-based malware detection. Researchers can assemble their own malware detection models by selecting the desired feature forms and neural network models according to their needs.
AmandaSystem consists of four modules, Apk Parsing module, Feature Extraction module, Feature Encoding module, and Netural Network module, respectively.
Among them, Apk Parsing module is a lightweight apk parsing module we customized, which can efficiently analyze a large number of apk concurrently, output good feature forms, and construct mapping relationships between apk call permissions and APIs.
The Feature Extraction module implements 11 kinds of feature types of Android applications which are more commonly used at present.
In Feature Encoding module, 4 more common feature encoding forms are implemented. Finally, in the Netural Network module, 4 types of neural networks are implemented using standardized templates.
AmandaSystem is mainly written in Python, and the operations related to third-party components are written in Java. The general framework of AmandaSystem in this paper is shown in Fig. 1, and the specific design of each module of the platform is described in detail below.

AmandaSystem architecture.
Android APK (Android Package Kit) file is essentially a compressed file, mainly divided into resc resource file and classes.dex file, in AmandaSystem Apk Parsing module is mainly composed of ApkAnalysis and PerApTool two parts. Where ApkAnalysis will analyze individual Apk, where only the decompiled.dex file is smali is dependent on the open source tool Androguard built-in decompilation engine DAD, we analyze and manage the decompiled data and generate good data form for the subsequent modules; PerApTool is to use forward and reverse tracing algorithms to establish the permission-API mapping relationship inside apk. We will focus on ApkAnalysis in this section, and we put the introduction of PerApTool in Chapter 4 section.
Concurrent design with a combination of multi-process and multi-thread
In the Apk Parsing module, firstly, a multi-process pool is created, the size of which is based on the hardware used by different hosts, maximizing the kernel performance of the CPU to make the best use of it. Each sub-process represents an ApkAnalysis, and secondly, for different extraction units of the feature extraction module, corresponding collections in the form of dictionaries, lists, etc. are created globally to facilitate the collection of all the feature information returned from different ApkAnalyses.
At the same time, for a single Apk, we also have a multi-threaded management within ApkAnalysis, where resource files other than classes.dex are parsed by a separate thread, SourceAnalysis, and ClassesAnalysis first establishes a global mechanism for calling and being called between different DvmFormats. This mechanism (described in Chapter 4) avoids the high time and space costs associated with constructing complete call graph relationships, and allows concurrent multi-threaded processing of each dex file through a pool of threads.
First of all, each dex file is put into a DvmFormat for parsing, and each of these DvmFormats represents a sub-thread, and each sub-thread is associated with a DAD engine by default, the number of engines is 4, which can support four sub-threads for decompiling at the same time, and the rest that are not assigned to DADs enter the blocking queue and wait, and when a sub-thread finishes decompiling, it will automatically be disconnected from the associated DADs, and the DADs will be disconnected by the When a sub-thread finishes decompiling, it will be automatically disconnected from the associated DAD, and the threads in the blocking queue will start to seize the DAD.
After a single dex is decompiled into a smali file, it is passed to the ClassManager in the form of a file stream, and the ClassManager will encapsulate each instruction in the Instruction class. Each ClassManager will build a unitchain, and a unitchain will be associated with the index of the class and the index of the method under the same dex. The index of class is relative to a single dex file, which can also be called absolute address, while method is relative to the address of the class where it is located, and similarly, the index of string and field under each method is relative to the address of the method where it is located, and their values are passed up to The design of Apk Parsing module is shown in Fig. 2.

The design of Apk Parsing module.
First of all, traditional decompiling tools apktool and jadx do not provide the function of concurrent analysis of multiple apk at the same time, but only for a single apk. Secondly, apktool and jadx are both written and implemented based on java. apktool parses classes.dex into smali files, while jadx decompiles dex into java files.
When we analyzed the source code of Apktool, we found that Apktool first calls dexlib.jar to process dex files and then calls backsmali tool to generate smali files. backsmali tool is also based on java, although it also implements concurrent multi-threaded parsing of multiple dex files, but It generates smali files on a class-by-class basis, which results in far more smali files than dex files, making it more expensive in terms of space. To make matters worse, ApkTool parses resource files, manifest.xml, and class.dex sequentially, unlike AmandaSystem which parses resource files and dex files concurrently, which also results in high time cost.
For classes.dex analysis, ApkTool’s call to the backsmali tool simply copies the generated smali files, which causes any researcher using the tool to first reconstruct the program’s architecture from these smali files to achieve the subsequent analysis goal, raising the analysis threshold, a problem that also exists in the This problem also exists in the jadx tool. In contrast, AmandaSystem not only constructs a complete chain of relationships from class to method, field and string in a dex file when parsing the dex file in multiple threads, but also establishes a global point-to-point mechanism of calling and being called through ClassesAnalysis for managing multiple dexes, and constructs a link between classes in different threads. This provides the basis for the subsequent construction of more complex data flow graphs and the extraction of different features.
Secondly, when analyzing the source code of the decompiler tool jadx, we found that the decompilation process of jadx is to first parse the dex into smali, convert the smali into Java class files through asm.jar, and finally parse the class to get Java source code. Although the conversion to Java source code improves the code visualization, this conversion is costly for time and space for batch apk decompilation and analysis.
In terms of details, jadx decompiles through the class JadxDecompiler. jadxDecompiler uses the Executors.newFixedThreadPool() function in Java to create a thread pool, however, this function does not enable thread multiplexing, which often results in decompiling under the default configuration of jadx large apk (>50MB), it is prone to false deaths and crashes. Second, although Jadx’s multi-threaded analysis of resource files and classes.dex files are concurrent, however, the concurrent analysis of classes.dex only exists in the last step of the decompilation process to parse the class to get the Java source code, of course, jadx is designed because its function is to support a variety of java-based decompilation, and is not specifically designed for apk decompilation. Not specifically designed for apk decompiling, so in decompiling apk, in time and space than other decompiling tools cost more.
To sum up, compared with traditional decompiling tools, AmandaSystem by this concurrent design of multi-process and multi-thread combination, we are better than traditional decompiling tools both in time efficiency and space efficiency, and we have compared and described AmandaSystem quantitatively with traditional decompiling tools in Chapter 5.
Efficient multi-grain analysis
Generally, the tools for analyzing the decompiled smali files, such as AndroPyTool and Androguard, use the traditional method of building multi-granularity analysis by scanning instructions one by one, collecting information of each class from each smali file, and traversing the instructions of each granularity in turn according to the principle of decreasing granularity, for example, traversing a class, finding For example, traverse a class, find each method, and then traverse each method to find fields and strings with smaller granularity.
However, this approach requires multiple iterations of all instructions globally, which is inefficient, while ApkAnalysis can take full advantage of multi-threading, instead of collecting all class information from the global, it creates ClassAnalysis, MethodAnalysis, FieldAnalysis, and FieldAnalysis sequentially according to smali syntax when each ClassManager iterates over instructions. For example, when encountering . class, a ClassAnalysis is automatically created until the next . class until the next . class is encountered . method, a MethodAnalysis is also created until the next . endmethod is encountered. This approach ensures that ClassManager completes a multi-granularity analysis when it traverses the entire decompiled file of instructions at once, and each ClassManager collects all ClassAnalysis in a single file and finally passes it to ClassesAnalysis for global class management.
In addition, the main functions of each granularity analysis: parsing the basic information of different granularities, such as the method name, parameters, return value, etc. of a method. determining the start and end indexes of different granularities in the smali file, building the set of instructions contained in each granularity, in addition to MethodAnalysis also builds a BasicBlock for each Method In addition, MethodAnalysis also constructs a BasicBlock for each Method. constructs a call and invoke mechanism between different granularities in ClassManager, and a call and invoke mechanism between granularities of the same level within a single granularity.
Feature extraction module
Various types of features in Android applications help to represent the behavior of the program more comprehensively is a currently accepted research strategy in the field of malware detection, while multiple features also help models to learn the intrinsic association between features. According to many current research approaches, we classify Android features into two main categories: static features and dynamic features. Static features mainly include permissions, intent, API calls, etc. Dynamic features mainly include network data, file reading and writing operations, etc. This part is implemented by two sub-modules, StaticFeatureExtraction and Dynamic Feature Extraction in Feature Extraction module respectively.
Feature triggers and static feature extraction
In most malware detection methods, the apk parsing work and feature extraction are performed separately, because the feature extraction work of these detection methods is dependent on the output of third-party tools, and this method needs to collect features by actively traversing the program instructions after the apk parsing.
For example, many static features of Android: identification of tainted sources/sinks, invocation of sensitive permissions, data leaks caused by Intent, etc. can be done completely in the process of program analysis. This is because the apk parsing does not involve the targeted collection of a specific feature, which means we may have to retrieve the aggregated information after apk parsing, or in the worst case, re-traverse all the decompiled code. Compared to this approach, AmandaSystem’s process is more streamlined, and the full set of features can be output after apk parsing.
In the feature extraction process, AmandaSystem does not adopt the previous approach in the literature of waiting to perform program decompilation and parsing before performing feature extraction. Instead, the feature trigger is designed and implemented by referring to the database trigger. The feature triggers are used in the process of apk analysis by AmandaSystem to passively collect feature information throughout all stages of apk analysis. The biggest difference with the past methods is that feature extraction does not affect the apk parsing process, but the feature collection can be done at the same time when the apk parsing is completed.
For example, for PermissionTrigger, when SourceAnalysis traverses manifest.xml, it retrieves each node of each AxmlNode and collects <uses-permission>,and when ClassManager of ClassAnalysis traverses all When ClassManager in ClassAnalysis traverses all directives, it determines whether the register in the directive const - string vx contains data with . permission. Figure 3 shows the structure of the PermissionTrigger and how features are collected throughout the apk parsing process.

Structure of PermissionTrigger.
All feature triggers inherit a unified interface, and different feature triggers need to be written depending on the features to be extracted. Sources/Sinks are different in that they need to collect information from MethodAnalysis. But essentially, they only need to process elements from the Axml and Instruction data structures, as the internal structure of PermissionTrigger in Fig. 3 demonstrates.
In the collection of dynamic features, we integrated the Monkey Runner tool provided by Google, by writing scripts, installing the apk to the Android emulator, testing by Monkey, using strace to get the sequence of SystemCall calls and grabbing the traffic data by filtering the http packets sent by the apk to get the malicious url address, and finally get the log information based on logcat. In addition, considering that the Android environment compiled by DroidBox [2] is still around 4.0, the version is too old, so we do not collect dynamic taint analysis data based on the patch file of DroidBox.
To better integrate some hard-to-extract features such as IR and RTA. we also integrate soot [15] and spark [16] as third-party components, which will not be started by default unless users need them, because their start-up will seriously affect the running efficiency of AmandaSystem.
Feature encoding module
After apk analysis and feature extraction, the extracted features need to be further encoded as feature vectors. There are many forms of feature encoding, in AmandaSystem, we provide four more common encoding forms, One-hot, N-gram, Image, Graph.
Netural network module
Currently, we provide some models of traditional neural networks, such as DNN, CNN, RNN, etc. Since we all have different requirements for hyperparameters such as the number of layers and training time of the models, users can specify these hyperparameters according to their business needs, or they can directly use our default parameters. In addition, AmandaSystem modularizes the individual neural network models so that users can easily rely on these basic network models to form more complex neural networks.
PerApTool
PerApTool is an apk internal permission and API call mapping tool based on AmandaSystem’s systematic calling and called mechanism implementation. In the following, we will first introduce the design idea of calling and called mechanism with PerApTool to establish the permission and API call mapping relationship, and finally compare it with other methods.
Calling and called mechanisms
Inside an Android application, one class usually contains access to properties, methods of another class. And from the perspective of software analysis, this analysis of data flow between different classes or functions is called inter-procedural analysis.
In the past methods, if you want to get the invocation relationship between methods, you need to construct the complete invocation graph globally. In contrast, AmandaSystem is different. For example, at the granularity of class, when we parse a class, we record the other classes it calls in the ClassAnalysis of that class, and also record the current class in the ClassAnalysis of the called class. At the same time, the called function can be found from strings and fields, and the called function can find the class where the called function is located. This enables the establishment of relationships between and across granularity levels, while such relationships are generated only between point-to-point, and do not require the generation of a complete data flow graph of the apk, resulting in a smaller time and space cost. The calling and invoked mechanism is illustrated in Fig. 4.

Calling and called mechanisms.
The advantage of this approach is that when we retrieve permission strings, permission checking functions and Intent.Action in the decompiled code, we can quickly establish the calling relationships between functions and between strings and functions using the mechanisms already established. There is no need to build a complete call graph first and then retrieve and analyze it step by step from top to bottom as in the traditional call graph based analysis method.
PerApTool constructs a mapping relationship between permissions and API calls by tracing permission requests from permission strings, permission checking functions, and simulated permission checking mechanisms to APIs called by Android applications. It divides the process of establishing the mapping relationship between permissions and API calls into three parts.
By permission string
First,PerApTool first finds all the declared permissions in AndroidManifest.xml in the apk and stores them in a collection called P1. Second,Import the set P1 into the submodule StringAnalysis in AmandaSystem and find the function that calls this permission string through AmandaSystem’s call and called mechanism. Third,the mapping relationship between permissions and API calls is established by locating where the permission string appears in the function, following the forward tracing and backward tracing algorithms to find the API calls after applying that permission, and generating the mapping relationship result set R1.
By the permission check function
There are four categories of permission checking functions.
ContextCompat.checkSelfPermission() checks if the application has some kind of dangerous permissions. ActivityCompat.requestPermissions() applications can request permissions dynamically via this method, which when called brings up a dialog box prompting the user to authorize the requested permission. ActivityCompat.shouldShowRequestPermission() if the application has requested this permission before, but the user rejected the request. onRequestPermissionsResult() displays a dialog to the user when the application requests this permission.
First, identify all permission checking functions through the submodule MethodAnalysis in AmandaSystem, go through the calling and called mechanism of AmandaSystem, and find all the called functions of the permission checking function.
Second,Locate the position of the permission check function in these called functions and find the API calls that appear after the permission check function. Third,the API calls that pass the permission checking function are UI Fuzzed with the declared set of permissions P1 to establish the mapping relationship between permissions and API calls, generating the mapping relationship result set R2.
By Action.Intent
First we extracted the file about the mapping relationship between Intent.Action and permission in PScout and compared it with the Intent.Action in AndroidManifest.xml to find the Intent.Action that appears in AndroidManifest.xml and save its collection relationship as P2.
Second, find all Intent.Actions in StringAnalysis and locate the location of Intent.Action in these called functions by the calling and called mechanism of AmandaSystem.
Third, the Api invoked by Intent.Action is found by the forward tracing and backward tracing algorithms, and then the mapping relationship between permissions and API calls is established based on the P2 set to generate the mapping relationship result set R3.
The sets of mappings R1, R2, and R3 produced by these three methods are subjected to a merging operation to produce the final set R of relations between permissions and mappings.
Forward tracking and backward tracking algorithms
Google introduced a permission request mechanism in Android 6.0 that divides all permissions into normal and dangerous permissions. Every time an app uses a dangerous permission, it needs to dynamically request and get authorization from the user to use it.
Algorithm 1 Main Loop of Forward tracking algorithm [1]
initialize the set of method instructions for calling the string, ins _ list
initialize the permission check process function, flow()
api _ flag:= False
For each i ∈ ins _ list
op _ value:= opcode of ins _ list [i]
If op _ value is 0x1a
reg:= the virtual register of ins _ list [i]
meth _ flag:= False
j = i + 1
For each j ∈ ins _ list
op:= opcode of ins _ list [j]
If op range ∈ [0x6e, 0x72] ∪ [0x74, 0x78] ∧ ¬meth _ flag
If reg ∈ins _ list [j]
meth _ flag:= True
api _ flag:= flow()
EndIf
EndIf
If meth _ flag
break
EndIf
EndFor
EndIf
EndFor
return api _ flag
end algorithm
Main Loop of Backward tracking algorithm
begin algorithmic[1]
If api _ flag is False
initialize the set of functions that call the current function, cm _ list
For each c ∈ cm _ list
If c.name() ⊇ "check"
initialize the set of functions that call c, cm2 _ list
The same process as earlier, reverse look for the call function three levels, c3
...
mapping permission to c3
...
Else
initialize the set of method instructions of c
same process as forward tracing after locating the string
EndIf
EndFor
EndIf
end algorithm
This allows us to establish a mapping of permissions to API calls by simulating the runtime permission checking process and determining which Api calls require what permissions to be requested. The forward and backward tracing algorithms are implemented based on this idea.
We give the implementation logic of the forward and reverse tracing algorithms in Algorithm 1 and Algorithm 2. In the forward tracing algorithm, after locating the permission string, we check whether the decompiled code afterwards has the if-eqz, if-nez structure, and find the API call that needs to apply permission by tracing the jump instruction of the virtual register.
Backward tracing is initiated after a forward tracing is invalidated, such as a direct jump to a return instruction for a permission string or the calling function itself is of type check, and we need to rely on AmandaSystem’s call and called mechanism to pass up the hierarchy and trace how the result of the permission request is used, with the subsequent operation being the same as forward tracing after locating the permission string The process is the same.
The forward and backward tracing algorithms establish permission and Api call mapping relationships by simulating the runtime permission checking process and relying on AmndaSystem’s systematic calling and called mechanism.
It is important to note that reverse tracing is only performed for three levels of nested operations, starting from the function F that calls the input permission string, reverse tracing to the function F1 that calls it, considering the case of multi-layer check function and return, reverse tracing again to the function F2 that calls F1, and if there is still no result, performing the last round of reverse tracing to the function F3 that calls function F2. If the three levels of nested If there is no result in the reverse tracing, the mapping relationship between permission and F3 is established, indicating that the function has operations related to permission, but no permission checking process is performed, resulting in permission abuse situation.
The advantages of PerApTool
First, PerApTool is based on Android applications, it neither requires the source code of APK nor the compilation of the whole AOSP, which means that it can build permission and API mapping relationships for any apk, unlike other methods introduced in our related work which all have certain limitations.
Second, PerApTool is implemented by locating permission functions and permission strings. Through the calling and called mechanism of AmandaSystem, we can quickly and accurately retrieve the calling relationship associated with a string and function, getting rid of the need to build a complete function calling map by traversing all API functions of AOSP or apk in the past methods The third, the PerApTool is a tool that allows you to quickly and accurately retrieve the relevant call relationships.
Thirdly, PerApTool simulates the permission application process by algorithm to find the API calls related to the permission, whether it is the system API of Android or the API of the third-party library, as long as it is related to calling the permission, the corresponding mapping relationship will be established, which ensures the completeness of the tool. While other methods at present can only establish the mapping relationship between permissions and Android system Api.
Fourthly, as long as the mechanism of permission application in Android system remains unchanged, then PerApTool can accurately establish the mapping relationship inside apk without relying on Android system version and API manual, and PerApTool ensures that there will be no failure problem due to the rapid evolution of Android system and iterative update of apk.
Finally, PerApTool locates the possible locations of permission calls by the three parts introduced earlier, and then simulates the permission application process by forward and backward tracing algorithms to establish the mapping relationship, which ensures the accuracy of the tool.
Evaluation
In this paper, we analyze 2000 apk in 5 major categories and 42 minor categories from the CICMalAnal2017, CIC-AAGM2017, and CIC-InvesAndMal2019 datasets on Intel Core i9 2.5GHz CPU, 32GB RAM, Ubuntu-21.04, Python programming language and PyCharm platform.
Assessment indicators
To better evaluate AmandaSystem, we split out the most important ApkAnalysis and PerApTool in AmandaSystem. To demonstrate the time efficiency of ApkAnalysis, a set of Apk data is selected on the entire dataset to show our results, and we will measure the runtime of ApkAnalysis and existing third-party apk decompilation tools separately on the same set of Apk to demonstrate that ApkAnalysis requires less time cost when parsing against a single apk. Considering that ApkTool and Jadx do not support concurrent analysis of multiple Apk, we add their runtimes for individual analysis of each apk in the whole group to get the runtime of the whole group, and we also measure the runtime of ApkAnalysis for concurrent parsing of the same group of apk to demonstrate the advantage of concurrent parsing. Finally, we measure the space cost in terms of the size of the output file after decompiling.
Also, to measure the accuracy and completeness of the mappings extracted by PerApTool, where accuracy is defined as the proportion of the mappings actually present in the extracted mappings by PerApTool, and completeness is defined as the proportion of the mappings found by PerApTool to the total number of mappings present. We first obtain the mapping relationships by PerApTool on a set of apk with known source code, then obtain the total number of mappings within a single apk by static analysis, and finally calculate the accuracy and completeness of PerApTool.
The time and space cost of ApkAnalysis
We compared ApkAnalysis with the current mainstream apk decompiler tools apktool and jadx, and according to the results in Table 2, both have 82% usage rate in the existing literature related to malware detection. For other apk decompiling tools, such as dex2jar, it can only parse a single dex file, not the complete apk, and is not comparable, while ApKill’s underlying layer is based on jadx development, which is necessarily less time and space efficient than jadx, so we just compare it with jadx.
Comparison of AmandaSystem’s apk parsing module and ohter apk reversal tools runtime
Comparison of AmandaSystem’s apk parsing module and ohter apk reversal tools runtime
We measured the runtime required for ApkAnalysis to concurrently parse 2055 apk’s on the CIC-AAGM2017 dataset to be 51 minutes, while we show the runtime measurements for the same set of apk’s in Table 2.
When performing a single analysis on the same apk, our average runtime is 2.49s, while ApkTool’s average runtime is 3.32s, giving us about 75% of ApkTool’s runtime and 70% of Jadx’s; and when performing concurrent analysis on the same group, we compress this runtime to 50% of ApkTool’s and Jadx’s 46%.
Space cost
We show a comparison of the size of the same set of apk output files in Table 3, where we store the ApkAnalysis output stream into a single text file and compare the size of that text file with the output files of existing tools.
Comparison of AmandaSystem’s apk parsing module and ohter apk reversal tools space cost
Comparison of AmandaSystem’s apk parsing module and ohter apk reversal tools space cost
When performing a single analysis on the same apk, the size of our output data is approximately one thousandth of that of other third-party tools, and our output data has a good form of feature data as shown in the right image of Fig. 4, which facilitates the extraction and encoding of features for later processes.
First, we extract the API call list for a set of Apk with known source code using static analysis method and establish the mapping relationship with its permission declaration in AndroidManifest.xml. Then, we analyze this set of apk by PerApTool to output the mapping relationship between permissions and APIs, and finally calculate the accuracy and completeness based on the above two sets of mapping relationship data.
Completeness and accuracy
To evaluate the completeness of the permission mapping created by PerApTool, we define a single apk in the dataset as apk i ,where i ∈ 0 . . . n
Also, we use the static analysis method to obtain the total number of individual apk i mapping relations defined as Pi,1, and the total number of mapping relations extracted from PerApTool defined as Pi,2 where the correct mapping relation is defined as TM i , and the incorrect mapping relation is defined as FM i .
So we get the following relation:
Pi,2 = TM i + FM i
We can now give a definition of mapping completeness C i : C i = TM i / Pi,1
Accuracy A i is defined as: A i = TM i / Pi,2
We show, in Table 4, the results of our evaluation of a set of apk mappings.
Mapping evaluation of a set of apk
Mapping evaluation of a set of apk
Due to the rise of deep learning, malware detection methods based on deep learning are also increasing day by day. However, facing a large number of apk in malware datasets, it becomes more and more important to conduct apk analysis, extract features, feature coding and fusion quickly and efficiently, and finally import suitable neural network models to coherently connect the various processes of malware detection. Especially for deep learning researchers without reverse engineering or security background, how can they focus more on single process innovation and get rid of complicated other experimental aspects, AmandaSystem gives the solution. AmandaSystem simplifies the operational process of malware detection by using lightweight parsing module and multi-threaded mechanism, which can effectively utilize CPU resources to concurrently process a large number of apk in malware dataset and generate good feature forms. At the same time, we propose new static analysis methods to build mapping relationships between sensitive permissions and API calls by taking advantage of the systematization of AmandaSystem, and design the PerApTool tool.
Finally, in the next work, we consider extending the architecture of AmandaSystem to gradually get rid of the dependence on DroidBox and FlowDroid for part of the current feature extraction work and further reduce the time cost.
Footnotes
Acknowledgments
This work was supported by the National Natural Science Foundation of China (grant number 62166041).
