Scramblesuit : An effective timing side-channels framework for malware sandbox evasion 1

Abstract

Online malware scanners are one of the best weapons in the arsenal of cybersecurity companies and researchers. A fundamental part of such systems is the sandbox that provides an instrumented and isolated environment (virtualized or emulated) for any user to upload and run unknown artifacts and identify potentially malicious behaviors. The provided API and the wealth of information in the reports produced by these services have also helped attackers test the efficacy of numerous techniques to make malware hard to detect.

The most common technique used by malware for evading the analysis system is to monitor the execution environment, detect the presence of any debugging artifacts, and hide its malicious behavior if needed. This is usually achieved by looking for signals suggesting that the execution environment does not belong to a native machine, such as specific memory patterns or behavioral traits of certain CPU instructions.

In this paper, we show how an attacker can evade detection on such analysis services by incorporating a Proof-of-Work (PoW) algorithm into a malware sample. Specifically, we leverage the asymptotic behavior of the computational cost of PoW algorithms when they run on some classes of hardware platforms to effectively detect a non bare-metal environment of the malware sandbox analyzer. To prove the validity of this intuition, we design and implement Scramblesuit, a framework to automatically (i) implement sandbox detection strategies, and (ii) embed a test evasion program into an arbitrary malware sample. We perform a comprehensive evaluation ofScramblesuitacross a wide range of: 1) COTS architectures (ARM, Apple M1, i9, i7 and Xeon), 2) malware families, and 3) online sandboxes (JoeSandbox, Sysinternals, C2AE, Zenbox, Dr.Web VX Cube, Tencent HABO, YOMI Hunter). Our empirical evaluation shows that a PoW-based evasion technique is hard to fingerprint, and reduces existing malware detection rate by a factor of 10. The only plausible counter-measure toScramblesuitis to rely on bare-metal online malware scanners, which is unrealistic given they currently handle millions of daily submissions.

Keywords

Malware malware analysis sandbox evasion PoW

1. Introduction

This paper is an extended version of the work published at ESORICS 2021 [67]. Malware attacks have a significant financial cost, estimated around $1.5 trillion dollars annually (or $2.9 million dollars per minute) [45], with predictions hinting at this cost to reach $6 trillion dollars by 2021 [28]. Due to the sheer amount of known malware samples [25,101], manual analysis neither scales nor allows to build any comprehensive threat intelligence around the detected cases (e.g., malware clustering by specific behavior, family or infection campaign). To address this problem, security researchers have introduced sandboxes [13]: isolated environments that automate the dynamic execution of malware and monitor its behavior under different scenarios. Sandboxes usually comprise a set of virtualized or emulated machines, instrumented to gather fundamental information of the malware execution, such as system calls, registry keys accessed or modified, new files created, and memory patterns.

As a next step, online services appeared to bring malware analysis from security experts to common users [75]. Online malware scanners are not only useful for the users but also for the attackers. By allowing an artifact to be checked multiple times against various state-of-the-art malware analysis sandboxes, attackers can tune the evasiveness of their malware samples by exploiting the feedback reported by these services and try various techniques before making the sample capable of detecting the presence of a sandbox. Specific CPU instructions, registry keys, memory patterns, and red pills [62,76,79] are only a few of the signals used by attackers for identifying glitches of the emulated environment that can disclose the presence of a sandbox environment. These techniques have triggered an arms race, with the more sophisticated web malware scanners rushing to spoof any such exploitable signals [47].

In this work, we show how an attacker can evade malware analysis in these scanning services by leveraging Proof-of-Work (PoW) [34] algorithms. The key intuition of our technique lies in the fact that, like NP-class problems [106], the asymptotic behavior of a PoW algorithm is constant in terms of computational power [34]. This means that CPU and memory consumption remain stable over time. Therefore, PoW algorithms are perfect candidates for benchmarking the computation capability of the underlying hardware. Such a benchmark can be used as a fingerprint of the underlying computing infrastructure, in particular to reveal the presence of a sandbox since its fingerprint deviates statistically from the one observed in a native hardware platform.

A key advantage of using PoW techniques is that they are a time-proof and self-contained mechanism compared to other more fine-grained timing side-channel approaches that try to detect the underlying hardware machine. In fact, our system does not require access to precise timing resources for detecting the emulated environment (e.g., network or fine-grained timers). In our evaluation we empirically validate that a PoW-based technique can detect an emulated environment with high precision just by looking at the output of the algorithm (i.e., execution time, and number of successful iterations). Furthermore, PoW implementations do not raise any suspicion to automated malware sandboxes compared with the stalling code (e.g., infinite loops and/or sleep) that is easier to detect because of CPU idleness [57]. An additional advantage of this approach is that current defensive techniques that aim at spoofing the virtualization signals present in contemporary sandboxes cannot act as countermeasures against the stable timing side-channels that our technique exploits. Fingerprinting PoW algorithms as a malware component is feasible, e.g., by checking the usage of particular cryptographic instructions. However, using it as a proxy signal for detecting malware would produce a large number of false positives since PoW algorithms are part of legitimate applications such as Filecoin [53] and Hashcash [10].

Contributions. In this paper, we make the following contributions:

We design and implement Scramblesuit, a framework to automatically create, inject, and evaluate PoW-based evasion strategies in arbitrary programs. Scramblesuit operates as a three-step pipeline. First (step 1) multiple PoW algorithms are thoroughly tested across different hardware platforms, operating systems (Linux Ubuntu 22.04 and Windows 10, MacOS Big Sur), and machine loads. The outcome of these tests (step 2) is used to build a statistical characterization of each PoW’s execution time under each setting. We use the Bienaymé–Chebyshev inequality [2] to obtain statistical evidence about the expected execution time. Next, an adversary can upload its malware to the Scramblesuit framework and select the evasion mechanism to be used. Finally (step 3), Scramblesuitevaluates the accuracy of the evasion mechanism based on our proposed technique and tested on multiple online sandbox services such as: JoeSandbox, Sysinternals, C2AE, Zenbox, Dr.Web VX Cube, Tencent HABO, YOMI Hunter [ 75 ].

We empirically evaluate each step of Scramblesuit’s pipeline. For the PoW threshold estimation, we have tested three popular PoW algorithms (Catena [103], Argon2 [15,16] and Yescrypt [43]) using multiple configurations. During 24 hours of testing, we find Chebyshev inequality values higher than 97% regardless of the PoW and setting used. This result confirms high determinism in the PoW execution times on real hardware, thus validating the main intuition behind this work. We test our technique when applied to six known malware families by submitting to twelve sandboxes several variants that include PoW-based evasion. The results demonstrate how PoW-based evasion reduces the number of detections, even in presence of anti-analysis techniques such as code virtualization or packing.

To further quantify the efficacy of PoW-based evasion with real-world sandboxes, we wrote a fully functional malware sample for several operating systems (Linux, Windows, MacOS) which we integrated with an evasion mechanism based on Argon2, and submitted it to fourteen online sandboxes. Reports from each sandbox mark our malware as clean. We have tested six different malware families and recycled them as brand-new samples, beyond creating our own new variant. We further discuss the behavioral analysis for our malware, as well as potential countermeasures to this novel PoW-based evasion mechanism we have proposed. To ensure the reproducibility of our results and foster further research on this topic, we make the source code of Scramblesuit publicly available [69].2

²
https://github.com/artifactrepo/Esorics2021_Paper159

We evaluateScramblesuitextensively across the following computing architectures: Raspberry Pi 3, Dual Intel Xeon, Intel i9, Intel i7, and the recent Apple M1. These computing architectures cover the whole spectrum used by malware analysis sandboxes. It follows that our statistical results are an upper bound for evading the state of the art of the virtualized hardware system used as malware analysis sandbox.

2. Background

Fig. 1.

WannaCry ransomware variant dynamic execution graph (VirusTotal).

Malware Analysis. Researchers and professionals have evolved their tools and skills in response to the evolution of malicious software. There is a substantial amount of literature devoted to analyze and counter malware [21,39,51,52,65,71,102,107]. Every aspect of the phenomenon has been taken into consideration, from its network infrastructure, to the code that gets reused among samples, unexplored paths in the control-flow, sandbox design and instrumentation. Nonetheless the arms race continues and as new analysis evasion techniques are found also new countermeasures are developed. Figure 1 gives an idea of how sandboxes environment can be useful to study malware samples and their behaviour. In this example, we see a WannaCry variant – a popular ransomware – which contacts different servers between Europe and US to perform its operations. The figure gives an idea of the scale of the malware phenomenon and the amount of revenue that can be gained using infected machines.

Anti-Analysis Techniques. Several anti-analysis techniques have been developed during the years by miscreants: packers [14,60,98], emulators [93], anti-debugging and anti-disassembly tricks and stalling code. Most techniques have been promptly countered by our community, with the exception of stalling code. This anti-analysis technique is very difficult to detect and poses a problem for commercial sandboxes [72].

2.1. PoW as part of Scramblesuit

Proof-of-Work (PoW) [34] is a cryptographic technique used to guarantee that a party (the prover) has spent a certain amount of computational effort. A key feature of PoW algorithms is their asymmetry: the work imposed on the client is moderately hard but it is easy for a server to check the computed result. There are two types of PoW protocols: (a) challenge-response protocols, which require an interactive link between the server and the client, and (b) solution-verification protocols, which allow the client to solve a self-imposed problem and send the solution to the server to verify the validity of the problem and its solution. Such PoW protocols (also known as CPU cost functions) leverage algorithms like hashcash with double iterated SHA256 [56], momentum birthday collision [55], cuckoo cycle [96], and more. These algorithms may be:

Memory-bound, where computation speed is bound by main memory accesses. The performance of such algorithms is expected to be less sensitive to hardware evolution.

CPU-bound, where computation speed is bound by the system’s processor, which greatly varies in time, as well as between high-end and low-end devices.

Network-bound, where a client must collect tokens from remote nodes before querying the service provider. In this sense, the work is not performed by the client, but they incur delays because of the latency to get the required tokens.

PoW has gained much popularity in recent years by becoming the founding block of the blockchain technology. PoW is the (trustless) consensus algorithm in a blockchain network that is used to confirm transactions and append new blocks to the chain. In particular, performing PoW – which in this context is known as mining – is required as a way to force miners to compete against each other in solving complex computational puzzles. Whenever a miner solves the PoW computational puzzle of a new block, it broadcasts that block to the network. All other miners can then easily verify that the solution is correct and the block is confirmed. The difficulty of each PoW puzzle is periodically adjusted to keep the block construction time around a target time [34]. While designing Scramblesuit, we considered three PoW implementations: (1) Catena [103], (2) Argon2 [15,16] and (3) Yescrypt [43]. These algorithms were among the finalists of the Password Hashing Competition (PHC) [9,36,104], which led to the design of significant improvements in the PoW implementation, namely stability in memory footprint, duration, and CPU usage.

2.1.1. The choice of Argon2

Argon2 (the PHC winner) is the PoW strategy used in Scramblesuit. Argon2 guarantees that, by using the same input parameters, the amount of computation performed is asymptotically constant; hence, the variance of Argons2’s execution time T is very small on the same platform. Argon2 is also based on a memory-hard function which, even in the case of parallel or specialized execution (e.g., ASICs or FPGAs), will not enhance scalability, and hence remains computationally bounded due to its asymptotic behavior. The Argon2 algorithm takes the following inputs:

A message string P, which is a password for password hashing applications. Its length must be within 32-bit size.

A nonce S, which is used as salt for password hashing applications. Its length must be within 32-bit size.

A degree of parallelism p that determines how many independent (but synchronized) threads can be run. Its value should be within 24-bit size (minimum is 1).

A tag with length within 2 and 32-bit

A memory size m, which is a number expressed in Kibibytes 3

³
A “Kibibyte” is equal to 1024, or $2^{10}$ , bytes. It is a term to discern between the Kilo prefix in the universal metric system which normally represents a $10^{3} (1000)$ magnitude factor. While in Computer Science that corresponds to $2^{10}$ which is 1024 so the name Kilobyte should be 1000 bytes and has always been interpreted as 1024 in reality.

A number of internal iterations t, which is used to tune the running time independently of the memory size. Its value should be within 32-bit size (minimum is 1).

These input parameters will be used in our framework to define the computational boundary of the algorithm execution on a specific class of hardware machines. Once the parameters are set, the output of the PoW algorithm only depends on the hardware platform. Argon2 has two implementations: Argon2d and Argon2i. Argon2d maximizes resistance to GPU-based password cracking attacks [11] by accessing the memory array in a data-dependent order, which reduces the possibility of time–memory trade-off (TMTO) attacks [41]. Another benefit of this access pattern is that it also reduces the possibility of time–memory trade-off (TMTO) attacks [41]. However, this comes with the associated cost of being vulnerable to side-channel attacks [104]. Conversely, Argon2i does not have data-dependency (i.e., the next iteration of the algorithm does not depend on the previous one) and thus it is resilient against such attacks. As a consequence, Argon2i became popular as a PoW in cryptocurrencies and back-end servers as it makes it difficult to observe memory patterns. Argon2i is our choice for Scramblesuit evasion for its security features (e.g., memory protection against side channel), as well as the results shown in Section 4. Below, we summarize the main features of the Argon2i algorithm:

Performance: the selected memory area is filled very fast, this plays against adversaries with specialized ASIC.

Tradeoff Resilience: with default number of memory passes an ASIC equipped adversary cannot decrease the time-area product even if the memory size used is reduced by a factor of 4 or more. This has been proved not to work with Catena and Lyra algorithms.

GPU-FPGA-ASIC Unfriendly: implementing a dedicated cracking hardware will not improve the performance (timewise) of the algorithm.

2.2. Timing side channels, stalling code and Scramblesuit’s approach

Various techniques have been proposed to detect if applications are running inside a sandbox, emulated or virtualized environment. The most reliable and stealthy ones are based on timing measurements [ 49 ]. Such an approach bases its own success on the use of fine-grained timing instructions provided by the operating system. Recently, access to these instructions have been limited by the operating system to protect against the very dangerous micro-architectural attacks such as Spectre and Meltdown [ 48 , 59 ]. Consequently, timing measurements are no more a viable option for sandbox detection.

Different from the classic timing techniques our mechanism is based on PoW algorithms and offer strong cryptographic premises with a very stable complexity growth, which make an evasion technique resilient to any countermeasure, such as using more powerful bare-metal machines to enhance performance or reduce the time measurements granularity. By exploiting the asymptotic behavior of the PoW algorithms, we build a statistical model that can be used to identify the class of environment where the algorithm is running and consequently distinguish between physical and virtualized, emulated or simulated architectures, like different flavors of malware sandboxes. Indeed, even fine grained red pills techniques [76] such as CPU instruction misbehaviour can be easily fixed in the sandbox or spoofed to thwart evasion techniques. Differently, PoW relies on well defined mathematical and computational behaviour. Moreover, a simple modification of the PoW library avoids the malware sample to be fingerprinted with statistical techniques.

If we take as an example of PoW complexity the one that is used by bitcoin, by design the computational complexity of the algorithm increases for each new block of the blockchains transaction [66]. For instance, the computation complexity of the PoW used by blockchains increases for each new block of blockchain transactions [66]. This is due to the fact that resources (cryptocoins) of the blockchain ecosystem are limited; as resources availability decreases, their price in terms of computational power grows, hence the PoW computation becomes more difficult at every iteration. Such computation increase shows the asymptotic behavior that can be exploited by our technique. To approve a pool of transactions an amount of computation is needed to solve the PoW and compute the next block of the chain.

3. Our approach: Scramblesuit

This section describes Scramblesuit in detail. We first introduce our threat model (Section 3.1) and then provide an overview of the technique (Section 3.2) along with its workflow. We then describe how the key parameters are estimated (Sections 3.3 and 3.4) and how an arbitrary sample can be equipped with the evasion module (Section 3.5).

3.1. Threat model

We assume a malware scanning service based on virtualized or emulated sandboxes, which allows users to upload and scan their individual files for free as many times as they need. Such service combines results from various state-of-the-art malware analysis sandboxes before returning to the user a detailed report about the detection outcome of each sandbox scanner used.

We further assume an attacker who developed a program that includes (i) some malicious payload along with (ii) a technique to pause or alter the execution of the malicious program itself, when a possible malware analysis environment is detected. Before distributing the malicious program to the victims, the attacker may use a malware scanning service to assess its evasiveness.

Beyond the fact that an attacker can hide its own artifact, in this paper we demonstrate that it is possible to recycle an existing malware and make it look like benign software. This could enable an attacker to recycle billions of malicious samples and eventually break current detection mechanisms.

Fig. 2.

High level overview of Scramblesuit. Step 1: execution of the PoW on several hardware/OSes using different configuration settings and system load to have a broad coverage of execution power. Step 2: threshold estimation based on execution time of the PoW per configuration/architecture. Step 3: malware integration and submission test.

3.2. System design

As described in Section 2, PoW puzzles have moderately high solving cost and a very small verification time, like problems in the NP complexity class [106]. This implies that their asymptotic behavior is constant in terms of computational cost [34], e.g., CPU and memory consumption. Scramblesuit exploits this asymptotic behavior to build a statistical model that can be used to identify the class of hardware machines where the algorithm is running. Such a model can later be used to distinguish between physical and virtualized architectures, like those used by malware sandboxes. Scramblesuit is a three-step pipeline (see Fig. 2):

Performance Profiling. It executes multiple PoW algorithms on several hardware and operating systems using different configuration settings and system loads.

Model estimation. The previous step provides the system with a measurement of the amount of time needed to execute the PoW on real hardware. By using the Bienaymé–Chebyshev [2] inequality, it then estimates the time (threshold) expected for a particular configuration to run on a given architecture.

Integration. Once the models are built, a malware developer can select a specific PoW and parameters to associate with an arbitrary malware sample. Scramblesuit then generates a module with the chosen PoW, which is integrated with the sample by building a single statically-linked executable.

We leverage a custom Cuckoo Sandbox [40] and popular crowd-sourced malware scanning services (like VirusTotal or similar [75]) to form a testbed allowing to report on the accuracy of the evasiveness of the malware in real-world settings.

3.3. Performance profiling

The first step in Scramblesuit’s pipeline produces a number of PoW executions using different algorithms, parameters, hardware, operating systems, and load settings.

Hardware. Scramblesuit leverages three classes of machines: low, medium, and high-end. The high-end machines are a desktop equipped with an Intel(R) Core(TM) i9-9900X CPU @ 3.50 GHz with 10 physical cores and 20 threads equipped with a PCI-e M2 512 GB disk and 32 GB of RAM and an iMac 24-inch (dated 2021) with M1 chip, 16 GB of integrated ram, and a 1 TB SSD. The medium-end machines are a workstation equipped with a Dual Intel(R) Xeon(R) CPU E5-2643 0 @ 3.30 GHz with 16 physical cores and 64 GB of RAM and an Intel i7-2640M @ 3.50 GHz with 16 GB of RAM. Finally, the low-end device is a Raspberry Pi 3 which comes with a quad core ARMv7 Processor rev 4 (v7l) and 1 GB of RAM.

Systems and loads. With the exception of the Raspberry Pi 3 and the iMac, the other hardware platforms are set up in dual boot, supporting both Linux (Ubuntu 22.04, 64 bits) and Windows 10 (64 bits). Each platform can be further configured in idle and busy mode. The latter is achieved using iperf [33] a CPU bound network traffic generator to keep the operating system and the CPU occupied.

PoW and parameters. Scramblesuit supports three popular PoW algorithms: Catena [103], Argon2 [15,16], and Yescrypt [43]. Each PoW algorithm is executed multiple times with different input parameters on each hardware platform, operating system, and load setting. The parameters of each algorithm allow to control the amount of memory, parallelism, and complexity of the PoW. Our selection is based on common configuration of common-of-the-shelf (COTS) hardware devices, with respect to memory and CPU. However, not all the selected algorithms have these parameters available for tuning and in some cases, their tuning is more coarse grained [103].

3.4. Threshold estimation

The second step in Scramblesuit’s pipeline aims at estimating the PoW thresholds for different settings (PoW algorithm, parameters, hardware, operating system, and load). This is achieved through a statistical characterization of the execution time in each setting using the Bienaymé–Chebyshev inequality [2]. This is a well-known result in probability theory stating that, for a large class of distributions, no more than $\frac{1}{k^{2}}$ values of a distribution X can be more than k standard deviations (σ) away from the mean (μ): $\begin{matrix} (1) & Pr (| X - μ | ⩾ k σ) ⩽ \frac{1}{k^{2}} \end{matrix}$

Using the empirical distribution of execution time observed in the previous step, this inequality allows us to select a threshold T (i.e., a maximum execution time) which guarantees a high sample population coverage. The previous deduction enables us to determine with high probability the time T it will take for a PoW to run if the underlying platform is not virtualized. To reduce false positives, the evasion rule can be generalized to “the execution environment is virtualized if the PoW does not complete N executions in less than T seconds.”

3.5. Malware integration and testing

The final step in Scramblesuit’s pipeline is PoW integration with a malware sample provided as input. At this step, the attacker can upload its sample to Scramblesuit and select the PoW-based evasion mechanism to be used, along with its parameters. Scramblesuit further informs the attacker about the predicted accuracy of this selection.

Scramblesuit integrates the uploaded malware with the PoW selected and the Boost C++ libraries [95], which ease the OS interaction to build a single statically-linked executable. The compilation stage is automated as an Ansible [8] playbook and clang [26]. The integration is achieved at the linking stage, so the malware will have a stub call to an external symbol that will be linked with the chosen PoW. Scramblesuit’s pipeline then starts the Ansible scripts, which run some tests and launch the compilation of the final binary for multiple platforms automatically.

Testing. To evaluate the accuracy of the newly generated evasion mechanism, we rely both on a local sandbox – a custom Cuckoo Sandbox [40] equipped with Windows 10 (64 bits), which is the most targeted OS for malware campaigns [105] – and several on-line free-of-charge sandbox services [75]. Once this step is completed, Scramblesuit offers to the user access to the set of reports generated by each sandbox.

Static Evasion vs. Sandbox Evasion. In our experiments, we use a paid VirusTotal subscription. This subscription grants us access to 8 more sandboxes. Since our goal is to trigger the dynamic analysis of the sandboxes itself we implement a naïve packing technique that is able to pass the static checks and run our sample inside the sandbox. Using this technique, we managed to run our PoW mechanism and successfully test our evasion technique.

4. Evaluation

In this section, we evaluate Scramblesuit’s pipeline. We first analyze the combination of PoWs and their parameters currently supported by Scramblesuit. The outcome of this evaluation are the parameters N (cycle of execution made in less than T second) and T (maximum execution time) to be associated with the malware sample. We then discuss the accuracy of our evasion mechanism using various case studies across three public malware scanning services: ([4–6]), [7,20], along with our own Cuckoo Sandbox instance.

Table 1
Number of consecutive PoW executions per hardware and OS combination over 24 hours. For a given platform, the first line refers to results obtained with the idle setting, while the second line refers to busy setting

Platform Status Win 10 Ubuntu 22.04 MacOS Big Sur

Intel i9 idle 4,500 9,325 –

busy 3,642 8,867 –

Apple iMac M1 idle – – 9,964

busy – – 7,890

Intel i7 idle 2,365 6,689 –

busy 1,322 4,356 –

Dual Intel Xeon idle 6,005 7,897 –

busy 4,320 7,012 –

Raspberry Pi 3 idle – 300 –

busy – 143 –

Platform	Status	Win 10	Ubuntu 22.04	MacOS Big Sur
Intel i9	idle	4,500	9,325	–
busy	3,642	8,867	–
Apple iMac M1	idle	–	–	9,964
busy	–	–	7,890
Intel i7	idle	2,365	6,689	–
busy	1,322	4,356	–
Dual Intel Xeon	idle	6,005	7,897	–
busy	4,320	7,012	–
Raspberry Pi 3	idle	–	300	–
busy	–	143	–

Table 2

Statistical measurement results for Catena

Garlic graph size	Min	Max	Sigma	Mean	K	Chebyshev
15	0.12	5.35	0.503	0.209	9.99	99.00%
18	1.13	35.61	4.22	1.86	7.94	98.41%
20	5.11	165.57	19.01	8.26	8.26	98.54%

4.1. Threshold estimation and PoW algorithm choice

For each PoW, we have selected different configurations with respect to memory footprint, parallelism, and algorithm internal iterations (see Tables 2 for Catena and 3 for Argon2i and Yescrypt). Argon2i and Yescrypt have similar parameters (memory, number of threads, blocks) whereas Catena’s only parameter is a graph size which grows in memory and will make its computation harder as the graph size increases.

Scramblesuit executes each PoW configuration on the low-end (Raspberry Pi 3), medium-end (Dual Intel Xeon, Intel i7), and high-end (Intel i9, and Apple iMac 24-inch with M1 chip). All PoW configurations are executed sequentially during 24 hours on each machine for both idle and busy conditions. As pointed out in Section 3, with the exception of the Raspberry Pi 3 and the M1, all tests are performed on two operating system per hardware platform: Linux (Ubuntu 22.04.3, 64 bits) and Windows 10 (64 bits).

Table 1 shows the total number of PoW executed over 24 hours per hardware, operating systems, and CPU load (idle or busy). Regardless of the CPU load on each machine, we observe two key insights. First, there is a significant drop in the number of PoW executions when considering Linux vs Windows, which is close to a 50% reduction in the high-end machine. This is due to operating system interaction, ABI and binary format, and ultimately idle cycle management. Second, a 30x reduction in the number of PoW executions when comparing high-end and low-end platforms, e.g., under no additional load the Raspberry Pi 3 completes 300 executions versus an average of 8,611 executions on both the high and medium-end machines. Finally, extra load on the medium and high-end machines causes a reduction in the number of proofs computation of around 6-10%, averaging out to 7,300 executions between the two machines. A more dramatic 50% reduction was instead measured for the Raspberry Pi 3.

Next, we statistically investigate PoW execution times by mean of the Bienaymé–Chebyshev inequality (see Section 3.4). To balance equally sized datasets, we sampled 150 random executions (i.e., the total number of executions that were possible to complete on the low-end platform) from the 9,325 executions available from both the medium and high-end platforms. Tables 2 and 3 show for each PoW and configuration, several statistics (min, max, σ, and K, Chebyshev inequality) of the PoW execution time computed across hardware platforms, OSes (when available), and load condition (idle, busy). Overall, we measured Chebyshev inequality values higher than 97% regardless of the PoW and its configuration. This confirms high determinism in the PoW execution times on real hardware, validating the main intuition behind this work.

Algorithm choice. The results above provide the basis to select a PoW algorithm along with its parameters to integrate with the input malware sample. These results indicate that PoW selection has minimal impact on the expected accuracy of the proposed evasion mechanism. We then selected Argon2i (with 8 threads, 100 internal functions and 4KiB of memory) motivated by its robustness and maturity. We leverage the results from Table 3 (top, second line) to set the parameters N (PoW execution) and T (evasion threshold) of an Argon-based evasion mechanism. The table shows that $K = 8.1$ seconds allows a good coverage for the execution time population (98.3%). We opted for a more conservative value of $T = 10$ and further performed multiple tests on our internal Cuckoo Sandbox. Given that our Cuckoo Sandbox could not even execute 1 PoW with $T = 10$ , we simply set $N > 1$ . We will use this configuration for the experimentation described in the remainder of this paper.

4.2. Case study: Known malware

We first analyze the effect of adding our PoW-based evasion strategy to the code of six well-known malware families: Relec, Forbidden Tear, Zeus, Jigsaw, Pony and AveMariaRAT (i.e., Botnets, RATs, Ransomware). The use of real-world malwares, which are well known and thus easy to detect, allows us to comment on the impact that PoW-based evasion has on malware reuse, the practice of recycling old malware for new attacks. We use Scramblesuit to generate various combinations of each original ransomware with/without PoW-based evasion strategy, code virtualization,4

⁴
This cannot be applied to ForbiddenTear since it is written in .NET.

and packing offered by Themida, a well-known commercial packer [74]. We verify that all the malicious operations of the original malwares were preserved across the generated versions.

Table 3

Statistical measurement results for Argon2i (top) and yescrypt (bottom). Thr. = number of threads. It. = number of algorithm steps. Mem. = amount of memory used in KiB. Cheb. = Chebyshev coverage

Thr.	It.	Mem.	Min	Max	Sigma	Mean	K	Cheb.
1	10	1 KB	0.01	0.70	0.09	0.02	7.9	98.4%
8	100	4 KB	0.20	9.28	1.07	0.46	8.1	98.3%
16	500	8 KB	2.03	88.8	10.5	3.85	7.9	98.4%
1	1K	8 KB	0.00	0.02	0.00	0.01	6.1	97.3%
8	2K	32 KB	0.03	0.56	0.05	0.05	10.5	99.1%
16	4K	64 KB	0.08	5.00	0.51	0.19	9.4	98.9%

Table 4

Online AntiVirus and sandbox detection results for six malware families samples and a benign test program using various anti-analysis configurations. a

Test	Relec	Jigsaw	Zeus	Pony	AveMariaRAT	Forbidden Tear	Hello World
Original	70%	85%	95%	96%	57%	36%	0.4%
Original + Code Virtualizer	30%	48%	45%	40%	43%	n/a b	26%
Original + PoW	0.1%	0.1%	0.1%	0.1%	0.1%	0.1%	0.1%

Results are in percentage because the number of AV programs used by VirusTotal changes, though percentages are absolute of that specific detection.

.NET Binaries cannot be virtualized.

We submitted all malware variants to online sandboxes for analysis and checked how many AV engines (antivirus products) and sandboxes [ 75 ] flag each variant as malicious (see Table 4 ). The embedding of the PoW technique inside the malware is able to decrease the detection rate on roughly 70 AVs, by a factor of 10 [ 70 , 89 – 92 ], reaching a level where the difference between the label malicious and false positive is evanescent. It is worth noting that the total percentage of the detection rate is computed by considering the results provided by all sandboxes run inside the VirusTotal framework plus AV engines. In order to achieve a low detection score and no sandbox flagging the samples as malicious we applied some simple static code transformation similar to the packing technique. It is fundamental to understand that while these trivial static code transformations influence static AV detection they never influence sandbox detection which were always thwarted exclusively by the PoW.

Table 4 also show results when submitting several variants of a standard Hello World program. Note that the original code has been flagged as malicious by 3 anti-viruses, though as it is possible to see from the report the detections are mislabeled i.e., Relec is not recognized. This false positive could be due to a large number of submissions of the same code hash (due to its simplicity and popularity), our source IP being flagged, and other unknown factors which may influence the scoring. The table also shows that adding code virtualization or packing translates into a substantial increase in false positive detections even of a simple Hello World program, confirming our intuition above. Instead, adding our PoW-based evasion strategy results in less false positives, one less than the original code. This is likely due to the fact that our code on top of Hello World has more entropy, with respect to a very simple one line program. This makes it look like a realistic non-malicious program to engines that measure such kinds of parameters.

Overall, these three case studies show that a PoW-based evasion strategy reduces the number of detections by one order of magnitude with known malware by preventing the sample from executing in the analysis sandbox. This result demonstrates large potential for malware reuse by coupling it with PoW-based evasion strategy. In the next section, we perform more controlled experiments based on fresh (i.e., previously unseen) malware.

4.3. Case study: Fresh malware sample

In order to further explore the results obtained in the previous case studies, we wrote a simple malware PoC (roughly 150 LoC) for Windows 10 (VC++) and Linux (C++) and Apple MacOS (for M1). Our malware sample implements a basic ransomware functionality which scans the entire hard drive and encrypts all its files. This behavior should be easy to detect by any malware scanning service.5

⁵
The malware detection report for this malware without our PoW-based evasive measure has been anonymized [68,81].

Using Scramblesuit, we automatically embed a PoW (Argon2) and make sure to exhibit its malicious activity only if the PoW is successfully executed at least N times before a timeout T.

We submitted different variants of this malware sample (with PoW, without PoW, with static sanitization) to several on-line sandboxes and the results were disheartening (see Table 5). For the static sanitation we remove the symbol tables and debugging symbols. Note that very similar results were also achieved with our local Cuckoo Sandbox. It is important to note that to check the execution of the malware payload we insert a create-file function at the beginning of the malware payload itself. Such file creation is visible on the behavioral report of the analyzed sandboxes in case the malware payload is executed.6

⁶

This reference has been anonymized not to violate the terms of service of sandbox vendors [81].

We used such a simple test to check whether the PoW algorithm detects the emulated environment and so validates our technique. In case such a file is not present on the behavioral report, it means the PoW algorithm detects the emulated environment and stops the payload execution. None of the analyzed sandboxes is able to execute more than 1 PoW during

T = 10

(or even

T = 20

sec), which is worse than what a Raspberry Pi 3 can do even in presence of some extra load (e.g., see max value in the top of Table 1).

Table 5

Execution results of a custom ransomware sample on several sandboxes a

Sandbox	Evasion timeout	PoW timeout	# of PoW executed	Timeout	Notes
Sandbox1	10 secs	50	1	120	Clean
Sandbox1	15 secs	45	1	180	Clean
Sandbox1	20 secs	40	1	240	Clean
Sandbox1	20 secs	15	1	500	Clean
Sandbox2	20 secs	15	0	N/A	Clean
Sandbox3	20 secs	45	N/A	N/A	Clean
Sandbox4	N/A	N/A	N/A	120	Clean
Sandbox5	N/A	N/A	N/A	120	Clean
Sandbox6	N/A	N/A	N/A	120	Clean
Sandbox7	N/A	N/A	N/A	120	Clean
Sandbox8	N/A	N/A	N/A	120	Clean
Sandbox9	N/A	N/A	N/A	120	Clean
Sandbox10	N/A	N/A	N/A	120	Clean
Sandbox11	N/A	N/A	N/A	120	Clean
Sandbox12	N/A	N/A	N/A	120	Clean

The eight additional malware sandboxes (4-12) provided by VirusTotal come with preconfigured settings; our submission were always not detected.

We made all the reports of our analysis publicly available, including screenshots of evasive malware samples. 7

⁷

The references have been anonymized not to violate the terms of service of sandbox vendors [82–88].

It has to be noted that not all sandboxes report are the same, but they all signal the hard drive scan (Ransomware behavior) without full static protection (i.e., with the default compiler options). In Table 5 the number of PoW executed is visible only if a screenshot of the sandbox is available. As for the sandbox execution timeout, not all the analysis services had it available for selection.

Table 6

Malware execution results on hardware (busy with CPU-z on windows and SOHO programs on other platforms)

Machine	Timeout	μ of # PoW Win32	μ of # PoW Win64	μ of # PoW Linux32	μ of # PoW Linux64	# PoW MacOS Big Sur
Intel i9	10 sec	17.5	28.3	33.2	48.2	–
Apple M1	10 sec	–	–	–	–	43
Intel i7	10 sec	11.2	17.2	23.3	38.1	–
Dual Xeon	10 sec	10.1	15.3	20.2	44.2	–

Table 7

Malware execution results on hardware (idle)

Machine	Timeout	μ of # PoW Win32	μ of # PoW Win64	# PoW MacOS Big Sur
Intel i9	10 sec	15.5	23.5	–
Apple M1	10 sec	–	–	25
Intel i7	10 sec	9.2	15.1	–
Dual Xeon	10 sec	12.2	13.4	–

Multi Architecture Sandbox Resistant Malware. We have analyzed a variety of COTS architectures, including the recent M1 from Apple Inc. We find that the performance of the M1 chip are indeed within the bounds of our statistical analysis making our threshold estimation accurate and resilient to future changes. Tables 6 and 7 show how all the platforms stay within 50 executions of the PoW during 10 seconds. Our results show the effectiveness and future proofness approach of Scramblesuit based on Pow-How algorithms. For this reason we believe that Scramblesuit’s approach will last in the future and it will be effective against new platforms and architectures. These results are an upper bound from the computer architectural point of view since they are obtained on a large range of devices, under several configurations, i.e., year of manufacture, end-user, RISC and CISC architecture, performance.

5. Security analysis

The results shown in the previous section demonstrate that a Scramblesuit-ed malware can effectively detect a sandbox and abort the execution of any malicious payload. This strategy is effective in getting a malware sample marked as “clean” by all sandboxes tested by Scramblesuit (see Table 5). Scramblesuit’s technique is simple to deploy, it does not require precise timing measurements and, thanks to its algorithmic properties, it will last for many years as a potential threat.

We next discuss in detail the behavioral analysis of our malware. This is an analysis produced by a sandbox related to how a malware interacts with file system, network, and memory. If any of the monitored operations matches a known pattern, the sandbox can raise an alarm.

Fig. 3.

Behavioral map of the malware PoC without PoW and without full static protection enabled.

Figures 3, 4, and 5 show the behavioral analysis of our malware on a radar plot, labeled with most prevalent AV labels. The samples were submitted with different combinations of PoW and static protection. In Fig. 3, the radar plot is mostly “green” (benign) with respect to some operations like phishing, banker and adware for which we would not expect otherwise. However, four “suspicious” (orange) behaviors are reported with respect to evader, spyware, ransomware, Trojan operations. While our malware PoC is not labeled as “malicious” (red), the suspicious flags for our binary would trigger further manual analysis that could reveal its maliciousness. It is thus paramount to investigate and mitigate such suspicious flags.

Fig. 4.

Behavioral map of the malware PoC without PoW and with full static protection enabled.

Fig. 5.

Behavioral map of the malware PoC with PoW and with full static protection enabled.

Our intuition is that the suspicious flags are due to the fact that our malware is neither packed nor stripped, and hence some of its functionality i.e., exported functions, linked libraries, and function names are visible through basic static analysis that is usually also implemented in the dynamic sandbox environment. Accordingly, we strip out the whole static information from our binary and resubmit it as a new binary. Figure 4 shows the behavioral analysis of our PoC malware without PoW-based sandbox detection but with full static protection enabled. As expected, various signals have dropped from the behavioral report. Finally, Fig. 5 shows the result of adding PoW to the last binary. A completely green radar plot which does not raise any suspicion illustrates the evasion effect of Scramblesuit.

Fig. 6.

CPU consumption of our malware PoC (Argon2d) Malware:red line, System Idle (PID 0):green line.

Fig. 7.

Memory consumption of our malware PoC (Argon2d) Malware:red line, System Idle (PID 0):green line.

Fig. 8.

CPU consumption of our malware PoC. T = 60 seconds and 0.5 seconds between each PoW execution. Malware:red line, System Idle (PID 0):green line.

CPU and memory usage. The main downside of associating a PoW with a malware sample is an increase in both CPU and memory consumption. We here report on CPU and memory consumption as measured by our sandbox. Figures 6 and 7 compare, respectively, CPU and memory utilization of our malware (red line) with System Idle (PID 0). With respect to CPU usage, the PoW associated with our malware causes an (expected) 100% utilization for the whole duration of the PoW ( $T = 10$ sec). With respect to memory utilization, our malware only requires about 17 MB versus the 7 MB that utilizes a sample system process like System Idle (PID 0). This is a minor increase, unlikely to raise any suspicion.

Next, we investigate whether we can reduce the CPU usage of our PoC ransomware by setting a longer T (e.g., 60 sec) and a sleep command of 0.5 sec between each PoW execution. Despite such sleeps, Fig. 8 still shows 100% CPU utilization for the whole T (60 sec in this test). The lack of CPU reduction associated with the extra sleep commands is counter-intuitive. The likely explanation is that the sandbox leverages a coarse CPU monitoring tool and, thus, the CPU reduction associated with our extra sleep commands gets averaged out. These results provide a foundation to detect evasion techniques based on PoW. A sandbox could attempt heuristics based on a binary’s CPU and memory consumption. We argue, however, that this is quite challenging because of the potential high number of false positives that can be generated.

6. Countermeasures

Evasion techniques are easily comparable with other anti-analysis techniques like packing. Packing techniques have evolved to such sophistication that it has become practically impossible to unpack a malware sample without dynamically executing it [98,99]. However, dynamically executing a sample can indeed trigger evasion techniques like stalling code. To counter evasion techniques, and especially the ones that Scramblesuit implements, one idea would be to fingerprint the algorithms, e.g., CPU and memory footprint. However, it would be very easy for attackers to apply code polymorphism techniques and produce variants that diverge from the original implementation, as it is done with packers. This will constitute a challenge for the sandbox, which could generate a false negative by not being able to spot the algorithm. In Table 4, the Hello World program is detected as malicious and our technique reduces its detection rate and with a code virtualizer it makes the sample completely stealth.

Fingerprinting evasion. A common solution against red pills [76] is to reduce the amount of instructions failing due to emulation. As Martignoni et al. [61,62] show, the analysis can be automated and the fixes can be easily produced. However, with PoW the computational model is not seeking for emulation/virtualization failures or malfunctions. Instead, PoW is acting as a probe to spot a side channel in the execution time of the algorithm, which in this case is time-based.

Virtualized instructions set. Native execution of the cryptographic instructions is another potential countermeasure that could be considered to mitigate our approach. In such a case, the cryptographic instructions of the PoW algorithm are not emulated by the sandbox environment, but directly executed on the native CPU. Avoiding the emulation of the cryptographic instructions could clearly improve the computational performance of the PoW algorithm and reduce the success probability of the evasive behavior shown by Scramblesuit. The technique described in the Inspector Gadget paper [51], which works at the program analysis level, may also work to avoid the execution of our evasion code. Once the sample is unpacked, it would be possible to extract and execute only the malware branch of the code as a gadget and analyze its behavior in isolation. However, a sufficiently complex packer or emulator would make such a process very tedious and require manual effort, which makes this solution excessively complex to be implemented in an automated malware analysis service.

Specialized hardware. Even if our choice, Argon2, is resilient to specialized circuits for mining (ASICs and FPGAs), other PoW algorithms are not, and hence an analyst could equip his sandbox with a miner [97]. Such dedicated hardware is expensive for a non-professional user (around $3,000 at the time of writing). Nonetheless, if the phenomenon of sandbox evasion due to PoW proliferates, having such a platform would be of great help to offload the PoW calculations, through a tailored interface, and continue the execution of the malware sample inside the sandbox. The cost/benefit trade-off of adopting such a measure really depends on the intended scale of the analysis platform. For example, according to VirusTotal statistics [25], the service receives weekly more than 3M PE binaries. Hence, a dedicated hardware to defeat PoW evasion based techniques seems to be a good compromise, since it allows to analyze and discover new malicious behaviors.

Spoofing timers. The sandbox that gets a Scramblesuit-ed malware could try to delay the time, which could mean to make our $T = 10$ seconds last much longer to achieve the payload execution. This approach may work well. Though, if we expect a total of at least 50 PoW iterations (see Section 3.4) and the sandbox is not able to execute more than one in about a minute for a unique malware sample, the analysis would take more than one hour. This will eventually extract the payload that will then require extra work to be reverse engineered, understood, and fingerprinted. Hence, this approach may not scale in terms of time/cost for the large number of samples that online sandboxes analyze daily.

Bare-Metal Sandboxes. Using bare metal hardware represents a reasonable solution that might be adopted within corporate companies but it is not possible to use such technology at Internet scale, i.e., cloud-based solutions like Virus Total. Also, isolated sandboxes do not benefit from the information that on-line in cloud services have which leverages large scale cross-correlations.

7. Discussion

7.1. Ethical considerations

The results obtained by Scramblesuit regarding the analyzed publicly available sandboxes, normally used by malware analysts under their term of service (ToS), demonstrate that our technique works consistently either in our custom Cuckoo Sandbox implementation or in proprietary solutions. Our aim, though, is not to disrupt any business nor to difficult the operation of companies that profit from providing malware behavior analysis. We contacted all the platforms and vendors that we have tested with Scramblesuit and we notified them about our findings. Part of the vendors were very positive and agreed to further collaborate to work on practical countermeasures. Unfortunately, the response we received from other vendors opposed any dissemination of our results, adopting a shortsighted security-through-obscurity approach which is not novel in our community. Consequently, tested vendors have been anonymized to avoid violation of their ToS. We purposely maintained the number of new variants submitted to the bare minimum, but our approach may easily transform any existing sample into a new one. The authors are available for contact for further information disclosure.

7.2. Bare-metal environments

In [47] the authors present BareCloud: a bare-metal system which helps to detect evasive malware. This system in order to execute malware trades visibility against transparency. In other words it makes the analysis system transparent (non-detectable by malware) and produces less powerful analysis data (limited instrumentation). Indeed their detection technique leverages hierarchical similarity [35] comparison between different malware execution traces (virtualized and emulated) systems i.e., (Ether [31], Anubis [13], and VirtualBox [40]). One of the biggest problem of hierarchical similarity algorithms is scalability, which means that the algorithm should be polynomial in time and space. An example [73] of application and analysis of hierarchical similarity for binary program comparison shows $O (n^{2})$ complexity. Hence using BareCloud as a production system for example for VirusTotal which claims [25] about 1.5M daily submissions means that the hierarchical comparison would approximate 2.250 billion of operations daily to detect evasive malware with bare metal equipment. It is evident that BareCloud can be useful in special cases, as briefly stated above, where also a manual analyst can make the difference. For the sake of scalability though, virtualization and emulation methods cannot be fully replaced, even if it would be possible to instrument in hardware an entire system [57], the approach would suffer many other issues, for instance having a lot of physical hardware and maintaining it.

Evasive malware on the rise. Over 70% of all malware attacks involved evasive zero-day malware in Q2 of 2020: a 12% rise on the previous quarter [27]. This denotes that evasive malware is a phenomenon that will hardly disappear and there will always be continuous research in evading analysis systems. Recently, a well-known sandbox vendor declared that “detecting evasive samples is not a priority for their business” (via private communication). While this statement suggests that online services might not pay special attention, if samples try to evade their aggregated detection system, this does not imply that such evasion is going to disappear. This statement, though, may open avenues for a different attack surface, such as Economical Denial of Sustainability [24].

7.3. Economical denial of sustainability

Online sandboxes, like any other business, have costs to sustain. Ignoring evasive malware to avoid an additional cost is (for now) understandable. Unfortunately, malware that exploits Scramblesuit’s technique implies additional energy and memory costs, especially if submitted in large scale to such systems, opening avenues to EDoS attacks, which will try to make the on-line service not sustainable economically. These on-line services receive on average 1.5M samples daily. It is not difficult to imagine how much energy just a tenth of the total submissions can consume if it is running PoW. Such an algorithm is one of the most energy intensive operations that a computer can perform. For instance, the yearly energy consumption of Bitcoin’s blockchain is comparable to that of a country such as Tunisia or Czech Republic [30]. We strongly recommend that not all evasion techniques are the same, and every technique that exploits hardware consumption side channels should be properly analyzed to avoid service disruption.

7.4. Malware automation

Following this research line we plan to continue studying the phenomena that surrounds malware, its analysis, and the countermeasures that can be applied to thwart such analysis. We envision that attack-driven research is a necessary approach to demonstrate the duality of every phenomena that are in conflict. In particular we argue that in the future, malware could thwart analysis trying to mine the last block of a public blockchain as some JS based malware was doing to get rewarded. In this case, the mining would serve as stalling code to evade detection. Though, it will be connected with a decentralized live network and its analysis would be dependent on transaction confirmation. Hence, causing a delay, until the block gets mined. For instance, in Bitcoin the average transaction confirmation lasts roughly 10 minutes. While nowadays sandboxes, offer 5 minutes to perform an analysis

Moreover, updates to the malware evasion time can be done on the fly when the next block of the blockchain is calculated, thus automating the evasive behavior upgrade.

7.5. CPU fingerprinting

With a very granular source of time, PoW algorithms – given their asymptotic behavior – may be used to fingerprint hardware platforms instead of simply distinguishing sandboxes from real hardware. The potential of this proposal is to perform such fingerprinting remotely as Sánchez-Rola [80] did, i.e., through the browser. This opens many new questions about user profiling, privacy and security. However, since the well-known micro-architectural bugs Spectre, Meltdown, and Foreshadow [48,59,100] the timer resolution granularity of browsers, normally accessed through the $performance . now ()$ JavaScript function [63] has been highly reduced, so fingerprinting a specific CPU based on execution time can become a daunting task and requires a lot of time or generate many false positives.

8. Related work

There is a significant body of research [19,22,38,54,93,94,108] focusing on both designing novel evasion techniques for malware and also providing mechanisms to detect them. We next discuss the most relevant works related to ours.

Evading Malware Analysis. Packer identification tools such as Yara and others [1,3,78] use signatures to detect if an executable is packed with a specific previously known packer. Moreover, these tools can be used in combination with manually written static unpackers to enhance their functionality and customize their behaviour. Also, generic dynamic unpackers have been proposed [18,46,60] to avoid writing a specific static unpacker which may be useful for a single malware sample. Ugarte et al. [98] propose a taxonomy of packer complexity and perform an exhaustive study of custom and COTS packers. In the work by D’Elia et al. [29] a comprehensive description of anti-analysis techniques, there is one of the first description on how to counter stalling code which is one of the resilient anti-analysis technique and they claim the approach to dismantle artifacts code is manual.

Fingerprinting emulated environments. By recognizing the sandboxes of different vendors, malware can identify the distinguishing characteristics of a given emulated environment and alter its behavior accordingly. The work in [79] introduced the notion of red pill and released a short exploit code snippet that could be used to detect whether the code is executed under a VM or in a real platform. In [76], the authors propose an automatic and systematic technique (based on EmuFuzzer [61]) to generate red pills for detecting whether a program is executed inside a CPU emulator. In [62], the authors build KEmuFuzzer, which leverages protocol-specific fuzzing and differential analysis. KEmuFuzzer forces the hosting virtual machine and the underlying physical machine to execute specially crafted snippets of user- and system-mode code before comparing their behaviors. In [17] authors presented AVLeak, a tool that can fingerprint emulators running inside commercial antivirus (AV) software, which are used whenever AVs detect an unknown executable. The authors developed an approach that allows them to deal with these emulators as black boxes and then use side channels for extracting fingerprints from each AV engine. Instead, we show that even with completely transparent analysis programs, the real environment can be used by the malware to determine that it is under analysis. In [64] authors propose a ML-based approach to detect emulated environments. This technique is based on the use of features such as the number of running processes, shared DLLs, size of temporary files, browser cookies, etc. These features are named by the authors “wear-and-tear artifacts” and are present in real systems as opposed to sandboxes. The authors use such features to train an SVM classifier. We also rely on modeling a distinguishing feature, in our case is a time channel arising from the asymptotic behavior of a Pow, not the presence or absence of system artefacts.

In [37], authors introduce the virtual machine monitor (VMM) detection and they propose a fuzzy benchmark approach that works by making timing measurements of the execution time of particular code sequences executed on the remote system. The fuzziness comes from heuristics which they employ to learn characteristics of the remote system’s hardware and its configuration. In [77], authors analyze a number of possibilities to detect system emulators. Their results show that emulation can be successfully detected as easily as virtual machines, mainly because the task of perfectly emulating real hardware is complex. In [23], the authors present a technique that leverages TCP timestamps to detect anomalous clock skews in VMs. A downside of the approach is that it requires the transmission of streams of hundreds of SYN packets to the VM, something that can be detected in the case of a honeypot VM and flagged as malicious behavior. Finally, in [42] authors present a number of red pills based on timing side channels that are implemented in JavaScript and run in the browser, aiming to detect whether the browser is running inside a virtual machine. Compared to the previous approaches, Scramblesuit is more principled and offers a solid basis founded on cryptographic primitives (PoW) with predictable and reproducible computational behavior on different tested platforms.

Detecting evasive malware. In [32], the authors propose Ether, a malware analyzer that eliminates in-guest software components vulnerable to detection. Ether leverages hardware virtualization extensions such as Intel VT, thus residing outside of the target OS environment. In [47], the authors present an automated evasive malware detection system based on bare-metal dynamic malware analysis. Their approach is designed to be transparent and thus robust against sophisticated evasion techniques. The evaluation results showed that it could automatically detect 5,835 evasive malware out of 110,005 tested samples. In [12], authors propose a technique to detect malware that deploys evasion mechanisms. Their approach works by comparing the system call trace recorded when running a malware program on a reference system with the behavior observed in the analysis environment. In [58], authors propose a system for detecting environment-sensitive malware by comparing its behavior in multiple analysis sandboxes in an automated way. Compared to previous techniques, our approach is agnostic to system artifacts and cannot be recognized by only monitoring the system operations.

Timing Side Channels. One of the most dangerous side channels is time, because algorithms have different complexity and hence require more computational power and time to execute. The seminal work on timing side-channels by Kocher [49] dates back to 1996 and the corpus of literature of these attacks is very huge, for the simple reason that everything that has a clock consumes time. To the best of our knowledge the most recent attack has been published by Huo et al. [44] against the Intel CPUs SGX Trusted Execution Environment (TEE) which has been found to be vulnerable to Foreshadow [100]. In general speculative execution features of CPUs expose caches to timing side channels [48,59], in Scramblesuit’s case we don’t exploit micro architectural vulnerabilities, PoW is difficult enough to allow a more coarse grained approach i.e., for exploiting speculative execution a very high resolution timer is required, in our case our resolution is in the order of $10^{1}$ seconds magnitude, which makes the attack quite practical and easy to test.

Other Side Channels. Power consumption is also a way to spot something “costly” is happening in a machine, the seminal work in this area appeared in 1999 by Kocher et al. [50]. PoW algorithms are well known to consume a lot of energy, the Bitcoin network and its PoW yearly consumes roughly 70 TeraWatts/hour on average, a figure comparable to a small country energy consumption e.g., Czech Republic [30]. In the arms-race of malware also such side channels can be exploited together with other features [64] as a defence mechanism or attack. Even though Scramblesuit uses time to expose sandboxes because it does not require very-high granular measurements. Also power consumption could be used to obtain a similar result. As we show in Section 6 with respect to CPU usage.

9. Conclusion

Online malware scanning services are becoming more and more popular, allowing users to upload and scan artefacts against AV engines and malware analysis sandboxes. In this paper, we explore the usage of Proof-of-Work (PoW) as a malware evasion technique on the entire spectrum of computer architecture used in the malware analysis sandboxes context. In particular we analyzed two more fundamental computer architectures: Intel i7 and Apple M1 and we provided empirical evidence of how PoW can be used to evade real-world malware analysis sandboxes and we have tested six different malware families and recycled them as brand-new samples, beyond creating our own new variant. Since our techniques have been applied to the most common computer architectures our statistical results can be used as an upper bound for evading the state of the art of the virtualized hardware system in the malware analysis context.

Footnotes

Acknowledgments

This research was supported by the Spanish AEI grant ODIO (PID2019-111429RB-C21), by the Region of Madrid grant CYNAMON-CM (P2018/TCS-4566), co-financed by European Structural Funds ESF and FEDER, and by the Excellence Program EPUC3M17. The opinions, findings, conclusions, or recommendations expressed are those of the authors and do not necessarily reflect those of any of the funders.

References

2007, wolfram77web. Peid packer detector, https://github.com/wolfram77web/app-peid.

Alsmeyer, Chebyshev’s inequality, in: International Encyclopedia of Statistical Science, Springer, Berlin Heidelberg, 2011.

anonymized. Yara Signature Detector. anonymized, 2007.

anonymized. Sandbox 3. anonymized, 2020.

anonymized. Sandbox 1, 2020, anonymized.

anonymized. Sandbox 2, 2020, anonymized.

anonymized. Sandbox 2, 2020, https://codesandbox.io/examples/package/tencent.

Red Hat Inc. Ansible it automation. https://github.com/ansible, 2020.

J.-P.

Aumasson,

Arcieri,

Chestnykh,

Gosney,

Graves,

Green,

Gutmann,

Junod,

P.-H.

Kamp,

Lucks,

Neves,

Percival,

Peslyak,

Ray,

Steube,

Thomas,

Sonmez Turan,

Wilcox-O’Hearn,

Winnerlein and

Yarrkov, Password hashing competition, 2015, https://password-hashing.net/.

10.

Back, Hashcash: antin-spam tool. http://www.hashcash.org/, 2020.

11.

Bakker and

Van Der Jagt, Gpu-based password cracking, Technical Report, 2010.

12.

Balzarotti,

Cova,

Karlberger and

Vigna, Efficient detection of split personalities in malware, in: Proc. 17th Annual Network and Distributed System Security Symposium (NDSS), 2010, 2010.

13.

Bayer,

Milani Comparetti,

Hlauschek,

Krügel and

Kirda, Scalable, behavior-based malware clustering, in: NDSS, The Internet Society, 2009.

14.

Bilge,

Kirda,

Kruegel and

Balduzzi, EXPOSURE: Finding malicious domains using passive DNS analysis, in: Proceedings of the Network and Distributed System Security Symposium, NDSS 2011, San Diego, California, USA, 6th February – 9th February 2011, The Internet Society, 2011.

15.

Biryukov,

Dinu and

Khovratovich, Argon2: New generation of memory-hard functions for password hashing and other applications, in: IEEE European Symposium on Security and Privacy, EuroS&P 2016, Saarbrücken, Germany, March 21–24, 2016, 2016.

16.

Biryukov,

Dinu,

Khovratovich and

Josefsson, 2019, Argon2 rfc, www.tools.ietf.org/id/draft-irtf-cfrg-argon2-05.html.

17.

Blackthorne,

Bulazel,

Fasano,

Biernat and

B.Y.

Avleak, Fingerprinting antivirus emulators through black-box testing, in: 10th USENIX Workshop on Offensive Technologies (WOOT 16), USENIX Association, Austin, TX, 2016.

18.

Bonfante,

Fernandez,

J.-Y.

Marion,

Rouxel,

Sabatier and

A.T.

Codisasm, Medium scale concatic disassembly of self-modifying binaries with overlapping instructions, in: Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, CCS’15, 2015.

19.

Brengel,

Backes and

Rossow, Detecting hardware-assisted virtualization, in: Proceedings of the 13th International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment – Volume 9721, DIMVA 2016, Springer-Verlag, Berlin, Heidelberg, 2016, pp. 207–227.

20.

C2AI Sandbox. C2AI Sandbox. https://blog.virustotal.com/search/?q=multisandbox, 2022.

21.

Caballero,

Grier,

Kreibich and

Paxson, Measuring pay-per-install: The commoditization of malware distribution, in: Proceedings of the 20th USENIX Security Symposium, 2011.

22.

Canali,

Lanzi,

Balzarotti,

Kruegel,

Christodorescu and

Kirda, A quantitative study of accuracy in system call-based malware detection, in: International Symposium on Software Testing and Analysis, ISSTA 2012, Minneapolis, MN, USA, July 15–20, 2012,

M.P.E.

Heimdahl and

Su, eds, ACM, 2012, pp. 122–132.

23.

Chen,

Andersen,

Morley Mao,

Bailey and

Nazario, Towards an understanding of anti-virtualization and anti-debugging behavior in modern malware, in: 2008 IEEE International Conference on Dependable Systems and Networks with FTCS and DCC (DSN), IEEE, 2008, pp. 177–186. doi:10.1109/DSN.2008.4630086.

24.

F.Z.

Chowdhury,

L.B.M.

Kiah,

M.A.M.

Ahsan and

M.Y.I.B.

Idris, Economic denial of sustainability (edos) mitigation approaches in cloud: Analysis and open challenges, in: 2017 International Conference on Electrical Engineering and Computer Science (ICECOS), 2017, pp. 206–211. doi:10.1109/ICECOS.2017.8167135.

25.

Chronicle Security. File statistics during last 7 days. https://www.virustotal.com/en/statistics/, 2020.

26.

L.L.V.M.

Clang, a c language family frontend for llvm, 2020, https://clang.llvm.org/.

27.

Coker, Evasive malware threats on the rise despite decline in overall attacks, 2020, https://www.infosecurity-magazine.com/news/evasive-malware-rise-decline/.

28.

Cybersecurity Ventures. Global cybercrime damages predicted to reach $6 trillion annually by 2021. https://cybersecurityventures.com/cybercrime-damages-6-trillion-by-2021/, 2018.

29.

D.C.

D’Elia,

Coppa,

Palmaro and

Cavallaro, On the dissection of evasive malware, Vol. 15, 2020, pp. 2750–2765.

30.

Digiconomist. Yara Signature Detector. https://digiconomist.net/bitcoin-energy-consumption, 2007.

31.

Dinaburg,

Royal,

Sharif and

W.L.

Ether, Malware analysis via hardware virtualization extensions, in: Proceedings of the 15th ACM Conference on Computer and Communications Security, CCS’08, Association for Computing Machinery, New York, NY, USA, 2008, pp. 51–62.

32.

Dinaburg,

Royal,

Sharif and

Lee, Ether: Malware analysis via hardware virtualization extensions, in: Proceedings of the 15th ACM Conference on Computer and Communications Security, 2008, pp. 51–62. doi:10.1145/1455770.1455779.

33.

Dugan,

Elliott,

B.A.

Mah,

Poskanzer and

Prabhu, iperf – the ultimate speed test tool for tcp, udp and sctp, 2020, https://iperf.fr/.

34.

Dwork and

Naor, Pricing via processing or combatting junk mail, in: Proceedings of the 12th Annual International Cryptology Conference on Advances in Cryptology, CRYPTO’92, Springer-Verlag, 1992.

35.

Feldman and

Dagan, Knowledge discovery in textual databases (kdt), in: Proceedings of the First International Conference on Knowledge Discovery and Data Mining, KDD’95, AAAI Press, 1995, pp. 112–117.

36.

Fisher, Cryptographers aim to find new password hashing algorithm, 2013, https://threatpost.com/cryptographers-aim-find-new-password-hashing-algorithm-021513/77535/.

37.

Franklin,

Luk,

J.M.

McCune,

Seshadri,

Perrig and

Van Doorn, Remote detection of virtual machine monitors with fuzzy benchmarking, ACM SIGOPS Operating Systems Review42(3) (2008), 83–92. doi:10.1145/1368506.1368518.

38.

Graziano,

Canali,

Bilge,

Lanzi and

Balzarotti, Needles in a haystack: Mining information from public dynamic analysis sandboxes for malware intelligence, in: Proceedings of the 24rd USENIX Security Symposium (USENIX Security), 2015.

39.

Gu,

Yegneswaran,

Porras,

Stoll and

Lee, Active botnet probing to identify obscure command and control channels, in: Proceedings of 2009 Annual Computer Security Applications Conference (ACSAC’09), 2009.

40.

Guarnieri, Cuckoo sandbox. https://cuckoosandbox.org/, 2010.

41.

Hellman, A cryptanalytic time-memory trade-off, IEEE Transactions on Information Theory26(4) (1980), 401–406. doi:10.1109/TIT.1980.1056220.

42.

Ho,

Boneh,

Ballard and

Provos, Tick tock: Building browser red pills from timing side channels, in: 8th {USENIX} Workshop on Offensive Technologies ({WOOT} 14), 2014.

43.

Hornby and

Peslyak, yescrypt – scalable kdf and password hashing scheme, 2015, www.openwall.com/yescrypt.

44.

Huo,

Meng,

Wang,

Hao,

Zhao,

Zhai and

Li, Bluethunder: A 2-level directional predictor based side-channel attack against sgx, Vol. 2020, 2019, pp. 321–347.

45.

Infosecurity Magazine. Cybercrime costs global economy $2.9m per minute. https://www.infosecurity-magazine.com/news/cybercrime-costs-global-economy/, 2019.

46.

M.G.

Kang,

Poosankam and

H.Y.

Renovo, A hidden code extractor for packed executables, in: Proceedings of the 2007 ACM Workshop on Recurring Malcode, WORM’07, Association for Computing Machinery, New York, NY, USA, 2007, pp. 46–53. doi:10.1145/1314389.1314399.

47.

Kirat,

Vigna and

C.K.

Barecloud, Bare-metal analysis-based evasive malware detection, in: 23rd USENIX Security Symposium (USENIX Security 14), USENIX Association, San Diego, CA, 2014, pp. 287–301.

48.

Kocher,

Horn,

Fogh,

Genkin,

Gruss,

Haas,

Hamburg,

Lipp,

Mangard,

Prescher,

Schwarz and

Yarom, Spectre attacks: Exploiting speculative execution, in: 40th IEEE Symposium on Security and Privacy (S&P’19), 2019.

49.

P.C.

Kocher, Timing attacks on implementations of Diffie-Hellman, rsa, dss, and other systems, in: Proceedings of the 16th Annual International Cryptology Conference on Advances in Cryptology, CRYPTO’96, Springer-Verlag, Berlin, Heidelberg, 1996, pp. 104–113.

50.

P.C.

Kocher,

Jaffe and

Jun, Differential power analysis, in: Proceedings of the 19th Annual International Cryptology Conference on Advances in Cryptology, CRYPTO’99, Springer-Verlag, Berlin, Heidelberg, 1999, pp. 388–397. doi:10.1007/3-540-48405-1_25.

51.

Kolbitsch,

Holz,

Kruegel and

Kirda, Inspector gadget: Automated extraction of proprietary gadgets from malware binaries, in: 31st IEEE Symposium on Security and Privacy, S&P 2010, 16–19 May 2010, IEEE Computer Society, Berleley/Oakland, California, USA, 2010, pp. 29–44. doi:10.1109/SP.2010.10.

52.

Kotzias,

Bilge and

Caballero, Measuring PUP prevalence and PUP distribution through pay-per-install services, in: Proceedings of the 25th USENIX Security Symposium, 2016.

53.

Labs, Filecoin: a decentralized storage network, 2020, https://filecoin.io/.

54.

Lanzi,

Balzarotti,

Kruegel,

Christodorescu and

Kirda, Accessminer: Using system-centric models for malware protection, in: Proceedings of the 17th ACM Conference on Computer and Communications Security, CCS 2010, Chicago, Illinois, USA, October 4–8, 2010,

Al-Shaer,

A.D.

Keromytis and

Shmatikov, eds, ACM, 2010, pp. 399–412.

55.

Larimer, Momentum–a memory-hard proof-of-work via finding birthday collisions, Technical report, 2014.

56.

Laurie and

Clayton, Proof-of-work proves not to work; version 0.2, in: Workshop on Economics and Information, Security, 2004.

57.

L.W.

Li,

Duc and

Pacalet, Hardware-assisted memory tracing on new socs embedding fpga fabrics, in: Proceedings of the 31st Annual Computer Security Applications Conference, ACSAC 2015, Association for Computing Machinery, New York, NY, USA, 2015, pp. 461–470.

58.

Lindorfer,

Kolbitsch and

Milani Comparetti, Detecting environment-sensitive malware, in: International Workshop on Recent Advances in Intrusion Detection, Springer, 2011, pp. 338–357. doi:10.1007/978-3-642-23644-0_18.

59.

Lipp,

Schwarz,

Gruss,

Prescher,

Haas,

Fogh,

Horn,

Mangard,

Kocher,

Genkin,

Yarom and

Hamburg, Meltdown: Reading kernel memory from user space, in: 27th USENIX Security Symposium (USENIX Security 18), 2018.

60.

Martignoni,

Christodorescu and

S.J.

Omniunpack, Fast, generic, and safe unpacking of malware, in: ACSAC07, 2007.

61.

Martignoni,

Paleari,

Fresi Roglia and

Bruschi, Testing CPU emulators, in: Proceedings of the 2009 International Conference on Software Testing and Analysis (ISSTA), ACM, Chicago, Illinois, USA, 2009, pp. 261–272.

62.

Martignoni,

Paleari,

Fresi Roglia and

Bruschi, Testing system virtual machines, in: Proceedings of the 2010 International Symposium on Testing and Analysis (ISSTA), Trento, Italy, 2010.

63.

MDN web docs. performance.now(). https://developer.mozilla.org/en-US/docs/Web/API/Performance/now, 2020.

64.

Miramirkhani,

M.P.

Appini,

Nikiforakis and

Polychronakis, Spotless sandboxes: Evading malware analysis systems using wear-and-tear artifacts, in: 2017 IEEE Symposium on Security and Privacy (SP), 2017, pp. 1009–1024. doi:10.1109/SP.2017.42.

65.

Moser,

Krügel and

Kirda, Exploring multiple execution paths for malware analysis, in: 2007 IEEE Symposium on Security and Privacy (S&P 2007), 20–23 May 2007, IEEE Computer Society, Oakland, California, USA, 2007, pp. 231–245. doi:10.1109/SP.2007.17.

66.

Nakamoto, Bitcoin: A peer-to-peer electronic cash system,”, http://bitcoin.org/bitcoin.pdf.

67.

Nappa,

Papadopoulos,

Varvello,

Aceituno Gomez,

Tapiador and

Lanzi, Pow-how: An enduring timing side-channel to evade online malware sandboxes, in: Computer Security – ESORICS 2021,

Bertino,

Shulman and

Waidner, eds, Springer International Publishing, Cham, 2021, pp. 86–109. doi:10.1007/978-3-030-88418-5_5.

68.

Nappa,

Papadopoulos,

Varvello,

Tapiador and

Lanzi, PoC Behaviour (No Evasion) – anonymized, 2020, anonymized.

69.

Nappa,

Papadopoulos,

Varvello,

Tapiador and

Lanzi, 2021, Artifact repository, https://github.com/artifactrepo/Esorics2021_Paper159.

70.

Nappa,

Papadopoulos,

Varvello,

Tapiador and

Lanzi, Relec + PoW + static sanitization) – anonymized, 2021, anonymized.

71.

Nappa,

Xu,

Zubair Rafique,

Caballero and

G.Gu.

Cyberprobe, Towards Internet-scale active detection of malicious servers, in: Proceedings of the 21st Annual Network and Distributed System Security Symposium (NDSS’14), 2014.

72.

Lastline Inc. Not so fast my friend – using inverted timing attacks to bypass dynamic analysis. www.lastline.com/labsblog/not-so-fast-my-friend-using-inverted-timing-attacks-to-bypass-dynamic-analysis/, 2014.

73.

Oprişa and

Ignat, A measure of similarity for binary programs with a hierarchical structure, in: 2015 IEEE International Conference on Intelligent Computer Communication and Processing (ICCP), 2015, pp. 117–123. doi:10.1109/ICCP.2015.7312615.

74.

Oreans. Advanced windows software protection system. https://www.oreans.com/themida.php, 2020.

75.

Ozarslan, Online malware sandboxes, 2016, www.medium.com/@su13ym4n/15-online-sandboxes-for-malware-analysis-f8885ecb8a35.

76.

Paleari,

Martignoni,

Fresi Roglia and

Bruschi, A fistful of red-pills: How to automatically generate procedures to detect cpu emulators, in: Proceedings of the 3rd USENIX Conference on Offensive Technologies, WOOT’09, USENIX Association, USA, 2009, p. 2.

77.

Raffetseder,

Krügel and

Kirda, Detecting system emulators, in: Information Security, 10th International Conference, ISC 2007, Proceedings, Valparaíso, Chile, October 9–12, 2007,

J.A.

Garay,

A.K.

Lenstra,

Mambo and

Peralta, eds, Lecture Notes in Computer Science, Vol. 4779, Springer, 2007, pp. 1–18.

78.

RDG Software. RDG Packer Detector. http://www.rdgsoft.net/, 2007.

79.

Rutkowska, Red pill ... or how to detect VMM using (almost) one CPU instruction, 2004, https://securiteam.com/securityreviews/6z00h20bqs/.

80.

Sánchez-Rola,

Santos and

Balzarotti, Clock around the clock: Time-based device fingerprinting, in: CCS’18, 2018.

81.

Sandbox. Evasive malware analysis report. anonymized, 2020.

82.

Sandbox. Evasive malware analysis report. anonymized, 2020.

83.

Sandbox. Evasive malware analysis report. anonymized, 2020.

84.

Sandbox. Evasive malware analysis report. anonymized, 2020.

85.

Sandbox. Evasive malware analysis report – 1. anonymized, 2020.

86.

Sandbox. Evasive malware analysis report – 2. anonymized, 2020.

87.

Sandbox. Evasive malware analysis report – 3. anonymized, 2020.

88.

Sandbox. Evasive malware analysis sandbox. anonymized, 2020.

89.

Sandboxes. Avemaria + pow. https://www.virustotal.com/gui/file/371032f96b0fc980fe3aecb408cbcda24cfe6de92a17f50b5a510dc4ba6bb516?nocache=1, 2022.

90.

Sandboxes. Jigsaw + pow. https://www.virustotal.com/gui/file/b15f0a7711fb1010cac259761e40c5ef9aac8bb9f8fe3e9d0a453f92b2b94b79?nocache=1, 2022.

91.

Sandboxes. Pony + pow. https://www.virustotal.com/gui/file/100eebd4e7237c369466876c7b00ece40c06160e646d50004c6ccdad5fcd9998?nocache=1, 2022.

92.

Sandboxes. Zeus + pow. https://www.virustotal.com/gui/file/8193241eb14e5de669ecd4b32b430aea9a12d6ededdf59415e2ad3443bcccf61?nocache=1, 2022.

93.

Sharif,

Lanzi,

Giffin and

Lee, Automatic reverse engineering of malware emulators, Security and Privacy, IEEE Symposium on0 (2009), 94–109.

94.

Tanabe,

Ueno,

Ishii,

Yoshioka,

Matsumoto,

Kasama,

Inoue and

Rossow, Evasive malware via identifier implanting, in: Detection of Intrusions and Malware, and Vulnerability Assessment,

Giuffrida,

Bardin and

Blanc, eds, Springer International Publishing, Cham, 2018, pp. 162–184. doi:10.1007/978-3-319-93411-2_8.

95.

The Boost organization. Boost c++ libraries, 2020, https://www.boost.org/.

96.

Tromp, Cuckoo cycle: A memory bound graph-theoretic proof-of-work, in: International Conference on Financial Cryptography and Data Security, Springer, 2015, pp. 49–62. doi:10.1007/978-3-662-48051-9_4.

97.

Tuwiner, Bitmain antminer s9 review, 2017, https://www.buybitcoinworldwide.com/mining/hardware/antminer-s9/.

98.

Ugarte-Pedrero,

Balzarotti,

Santos and

P.G.

Bringas, Sok: Deep packer inspection: A longitudinal study of the complexity of run-time packers, in: 2015 IEEE Symposium on Security and Privacy, 2015, pp. 659–673. doi:10.1109/SP.2015.46.

99.

Ul Haq,

Chica,

Caballero and

Jha, Malware Lineage in the Wild. Computers & Security78(C) (2018), 347–363.

100.

Van Bulck,

Minkin,

Weisse,

Genkin,

Kasikci,

Piessens,

Silberstein,

T.F.

Wenisch,

Yarom and

R.S.

Foreshadow, Extracting the keys to the intel sgx kingdom with transient out-of-order execution, in: Proceedings of the 27th USENIX Conference on Security Symposium, SEC’18, USENIX Association, USA, 2018, pp. 991–1008.

101.

VirusShare. Virusshare.com – because sharing is caring. https://virusshare.com/l, 2020.

102.

Wang,

Wei,

Gu and

W.Z.

Taintscope, A checksum-aware directed fuzzing tool for automatic software vulnerability detection, in: Proceedings of the 31st IEEE Symposium on Security and Privacy (Oakland’10), 2010.

103.

Wenzel

Forler and

Lucks, The catena password-scrambling framework, www.uni-weimar.de/fileadmin/user/fak/medien/professuren/Mediensicherheit/Research/Publications/catena-v3.1.pdf, 2015.

104.

Wetzels, Open sesame: The password hashing competition and, 2016, argon2. arXiv preprint arXiv:1602.03097.

105.

Wikipedia, Wannacry ransomware hits prevalently windows, 2017, https://en.wikipedia.org/wiki/WannaCry_ransomware_attack/.

106.

Wong, Np complexity, 2013, https://www.cryptologie.net/article/43/np-complexity/.

107.

Xu,

Nappa,

Baykov,

Yang,

Caballero and

G.Gu.

AutoProbe, Towards automatic active malicious server probing using dynamic binary analysis, in: Proceedings of the 21st ACM Conference on Computer and Communication Security, 2014.

108.

Yokoyama,

Ishii,

Tanabe,

Papa,

Yoshioka,

Matsumoto,

Kasama,

Inoue,

Brengel,

Backes and

C.R.

Sandprint, Fingerprinting malware sandboxes to provide intelligence for sandbox evasion, in: Research in Attacks, Intrusions, and Defenses,

Monrose,

Dacier,

Blanc and

Garcia-Alfaro, eds, Springer International Publishing, 2016.

Scramblesuit : An effective timing side-channels framework for malware sandbox evasion 1

Abstract

Keywords

1. Introduction

2 https://github.com/artifactrepo/Esorics2021_Paper159

2.1.1. The choice of Argon2

3. Our approach: Scramblesuit

3.1. Threat model

3.3. Performance profiling

3.4. Threshold estimation

3.5. Malware integration and testing

4. Evaluation

4.2. Case study: Known malware

4 This cannot be applied to ForbiddenTear since it is written in .NET.

5 The malware detection report for this malware without our PoW-based evasive measure has been anonymized [68,81].

7. Discussion

7.1. Ethical considerations

7.2. Bare-metal environments

7.3. Economical denial of sustainability

7.4. Malware automation

7.5. CPU fingerprinting

8. Related work

9. Conclusion

Footnotes

Acknowledgments

References

²
https://github.com/artifactrepo/Esorics2021_Paper159

⁴
This cannot be applied to ForbiddenTear since it is written in .NET.

⁵
The malware detection report for this malware without our PoW-based evasive measure has been anonymized [68,81].