Abstract
Online malware scanners are one of the best weapons in the arsenal of cybersecurity companies and researchers. A fundamental part of such systems is the sandbox that provides an instrumented and isolated environment (virtualized or emulated) for any user to upload and run unknown artifacts and identify potentially malicious behaviors. The provided API and the wealth of information in the reports produced by these services have also helped attackers test the efficacy of numerous techniques to make malware hard to detect.
The most common technique used by malware for evading the analysis system is to monitor the execution environment, detect the presence of any debugging artifacts, and hide its malicious behavior if needed. This is usually achieved by looking for signals suggesting that the execution environment does not belong to a native machine, such as specific memory patterns or behavioral traits of certain CPU instructions.
In this paper, we show how an attacker can evade detection on such analysis services by incorporating a Proof-of-Work (PoW) algorithm into a malware sample. Specifically, we leverage the asymptotic behavior of the computational cost of PoW algorithms when they run on some classes of hardware platforms to effectively detect a non bare-metal environment of the malware sandbox analyzer. To prove the validity of this intuition, we design and implement
Introduction
This paper is an extended version of the work published at ESORICS 2021 [67]. Malware attacks have a significant financial cost, estimated around $1.5 trillion dollars annually (or $2.9 million dollars per minute) [45], with predictions hinting at this cost to reach $6 trillion dollars by 2021 [28]. Due to the sheer amount of known malware samples [25,101], manual analysis neither scales nor allows to build any comprehensive threat intelligence around the detected cases (e.g., malware clustering by specific behavior, family or infection campaign). To address this problem, security researchers have introduced sandboxes [13]: isolated environments that automate the dynamic execution of malware and monitor its behavior under different scenarios. Sandboxes usually comprise a set of virtualized or emulated machines, instrumented to gather fundamental information of the malware execution, such as system calls, registry keys accessed or modified, new files created, and memory patterns.
As a next step, online services appeared to bring malware analysis from security experts to common users [75]. Online malware scanners are not only useful for the users but also for the attackers. By allowing an artifact to be checked multiple times against various state-of-the-art malware analysis sandboxes, attackers can tune the evasiveness of their malware samples by exploiting the feedback reported by these services and try various techniques before making the sample capable of detecting the presence of a sandbox. Specific CPU instructions, registry keys, memory patterns, and red pills [62,76,79] are only a few of the signals used by attackers for identifying glitches of the emulated environment that can disclose the presence of a sandbox environment. These techniques have triggered an arms race, with the more sophisticated web malware scanners rushing to spoof any such exploitable signals [47].
In this work, we show how an attacker can evade malware analysis in these scanning services by leveraging Proof-of-Work (PoW) [34] algorithms. The key intuition of our technique lies in the fact that, like NP-class problems [106], the asymptotic behavior of a PoW algorithm is constant in terms of computational power [34]. This means that CPU and memory consumption remain stable over time. Therefore, PoW algorithms are perfect candidates for benchmarking the computation capability of the underlying hardware. Such a benchmark can be used as a fingerprint of the underlying computing infrastructure, in particular to reveal the presence of a sandbox since its fingerprint deviates statistically from the one observed in a native hardware platform.
A key advantage of using PoW techniques is that they are a time-proof and self-contained mechanism compared to other more fine-grained timing side-channel approaches that try to detect the underlying hardware machine. In fact, our system does not require access to precise timing resources for detecting the emulated environment (e.g., network or fine-grained timers). In our evaluation we empirically validate that a PoW-based technique can detect an emulated environment with high precision just by looking at the output of the algorithm (i.e., execution time, and number of successful iterations). Furthermore, PoW implementations do not raise any suspicion to automated malware sandboxes compared with the stalling code (e.g., infinite loops and/or sleep) that is easier to detect because of CPU idleness [57]. An additional advantage of this approach is that current defensive techniques that aim at spoofing the virtualization signals present in contemporary sandboxes cannot act as countermeasures against the stable timing side-channels that our technique exploits. Fingerprinting PoW algorithms as a malware component is feasible, e.g., by checking the usage of particular cryptographic instructions. However, using it as a proxy signal for detecting malware would produce a large number of false positives since PoW algorithms are part of legitimate applications such as Filecoin [53] and Hashcash [10].
Contributions. In this paper, we make the following contributions:
We design and implement We empirically evaluate each step of To further quantify the efficacy of PoW-based evasion with real-world sandboxes, we wrote a fully functional malware sample for several operating systems (Linux, Windows, MacOS) which we integrated with an evasion mechanism based on Argon2, and submitted it to fourteen online sandboxes. Reports from each sandbox mark our malware as clean. We have tested six different malware families and recycled them as brand-new samples, beyond creating our own new variant. We further discuss the behavioral analysis for our malware, as well as potential countermeasures to this novel PoW-based evasion mechanism we have proposed. To ensure the reproducibility of our results and foster further research on this topic, we make the source code of
We evaluate

WannaCry ransomware variant dynamic execution graph (VirusTotal).
Malware Analysis. Researchers and professionals have evolved their tools and skills in response to the evolution of malicious software. There is a substantial amount of literature devoted to analyze and counter malware [21,39,51,52,65,71,102,107]. Every aspect of the phenomenon has been taken into consideration, from its network infrastructure, to the code that gets reused among samples, unexplored paths in the control-flow, sandbox design and instrumentation. Nonetheless the arms race continues and as new analysis evasion techniques are found also new countermeasures are developed. Figure 1 gives an idea of how sandboxes environment can be useful to study malware samples and their behaviour. In this example, we see a WannaCry variant – a popular ransomware – which contacts different servers between Europe and US to perform its operations. The figure gives an idea of the scale of the malware phenomenon and the amount of revenue that can be gained using infected machines.
Anti-Analysis Techniques. Several anti-analysis techniques have been developed during the years by miscreants: packers [14,60,98], emulators [93], anti-debugging and anti-disassembly tricks and stalling code. Most techniques have been promptly countered by our community, with the exception of stalling code. This anti-analysis technique is very difficult to detect and poses a problem for commercial sandboxes [72].
Proof-of-Work (PoW) [34] is a cryptographic technique used to guarantee that a party (the prover) has spent a certain amount of computational effort. A key feature of PoW algorithms is their asymmetry: the work imposed on the client is moderately hard but it is easy for a server to check the computed result. There are two types of PoW protocols: (a) challenge-response protocols, which require an interactive link between the server and the client, and (b) solution-verification protocols, which allow the client to solve a self-imposed problem and send the solution to the server to verify the validity of the problem and its solution. Such PoW protocols (also known as CPU cost functions) leverage algorithms like hashcash with double iterated SHA256 [56], momentum birthday collision [55], cuckoo cycle [96], and more. These algorithms may be:
Memory-bound, where computation speed is bound by main memory accesses. The performance of such algorithms is expected to be less sensitive to hardware evolution.
CPU-bound, where computation speed is bound by the system’s processor, which greatly varies in time, as well as between high-end and low-end devices.
Network-bound, where a client must collect tokens from remote nodes before querying the service provider. In this sense, the work is not performed by the client, but they incur delays because of the latency to get the required tokens.
PoW has gained much popularity in recent years by becoming the founding block of the blockchain technology. PoW is the (trustless) consensus algorithm in a blockchain network that is used to confirm transactions and append new blocks to the chain. In particular, performing PoW – which in this context is known as mining – is required as a way to force miners to compete against each other in solving complex computational puzzles. Whenever a miner solves the PoW computational puzzle of a new block, it broadcasts that block to the network. All other miners can then easily verify that the solution is correct and the block is confirmed. The difficulty of each PoW puzzle is periodically adjusted to keep the block construction time around a target time [34]. While designing
The choice of Argon2
Argon2 (the PHC winner) is the PoW strategy used in
A message string P, which is a password for password hashing applications. Its length must be within 32-bit size.
A nonce S, which is used as salt for password hashing applications. Its length must be within 32-bit size.
A degree of parallelism p that determines how many independent (but synchronized) threads can be run. Its value should be within 24-bit size (minimum is 1).
A tag with length within 2 and 32-bit
A memory size m, which is a number expressed in Kibibytes 3
A “Kibibyte” is equal to 1024, or
A number of internal iterations t, which is used to tune the running time independently of the memory size. Its value should be within 32-bit size (minimum is 1).
These input parameters will be used in our framework to define the computational boundary of the algorithm execution on a specific class of hardware machines. Once the parameters are set, the output of the PoW algorithm only depends on the hardware platform. Argon2 has two implementations: Performance: the selected memory area is filled very fast, this plays against adversaries with specialized ASIC. Tradeoff Resilience: with default number of memory passes an ASIC equipped adversary cannot decrease the time-area product even if the memory size used is reduced by a factor of 4 or more. This has been proved not to work with Catena and Lyra algorithms. GPU-FPGA-ASIC Unfriendly: implementing a dedicated cracking hardware will not improve the performance (timewise) of the algorithm.
Various techniques have been proposed to detect if applications are running inside a sandbox, emulated or virtualized environment. The most reliable and stealthy ones are based on timing measurements [ 49 ]. Such an approach bases its own success on the use of fine-grained timing instructions provided by the operating system. Recently, access to these instructions have been limited by the operating system to protect against the very dangerous micro-architectural attacks such as Spectre and Meltdown [ 48 , 59 ]. Consequently, timing measurements are no more a viable option for sandbox detection.
Different from the classic timing techniques our mechanism is based on PoW algorithms and offer strong cryptographic premises with a very stable complexity growth, which make an evasion technique resilient to any countermeasure, such as using more powerful bare-metal machines to enhance performance or reduce the time measurements granularity. By exploiting the asymptotic behavior of the PoW algorithms, we build a statistical model that can be used to identify the class of environment where the algorithm is running and consequently distinguish between physical and virtualized, emulated or simulated architectures, like different flavors of malware sandboxes. Indeed, even fine grained red pills techniques [76] such as CPU instruction misbehaviour can be easily fixed in the sandbox or spoofed to thwart evasion techniques. Differently, PoW relies on well defined mathematical and computational behaviour. Moreover, a simple modification of the PoW library avoids the malware sample to be fingerprinted with statistical techniques.
If we take as an example of PoW complexity the one that is used by bitcoin, by design the computational complexity of the algorithm increases for each new block of the blockchains transaction [66]. For instance, the computation complexity of the PoW used by blockchains increases for each new block of blockchain transactions [66]. This is due to the fact that resources (cryptocoins) of the blockchain ecosystem are limited; as resources availability decreases, their price in terms of computational power grows, hence the PoW computation becomes more difficult at every iteration. Such computation increase shows the asymptotic behavior that can be exploited by our technique. To approve a pool of transactions an amount of computation is needed to solve the PoW and compute the next block of the chain.
Our approach: Scramblesuit
This section describes
Threat model
We assume a malware scanning service based on virtualized or emulated sandboxes, which allows users to upload and scan their individual files for free as many times as they need. Such service combines results from various state-of-the-art malware analysis sandboxes before returning to the user a detailed report about the detection outcome of each sandbox scanner used.
We further assume an attacker who developed a program that includes (i) some malicious payload along with (ii) a technique to pause or alter the execution of the malicious program itself, when a possible malware analysis environment is detected. Before distributing the malicious program to the victims, the attacker may use a malware scanning service to assess its evasiveness.
Beyond the fact that an attacker can hide its own artifact, in this paper we demonstrate that it is possible to recycle an existing malware and make it look like benign software. This could enable an attacker to recycle billions of malicious samples and eventually break current detection mechanisms.

High level overview of
As described in Section 2, PoW puzzles have moderately high solving cost and a very small verification time, like problems in the NP complexity class [106]. This implies that their asymptotic behavior is constant in terms of computational cost [34], e.g., CPU and memory consumption.
Performance Profiling. It executes multiple PoW algorithms on several hardware and operating systems using different configuration settings and system loads.
Model estimation. The previous step provides the system with a measurement of the amount of time needed to execute the PoW on real hardware. By using the Bienaymé–Chebyshev [2] inequality, it then estimates the time (threshold) expected for a particular configuration to run on a given architecture.
Integration. Once the models are built, a malware developer can select a specific PoW and parameters to associate with an arbitrary malware sample.
We leverage a custom Cuckoo Sandbox [40] and popular crowd-sourced malware scanning services (like VirusTotal or similar [75]) to form a testbed allowing to report on the accuracy of the evasiveness of the malware in real-world settings.
Performance profiling
The first step in
Hardware.
Systems and loads. With the exception of the Raspberry Pi 3 and the iMac, the other hardware platforms are set up in dual boot, supporting both Linux (Ubuntu 22.04, 64 bits) and Windows 10 (64 bits). Each platform can be further configured in idle and busy mode. The latter is achieved using
PoW and parameters.
Threshold estimation
The second step in
Using the empirical distribution of execution time observed in the previous step, this inequality allows us to select a threshold T (i.e., a maximum execution time) which guarantees a high sample population coverage. The previous deduction enables us to determine with high probability the time T it will take for a PoW to run if the underlying platform is not virtualized. To reduce false positives, the evasion rule can be generalized to “the execution environment is virtualized if the PoW does not complete N executions in less than T seconds.”
Malware integration and testing
The final step in
Testing. To evaluate the accuracy of the newly generated evasion mechanism, we rely both on a local sandbox – a custom Cuckoo Sandbox [40] equipped with Windows 10 (64 bits), which is the most targeted OS for malware campaigns [105] – and several on-line free-of-charge sandbox services [75]. Once this step is completed,
Static Evasion vs. Sandbox Evasion. In our experiments, we use a paid VirusTotal subscription. This subscription grants us access to 8 more sandboxes. Since our goal is to trigger the dynamic analysis of the sandboxes itself we implement a naïve packing technique that is able to pass the static checks and run our sample inside the sandbox. Using this technique, we managed to run our PoW mechanism and successfully test our evasion technique.
Evaluation
In this section, we evaluate
Number of consecutive PoW executions per hardware and OS combination over 24 hours. For a given platform, the first line refers to results obtained with the idle setting, while the second line refers to busy setting
Number of consecutive PoW executions per hardware and OS combination over 24 hours. For a given platform, the first line refers to results obtained with the idle setting, while the second line refers to busy setting
Statistical measurement results for Catena
For each PoW, we have selected different configurations with respect to memory footprint, parallelism, and algorithm internal iterations (see Tables 2 for Catena and 3 for Argon2i and Yescrypt). Argon2i and Yescrypt have similar parameters (memory, number of threads, blocks) whereas Catena’s only parameter is a graph size which grows in memory and will make its computation harder as the graph size increases.
Table 1 shows the total number of PoW executed over 24 hours per hardware, operating systems, and CPU load (idle or busy). Regardless of the CPU load on each machine, we observe two key insights. First, there is a significant drop in the number of PoW executions when considering Linux vs Windows, which is close to a 50% reduction in the high-end machine. This is due to operating system interaction, ABI and binary format, and ultimately idle cycle management. Second, a 30x reduction in the number of PoW executions when comparing high-end and low-end platforms, e.g., under no additional load the Raspberry Pi 3 completes 300 executions versus an average of 8,611 executions on both the high and medium-end machines. Finally, extra load on the medium and high-end machines causes a reduction in the number of proofs computation of around 6-10%, averaging out to 7,300 executions between the two machines. A more dramatic 50% reduction was instead measured for the Raspberry Pi 3.
Next, we statistically investigate PoW execution times by mean of the Bienaymé–Chebyshev inequality (see Section 3.4). To balance equally sized datasets, we sampled 150 random executions (i.e., the total number of executions that were possible to complete on the low-end platform) from the 9,325 executions available from both the medium and high-end platforms. Tables 2 and 3 show for each PoW and configuration, several statistics (min, max, σ, and K, Chebyshev inequality) of the PoW execution time computed across hardware platforms, OSes (when available), and load condition (idle, busy). Overall, we measured Chebyshev inequality values higher than 97% regardless of the PoW and its configuration. This confirms high determinism in the PoW execution times on real hardware, validating the main intuition behind this work.
Algorithm choice. The results above provide the basis to select a PoW algorithm along with its parameters to integrate with the input malware sample. These results indicate that PoW selection has minimal impact on the expected accuracy of the proposed evasion mechanism. We then selected Argon2i (with 8 threads, 100 internal functions and 4KiB of memory) motivated by its robustness and maturity. We leverage the results from Table 3 (top, second line) to set the parameters N (PoW execution) and T (evasion threshold) of an Argon-based evasion mechanism. The table shows that
Case study: Known malware
We first analyze the effect of adding our PoW-based evasion strategy to the code of six well-known malware families: Relec, Forbidden Tear, Zeus, Jigsaw, Pony and AveMariaRAT (i.e., Botnets, RATs, Ransomware). The use of real-world malwares, which are well known and thus easy to detect, allows us to comment on the impact that PoW-based evasion has on malware reuse, the practice of recycling old malware for new attacks. We use
This cannot be applied to ForbiddenTear since it is written in .NET.
Statistical measurement results for Argon2i (top) and yescrypt (bottom). Thr. = number of threads. It. = number of algorithm steps. Mem. = amount of memory used in KiB. Cheb. = Chebyshev coverage
Online AntiVirus and sandbox detection results for six malware families samples and a benign test program using various anti-analysis configurations. a
Results are in percentage because the number of AV programs used by VirusTotal changes, though percentages are absolute of that specific detection.
.NET Binaries cannot be virtualized.
We submitted all malware variants to online sandboxes for analysis and checked how many AV engines (antivirus products) and sandboxes [ 75 ] flag each variant as malicious (see Table 4 ). The embedding of the PoW technique inside the malware is able to decrease the detection rate on roughly 70 AVs, by a factor of 10 [ 70 , 89 – 92 ], reaching a level where the difference between the label malicious and false positive is evanescent. It is worth noting that the total percentage of the detection rate is computed by considering the results provided by all sandboxes run inside the VirusTotal framework plus AV engines. In order to achieve a low detection score and no sandbox flagging the samples as malicious we applied some simple static code transformation similar to the packing technique. It is fundamental to understand that while these trivial static code transformations influence static AV detection they never influence sandbox detection which were always thwarted exclusively by the PoW.
Table 4 also show results when submitting several variants of a standard Hello World program. Note that the original code has been flagged as malicious by 3 anti-viruses, though as it is possible to see from the report the detections are mislabeled i.e., Relec is not recognized. This false positive could be due to a large number of submissions of the same code hash (due to its simplicity and popularity), our source IP being flagged, and other unknown factors which may influence the scoring. The table also shows that adding code virtualization or packing translates into a substantial increase in false positive detections even of a simple Hello World program, confirming our intuition above. Instead, adding our PoW-based evasion strategy results in less false positives, one less than the original code. This is likely due to the fact that our code on top of Hello World has more entropy, with respect to a very simple one line program. This makes it look like a realistic non-malicious program to engines that measure such kinds of parameters.
Overall, these three case studies show that a PoW-based evasion strategy reduces the number of detections by one order of magnitude with known malware by preventing the sample from executing in the analysis sandbox. This result demonstrates large potential for malware reuse by coupling it with PoW-based evasion strategy. In the next section, we perform more controlled experiments based on fresh (i.e., previously unseen) malware.
In order to further explore the results obtained in the previous case studies, we wrote a simple malware PoC (roughly 150 LoC) for Windows 10 (VC++) and Linux (C++) and Apple MacOS (for M1). Our malware sample implements a basic ransomware functionality which scans the entire hard drive and encrypts all its files. This behavior should be easy to detect by any malware scanning service.5
The malware detection report for this malware without our PoW-based evasive measure has been anonymized [68,81].
We submitted different variants of this malware sample (with PoW, without PoW, with static sanitization) to several on-line sandboxes and the results were disheartening (see Table 5). For the static sanitation we remove the symbol tables and debugging symbols. Note that very similar results were also achieved with our local Cuckoo Sandbox. It is important to note that to check the execution of the malware payload we insert a create-file function at the beginning of the malware payload itself. Such file creation is visible on the behavioral report of the analyzed sandboxes in case the malware payload is executed.6
This reference has been anonymized not to violate the terms of service of sandbox vendors [81].
Execution results of a custom ransomware sample on several sandboxes a
The eight additional malware sandboxes (4-12) provided by VirusTotal come with preconfigured settings; our submission were always not detected.
We made all the reports of our analysis publicly available, including screenshots of evasive malware samples. 7
It has to be noted that not all sandboxes report are the same, but they all signal the hard drive scan (Ransomware behavior) without full static protection (i.e., with the default compiler options). In Table 5 the number of PoW executed is visible only if a screenshot of the sandbox is available. As for the sandbox execution timeout, not all the analysis services had it available for selection.Malware execution results on hardware (busy with CPU-z on windows and SOHO programs on other platforms)
Malware execution results on hardware (idle)
Multi Architecture Sandbox Resistant Malware. We have analyzed a variety of COTS architectures, including the recent M1 from Apple Inc. We find that the performance of the M1 chip are indeed within the bounds of our statistical analysis making our threshold estimation accurate and resilient to future changes. Tables 6 and 7 show how all the platforms stay within 50 executions of the PoW during 10 seconds. Our results show the effectiveness and future proofness approach of
The results shown in the previous section demonstrate that a
We next discuss in detail the behavioral analysis of our malware. This is an analysis produced by a sandbox related to how a malware interacts with file system, network, and memory. If any of the monitored operations matches a known pattern, the sandbox can raise an alarm.

Behavioral map of the malware PoC without PoW and without full static protection enabled.
Figures 3, 4, and 5 show the behavioral analysis of our malware on a radar plot, labeled with most prevalent AV labels. The samples were submitted with different combinations of PoW and static protection. In Fig. 3, the radar plot is mostly “green” (benign) with respect to some operations like phishing, banker and adware for which we would not expect otherwise. However, four “suspicious” (orange) behaviors are reported with respect to evader, spyware, ransomware, Trojan operations. While our malware PoC is not labeled as “malicious” (red), the suspicious flags for our binary would trigger further manual analysis that could reveal its maliciousness. It is thus paramount to investigate and mitigate such suspicious flags.

Behavioral map of the malware PoC without PoW and with full static protection enabled.

Behavioral map of the malware PoC with PoW and with full static protection enabled.
Our intuition is that the suspicious flags are due to the fact that our malware is neither packed nor stripped, and hence some of its functionality i.e., exported functions, linked libraries, and function names are visible through basic static analysis that is usually also implemented in the dynamic sandbox environment. Accordingly, we strip out the whole static information from our binary and resubmit it as a new binary. Figure 4 shows the behavioral analysis of our PoC malware without PoW-based sandbox detection but with full static protection enabled. As expected, various signals have dropped from the behavioral report. Finally, Fig. 5 shows the result of adding PoW to the last binary. A completely green radar plot which does not raise any suspicion illustrates the evasion effect of

CPU consumption of our malware PoC (Argon2d) Malware:red line, System Idle (PID 0):green line.

Memory consumption of our malware PoC (Argon2d) Malware:red line, System Idle (PID 0):green line.

CPU consumption of our malware PoC. T = 60 seconds and 0.5 seconds between each PoW execution. Malware:red line, System Idle (PID 0):green line.
CPU and memory usage. The main downside of associating a PoW with a malware sample is an increase in both CPU and memory consumption. We here report on CPU and memory consumption as measured by our sandbox. Figures 6 and 7 compare, respectively, CPU and memory utilization of our malware (red line) with System Idle (PID 0). With respect to CPU usage, the PoW associated with our malware causes an (expected) 100% utilization for the whole duration of the PoW (
Next, we investigate whether we can reduce the CPU usage of our PoC ransomware by setting a longer T (e.g., 60 sec) and a sleep command of 0.5 sec between each PoW execution. Despite such sleeps, Fig. 8 still shows 100% CPU utilization for the whole T (60 sec in this test). The lack of CPU reduction associated with the extra sleep commands is counter-intuitive. The likely explanation is that the sandbox leverages a coarse CPU monitoring tool and, thus, the CPU reduction associated with our extra sleep commands gets averaged out. These results provide a foundation to detect evasion techniques based on PoW. A sandbox could attempt heuristics based on a binary’s CPU and memory consumption. We argue, however, that this is quite challenging because of the potential high number of false positives that can be generated.
Evasion techniques are easily comparable with other anti-analysis techniques like packing. Packing techniques have evolved to such sophistication that it has become practically impossible to unpack a malware sample without dynamically executing it [98,99]. However, dynamically executing a sample can indeed trigger evasion techniques like stalling code. To counter evasion techniques, and especially the ones that
Fingerprinting evasion. A common solution against red pills [76] is to reduce the amount of instructions failing due to emulation. As Martignoni et al. [61,62] show, the analysis can be automated and the fixes can be easily produced. However, with PoW the computational model is not seeking for emulation/virtualization failures or malfunctions. Instead, PoW is acting as a probe to spot a side channel in the execution time of the algorithm, which in this case is time-based.
Virtualized instructions set. Native execution of the cryptographic instructions is another potential countermeasure that could be considered to mitigate our approach. In such a case, the cryptographic instructions of the PoW algorithm are not emulated by the sandbox environment, but directly executed on the native CPU. Avoiding the emulation of the cryptographic instructions could clearly improve the computational performance of the PoW algorithm and reduce the success probability of the evasive behavior shown by
Specialized hardware. Even if our choice, Argon2, is resilient to specialized circuits for mining (ASICs and FPGAs), other PoW algorithms are not, and hence an analyst could equip his sandbox with a miner [97]. Such dedicated hardware is expensive for a non-professional user (around $3,000 at the time of writing). Nonetheless, if the phenomenon of sandbox evasion due to PoW proliferates, having such a platform would be of great help to offload the PoW calculations, through a tailored interface, and continue the execution of the malware sample inside the sandbox. The cost/benefit trade-off of adopting such a measure really depends on the intended scale of the analysis platform. For example, according to VirusTotal statistics [25], the service receives weekly more than 3M PE binaries. Hence, a dedicated hardware to defeat PoW evasion based techniques seems to be a good compromise, since it allows to analyze and discover new malicious behaviors.
Spoofing timers. The sandbox that gets a
Bare-Metal Sandboxes. Using bare metal hardware represents a reasonable solution that might be adopted within corporate companies but it is not possible to use such technology at Internet scale, i.e., cloud-based solutions like Virus Total. Also, isolated sandboxes do not benefit from the information that on-line in cloud services have which leverages large scale cross-correlations.
Discussion
Ethical considerations
The results obtained by
Bare-metal environments
In [47] the authors present BareCloud: a bare-metal system which helps to detect evasive malware. This system in order to execute malware trades visibility against transparency. In other words it makes the analysis system transparent (non-detectable by malware) and produces less powerful analysis data (limited instrumentation). Indeed their detection technique leverages hierarchical similarity [35] comparison between different malware execution traces (virtualized and emulated) systems i.e., (Ether [31], Anubis [13], and VirtualBox [40]). One of the biggest problem of hierarchical similarity algorithms is scalability, which means that the algorithm should be polynomial in time and space. An example [73] of application and analysis of hierarchical similarity for binary program comparison shows
Evasive malware on the rise. Over 70% of all malware attacks involved evasive zero-day malware in Q2 of 2020: a 12% rise on the previous quarter [27]. This denotes that evasive malware is a phenomenon that will hardly disappear and there will always be continuous research in evading analysis systems. Recently, a well-known sandbox vendor declared that “detecting evasive samples is not a priority for their business” (via private communication). While this statement suggests that online services might not pay special attention, if samples try to evade their aggregated detection system, this does not imply that such evasion is going to disappear. This statement, though, may open avenues for a different attack surface, such as Economical Denial of Sustainability [24].
Economical denial of sustainability
Online sandboxes, like any other business, have costs to sustain. Ignoring evasive malware to avoid an additional cost is (for now) understandable. Unfortunately, malware that exploits
Malware automation
Following this research line we plan to continue studying the phenomena that surrounds malware, its analysis, and the countermeasures that can be applied to thwart such analysis. We envision that attack-driven research is a necessary approach to demonstrate the duality of every phenomena that are in conflict. In particular we argue that in the future, malware could thwart analysis trying to mine the last block of a public blockchain as some JS based malware was doing to get rewarded. In this case, the mining would serve as stalling code to evade detection. Though, it will be connected with a decentralized live network and its analysis would be dependent on transaction confirmation. Hence, causing a delay, until the block gets mined. For instance, in Bitcoin the average transaction confirmation lasts roughly 10 minutes. While nowadays sandboxes, offer 5 minutes to perform an analysis
Moreover, updates to the malware evasion time can be done on the fly when the next block of the blockchain is calculated, thus automating the evasive behavior upgrade.
CPU fingerprinting
With a very granular source of time, PoW algorithms – given their asymptotic behavior – may be used to fingerprint hardware platforms instead of simply distinguishing sandboxes from real hardware. The potential of this proposal is to perform such fingerprinting remotely as Sánchez-Rola [80] did, i.e., through the browser. This opens many new questions about user profiling, privacy and security. However, since the well-known micro-architectural bugs Spectre, Meltdown, and Foreshadow [48,59,100] the timer resolution granularity of browsers, normally accessed through the
Related work
There is a significant body of research [19,22,38,54,93,94,108] focusing on both designing novel evasion techniques for malware and also providing mechanisms to detect them. We next discuss the most relevant works related to ours.
Evading Malware Analysis. Packer identification tools such as Yara and others [1,3,78] use signatures to detect if an executable is packed with a specific previously known packer. Moreover, these tools can be used in combination with manually written static unpackers to enhance their functionality and customize their behaviour. Also, generic dynamic unpackers have been proposed [18,46,60] to avoid writing a specific static unpacker which may be useful for a single malware sample. Ugarte et al. [98] propose a taxonomy of packer complexity and perform an exhaustive study of custom and COTS packers. In the work by D’Elia et al. [29] a comprehensive description of anti-analysis techniques, there is one of the first description on how to counter stalling code which is one of the resilient anti-analysis technique and they claim the approach to dismantle artifacts code is manual.
Fingerprinting emulated environments. By recognizing the sandboxes of different vendors, malware can identify the distinguishing characteristics of a given emulated environment and alter its behavior accordingly. The work in [79] introduced the notion of red pill and released a short exploit code snippet that could be used to detect whether the code is executed under a VM or in a real platform. In [76], the authors propose an automatic and systematic technique (based on EmuFuzzer [61]) to generate red pills for detecting whether a program is executed inside a CPU emulator. In [62], the authors build KEmuFuzzer, which leverages protocol-specific fuzzing and differential analysis. KEmuFuzzer forces the hosting virtual machine and the underlying physical machine to execute specially crafted snippets of user- and system-mode code before comparing their behaviors. In [17] authors presented AVLeak, a tool that can fingerprint emulators running inside commercial antivirus (AV) software, which are used whenever AVs detect an unknown executable. The authors developed an approach that allows them to deal with these emulators as black boxes and then use side channels for extracting fingerprints from each AV engine. Instead, we show that even with completely transparent analysis programs, the real environment can be used by the malware to determine that it is under analysis. In [64] authors propose a ML-based approach to detect emulated environments. This technique is based on the use of features such as the number of running processes, shared DLLs, size of temporary files, browser cookies, etc. These features are named by the authors “wear-and-tear artifacts” and are present in real systems as opposed to sandboxes. The authors use such features to train an SVM classifier. We also rely on modeling a distinguishing feature, in our case is a time channel arising from the asymptotic behavior of a Pow, not the presence or absence of system artefacts.
In [37], authors introduce the virtual machine monitor (VMM) detection and they propose a fuzzy benchmark approach that works by making timing measurements of the execution time of particular code sequences executed on the remote system. The fuzziness comes from heuristics which they employ to learn characteristics of the remote system’s hardware and its configuration. In [77], authors analyze a number of possibilities to detect system emulators. Their results show that emulation can be successfully detected as easily as virtual machines, mainly because the task of perfectly emulating real hardware is complex. In [23], the authors present a technique that leverages TCP timestamps to detect anomalous clock skews in VMs. A downside of the approach is that it requires the transmission of streams of hundreds of SYN packets to the VM, something that can be detected in the case of a honeypot VM and flagged as malicious behavior. Finally, in [42] authors present a number of red pills based on timing side channels that are implemented in JavaScript and run in the browser, aiming to detect whether the browser is running inside a virtual machine. Compared to the previous approaches,
Detecting evasive malware. In [32], the authors propose Ether, a malware analyzer that eliminates in-guest software components vulnerable to detection. Ether leverages hardware virtualization extensions such as Intel VT, thus residing outside of the target OS environment. In [47], the authors present an automated evasive malware detection system based on bare-metal dynamic malware analysis. Their approach is designed to be transparent and thus robust against sophisticated evasion techniques. The evaluation results showed that it could automatically detect 5,835 evasive malware out of 110,005 tested samples. In [12], authors propose a technique to detect malware that deploys evasion mechanisms. Their approach works by comparing the system call trace recorded when running a malware program on a reference system with the behavior observed in the analysis environment. In [58], authors propose a system for detecting environment-sensitive malware by comparing its behavior in multiple analysis sandboxes in an automated way. Compared to previous techniques, our approach is agnostic to system artifacts and cannot be recognized by only monitoring the system operations.
Timing Side Channels. One of the most dangerous side channels is time, because algorithms have different complexity and hence require more computational power and time to execute. The seminal work on timing side-channels by Kocher [49] dates back to 1996 and the corpus of literature of these attacks is very huge, for the simple reason that everything that has a clock consumes time. To the best of our knowledge the most recent attack has been published by Huo et al. [44] against the Intel CPUs SGX Trusted Execution Environment (TEE) which has been found to be vulnerable to Foreshadow [100]. In general speculative execution features of CPUs expose caches to timing side channels [48,59], in
Other Side Channels. Power consumption is also a way to spot something “costly” is happening in a machine, the seminal work in this area appeared in 1999 by Kocher et al. [50]. PoW algorithms are well known to consume a lot of energy, the Bitcoin network and its PoW yearly consumes roughly 70 TeraWatts/hour on average, a figure comparable to a small country energy consumption e.g., Czech Republic [30]. In the arms-race of malware also such side channels can be exploited together with other features [64] as a defence mechanism or attack. Even though
Conclusion
Online malware scanning services are becoming more and more popular, allowing users to upload and scan artefacts against AV engines and malware analysis sandboxes. In this paper, we explore the usage of Proof-of-Work (PoW) as a malware evasion technique on the entire spectrum of computer architecture used in the malware analysis sandboxes context. In particular we analyzed two more fundamental computer architectures: Intel i7 and Apple M1 and we provided empirical evidence of how PoW can be used to evade real-world malware analysis sandboxes and we have tested six different malware families and recycled them as brand-new samples, beyond creating our own new variant. Since our techniques have been applied to the most common computer architectures our statistical results can be used as an upper bound for evading the state of the art of the virtualized hardware system in the malware analysis context.
Footnotes
Acknowledgments
This research was supported by the Spanish AEI grant ODIO (PID2019-111429RB-C21), by the Region of Madrid grant CYNAMON-CM (P2018/TCS-4566), co-financed by European Structural Funds ESF and FEDER, and by the Excellence Program EPUC3M17. The opinions, findings, conclusions, or recommendations expressed are those of the authors and do not necessarily reflect those of any of the funders.
