Abstract
Cache timing side channels allow a remote attacker to disclose the cryptographic keys, by repeatedly invoking the encryption/decryption functions and measuring the execution time.
Introduction
In practical implementations of cryptographic algorithms, the cryptographic keys could be leaked through side channels on timing [2,3,8,9,14,34,37,46,50,57,58], electromagnetic fields [29], power [10,49], ground electric potential [30] or acoustic emanations [31], even when the algorithm is semantically secure. Among these vulnerabilities, timing side-channel attacks are launched without any special probe devices. In particular, remote timing side channels allow attackers, who have no system privilege or physical access to the computer, to discover the keys by repeatedly invoking the encryption/decryption functions and measuring the execution time [9,14,17,18,46].
One type of remote timing side channels, called remote cache timing side channels in this paper, exploits the time differences of data-cache hits and misses. Such cache timing side channels are widely found in the implementations of block ciphers [4,7,9,53,58]. In the implementations, table lookup is the primary time-consuming operation in encryption/decryption. The overall execution time is influenced by the number of cache misses/hits in table lookup. Therefore, from the execution time, attackers are able to infer the inputs of table lookup operations which are determined by the secret keys (and the known plaintext/ciphertext). Generally, for a block cipher, encryption and decryption are symmetric computations with same basic operations. So we only emphasize on encryption in the rest of the paper, but all discussions and conclusions are applicable to decryption.
Compared with other side channels, such as power, electromagnetic fields, ground electric potential and acoustic emanations, cache-based timing side-channel attacks do not require special equipments or extra physical access to the victim system. Moreover, such remote attacks only require the least privilege to (remotely) invoke the encryption functions, while other active cache side-channel attacks involve operations by a malicious task that shares caches with the victim cryptographic engine [34,52,61,63].
Various methods are proposed against cache timing side-channel attacks for block ciphers, destroying the relationship between the secret keys and the execution time. The intuitive design is to perform encryption, a) with the lookup tables completely inside or outside caches, or b) in a constant period of time, regardless of cache utilization.
Another choice is to perform encryption without caches, which is inefficient. In fact, a typical
These two common mechanisms prevent cache-based timing attacks at different phases, but they introduce extra operations and significantly degrade the performance. In this paper, we investigate the performance of symmetric cryptographic functions that adopt
Each encryption is protected by
In order to optimize the performance,
When
The integrated scheme does not require any privileged operation on computer systems. The conditions to perform
We apply the integrated
To the best of our knowledge, this is the first to systematically analyze the optimal performance of integrating
We investigate the overheads of
We implement the optimal integration scheme on Linux with Intel CPUs for AES-128, and experimentally confirm that it (a) eliminates cache timing side channels, (b) outperforms other different integration strategies, and (c) works without any privileged operation on the system.
The remainder of this paper is organized as follows. Section 2 presents the background and related works. Section 3 discusses the
AES and block cipher implementation
AES is a block cipher with 128-bit blocks, and the key is 128, 192 or 256 in bits. AES encryption (or decryption) consists of several rounds of transformations and the number depends on the key length. Each round consists of
The last round of AES encryption performs only
The straightforward AES implementation needs four 1 KB tables in all rounds,
Four tables are encoded into a 2 KB lookup table in the compact implementation,
Implementations of other block ciphers, usually consist of table lookup operations and basic computations without data-dependent branches in the execution path, such as 3DES, Blowfish [54], CAST-128 [5] in OpenSSH-7.4p1 [47].
Caches, a small amount of high-speed memory cells located between CPU cores and RAM, are designed to temporarily store the data recently accessed by CPU cores, avoiding accessing the slow RAM chips. When the CPU core attempts to access a data block, the operation takes place in caches if the data have been cached (i.e., cache hit); otherwise, the data block is firstly read from RAM into caches (i.e., cache miss) and then the operation is performed in caches.
Cache timing side-channel attacks exploit the fact that accessing cached data is about two orders of magnitude faster than those in RAM, to recover the keys based on the execution time. Typically, it takes 3 to 4 cycles for a read operation in L1D caches, while an operation in RAM takes tens or hundreds of cycles [27].
Cache side-channel attacks on cryptographic engines are roughly divided into three categories: trace-driven, time-driven and access-driven attacks. The trace-driven attackers probe the variation of electromagnetic fields or power, to capture the profile of cache activities and deduce cache hits and misses [2]. In time-driven attacks, the adversary passively measures the execution time to disclose the secret keys. In these two categories of attacks, the cache states are changed by the victim cryptographic engine (and the system activities), while an access-driven attacker actively manipulates the cache states by running a malicious process that shares caches with the victim.
Time-driven side channel
Two typical remote cache timing side channel attacks on block cipher implementations are outlined as follows.
Bernstein’s attack. It statically analyzes the relation between the overall execution time and lookup table indexes [9]. The attacker firstly obtains the AES execution time for each possible value of
Internal collision attack. This attack exploits the fact that less time is needed for accessing the lookup table entries in the same cache line. Each 64-byte cache line has 16 4-byte entries, and in the first round of AES [14], the inputs of lookup table
Access-driven cache side channel
When the caches are shared by the attacker and the victim, the attacker could actively control the cache states and monitor the cache access by the victim process. Then, the keys are derived from the access patterns.
There are two categories of active cache side channels: synchronous and asynchronous [50]. In a synchronous attack, the attacker first sets the cache states, triggers the encryption, and monitors the memory access by detecting the change of cache states after the encryption. While in an asynchronous attack, the attacker sets the cache states and detects the changes by the victim process running concurrently. The synchronous attack monitors the cache state once per encryption, while the asynchronous one does it for several times per encryption with higher performance.
In synchronous Evict+Time attacks [50,57], the attacker sets the cache states as follows: triggering the encryption to load data into caches and explicitly evicting some from caches, which will be monitored. Then, the attacker triggers the encryption again, and records the time to infer whether the monitored addresses are accessed or not.
Prime+Probe [44,57] is launched in both scenarios. The attacker sets the cache states by filling one cache set with its own data, called Prime. Then, it accesses these data and measures the access time, to determine whether the monitored cache set is used by the encryption after the Prime phase, called Probe. Different cache sets are monitored one by one in synchronous attacks, or the access pattern of a cache set is detected in an asynchronous attack.
Flush+Reload [61] is another asynchronous threat that exploits shared memory. The attack process first flushes the shared memory data out of caches, and then measures the access time of some addresses in the shared memory to find whether they are accessed by the concurrent encryption or not (called Reload). Flush+Flush [32] and Prime+Abort [26] work in the asynchronous mode and exploit different methods to observe the cache access patterns. Flush+Flush is a variant of Flush+Reload, which monitors the cache state by flushing some addresses in the shared memory instead of reading them. Prime+Abort accesses the Prime data within Intel TSX, and then if some of them is accessed by the victim during the encryption, an abort occurs. So the attacker studies the cache access pattern through the abort events.
Defense against cache timing side channels
Countermeasures against the remote cache timing attacks have been proposed at different levels, and some of them also are effective against the active attacks.
Eliminating the difference of execution time. The straightforward method is to perform encryption without caches, by disabling caches [51], exploiting the no-fill cache mode [50,56] or loading lookup tables into registers [50]. But all of them result in very low performance. Bitsliced implementations [35,39,50] of AES without lookup tables, effectively defend against the cache timing attacks. However, they are specific to AES, and the performance is even lower.
AES-NI is a set of hardware instructions for AES, free of timing side channels. It is available only on Intel CPUs, and Intel provides the hardware support only for AES.
Loading lookup tables into caches before encryption [50,51,58], or using a compact table [34,50], eliminates cache misses remarkably, but not completely. There are still time differences exploitable to disclose the keys.
Confusing the cache access. Delay (or inserting padding instructions) after encryption, confuses the cache access leaked through the execution time. Random delay [50,51] or delay for the constant time of encryption [6,21,28,62], at the user-space or system level [20,22,64], were proposed. Instruction-based task scheduling [21,55,59] also confuses the cache access. These methods are algorithm-independent and effectively defend against the timing side-channel attacks, but result in a large performance overhead.
[51] suggests to access extra data arrays, while the permutation of lookup tables and caches is proposed in [13]. Rescheduling the instructions achieves the same goal, and is enforced at different levels [23,51]. The masking technique is used to the cache attack-resistant algorithm implementations [12,50]. The combination of blinding and delay is designed [40] to confuse the cache access against the cache timing side channels. These algorithm-dependent designs are not implementation-transparent, and also suffer from the performance problem to some extent.
Breaking the precision of timing confuses a local attacker’s observations [41,45,51], but it does not work against a remote attacker with its own timer.
Limiting the cache sharing. Cache partition prevents the interference of active cache side-channel attackers. Cache coloring partitions the caches at the operating system (OS) level [21]. STEALTHMEM [36] allows each VM to load the sensitive data into its own locked cache lines. Partition-locked caches are designed [38,60] to prevent the cache interference. Although they defend against the cache attacks, the usage of caches is significantly degraded and then such designs are rarely deployed on commodity systems.
CATalyst [43] and Cloak [25] employ CPU hardware features to limit the cache sharing. CATalyst [43] adopts Intel Cache Allocation Technology (CAT), a way-based cache-partitioning mechanism, to divide the caches into two partitions: non-secure one and secure one. The secure applications store sensitive data and codes in the secure cache partition which will never be evicted, against active cache side-channel attack. Cloak [25] uses hardware transactional memory (HTM) to provide non-shared caches for sensitive data and codes; that is, the sensitive codes and data are executed and accessed in HTM-backed transactions which will abort when attackers attempt to probe their cache access-patterns.
Integrated methods. Different methods are integrated to defend against the cache timing attacks. The scheme [16] integrates compact lookup tables, table permutation (or randomization), and cache line preloading, so that cache misses occur as little as possible while the relationship between the secret keys and the cache misses/hits is secret-independent. Dynamic instruction padding, isolated cache resources, and lazily-cleaned cache states are integrated against cache side-channel attacks [15]. These schemes are not implementation-transparent, or require special system privileges.
Eliminating remote cache timing side channels by integrating Warm and Delay
This section first introduces the threat model and the objectives. Then we describe the
Threat model
We consider the remote cache timing side-channel attacks on block ciphers which are implemented with lookup tables. The cryptographic algorithms and their implementation details are publicly known, but the keys are kept secret, which are the attack targets. The table-based implementations are very efficient and widely adopted by popular cryptographic engines, e.g., AES in OpenSSL and mbed TLS[42], 3DES and Blowfish [54] in GPG and SSH. In such implementations, no branches exist, and no operation except the table lookup operation results in the cache-based execution time variation.
This threat model is adapted in most of the typical scenarios of remote cache timing side-channel attacks. The remote attackers know the software/hardware configurations of the target system, including the OS, the cache hierarchy, etc. However, the running states of the system are unknown to the attackers, due to non-deterministic interrupts, task scheduling and other system activities. The attackers are able to invoke the encryption function arbitrarily – they could continuously invoke the function so that all lookup table entries are cached, or stop using the function for a long time so that all entries are highly likely to be evicted from caches. They are able to measure the overall execution time for each execution of the block cipher.
In the following analysis in Section 3 and Section 4, we first assume that the execution time variation is only caused by the cache hits/misses of lookup table entries, by ignoring the variation caused by the system architecture issues. Then, in Section 5.1, we demonstrate the mitigations to prevent the attacks exploiting the tiny time variation caused by these issues (e.g., cache replacement policy and cache structure) [9,19].
The cache timing side-channels by active attackers are out of the scope of this paper. That is, the attackers cannot run privileged or unprivileged processes on the target system to modify or probe the cache state. The attackers do not have physical access to the hardware, either.
Objective
This work aims to provide an in-depth investigation of the performance of integrating
Algorithm-independent. It is generally applicable to block ciphers implemented with lookup tables (so that they are vulnerable to cache timing attacks).
Implementation-transparent. It does not require modification to the source code of the encryption functions.
Unprivileged. It requires no privileged operations. It works well in user space or kernel space, to protect the computations in user mode and kernel mode.
More importantly, the objectives of our research are elaborated as follows:
We attempt to develop the first quantitative model to formally analyze the performance of block cipher implementations that integrate Follow the model, we examine the strategies of integrating In the optimal strategy, the parameters are configurable. By configuring appropriate parameters, the optimal performance is achieved for different block-cipher implementations and computing platforms.
The Warm+Delay scheme
There are two basic common countermeasures against remote cache timing side channels [9,19,34,50]: a)
A naive integration scheme is to perform both
When we consider that the cryptographic function is invoked continuously for large amounts of data, the time of each
The

The
In addition to
In general,
The timing side channels result from the relation between the measured execution time and the data processed; that is, different data processed (i.e., keys and plaintexts/ciphertexts) determines the lookup table access, resulting in different measured time as some table entries are in caches and others are not. We ensure that the measured time does not leak any exploitable information about the status of lookup tables in caches, and then destroy the relationship between the execution time and the data processed in encryption.
The execution time of
When applying the
The values of
In the following, we further explain that the integration scheme successfully prevents typical remote cache timing side-channel attacks in the literature. For Bernstein’s attack [9], although the attacker obtains the maximal AES execution time for each
To launch an internal collision attack [4,14], the attackers need to determine the least execution time for all possible
Towards the optimal performance of the Warm+Delay scheme
In this section, we apply the
Symbols in the performance analysis
Symbols in the performance analysis
In the
Of
What is the best strategy for the
What is the best strategy for the
When attempting to answer these questions, we focus on the optimization of the long-term performance. That is, our objective is to optimize the overall performance of encrypting large amounts of data, instead of the performance of encrypting one block (i.e., short-term performance).
The first question is relatively easy to answer. The
In our analysis (as in other implementations), an instruction loop imposes
Meanwhile, the
In the next, we first examine the optimal conditions for
The optimal conditions for Delay
To effectively eliminate the cache timing side channels, in the
From the attacker’s perspective,
If we delay the execution time to any value greater than
Performing the
As described above, if the encryption execution time is equal to or less than
When the execution time is in
It is easy to verify that, the
We apply the
The amount of data loaded in Warm
A
We denote the size of a cache line as C, and the time cost to load a cache line of data from RAM to L1D caches as
If
Not loading N cache lines of table entries saves the execution time of
If a table entry is uncached but accessed during encryption, the execution time is greater than
Benefit and expected cost of not loading N cache lines of table entries (in CPU cycles)
The execution time shall be delayed to
The time cost to load a cache line of data is
For the AES-128 implementation with a 2 KB lookup table, loading all entries of the lookup table into caches in the
We finished the experiments on a Lenovo ThinkCentre M8400t PC with an Intel Core i7-2600 CPU and 2 GB RAM.
On commodity systems, the cache is enough for all entries of the lookup tables. For example, AES is implemented with the lookup tables of 2 KB, 4 KB, 4.25 KB or 5 KB, and the lookup tables of DES/3DES are 2 KB. Meanwhile, the typical L1D cache is 32 KB or 64 KB for Intel CPUs, and 16 KB or 32 KB for ARM CPUs.
Moreover, in our
We use Markov Chain to analyze the states of the lookup table in caches during the continuous invocations of AES encryption. Then the extra costs are calculated based on the probability of each state in the stationary distributions, to determine the optimal
The state is estimated after each execution of AES encryption. The transition among three states is triggered by three events:
Without loss of generality, we assume that during one state transition, each of the events occurs once at most.3
Even if some occurs more than once actually, they can be viewed as one event with the strengthened impact on the state transition.
We consider three strategies:
Conditional
Less
More
Markov Chain is used to simulate the state transition. The state of the lookup table denoted as
Next, the stationary distributions for the three different strategies are analyzed as follow.

The Markov state transition probability diagram.
Conditional
In the state
In the state
We use
Less
In the state
In the state
The stationary distribution of three states satisfies with the following equations:
More
In the state
The stationary distribution of three states satisfies with the following equations:
We compare the performance of conditional
We denote the extra time cost introduced by
The conditional
The expected extra time costs introduced by less
We find that
Next, we compare conditional
The conditional
We compare It is drawn that,
Conditional
Firstly, consider an unusual strategy that performs Any It has been proved that conditional For the AES-128 implementation with a 2 KB lookup table, performing First of all, On this platform, the range of We find that when Finally, according to Theorem 6, performing
In the above analysis, we ignore some effects on caches by encryption execution, task scheduling and interrupts, and these effects are discussed as below. We also summarize the relation between the measured time and the cache states.
One encryption execution is a partial
Therefore, when the system is in
If task scheduling or interrupts occur during the encryption,
Table 3 shows the results on the Lenovo ThinkCentre M8400t PC, and performing
The introduced time by performing warmup operation or not (in CPU cycles)
The introduced time by performing warmup operation or not (in CPU cycles)
If task scheduling is triggered but the encryption execution is not suspended and resumed immediately,
The execution time does not exactly reflect the cache state. In the above, we prove that,
Therefore, performing
For other key lengths with different numbers of rounds and implementations with different sizes of tables, Theorems 1, 2, 3, 4, 5, and 6 still hold, but we need to re-calculate the equations (for example, in Theorem 3,
Different algorithms
For other block cipher implementations with lookup tables, Theorems 1, 2, 3, 4, 5, and 6 still hold. We still use
Performance evaluation
We evaluate the scheme on a Lenovo ThinkCentre M8400t PC with an Intel Core i7-2600 CPU and 2 GB RAM. The CPU has 4 cores and each core has a 32 KB L1 data cache, and the operating system is Linux v3.6.2 (32-bit).
Implementation
We apply
However, even when all table entries are in L1D caches, the execution time of encryption still has variations and these variations could be exploited to launched side-channel attackers [9,19]. The variations result from two types of time differences within L1D caches. One is the cache bank conflict. The L1D cache is divided physically into several banks. Each core has its own L1D cache and the banks are not across the L1D cache. Different banks can be concurrently accessed, but one bank only serves one request at a time. So, multiple access operations to the same cache bank are slower than those to different banks. Because the L1D cache is unshared and the L1D cache is not accessed by other cores, we disable Hyper-Threading of the system to ensure the protected process itself not to produce concurrent access to L1D caches with cache bank conflicts. In all the following experiments, Hyper-Threading is disabled.
The other cause is the read-write conflict that the load from L1D caches takes slightly more time if it involves the same set of cache lines as a recent store. Figure 2a shows the distribution of the AES execution time for

The distribution of AES encryption time with a 2 KB lookup table in L1D caches.

The relation between the delayed time and the number of instruction loops.
The cost of
We measure the time for different loop numbers of the

The implementation of

The distribution of AES execution time, with one cache line of data uncached.
The fixed parameters (
Figure 4 shows the distribution of AES execution time for
This section shows that, with the
The distribution of the encryption execution time with
Figure 5 shows the distributions of the AES execution time for two implementations, which will be observed by attackers. With

The distribution of the observed AES execution time with different plaintexts.
Performance of different
With low workload, no other task except the evaluated AES implementation runs. We use the SysBench benchmark to simulate the CPU and memory workloads. We run SysBench in its CPU mode, which launches 16 threads to issue 10K requests to search the primes up to 300K, to produce the CPU workload. For the memory workload, we adopt SysBench with 16 threads in its memory mode, which reads or writes 1 KB blocks each time to access total 3 GB data on one CPU core. The evaluated AES implementation and the concurrent workload run on the same CPU core. The interval of task OS scheduling is set as
Figure 6 shows that, in any case, conditional

AES execution performance of different
Comparison with other defenses.
Figure 7 shows that, hardware AES-NI has the best performance, and the

Performance of different defense methods.
Performance when integrated in HTTPS servers. We applied each solution to protect the AES implementation in OpenSSL, which serves as the cryptographic engine in an Apache HTTPS server. Apache serves several web pages of different sizes with TLSv1.2. The TLS cipher suit is ECDHE-RSA-AES128-SHA256. The client runs on another computer in 1 Gbps LAN with the server. ApacheBench issues 10K requests with various request sizes, and we measure the HTTPS server throughput.
The HTTPS throughput is shown in Fig. 8. When the unprotected AES is used, the throughput is 27.2 requests per second for 1 MB data. With

Performance of different defense methods in apache HTTPS servers.
Timing variations with fully-cached lookup tables
In the evaluation, we use the AES implementation with a 2 KB lookup table. The L1D cache of 32 KB, is typically divided into 64 cache sets and each set has 8 cache lines. The 2 KB lookup table occupies exactly one cache line in each of 32 cache sets. So we declare a 2 KB array using the other 32 cache sets as the stack, the address of which does not intersect with the lookup table module 4096.
However, when a larger lookup table is used, all cache sets will be occupied by default due to the continuity of lookup tables in memory. In these cases, to avoid the read-write conflict of cache sets between the look table and the stack, we shall reserve the lookup table within 32 cache sets. It will be achieved by discontinuous table entries. For example, when using 4 KB lookup tables (
Other cache-based timing side channels
The
The
Conclusion
This work investigates the integration of
Footnotes
Acknowledgments
This work was partially supported by National Natural Science Foundation of China (No. 61772518) and the Information Construction Project of the 13th Five-year Plan of Chinese Academy of Sciences (XXH13505). We would like to thank the associate editor and the reviewers for the insightful comments.
Proofs of Theorems 4 and 5
P a for different AES key lengthes and implementations
The size of a cache line is C bytes.
When the AES implementation uses 5 KB lookup tables,
When the AES implementation uses 4 KB lookup tables,
When the AES implementation uses 2 KB lookup table,
