Evolution of application-specific cache mappings

Abstract

Reconfigurable caches offer an intriguing opportunity to tailor cache behavior to applications for better run-times and energy consumptions. While one may adapt structural cache parameters such as cache and block sizes, we adapt the memory-address-to-cache-index mapping function to the needs of an application. Using a LEON3 embedded multi-core processor with reconfigurable cache mappings, a metaheuristic search procedure, and MiBench applications, we show in this work how to accurately compare non-deterministic performances of applications and how to use this information to implement an optimization procedure that evolves application-specific cache mappings for the LEON3 multi-core processor.

Keywords

Evolvable hardware caches cartesian genetic programming

1. Introduction

One of the major challenges processor architects have to face is the so-called memory bottleneck: the disparity in the frequencies of memory requests of processing units and latencies of DRAMs. To mask the memory access delays, multi-layered cache hierarchies have been introduced in the 70s [4], mimicking a high-speed and large main memory while using slow and inexpensive DRAM chips. This principle has been used successfully for many decades before running into performance issues in the last years. For instance, processing large data structures fills caches with data lacking temporal locality, deteriorating performances of other tasks executed on the same processor [11]. Varying memory access pattern types of applications executed by many-core systems interfere with each other, making it much more difficult for the cache to implement a coherent memory model efficiently. The consequence of these effects is that the processor manufacturers have introduced reconfigurability to caches to adapt the cache behavior to the requirements of applications [12].

Reconfigurable caches have only recently found their way into off-the-shelf processors [11]. Research in this area has started earlier, with the rise of reconfigurable logic in the 90s. The primary motivation for reconfigurable caches is that while well-configured caches with fixed architecture perform well for a broad range of applications, it is sometimes desirable to change the configuration of the cache to handle applications with atypical memory access patterns more energy-efficiently or to tailor a cache to specific use cases such as numerical simulation.

The research on reconfigurable caches subdivides roughly into structurally reconfigurable caches and application-specific memory-to-cache-index mapping functions. The work on structural cache reconfiguration investigates the benefits of dynamically changing the geometry of the cache memories, i.e., the cache’s size, the number of ways, and the number of cache blocks. Work on reconfigurable memory-address-to-cache index functions has the goal of distributing accesses to cache memories more evenly for a reduced number of conflict misses. Usually, permutation- and single-level XOR-based functions, as well as prime-moduli and multi-stage mappings, are used in related work [15].

While reconfigurability helps caches to improve their performance, trade-offs such as larger logic areas, longer hit times, and a bigger overall number of memory cells may arise. The memory overhead is, to some extent, bearable since today, most processor designs are not restricted by silicon area but by performance and performance per energy. The increase in hit time is more critical. For any embedded processors with clock frequencies well below one GHz, the pressure on the timing is moderate. On the other hand, high-performance processors have several cache levels where only the first level is optimized for hit time. Here, the reconfigurability can be applied to higher-level caches more easily.

In this work, we present an FPGA-based implementation of a processor that can freely configure its memory-to-cache-index mapping functions. We show the challenge and the solution of how to reliably compare performances of non-deterministic applications and present an implementation of an optimization procedure that can evolve application-specific cache mappings.

2. Related work

A conventional cache consists of one or multiple separate memories, called ways, that are usually word addressable. $2^{l}$ consecutive words define a cache block and $2^{k}$ blocks define a way. Whenever a processor would like to access data, it puts the according memory address

$\displaystyle A=[a_{n-1}\dots a_{0}]=[t_{m-1}\dots t_{0}][i_{k-1}\dots i_{0}][% b_{l-1}\dots b_{0}][a_{1}a_{0}]$

onto the memory bus. The first level cache checks, whether it stores the requested data in one of its ways by splitting the memory address into the block offset

$\displaystyle B=[b_{l-1}\dots b_{0}]=[a_{l+2-1}\dots a_{2}],$

block or set index

$\displaystyle I=[i_{k-1}\dots i_{0}]=[a_{k+l+2-1}\dots a_{l+2}],$

and a tag

$\displaystyle T=[t_{m-1}\dots t_{0}]=[a_{m+k+l+2-1=n-1}\dots a_{k+l+2}].$

If one of the cache blocks in the set selected by the block index bits $I$ contains valid data and the stored tag is identical to the tag bits $T$ , the requested data is returned to the processor. Otherwise, the cache passes the request to the next memory stage.

In a non-conventional cache mapping scheme, the tag and index bits are transformed by two functions, $f$ and $g$ , before relaying their outputs to the cache. A common approach is, for instance, implementing $g$ as a permutation on the bits $I^{\prime}\subset T\cup I$ and setting $f=id((T\cup I)\setminus I^{\prime})$ . The elegance of this approach, investigated, for instance, by Givargis [7], Stanca et al. [20], and Patel et al. [16], is that shuffling the cache index lines does not require changing any of the other components of a cache, preserving the hit-times and implementation simplicity. An extension to the bit shuffling approach was investigated by Ros et al. in [17], where index bits $c^{\prime}$ are computed using the index and tag bits $[a_{n-1}\dots a_{k}]$ . In such an addressing scheme, the same index bits $[a_{m+k-1}\dots a_{k}]$ can be mapped to different cache sets. The index bits have, therefore, to be stored together with the tag bits $[a_{n-1}\dots a_{m+k}]$ for each block to distinctively identify a cache hit. Although the scheme of Ros et al. induces overheads on cache’s tag memory and tag comparators, it also has the potential to outperform the approach of Givargis and Patel et al.

Figure 1.

Hashing virtual memory addresses for indexing a TLB. The AMDAHL 470 V/6 Machine Reference Manual [6, 4].

To distribute accesses to cache lines more evenly, bit address shuffling can be combined with an additional translation function. Already in the 1970s, the first mainframes with virtual memories, such as the IBM 360 and the Amdahl 470, have incorporated XOR-based mappings in their Lookaside Buffers (TLB) [19, 4]. It helped to avoid congestion as identical virtual address spaces have been assigned to all executed applications by the OS. The Amdahl mainframes, for instance, randomized the memory-to-TLB-index mapping by XOR-ing the reversed application ID bits with the virtual page number bits (c.f. Fig. 1).

The idea of hashing TLB index bits has been adopted by Vandierendonck and De Bosschere to caches [21, 22]. The authors have introduced a heuristic and an optimal algorithm for determining the input bits to the XOR planes computing the cache index depending on the memory access stream of a specific application. The authors have employed an SA-110 ARM processor model configured with a 4 KB L1 direct-mapped cache and benchmarks from the PowerStone, MediaBench, and MiBench suites. The presented cross profiling results are particularly interesting. These demonstrate the generalization behavior of the XOR mapping functions trained on a specific application and input data for different test input data sets. Two metrics are reported, the reduction of the miss rate and the reduction of the overall run time. While in nearly all cases, the miss rates were reduced over a modulo cache, the corresponding run times do not strictly follow this trend, and, occasionally, slowdowns were observed. In more recent work, Wang et al. extended this approach to the caches of a GPU in [24].

The function $g$ may also be implemented as a multi-stage function, e.g., $g_{1}$ and $g_{2}$ . In the case of a first-stage miss using $g_{1}$ , the next-stage function $g_{2}$ is re-evaluated at the same cache level to examine, whether the requested memory cell is probably stored at a back-up location. Agarwal et al. [2] investigated this idea in a conventional direct-mapped cache using the regular modulo ( $g_{1}$ ) and a second, non-modulo ( $g_{2}$ ) cache mappings. This “hash-rehash” scheme doesn’t change the critical hit access time and can theoretically be extended to more rehash stages to mimic more numerous set-associativity [3]. Seznec and Bodin presented 1992 a similar idea of having different mapping functions for each of the ways of a cache [18]. The authors demonstrated their idea of a “skewed-associative cache” for a two way set associative cache and mapping functions constructed by XOR’ing bits from the block address field. The authors were able to reduce the miss-rate of a two set-associative cache to the level of a four set-associative cache with a negligible hardware overhead [18].

Another way of defining the memory-to-cache-index function is to use a different modulus. Diamond et al. [5] investigated prime and non-prime moduli to minimize bank conflicts of a GPU. Kim et al. [15] compared XOR-based permutation, polynomial modulus, prime modulus, and own indexing schemes for various GPGPU workloads. They were able to reduce computation time and energy consumption significantly.

Our approach can be seen as an methodological extension of the work of Vandierendonck et al. [23]. Similarly to Vandierendonck et al., we introduce a mapping function consisting of Boolean gates between the processor and the cache memories. However, we do not restrict the translation complexity to one level of XOR gates. Instead, the cache translation function is defined as a multi-level circuit composed of 2-input Look-up Tables (LUT). The LUT’s are embedded into a butterfly network, and their content is subject to an optimization algorithm. Such translation functions can compute any Boolean circuit of a certain size [10, 9, 14] and have the potential to outperform Vandierendonck’s approach. Additionally, as LUTs content is reconfigurable at run-time, application-specific cache mappings can be realized naturally. That is, during an offline phase, a well-performing cache mapping for an application is evolved and embedded into a segment of application’s binary. Later, in the regular operation phase of the application, application’s optimized cache mapping is configured into the memory-to-cache-index translation block on, e.g., a task switch by the OS scheduler and allow the application to execute with fewer cache misses.

This work is organized as follows. Section 3 presents the LEON3 processor implementation with reconfigurable cache address translation functions. The optimization method and the performance assessment procedure are presented in Secections 4 and 5. Secection 6 shows how the optimization algorithm has been configured, and Secection 7 presents the results of this paper. Finally, Secection 8 summarizes the insights and concludes this work.

3. The reconfigurable cache mapping architecture

Figure 2.

Relevant components of a LEON3 core and our extensions to the LEON3 architecture, colored in grey. The instruction cache is indexed using virtual and the data cache using physical addresses.

Figure 3.

a. The butterfly network of an RCB with 4 $\times$ 3 LUTs. b. The architecture of a reconfigurable LUT and the reconfiguration architecture.

For this work, we are using the embedded open-source LEON3 processor with one level of private instruction and data caches per core [1]. An open-source L2 cache implementation was not available for LEON3 at the starting time of our work.

The reconfigurable cache mapping architecture consists of Cache Mapping Controllers (CMC), the Reconfiguration Controller (RC), and the Reconfigurable Blocks (RCB) (c.f. Fig. 2). The task of the CMCs that are located in each core is to relay reconfiguration requests from the cores to the RC and to manage RCBs’ cache mapping reconfiguration process. The RC gets and serializes reconfiguration requests from CMCs and uses DMA to fetch reconfiguration bitstreams for the RCBs efficiently. The RCBs implement the reconfigurable cache mapping functions. Each core has a set of three active RCBs (L1:I, L1:D, and the snooping mechanism of L1:D) and at least one set of shadow RCBs for masking the reconfiguration time. LEON3’s L1:I caches do not implement a coherent memory model and need no synchronization of modified cache blocks among the cores.

An RCB implements a 16 $\times$ 5 grid of 2-input LUTs embedded into a feed-forward butterfly network (cf. Fig. 3a). While Xilinx’ SRLC32E primitive can reconfigure LUTs at run-time, no such method exists in the public domain for run-time reconfiguration of FPGA’s routing. The butterfly network offers a solution to this situation. It allows a primary output to be a function computed on any of the primary inputs. The identification of an appropriate configuration of the LUTs is subject to the optimization algorithm.

The reconfiguration of cache mappings is done through a Linux driver. The driver allows userspace programs to load and read cache mapping configuration bitstreams and test their functionality. During the power-up of the LEON3 cores, all RCBs are initialized with the conventional modulo mapping to mimic the standard behavior of a cache. The applications’ performance when using custom cache mappings is metered through the Linux’ perf_tool command [8].

The implementation overheads for the presented modules are shown in the left part of Table 1. The baseline configuration is a LEON3 4-core processor synthesized with direct-mapped 4KB instruction and data caches. The implementation consumes 13 Distributed RAMs (DRAMs), where RAM32x1D primitives are used. There is an overhead for the implementation of cache memories/-controllers, as the comparators for hit/miss detection are wider. In the original LEON3 cache implementation, not all bits of Block RAMs (BRAM), which are used to store cache tags and blocks, are employed. This results in steady BRAM usage.

Table 1

Hardware resources used by a LEON3 core and the parameters of the LEON3 platform implementing reconfigurable cache mappings

	FFs	LUTs	DRAMs	BRAMs
RC	176	557	13 (RAM32x1Ds)	0
RCB &	2972	1558	80 $\times$ 6 (SRL16Es)	0
controllers
Cache controllers
4KB, 1-way	969	2543	0	0
overhead	39.4%	23.8%	0.0%	0.0%
Cache tags and memories
4KB, 1-way	46	47	0	7
overhead	21.1%	17.5%	0.0%	0.0%

Generic system configuration
Parameters	Configuration
Clock frequency	50 Mhz
Floating point	Hardware/-Software
Memory	1GB DRAM
I/D-TLB	8 entries
Linux kernel	2.6.36.4 from Gaisler
Cache configuration (with/-without FPU hardware: 4 cores)
L1:I &	4KB:1-way
L1:D	{16,32}-bytes/line
Coherency	Snooping protocol

The right part of Table 1 summarizes the parameters of the prototype system. The prototype is implemented on a Xilinx ML605 board equipped with a Virtex-6 FPGA. The reconfigurable circuits are implemented for L1:I and L1:D caches.

4. Optimization methodology

To find well-performing cache mappings, we rely on a heuristic search method. The three main components of a heuristic are (a) a formal representation of a candidate solution, i.e., the encoding model, (b) functions that are able to combine and modify candidate solutions, i.e., variation operators, and (c) a metric that assesses the performance of a candidate solution, i.e., the objective function. This section presents the encoding model, the variation operators, and the heuristic procedure. The subsequent section shows how the performance of an application can be reliably assessed.

4.1 The encoding model for the bitstream of an RCB

Figure 4.

Cartesian Genetic Program (top), its encoding (bottom), and the set of nodes’ functions (right).

To encode a function consisting of a two-dimensional grid of reconfigurable LUTs that maps memory addresses to cache indexes, we employ the Cartesian Genetic Programming (CGP) model [13]. CGP is an evolutionary optimization technique and was invented to capture FPGA circuits. It encodes a circuit as a directed acyclic graph (DAG). An exemplification of the encoding is illustrated in Fig. 4. The graph nodes are arranged as a $n_{c}\times n_{r}$ grid and are connected by feed-forward wires (c.f. top of Fig. 4). Inputs to the graph are sourced in by $n_{i}$ primary inputs, and the graph’s outputs are fed into $n_{o}$ primary output nodes. All nodes are enumerated column-wise and from left to right. A wire can be restricted to span at most $l$ columns. Each node may have up to $n_{a}$ in-going wires and computes a single output. The lengths of wires connecting primary outputs with inner nodes and inner nodes with primary inputs are not restricted.

The graph in CGP is encoded by a linear list of integers, which is called a genotype and shown at the bottom of Fig. 4. The first $n_{a}+1$ integers, also called genes, encode the $n_{a}$ input wires (connection genes) and the function of the upper left node of the grid. In Fig. 4, the upper left node with the index 3 connects to the primary inputs 1 and 0 and computes the function $f_{2}=OR$ . The encoding of node three is, therefore, “1 0 2”. The next $n_{a}+1$ integers encode the configuration of node four and so on. The encoding scheme proceeds row-by-row and from left to right until all inner nodes are specified. The $n_{o}$ primary outputs are encoded by $n_{o}$ connection genes.

We adopt CGP for the encoding of the configuration of an RCB (c.f. Fig. 3b) by setting the configuration of connection genes such that the resulting wires define a butterfly network. The connection genes are then fixed and are not subject to the optimization algorithm. The encoding of the function genes of $n_{c}\times n_{r}=5\times 16=80$ nodes, which are 2-input LUTs in our RCB implementation, require in total for $80\times 2^{2}=320$ bits.

4.2 The variation operator

CGP’s variation operator is called mutation. Given the mutation rate and the size of the genotype, the operator computes how many genes have to be mutated. These genes are then selected randomly. A function gene is mutated by choosing a random function descriptor. A connection gene is mutated by randomly rewiring it to a preceding node within the range given by the levels back parameter.

We adopt CGP’s mutation operator by letting it mutate only the function genes. More precisely, given a mutation rate of, e.g., 1%, $\lfloor\frac{320}{100}\rfloor=3$ bits are selected randomly and flipped.

( $1+\lambda$ ) Evolutionary Algorithm initial candidate solution $p$ best candidate solution $p$

Termination condition not met $P\leftarrow\emptyset$ $i=1\dots\lambda$ $c\leftarrow\text{mutate}(p)$ evaluate $(c)$ $P=P\cup\{c\}$ $p\leftarrow\text{select-best}(\{p\}\cup P)$ (p)

4.3 The optimization algorithm

CGP is traditionally optimized by an ( $1+\lambda$ ), $\lambda=4$ , Evolutionary Algorithm (EA, c.f. Algorithm 1). When starting the optimization, the initial candidate solution (parent individual) is sampled randomly. After initialization, a loop is iterated for a predefined number of cycles or generations creating in each generation $\lambda$ new candidate solutions (off-spring or child individuals) by duplicating the bitstring of the parent and mutating it. Each off-spring individual is evaluated, and the best individual proceeds to the next generation as the new parent. In case the best off-spring individual has the same functional quality as the parent, the off-spring individual proceeds to the next generation.

5. The challenge of an accurate performance estimation

Figure 5.

The distributions of Misses Per 1000 Instructions (MPKI) measured for the conventional (left column) and two randomly sampled cache mappings (center and right columns). The L1:D cache is configured as a 4 kB directly-associative and physically-addressed cache. The experiments are conducted for the CJPEG application and four different input images (vectors 0 …3).

When metering a deterministic application, the general assumption is usually that the experiment will produce either deterministic measurements or measurements with negligible deviations. This assumption often holds for many of the related works that are using software simulators for system analysis. In our case, however, execution of an application on a multi-core system in an environment where concurrent applications are competing for the resources such as disk IO, cache space, processing time, and network bandwidth results in measurement variations that make comparing different computing architectures difficult. An example is given in Fig. 5, where sub-plots are showing histograms for L1:D miss-rate measurement variations of the CJPEG executable employing three different cache mapping functions and applied on four different images. The $x$ -axes of the subplots are showing the average number of Misses Per 1000 Instructions (MPKI)

$\displaystyle\text{MPKI}=\frac{M}{IC}\times 1000,$

where $M$ and $I C$ are the numbers of misses and re-tired instructions, respectively. The $y$ -axes are showing the frequencies of particular MPKI measurements. In the left column, MPKI distributions for the standard and in the central as well as right columns for two randomly generated cache mappings are plotted. MPKI distributions in each line are computed using different benchmark images.

The numbers in Fig. 5 have been conducted in experiments where we have pinpointed CJPEG to a dedicated core and minimized the execution of concurrent applications (context switches) to a single-digit rate. Despite all measures, and this is the first observation, MPKI deviations for the original and unchanged LEON3 processor executing a JPEG benchmark haven’t vanished completely, as can be observed in the left column of Fig. 5. The variation magnitude corresponds to the number of evicted cache blocks by concurrent applications and Linux kernel service functions. As these interference sources can hardly be eliminated on a multi-tasking system, we assume that this is the most accurate MPKI measurement precision that can be achieved on our system on average.

The deviation magnitude of MPKI measurements increases for the two randomly generated cache mappings (middle and right-hand column in Fig. 5). The mechanism behind this observation originates in Linux’s virtual-to-physical page mapping randomization. To prevent malicious software from reliably guessing physical pages with sensitive application data, virtual pages are allocated to randomly selected physical pages. If one would record the issued physical addresses of two benchmark executions, only the lower address bits indexing within a physical page would match. Higher bits would be randomized by Linux’s virtual-to-physical page mapping obfuscation mechanism. We have configured the L1:D cache to be of the same size as a physical page. This means that the conventional cache mapping operates only on non-randomized address bits, i.e., the cache mapping experiences for every benchmark execution the same sequence of input bit vectors. This is different for the randomized cache mappings, as these mappings are defined on all address bits and are experiencing for every benchmark execution addresses with varying upper bits. Consequently, cache blocks are indexed in a different order for every CJPEG execution, which results in a changing number of cache block evictions and miss rates. The middle and right columns in Fig. 5 show how Linux’ virtual-to-physical page randomization increases the variance of MPKI values compared to baseline MPKI distributions (left column in Fig. 5). Given these observations, the conclusion is that the typical performance of a cache mapping needs to be derived from a set of measurements.

To compare populations of measurements, one usually resorts to statistical methods. The very first question is whether the samples of a population are normally distributed. This leads us to the second observation: The histograms in the central and left columns of Fig. 5 already suggest that MPKI values of some cache mappings do not follow the normal distribution. To verify this, we have tested during an optimization run the MPKI distributions generated by 7508 cache functions using the Shapiro-Wilk, Kolmogorov-Smirnov, and Anderson-Darling tests at the confidence level of $1-\alpha=$ 95%. The tests confirm that 47.9%, 41.1%, and 46.7% of MPKI measurements do not follow the normal distribution.

The third observation is that MPKI distributions are not only not-normal but also different. The central and right columns in Fig. 5 are showing that histogram shapes are dissimilar and have differently distributed peaks. This is a frequent finding throughout our experiments. When distributions are different, non-parametric tests comparing the central tendencies while relying on the prerequisite of identical distributions cannot be used anymore ( $\text{CDF}(M(t))\neq\text{CDF}(N(t-\Delta))$ , CDF is the Cumulated Distribution Function). We resort, therefore, to the Wilcoxon rank-sum test, also called the Mann-Whitney U test, which is sensitive for detection of differences in distributions to compare the performances of cache mappings. The comparison methodology is presented in the following section.

5.1 Functional quality computation procedure

The search for good performing cache mapping functions has to ensure that candidate solutions excel for a wide range of potential input vectors for some application $a$ . We, therefore, select a set $V$ of different and representative input vectors to evaluate the performance of $a$ ’s candidate cache mapping $f$ during the training phase. The non-determinism of performance evaluation requires us to record for every tuple $(a,f,v_{i})$ , $v_{i}\in V$ , a set of experiments. The reference performance of the modulo cache mapping function $f_{\text{mod}}$ on $v_{i}$ for $a$ is therefore captured by the set

$\displaystyle R(a,v_{i},n_{\text{max}})=\bigcup_{j=1}^{n_{\text{max}}}\text{% mpki}(a,f_{\text{{mod}}},v_{i}).$

$n_{\text{max}}$ is number of executions of the application $a$ and its candidate cache mapping $f$ on an input vector $v_{i}$ . With $\tilde{R}(a,v_{i},n_{\text{max}})$ as the median MPKI value of $R(a,v_{i},n_{\text{max}})$ , the inverse normalized MPKI performance $m$ of a cache mapping function $f$ on the input vector $v_{i}$ is defined as

$\displaystyle m(a,f,v_{i})=\frac{\tilde{R}(a,v_{i},n_{\text{max}})}{\text{mpki% }(a,f,v_{i})}.$

Recording $m(a,f,v_{i})$ for all $v_{i}\in Vn_{\text{max}}$ times defines the set of performance numbers $M(f)$ that can be used for statistically comparing cache mapping functions:

$\displaystyle M(f)=M(a,f,V,n_{\text{max}})=\bigcup_{i=1}^{|V|}\bigcup_{j=1}^{n% _{\text{max}}}m(a,f,v_{i}).$

As motivated previously, we have selected the Wilcoxon rank-sum test to compare cache mappings. When comparing the sample populations $M(f_{i})$ and $M(f_{j})$ of two candidate cache mappings $f_{i}$ and $f_{j}$ , the hypotheses of the left-tailed Wilcoxon rank-sum test state for the CDFs $F(M(f_{i}))$ and $F(M(f_{j}))$ :

$\displaystyle H_{0}:F(M(f_{i}))=F(M(f_{j})),$

(1) $\displaystyle H_{1}:F(M(f_{i}))\leqslant F(M(f_{j})).$

Under the null hypothesis $H_{0}$ , we interpret that distributions of objective values for two cache mapping functions $f_{i}$ and $f_{j}$ are equal at the confidence level of $1-\alpha=$ 95%. If $H_{0}$ is accepted, we break the tie by selecting the cache mapping with the better median as the winner. If $H_{0}$ is rejected, $H_{1}$ implies that the distribution of MPKI ${}^{-1}$ values of $f_{j}$ is shifted to the right of $f_{i}$ . $f_{j}$ produces, therefore, lower MPKI values and is preferable over $f_{i}$ .
5.2 The early-stop strategy

The run-time of an optimization algorithm is typically dominated by the complexity of the performance assessment of a candidate solution. The cache mapping’s performance is defined in our work as a set $M(f,V,n)$ with up to $|V|\cdot n_{\text{max}}=4\cdot 64=256$ measurements.

During the initial experiments, we observed that most of the off-spring individuals had a lower performance as their parent. We use this observation to minimize the number of candidate solution executions $n$ , when computing $M(f,V,n)$ . For this, we stop after, e.g., $n=4$ measurements and do an early test, whether the new candidate cache mapping is inferior to its parent and can be dropped. If inferiority cannot be confirmed, the next batch of executions is performed (e.g., $n=4\rightarrow 8$ ), and the test is repeated. This procedure iterates until the upper bound for the maximal number of executions (e.g., $n_{\text{max}}=64$ ) is hit. In case no statistical difference between candidates can be identified, the cache mapping with a better median is selected as the winner. With the presented idea in mind, we have formalized the following “early-stop” strategy that minimizes the complexity of a performance comparison for underperforming off-spring candidates:

Figure 6.

Statistical Wilcoxon rank-sum test: The upper triangular parts show CDF for all pair comparisons of a (1 $+$ 4) EA. The lower triangular parts show the $p$ -value returned by the test for two one-sided alternatives: left-tailed test (or “less” comparison) and right-tailed test ( or “greater” comparison). The diagonal shows the distribution of $m(f,V)$ values.

For, e.g., a (1 $+$ 4) EA, the performance evaluation of 4 off-spring candidates subdivides into two phases. In the first phase, all off-spring individuals are evaluated on $|V|=4$ different input vectors $n=16$ times. The resulting measurements $M(f_{i},V,16)$ , $i=1,\dots,\lambda=4$ , are compared to the performance measurements of the parent individual $M(f_{\text{parent}},V,n_{\text{max}})$ with $n_{\text{max}}=64$ using the left-tailed Wilcoxon rank-sum test.

Off-spring individuals are removed if the test rejects the null-hypothesis for them.

The remaining individuals are executed again on four input vectors 16 times and the resulting sets $M(f_{i},V,32)$ are tested for inferiority against $f_{\text{parent}}$ deleting underperforming solutions. The procedure is repeated for $n=32\rightarrow 40\rightarrow 48\rightarrow 56\rightarrow 64$ until either $M(f_{i},V,64)$ has been computed for all remaining off-spring candidates or no off-spring candidates remain.

If at least one off-spring candidate remains after the first phase, children and the parent are joined into a common set $P$ and the following procedure is repeated until $|P|$ becomes one: A random 2-tuple is selected from $P$ and left- as well as right-tailed Wilcoxon rank-sum tests are applied. If one of the tests rejects $H_{0}$ , the according candidate solution is inferior and is removed. If the tests confirm $H_{0}$ , the candidate solution with the worse median is removed. After all individuals except one have been removed from $P$ , the remaining candidate becomes the new parent for the next generation.

To illustrate the procedure, we have collected sample data from one generation of a CJPEG optimization run. Figure 6 shows a side-by-side comparison of CDFs for all 2-tuples (parent and four children) in the upper-right triangle, the distribution of $m(f,v_{i})$ values for the five individuals in the diagonal and the $p$ -values generated by the Wilcoxon rank-sum test in the lower-left triangle. When analyzing the CDF comparisons of the off-spring individuals with the parent in the top line, one can observe that only the CDF of the fourth individual is shifted to the left of the CDF of the parent. This indicates that the cache mapping function no. 4 usually produces CJPEG runs with higher L1:D miss rates than those produced by the parent cache mapping. The significance of this observation is confirmed at the $1-\alpha=$ 95% level in the according $p$ -value diagram in the lower-left corner. This means that the computation of $M(f_{4},V,n)$ was stopped during the first phase, and unnecessary CJPEG executions could be avoided. The CDFs for all remaining individuals are not or not as clearly left-shifted from the CDF of the parent as it is the case for individual no. 4. The $p$ -values of the left-sided rank-sum tests confirm this (left column of Fig. 6). Consequently, individuals 1–3 survive into the second phase where the rank-sum test establishes the following order: off-spring 2 and the parent are mutually non-dominant. At the same time, candidate 1 dominates the parent and candidate 2, and candidate 3 dominates all other individuals. In consequence, the off-spring 3 becomes the new parent for the next generation.

Figure 7.

Breakdown of the distribution of the cumulated number of CJPEG executions on four input vectors before a sufficient amount of data is collected for the rank-sum test to make a decision.

Figure 7 shows the potential of the “early-stop” strategy. It plots the average number of CJPEG executions required to make a comparison. The data is collected in three CJPEG (1 $+$ 4)-EA runs with $|V|=4$ . In the breakdown, one can see that most of the decisions (84.9%) can be made after 32 executions of CJPEG for every of the four input vectors. After 40 evaluations, another 13.2% of the decisions can be carried out. Only 1.9% of all remaining decisions need more or less CJPEG executions to collect a sufficient amount of data for statistical discrimination. On average, 134.24 CJPEG executions are required for a comparison, which reduces our initial optimization times roughly by 50% to 75%.

6. Setting up the optimization algorithm

The main parameters responsible for the convergence rate of the $(1+\lambda)$ EA employed in this work are the mutation rate and $\lambda$ . In this section, we prepare our main experiments by examining test runs and adjusting $\lambda$ and the mutation rate.

6.1 Configuring the $(1+\lambda)$ evolutionary algorithm

Figure 8.

(1 $+$ 4) EA vs. (1 $+$ 1) EA: (1 $+$ 1) EA needs less candidate evaluations to reach similar quality regions as (1 $+$ 4) EA.

We tested the EA using different values for $\lambda$ on the CJPEG benchmark. Each benchmark configuration has been executed three times for 1000 generations. The results of the conventionally parametrized $\lambda$ ( $\lambda=$ 4) and best performing $\lambda$ are presented in Fig. 8. The $x$ -axis shows the dimension of the number of evaluated candidate solutions and the $y$ -axis the functional quality, which is maximized. One can see that (1 $+$ 1) EA is on par regarding the functional quality or slightly better than (1 $+$ 4) EA after 1000 generations. However, (1 $+$ 1) EA requires far fewer evaluations of candidate cache mappings to reach the quality levels of (1 $+$ 4) EA. $\lambda=$ 1 is, therefore, our choice for the main experiments in Section 7.

6.2 Finding a good mutation rate

Figure 9.

Amount of mutated genetic material that led to improved L1:D cache mappings.

Figure 10.

Amount of mutated genetic material that led to improved L1:I cache mappings.

Our strategy for finding well-performing mutation rates is to let the mutation operator sample the amount of mutated genetic material each time freely between 0% and 25%. The mutation rates that created better off-spring candidates during the optimization runs were recorded.

Figures 9 and 10 present stacked distributions for the amount of mutated genetic material for successfully improving L1:D and L1:I cache mappings. The first general observation is that mutating less genetic material is more effective than mutating a lot of genes. The second observation is that roughly half of successful mutations are the consequence of changing up to three genes. Based on these insights, we have defined gene mutation probability distributions of 48%, 31%, 21% and 46%, 31%, 23% for flipping 1, 2, and 3 genes when mutating data and instruction cache mappings, respectively.

Table 2

Evolution of custom cache mappings for a 4 KB direct mapped cache. Normalized median training and test miss-rate reductions (red.[%]) of the best evolved cache mapping of an application. Median absolute and relative numbers of miss-rates during testing ( $\times 10^{6}$ , rate[%]). Median run-time reductions during testing

Application	L1:D					L1:I
	Training	Testing				Training	Testing
	Cache misses				Run-time	Cache misses				Run-time
	red.[%]	$\times 10^{6}$	rate[%]	red.[%]	red.[%]	red.[%]	$\times 10^{6}$	rate[%]	red.[%]	red.[%]
CJPEG	32.7	9.5	41.1	35.2	16.8	12.4	2.2	1.9	$-$ 0.6	$-$ 1.6
DJPEG	3.6	3.7	32.0	3.3	2.1	64.5	2.7	5.2	67.7	8.7
FFT	10.1	0.7	10.2	5.0	0.2	4.6	6.9	16.3	4.7	2.6
DBLOWFISH	6.5	0.1	13.6	4.8	2.0	30.7	0.01	0.5	19.6	0.0
PATRICIA	10.7	3.9	21.4	6.4	0.2	3.0	31.0	25.7	3.3	3.1
DIJKSTRA	0.6	2.3	22.4	0.4	0.9	13.6	2.2	4.5	10.6	9.7
RIJNDAEL-EN	2.7	7.8	41.2	$-$ 12	$-$ 3.1	15.6	29.2	36.0	15.7	18.8
RIJNDAEL-DE	6.2	7.6	42.3	$-$ 48.9	$-$ 12.5	14.6	24.2	31.6	14.5	26.3
CRC32	4.5	4.7	9.3	$-$ 16.6	$-$ 2.1	33.3	32.2	16.0	33.3	37.1

7. Experiments

If an application would like to employ custom cache mappings, it has to evolve these mappings in a preparational step. Once custom mappings are evolved, no further optimization is required. Later, during the regular operation of the application, its custom cache mappings can be activated by the task scheduler on a context switch.

To simulate the lifetime of a custom cache mapping, we subdivide the experiments into two phases. In the first, the training phase, custom cache mappings are evolved and stored along with the application’s binary. The results of this phase are presented in Section 7.2. In the second, the validation phase presented in Section 7.3 the regular operation of an application is simulated. The application is executed under typical conditions, and its performance is measured using the conventional and custom cache mappings.

7.1 The experimental system

The overall experimental system consists of a host computer carrying out the EA and distributing candidate cache mapping evaluation jobs to multiple FPGA boards. When the client FPGA board receives an application, its candidate cache mapping, and application’s input data vector, the cache mapping circuits are reconfigured on all affected LEON3 cores. Then the client process fork()’s and migrates to the target cores. There, the memory image of the process is duplicated by Linux’s perf, which executes the target application with the according to input vector. Once the execution is finished, perf returns the collected metrics to the host process.

We evolve cache mappings for applications from the MiBench suite. For each application, data and instruction cache mappings are evolved in separate experiments. Three optimization runs are executed for 3000 generations for each application, and the best performing cache mapping out of these runs is selected and tested in the validation phase. During the optimization, the performances of candidate cache mappings are evaluated on four different training- and the best-evolved cache mappings on ten different validation data vectors. In the training phase, an application is executed on a single test data vector up to 12 and in the validation phase on a test data vector 16 times. The performance of a cache mapping is reported as the normalized median miss rate overall execution repetitions of an application and its cache mapping on all of its input data vectors.

Table 3
Evolution of custom cache mappings for a 4 KB 2-way cache. Normalized median training and test miss-rate reductions (red.[%]) of the best evolved cache mapping of an application. Median absolute and relative numbers of miss-rates during testing ( $\times 10^{6}$ , rate[%]). Median run-time reductions during testing

Application	L1:D					L1:I
	Training	Testing				Training	Testing
	Cache misses				Run-time	Cache misses				Run-time
	red.[%]	$\times 10^{6}$	rate[%]	red.[%]	red.[%]	red.[%]	$\times 10^{6}$	rate[%]	red.[%]	red.[%]
CJPEG	9.08	5.8	25.0	5.3	3.3	10.5	1.0	0.9	3.5	$-$ 0.5
DJPEG	3.8	3.1	29.6	0.1	0.0	35.0	0.4	0.7	11.6	$-$ 4.8
FFT	6.3	0.5	8.0	5.0	0.4	6.5	5.8	13.8	5.4	2.8
DBLOWFISH	4.2	4.6	13.0	4.0	0.8	20.4	0.9	0.5	9.7	$-$ 2.7
PATRICIA	11.3	3.1	17.5	0.8	0.3	2.5	25.9	21.5	2.5	2.6
DIJKSTRA	1.4	2.1	20.0	$-$ 30.0	$-$ 3.8	24.7	1.9	3.8	21.5	2.3
RIJNDAEL-EN	14.9	8.2	42.9	$-$ 25.2	$-$ 6.7	14.7	30.0	35.8	14.7	5.2
RIJNDAEL-DE	5.9	7.9	43.5	$-$ 1.1	0.3	22.1	25.6	33.3	22.4	9.7
CRC32	3.0	4.5	8.8	2.4	0.4	32.6	0.2	0.09	18.7	$-$ 2.5

7.2 The evolution of cache mappings: The training phase

The training results of best-evolved cache mappings are presented in the “training red.[%]” columns for L1:D and L1:I caches in Tables 2 and 3. The first observation is that the optimization process is always able to find better cache mappings than the standard mapping for data and instruction caches. The improvements lie up to 10% for the data and usually above 10% for the instruction caches. Sometimes, the MPKI improvements are rather large. For CJPEG and RIJNDAEL (L1:D, directly mapped) and DBLOWFISH as well as CRC32 (L1:D, direct-mapped and 2-way associative) the improvements are above 30%. For DJPEG (L1:I, direct-mapped), the MPKI reduction reaches 64.5%. These numbers show the potential for the improvement of cache miss rates.

The second observation is that fetching instructions can be predicted better than fetching data memory accesses. This is even though the L1:D cache usually shows higher relative miss rates than the L1:I cache for many applications and therefore, intuitively, L1:D should have a higher potential for miss-rate improvement (column: testing $\rightarrow$ cache misses $\rightarrow$ rate[%]). On the other hand, the absolute number of misses is often much higher for the L1:I cache (column: testing $\rightarrow$ cache misses $\rightarrow\times 10^{6}$ ). A reason that, for some applications, the miss-rate reductions are small may result from the randomization of the virtual-to-physical-page mapping of Linux. This could lead to potentially smaller achievable miss-rate reductions.

7.3 Testing custom cache mappings: The validation phase

To investigate how an application with a custom cache mapping behaves in a regular operation mode, we test the best cache mappings on the application’s input data vectors that have not been used in the previous section. We report the absolute cache miss rates (testing $\rightarrow$ cache misses $\rightarrow\times 10^{6}$ ), relative cache miss rates (testing $\rightarrow$ cache misses $\rightarrow$ rate[%]), miss rate reductions compared to the conventional cache (testing $\rightarrow$ cache misses $\rightarrow$ red.[%]), and the reductions of execution times (testing $\rightarrow$ run-time $\rightarrow$ red[%]).

The first observation is that the reduction of cache misses (testing $\rightarrow$ cache misses $\rightarrow$ red.[%]) usually follows the MPKI reductions of the training phase (training red.[%]). Notable exceptions are the RIJNDAEL and CRC32 (L1:D, direct-mapped), DIJKSTRA and RIJNDAEL-EN (L1:D, 2-way set associative), and the DJPEG (L1:I, 2-way set associative) benchmarks, where MPKI values are more than 10% behind the training performance.

The next observation is that, similar to training results, the instruction cache accesses can be predicted more accurately than data cache accesses. For data caches, the MPKI improvements lie up to roughly 5% except for the CJPEG benchmark (L1:D, direct-mapped) with 35.2% MPKI improvement, while for instruction caches the improvements are often above 10%.

The final observation is that the run-times can be improved, however, not to the extent reached by MPKI reductions. This is due to the fact that either the data or the instruction cache has been optimized and tested at the same time. I.e., reducing the miss-rate of CJPEG by 35% reduces the execution time only by 16.8% (L1:D, direct-mapped). The run-time improvements for the direct-mapped cache are pronounced for the CJPEG (L1:D, 16.8%), DJPEG (L1:I, 8.7%), DIJKSTRA (L1: I, 9.7%), RIJNDAEL-DE (L1:I, 26.3%), RIJNDAEL- EN (L1:I, 18.8%), and the CRC32 (L1:I, 37.1%) applications. For the set-associative cache, the run-time improvements are, as expected, smaller and lie around the performance of the conventional cache with the exceptions for the CJPEG, RIJNDAEL-DE (L1:I, 9.7%) benchmark.

8. Conclusion

The trend for more cores on a single die challenges the conventional processor design. Applications with fundamentally different memory access behaviors interfere with each other making it difficult for the cache logic to provide a uniform, coherent and efficient memory model. Shared resources allow for information leaks among unprivileged tasks. The memory bottleneck and caches are becoming one of the popular research and development fields of processor design again.

In this work, we have investigated the idea that a reconfigurable memory-address-to-cache-index mapping that is tailored by a search algorithm to a specific application may outperform the conventional modulo-mapping. We have extended for the instruction and data caches of LEON3 by reconfigurable address mapping functions, interfaced them with the Linux OS Kernel, adopted Linux’ scheduler, and evolved for nine applications for the L1:I and L1:D application-specific cache mappings. For most of the applications, the L1:I miss rates could be improved by more than 10%. Most of the execution times for L1:I and L1:D lie, however, around the performance of systems with a conventional cache. Large run-time improvements have been achieved for the Rijndael-DE (26.3%), Rijndael-EN (18.8%), CRC32 (37.1%), and CJPEG (16.8%) benchmarks.

References

Aeroflex Gaisler. Grlib.

Agarwal

, Analysis of cache performance for operating systems and mutliprogramming, PhD thesis, Stanford University, Tech. Rep. 87–332, 1987.

Agarwal

and Pudar

S.D.

, Column-associative Caches: A Technique for Reducing the Miss Rate of Direct-mapped Caches, in: Proc. Intl. S. Computer Architecture (ISCA), ISCA, ACM, 1993, pp. 179–190.

Corporation

, Amdahl 470V/6 Machine Reference Manual, 1976.

Diamond

J.R.

Fussell

D.S.

and Keckler

S.W.

, Arbitrary modulus indexing, in: Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), IEEE Computer Society, 2014, pp. 140–152.

Fotland

D.A.

Shelton

J.F.

Bryg

W.R.

La Fetra

R.V.

Boschma

S.I.

Yeh

A.S.

and Jacobs

E.M.

, Hardware design of the first HP precision architecture computers, Hewlett Packard J 38(3) (Mar. 1987), 4–17.

Givargis

, Improved indexing for cache miss reduction in embedded systems, in: Proceedings Design Automation Conference (DAC), IEEE, 2003, pp. 875–880.

Kaufmann

and Platzner

, A Hardware/Software Infrastructure for Performance Monitoring on LEON3 Multicore Platforms, in: Proc. Intl. Conf. on Field Programmable Logic and Applications (FPL), 2014, pp. 1–4.

Kaufmann

and Platzner

, Towards Self-Adaptive Caches: a Run-Time Reconfigurable Multi-Core Infrastructure, in: IEEE Intl. Conf. on Evolvable Systems (ICES), IEEE, 2014, pp. 31–37.

10.

Kaufmann

and Platzner

, Optimization of Application-specific L1 Cache Translation Functions of the LEON3 Processor, in: 9th World Congress on Nature and Biologically Inspired Computing (NaBIC), Advances in Nature and Biologically Inspired Computing, Springer, 2019.

11.

Intel, Improving Real-Time Performance by Utilizing Cache Allocation Technology, Technical report, Intel, 2015.

12.

Intel, Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 3B: System Programming Guide, Part 2, Technical report, Intel, 2015.

13.

Kaufmann

and Platzner

, Advanced Techniques for the Creation and Propagation of Modules in Cartesian Genetic Programming, in: Genetic and Evolutionary Computation (GECCO), ACM Press, 2008, pp. 1219–1226.

14.

Kaufmann

Plessl

and Platzner

, EvoCaches: Application-specific Adaptation of Cache Mappings, in: IEEE Adaptive Hardware and Systems (AHS), IEEE CS, 2009, pp. 11–18.

15.

Kim

K.Y.

and Baek

, Quantifying the performance and energy efficiency of advanced cache indexing for GPGPU computing, Microprocessors and Microsystems 43 (2016), 81–94.

16.

Patel

Macii

Benini

and Poncino

, Reducing cache misses by application-specific re-configurable indexing, in: Proceedings of the 2004 IEEE/ACM Intl. Conf. on Computer-aided Design (ICCAD), IEEE Computer Society, 2004, pp. 125–130.

17.

Ros

Xekalakis

Cintra

Acacio

M.E.

and García

J.M.

, Adaptive selection of cache indexing bits for removing conflict misses, IEEE Trans. Computers 64(6) (2015), 1534–1547.

18.

Seznec

and Bodin

, Skewed Associative Caches, Technical Report 1655, INRIA, 1992.

19.

Smith

A.J.

, Cache memories, ACM Comput. Surv. 14(3) (1982), 473–530.

20.

Stanca

Vassiliadis

Cotofana

and Corporaal

, Hashed Addressed Caches for Embedded Pointer Based Codes, in: Intl. Conf. on Parallel Processing (Euro-Par), volume 1900 of LNCS, Springer, 2000, pp. 965–968.

21.

Vandierendonck

and Bosschere

K.D.

, Constructing Optimal XOR-Functions to Minimize Cache Conflict Misses, in: 21st Intl. Conf. on Architecture of Computing Systems (ARCS ’08), volume 4934 of LNCS, Springer, 2008, pp. 261–272.

22.

Vandierendonck

Manet

and Legat

, Application-specific reconfigurable XOR-indexing to eliminate cache conflict, in: Proc. Design, Automation and Test in Europe (DATE), 2006, pp. 357–362.

23.

Vandierendonck

Manet

and Legat

, Application-specific reconfigurable xor-indexing to eliminate cache conflict misses, in: Proceedings Design, Automation and Test in Europe (DATE), IEEE, 2006, pp. 1–6.

24.

Wang

Liu

Wang

and Yu

, Eliminating intra-warp conflict misses in gpu, in: Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE), EDA Consortium, 2015, pp. 689–694.

Evolution of application-specific cache mappings

Abstract

Keywords

1. Introduction

2. Related work

4.1 The encoding model for the bitstream of an RCB

4.3 The optimization algorithm

5. The challenge of an accurate performance estimation

6.1 Configuring the ( 1 + λ ) evolutionary algorithm

7.1 The experimental system

7.3 Testing custom cache mappings: The validation phase

8. Conclusion

References

6.1 Configuring the $(1+\lambda)$ evolutionary algorithm