Abstract
As the number of processors integrated on a single chip increases with the fast pace dictated by Moore’s Law, multi-core systems-on-chip (MPSoCs) are becoming truly distributed systems at the micro-scale. From the application viewpoint, requirements for high performance and low power have increased at a breakneck speed in many embedded computing domains like wireless communication, audio and video processing. Applications in these areas are highly parallelizable and feature significant functional parallelism. In the past decades several programming models and tools aimed at easing the task of efficiently coding and mapping parallel applications. Most of them are tied to a specific platform or find far from optimal solutions due to the hardness of the allocation and scheduling problem. In this paper we presents (1) a Constraint Programming-based solver generating optimal scheduling and mapping decisions for such applications, optimized for maximal throughput, and (2) the framework used to validate the solver on a real multicore platform.
Introduction
Multicore processors have been embraced by the whole computer industry, since hardware manufacturers finally realized that the effort required for further improvements of single core chips trying to increase instruction-level parallelism is no longer worth the benefits eventually achieved. Thread-level parallelism (TLP) is currently under the focus of microprocessor vendors by designing chips with multiple internal parallel cores (such as, for instance the NVIDIA CUDA Programming Guide 1 and the Intel Single-chip Cloud Computer 2 ). However, this process does not automatically translate into greater system performance. The multicore solution exhibits for sure a superior peak performance, which can however only be achieved at the cost of significant software development effort [1, 2].
In the last two decades, the international scientific community has aimed at designing computer languages and tools to support application design/port and performance tuning for such parallel architectures [3]. Modern multicore processors are more and more limited by communication rather than computation [4]. However, communication or dependency information can often be difficult to determine at compile time. To effectively exploit this parallelism it is often necessary that the programmer manually re-structures the application in such a way that communication and task dependencies are explicit [5]. This is the case of streaming programming models, which have demonstrated significant performance advantages over standard (automatic) parallelization techniques in domains such as signal processing, multimedia and graphics [6, 7].
Synchronous Data-Flow SDF (see Section 3) is a representative example of such model of computation. Stream programs provide a vast amount of parallelism, which makes them well-suited to run efficiently on multi-core architectures, in particular on distributed memory architectures where communication is overlapped with computation (i.e. stream processors) [8].
The stream processing abstraction comes with some drawbacks: algorithms that cannot be naturally mapped to the paradigm must often be completely rewritten [9, 10]. Moreover, current programming practices and performance demands dictate that the programmer chooses a low-level language in which he can explicitly control the degree of parallelism and arduously tune his code for performance [11]. A good stream processing abstraction should reflect the underlying hardware model to properly map desired computation to the target hardware architecture [12]. Programming frameworks and middleware support for streaming applications (i.e. compilers, libraries, tools, runtime, etc.) need to be strongly specialized for the target architecture. The bottom layers of the framework stack (i.e. the compiler backend and runtime system) have to be highly tuned for the target hardware resources.
At the topmost levels of the software development stack, on the contrary, the programming abstractions should be as generic and architecture-agnostic as possible to increase ease of use and productivity. Still, knowledgeable programmers should be allowed to specialize the compilation flow for the target architecture by providing “hints” to the compiler. Simple language features should thus be designed to achieve both goals.
MPOpt-Cell [13] is a highly optimized framework for efficient development and execution of stream applications on the Cell BE Processor 3 . Cell is a heterogeneous multicore architecture composed by a standard general purpose microprocessor (called PPE), with eight coprocessing units (called SPEs) integrated on the same chip [14]. Cell has already demonstrated impressive performance ratings in computationally intensive applications and kernels. It has been exploited by several application domains, ranging from gaming to high performance computing [15, 16]. Thanks to its innovative architectural features, this architecture has been also adopted in the embedded system domain: e.g. in a work of Daniel Stasiak et al. [17] and in the Toshiba Spurs Engine 4 .
The real-time requirements frequently found at the heart of many streaming applications promote predictability as a first-class design goal. To achieve this goal MPOpt-Cell approach relies on strong off-line optimization and static scheduling. This allows to provide both robustness and high performance, different from several other related techniques based on dynamic scheduling and simple heuristics. One of the limitations of static approaches is that they often trade robustness for performance due to the introduction of schedule over-constraining. We address this issue by introducing additional precedence relations in the form of low-overhead fake data communications. This allows the schedule to stretch depending on the actual execution time of tasks at runtime.
The MPOpt-Cell framework
5
is a software chain structured in three separated components: A runtime system (RTS) which enables efficient execution of SDF applications on the Cell processor. The RTS manages all the aspects related to efficient hardware resource management, like task dispatching, data movement and task synchronization. Resource management adopts a centralized approach, being all decisions (e.g. task activations and DMA commands) issued by the PPE. This allows to effectively exploit advanced architectural features like double buffering, task migration and memory management, thus leading to a more efficient and flexible management of the overall system. A compiler backend which leverages on the Cyclic Resource-cOnstrained Scheduling Solver [18] (CROSS) to generate accurate scheduling and mapping decisions for the target SDF application, optimized for maximal throughput. Key of its efficiency is the combination between the strength of the constraint programming paradigm and the use of modular algebra. A simple and intuitive programming interface based on standard C augmented with annotations. A set of compiler directives have been identified, that capture the key abstractions of SDF. The programmer can easily describe SDF Graphs (SDFG) by enclosing code portions within our custom directives and specifying the flow of data among tasks. The compiler automatically extracts the tasks and the data-flow.
The three components, integrated in an unique application development framework, achieve the two-fold goal of easing application development and obtaining high performance from Cell-based processors.
The rest of the paper is structured as follows: in Section 2 we report the state-of-the-art framework supporting streaming computation while the background is presented in Section 3. Section 4 provides a compact overview of the runtime system (RTS) and of the programming interface. In section 5 we describe in details the MPOpt-Cell framework solver. The experimental results are presented in Section 6.
Related work
In the past few years several programming models and tools aimed at easing the task of efficiently mapping parallel applications on top of the Cell processors have seen the light.
Sequoia [19] is a programming language that abstractly exposes hierarchical memory in the programming model and provides language mechanisms to describe communication vertically through the machine and to localize computation to particular memory locations within it. This execution model is particularly well suited to data parallel computations. It enforces strict locality of computation, since tasks run in isolation on a processor and can only access data from within local memories. On the other hand, inter-node communication is much more complicated and much less performance efficient, since it has to take place through dedicated sub-tasks. This, in turn, makes it very difficult to model different kind of parallelism (such as those targeted by this work) with Sequoia constructs.
Offload [20] is a programming model from Codeplay for offloading parts of a C++ application to run on the SPEs of the Cell. Offload provides an efficient mechanism to automatically generate code for different ISAs (Instruction Set Architecture) in a heterogeneous Multi-Processor System-on-Chip (MPSoC), and to orchestrate data transfers in a transparent manner to the programmer. However, Offload constructs are not specific for data-flow applications, which can only be modeled at the price of significant coding effort. Moreover, no support for efficient scheduling of streaming tasks is natively provided by Offload.
Cell-Space [21] is a framework for developing streaming applications for the Cell. Developers construct applications by means of data flow components that are then scheduled to PPE/SPEs by a runtime system acting as a streaming communication interface. Different from our user-friendly annotation-based programming interface, with Cell-Space developers are required to construct data flow applications from a library of components which presents an application as an XML description of a data flow graph. In our framework a front-end compiler abstracts away these details of the internal representation of a data-flow application, thus requiring much less programmer involvement. Furthermore Cell-Space runtime system ensures load balancing through dynamic scheduling techniques, which however cannot guarantee robust and predictable execution times, such as ours, essential for real-time streaming applications.
StreamIt [6] is probably one of the most representative examples of a streaming language based on SDF available for the Cell processor. The StreamIt project provides a source language, a publicly available compiler, and a benchmark suite. Writing a StreamIt program, however, requires significant effort. Outlining all the tasks and communication channels that describe a streaming computation is left to the programmer. Moreover, the stream structures supported by StreamIt are limited to three representative patterns, namely pipelines (i.e., sequential composition), split-joins (i.e., parallel composition), and feedback loops (i.e., cyclic composition). MPOpt-Cell coding style based on annotations provides a much easier and expressive interface to data-flow programming than StreamIt constructs.
The most wide-spread programming model based on code annotations is undoubtedly OpenMP 6 . OpenMP provides a set of compiler directives that allow to outline parallelism and work sharing within a standard (sequential) C 7 program. While an OpenMP implementation for the Cell has been provided by authors of [22], the standard OpenMP model of computation is mainly focused on data parallelism at the loop level, and thus is not suitable to describing streaming parallelism. Still, the appealing easy-to-use coding style of OpenMP has led several researchers to extend the basic interface with custom constructs to describe data-flow parallelism.
Streaming extensions for OpenMP have been proposed within the ACOTES project [23]. Similar to what MPOpt proposes, the ACOTES programming model is based on a small set of key compiler directives that allow a programmer to identify streaming tasks, streams and ports. The focus of the optimization engine in ACOTES, however, is on loop transformations based on the polyhedral model [24] for efficient loop parallelization and vectorization. We do not target data parallelism in our work, and our optimization framework is rather aimed at guaranteeing a task schedule which maximizes throughput.
Cell SuperScalar [25] is another project which uses compiler directives as code annotations to model data-flow computation. The user has to identify the parallel parts of the application, which are then automatically offloaded. CellSS focuses on high-level parallelism and maintains a data flow graph of pending tasks. Among all the cited approaches, CellSS is probably the most closely related to the MPOpt-Cell one, and for this reason we chose it as a direct term of comparison for our experiments.
Background
Target architecture
Figure 1 shows a pictorial overview of the STI Cell Broadband Engine Hardware Architecture. The Cell is a non-homogeneous multi-core processor which includes a 64-bit PowerPC processor element (PPE) and eight synergistic processor elements (SPEs), connected by an internal high bandwidth Element Interconnect Bus (EIB).
The PPE is dedicated to the operating system and acts as the master of the system, while the eight synergistic processors are optimized for computation-intensive applications. The PPE is a multithreaded core featuring two levels of on-chip cache. The SPE is a computation-intensive coprocessor designed to accelerate media and streaming workloads. Each SPE consists of a synergistic processor unit (SPU) and a memory flow controller (MFC). The MFC includes a DMA controller, a memory management unit (MMU), a bus interface unit, and an atomic unit for synchronization with other SPUs and the PPE.
Efficient SPE software should heavily consider memory usage, since the SPEs operate on a limited on-chip memory (only 256 KB local store) that stores both instructions and data required by the program. The local memory of the SPEs is not coherent with thePPE main memory, and data transfers to and from the SPE local memories must be explicitly managed by using asynchronous coherent DMA commands.
Synchronous data-flow graphs
Synchronous Data-Flow Graphs (SDFGs) [26] are used to model periodic applications that must be bound to a Multi Processor System on Chip. They allow modeling of both pipelined streaming and cyclic dependencies between tasks. This model of computation represents data movements between activities through tokens (i.e. dot on the arcs of the graph). To assess the performances of an application on a platform, one important parameter is the throughput. In the following we provide some preliminary notions on synchronous data flow graphs used in this paper.
An SDFG is a pair consisting of a finite set of activities (also called, nodes, tasks or actors) and a finite set of dependency arcs. A dependency arc denotes a dependency of activity j on i, with . When i executes, it produces p tokens on and when j executes, it consumes q tokens from . Arcs 8 may contain initial tokens .
An activity execution is defined in terms of firings. An essential property of SDFGs is that every time an activity fires it consumes a given and fixed amount of tokens from its input edges and produces a known and fixed amount of tokens on its output edges. These amounts are called rates.
SDFGs in which all rates equal 1 are called Homogeneous Synchronous Data Flow Graphs (HSDFGs, [26]). Every SDFG can be converted to an equivalent HSDFG , by using the conversion algorithm in [27].
In Fig. 3 we report the HSDFG corresponding to the SDFG in Fig. 2 where the (unary) rates areomitted.
Throughput
Throughput is an important design constraint for embedded multimedia systems. The throughput of an SDFG refers to how often an activity produces tokens. To compute throughput, a notion of time must be associated with the execution of each activity (i.e., each activity has a duration d
i
also called execution time) and an execution scheme must be defined. We consider as execution scheme the
Working with Synchronous Data-Flow models of computation, it becomes natural to adopt a scheduling strategy which defines only the allocation and the ordering, and let the run-time scheduler to decide the start times.
Framework Overview
In this section we describe our approach to mapping a data-flow application on the Cell processor. The overall flow of our framework is depicted in Fig. 4.
The interaction between framework components is based on three hardware-software models: a Architecture Description Language (ADL), providing an abstract view of the target hardware platform. a Graph Description Language (GDL), providing a unified notation for distinct graph-based software design methodologies. GDL is used as an input format for the MPOpt Solver, which produces as output an enhanced version of the same model (we refer to as GDL+). GDL+ contains information about the affinity between tasks and SPEs as well as scheduling decisions. a Mapping Description Language (MDL), handling features that affect both hardware and software domains.
We consider a layered bottom-up approach, where three main building blocks incrementally abstract away architectural details from the developer’s view.
Framework models and languages
Architecture Description Language
The Architecture Description Language (ADL) describes hardware platforms from an abstract perspective, defining functional entities and their relationship and omitting structural details and temporaldynamics. This approach allows the framework to support different hardware platforms, providing a unified view to the solver component. Therefore, the MPOpt framework can be easily adapted to different architectures.
The Connector entity acts for any interconnection element in the system, regardless of real topology (e.g. buses, communication channels, etc). Connectors induce a logical partitioning of the global system environment into local ones, which are connected each other through gateway paths. The required attributes of Connector are bandwith and latency. A repeat attribute is also available to define a multiplicity of idempotent functional units. The self-association between Connectors enables a hierarchical view of hardware resources. Assuming this feature we can define application mapping constraints at different abstraction levels without loss of generality and using a single hardware model for different software requirements.
The resource has two abstract subtypes, LocalResource and Gateway. LocalResource is an abstract entity representing a strictly local resource. The local resources actually supported by ADL are processors and memories. Processor models any computational unit in the system (excluding DMA controllers). The inherited attribute slots specifies the computational load for the unit (e.g. total number of supported hardware threads). Memory represents a generic memory device: in this case the inherited attribute slots specifies the byte size of the device. Gateway entities are instead a means to establish a link between two connectors, with the aim to enable communication between corresponding local environments. The attributes of Gateway are sink, which specifies the linked Connector, and control, which declares the control permissions in relation to current local environment (internal control, external control or both). The model provides several concrete subtypes, including: DMAController, a DMA controller with slots transfer channels available; MemoryInterface, a local memory interface with slots ports. MessagingInterface, a message passing interface based on input and output queues.
Figure 5 shows an ADL mapping for Cell Architecture.
Graph Description Language
The Graph Description Language (GDL) is a general-purpose language for parallel software description, supporting various graph-based design methodologies, including SDFG [29]. GDL defines three base entities: Graph is a global container for all instance entities, Node is a vertex of the graph and Arc is an edge of the graph. Each entity contains a list of properties, which represents a set of features in the form of name-value pairs. Node has the optional attributes input (and, or, xor) and output (branch, fork), describing respectively input and output port semantics. Node has two concrete subtypes: SimpleNode, representing an atomic node, and ClusterNode, modeling a cluster node, which hierarchically contains a subgraph or an external reference. Cluster nodes can be interpreted as co-allocation constraints or library solutions [30, 31]. In the first case they force the allocation of internal nodes to the same local environment (see ADL Connector), in the second one they reference a predefined hardware/software subsystem. Arc references both a source node and a sink node. Furthermore each arc refers to two Endpoint, start and end, containing additional lists of properties. The framework uses GDL to describe SDFGs, in accordance with these rules: a graph property (token) specifies token byte size; a node property (codePtr) references the task code; arc start/end properties (rate) specify token rates; an arc property (tokenNum) defines initial tokens; an arc property (tokenPtr) references the values of initial tokens.
Mapping Description Language
The Mapping Description Language (MDL) describes all features which affect both hardware and software domains, e.g. the durations for task executions and data transfers. This approach properly decouples hardware and software requirements, respecting the separation of concerns principle [32]. A MappingSpace is a vector space D1 × D2 × … × D
n
; each D
i
is a finite domain (Domain entity) of elements (Element entity) related to entities in an ADL instance (adl_ref attribute). In practical cases resources of the same type (processors, memories, etc) logically belong to the same mapping domain. Dependency defines a tuple containing one element per domain (in the same order of domains in the mapping space) and associates a set of properties to it. Each element may assume the following values: a valid element for the domain; * (anything), in case that the dependency properties hold for each domain element; - (don’t care), in case that the dependency properties are totally unrelated to the domain.
In this way it is possible to associate properties to groups of resources (resource assignment) with the granularity necessary to meet a wide range of requirements. Defined properties are: duration - duration of the activity for this assignment (i.e. task execution duration); invalid - invalid resource assignment for the referred node; consumption_dname - resource consumption of this assignment, referred to the domain identified by dname.
GDLElement groups a collection of dependencies related to the same element in GDL instance (gdl_ref attribute), establishing an effettive connection between the two domains. Figure 6 shows an example of use for MDL: it creates a MappingSpace based on ADL entities and then defines a tuple related to a specific GDL node (P1 with any memory). The properties of this tuple specify an execution time (duration) and the resource consumptions for processor and memory.
The MDL notation is minimal and flexible, with the possibility for the designer to properly define decoupled mapping information for a specific application while considering different hardware platforms.
Front-end compiler
This section briefly describes MPOpt-Cell Programming Model, namely the front-end compiler.
MPOpt-Cell adopts a novel programming model based on the familiar C language, augmented with some compiler directives that allow the programmer to easily describe data-flow computation. The front-end compiler automatically outlines a parallel C program – describing tasks and data streams of the SDFG – which is sent to the backend Compiler.
Our approach borrows from the coding style of the well-known shared memory programming model OpenMP.
A compiler is in charge of transforming the annotations into code which spawns parallel threads at runtime and manages data sharing among them.
Therefore we define a set of directives which provide the needed abstractions to model a SDFG (i.e. actors, incoming/outgoing arcs, etc.). The role of our front-end compiler is twofold. First, the directives inserted by the programmer are processed so as to generate C code which will be compiled for the SPEs. Second, the structure of the SDFG described through the custom directives is extracted into a GDL representation for the solver to process. The compiler automatically extracts the SDFG representation required by the Solver.
The main benefits of this approach reside in an increased ease of use and productivity, since a programmer does not have to learn new language constructs or a brand-new programming language.
Runtime system
The bottom layer of the MPOpt framework consists of a target-specific
The runtime system introduces an abstraction layer which hides the difficulties of the parallel architecture, such as load balancing, synchronization, and communication between the main components of the streaming application. It orchestrates the execution of all the components in the data flow graph application and provides both streaming and event communication primitives to the components. Moreover, using centralized resource management, the runtime can dynamically balance the load over the available SPE processors. The runtime system leverages different structures for its execution, namely SDFG tasks, queues, executors and a resource manager.
Note that this is the only component specifically designed for the Cell platform. Hence, using the framework with another platform is feasible, and only the RTS system would require some changes.
CROSS solver
To ensure the most efficient task schedule for the considered application (i.e. the application considered in the framework) we developed a solver block operating on top of the RTS. The computed solution is guaranteed to satisfy user defined constraint which may be specified (e.g. a minimum throughput requirement).
The solver block requires an input description of hardware, software and cross-domain data in the form of ADL, GDL and MDL documents. A pre-processing step transforms the input SDFG into the corresponding Homogeneous SDF Graph (HSDFG) (see [27] for a wide overview of the transformation algorithm), with unary rates over each arc. Eventually, the homogeneous graph is transformed into a perfect-rate HSDF graph (see [33]). Both transformations involve polynomial-time algorithms.
Constraint programming model
The solver used in this approach is based on CROSS (presented in [18]). The main underlying idea is to focus on a λ-width time window. We adopt a start/end time decomposition similar to that in [34]. The start time of execution 0 of activity i (i.e. start (i, 0)) can be expressed as:
Figure 7 shows two different schedules: the former (a) is a classic schedule, computer over a horizon-length time window (usually the sum of the execution times of all activities), while the latter (b) is the relative λ-width time window schedule (based on modular algebra).
Precedence relations are modelled through the following temporal constraint:
where δ(i,j) is the number of tokens on the arc (i, j), d i is the duration if the activity i, and θ(i,j) is the latency of the communication between i and j (i.e. a time-lag).
Note that if the (modular) precedence constraints are satisfied for iteration 0, then they are satisfied for every other iteration.
Beside the application model, the Constraint Programming model takes into account the architectural components by describing resource constraints and architectural features.
In this framework, we describe the Cell model as a single cumulative resource of capacity equal to the number of SPU.
The restrictions on the resources are therefore modeled through a cumulative constraint 9 ; the constraint prevents the solver from finding a number of concurrent executions higher than the capacity of the resource.
The basic idea of the solving process is to model the effects of mapping and scheduling choices by means of graph modifications. The solver block assumes a self-timed scheduling policy 10 and the execution order is determined on the modified graph. In details, we modify the graph adding edges that force precedence relations between scheduled tasks so as to prevent tasks from competing for the SPEs. The runtime scheduler processes these edges as fake data communications between actors. Note that differently from a state-of-the-art disjunctive approach presented in [36], where the graph is modified during search, in the MPOpt-Cell solver the augmented graph is constructed when the search stops with a feasible solution.
The self-timed execution of a periodic application modeled with a synchronous data-flow graph consists of two different phases (see [37]): the transition phase and the periodic phase. The former appears only once at the beginning of the execution while the latter is periodically repeated ad infinitum. In this context, a static scheduling strategy [27] should include a different schedule for each execution phase. Conversely, thanks to the use of modular algebra, our solver produces a single static schedule that represents both phases (see Fig. 8). The transient and the periodic phase schedules are easily inferred by considering the iteration values of each task in the solution. In fact, the solution includes the modulus value λ and the set of start times 𝒮 i and iteration numbers β i . Before the application enters the periodic behaviour, some of the tasks (namely those for which β i > ω, with ω as current execution iteration) simply do notstart.
The type of solution provided by the CROSS solver 11 naturally suggests the use of a time-triggered schedule; on the opposite the MPOpt-Cell runtime system is based on a self-timed policy. The motivation for this choice is that modeling the precedence relations with arcs on the graph allows the solver to obtain flexible solutions. Moreover, if the actual execution times are shorter than the expected times it allows the runtime to deliver higher throughput.
At the end of the search, a novel algorithm translates an assignment of start and iteration variables to graph modifications and constructs the new GDL+ document. Therefore the output of the MPOpt-Cell solver consists of one “enhanced” GDL document (referred to as GDL+), which is essentially the input GDL with additional allocation and scheduling information. Namely, we specify the node affinity that references a resource unit chosen for task allocation or we add dummy arcs (precedence arcs) used to force task ordering on a specific resource. The algorithm operates into two steps: Allocation: the algorithm extracts from the solution the mapping association for each node (even if the allocation problem is not considered in CROSS). In fact, the use a single shared resource of capacity equal to the number of SPU guarantees that the maximum number of concurrent tasks is equal to the number of SPU (see Section 3 for details). Scheduling: the algorithm inserts into the original graph new precedence arcs (called ) that guarantee the predicted execution. These guarantee the correctness of both the transient and the periodic execution phases.
Note that counter-intuitively an (i, j) can have a negative number of tokens (i.e. ). This means that the source activity i has to execute at least times before the first execution of sink activity j.
The framework has been tested using three representative algorithms from the multimedia domain, namely a FFT kernel, a block matrix multiplication and a FM radio demodulator.
We compared our approach to Cell Superscalar (CellSs), a framework based on code annotation that provides a compiler and a runtime library for Cell platform programming; since the approach is quite similar to MPOpt-Cell, CellSs is a good candidate for comparative benchmarks.
All experiments were executed on a PlayStation 3 (3.2 GHz Cell) running Yellow Dog Linux 6.0. Reference implementations for the benchmark algorithms are available on StreamIt website: the serial code has been annotated using both Cell Superscalar and MPOpt-Cell annotations, and it has been compiled activating maximum optimization level (O3) for both compilation chains.
Experimental results were calculated considering a set of 50 program executions: each execution runs the algorithm and collects statistics for a stream of 200 input data. Consequently, the calculated throughput on each run is equal to the ratio 200/execution time.
The evaluation of framework performance follows two fundamental metrics: predictability andperformance.
Predictability is essential in the presence of hard real-time constraints: in this case, local throughput variations may result in violation of the deadlines, making the computation useless or even harmful.
On the other hand, a predictable schedule takes the risk to be extremely conservative, with an outcome of poor performance. For this reason we also decided to evaluate the overall performance of our approach, in terms of average throughput, making a comparison with Cell Superscalar results.
Predictability evaluation
Predictability has two main facets: the capability to produce a regular throughput with minor local variations and the ability to predict whether throughput constraints may be satisfied or not before deployment stage.
In the case of real-time applications it is essential to ensure the respect of definite deadlines and consequently a wide variability range is unacceptable.
Figure 9 depicts the results of a first experimental set: it graphically represents the variability of the measured execution times on both CellSs (left side) and MPOpt-Cell (right side).
A numerical scale was not reported in the chart because absolute values are not significant to understand the variability dynamics.
The distance between a point and the chart center shows the execution time of a single experiment. Hence, all points equidistant from the center represent different experimental values with the same execution time. Each application is depicted by a line whose points map the values measured at subsequent application launches. The circularity factor of each line directly traces the regularity of the throughput for the correspondent application: the closer the line to a perfect circle, the more regular execution times.
The experimental values related to MPOpt-Cell have a good circular factor, namely a limited variability. This aspect can be attributed to the high determinism derived by the use of a static scheduler.
Conversely, Cell Superscalar provides a dynamic scheduler which is based on a task dependency graph which sometimes allows locally better results, but exhibits an unpredictable runtime overhead. Therefore, the bad circular factor of corresponding lines highlights this behaviour.
Figure 10 shows the result for a second experimental setup: this was conducted by feeding the solver with task worst case/best case execution times, instead of average ones.
The off-line predicted throughput values are compared with the measured one at runtime, which is based on a schedule obtained by average execution times. In detail, each column in the figure refers to a different schedule, computed by taking into account worst case/average case/best case execution times; for each benchmark, the left- and right-most columns present off-line predicted throughput values of the considered schedule, while the middle column reports an experimentally measured value.
One can see that the worst case execution times (WCET) and the best case execution times (BCET) schedules establish strong bounds for the eligibility range of runtime throughput values (depicted in Fig. 10 by overlying vertical lines). In particular, the solution provided when all tasks are assumed to execute with the BCET provides a conservative upper bound on the throughput. More interestingly, the WCET solution provides an estimate of the best safe throughput value, i.e. the tightest throughput requirement the system can met.
Moreover, by providing the solver with WCETs, the predicted value is very close to the actual one, assessing the high accuracy of the adopted model. This allows to check in the early steps of the design process if the performance requirements can be met or an upstream optimization stage is required. Overall, this can potentially reduce the development time.
Performance evaluation
Figure 11 shows the results of a bare performance comparison (in terms of throughput) with Cell Superscalar. The figure bars represent the throughput values calculated as a mean of 200 program iterations: overlying vertical lines outline the interval of samples variability, i.e. the range of measured throughput values. The endpoints of these lines represent the maximum and minimum values originated by experimental results.
The throughput mean value is an assessment of the performance one can expect with a high number of iterations: it is particularly significant in a scenario characterized by continuous data streaming and in this context it can be naturally used as a performance indicator. Our approach is comparable to Cell Superscalar on FFT, it has better performance on FM radio and even better performance on matrixmultiplication.
The runtime performance of MPOpt-Cell framework is due to a set of concurrent factors: (1) the effective use of the SDF model, that fully exploits the data parallelism of streaming applications and avoids unnecessary synchronization barriers; (2) the static scheduling technique, which provides low-overhead resolution of possible resource conflicts by compile time allocation and scheduling; (3) the double buffering technique, that reduces the overhead of data transfers; (4) the use of an optimized schedule, which is based on appended precedence relations and allows the framework to stretch to accommodate actual execution times.
The low performance measured on the matrix multiplication benchmark using CellSS (40:1) motivated a deeper analysis. The performance gap appears to be attributable to framework design issues. Since CellSS does not natively support a streaming model, it is necessary to insert a synchronization barrier after the computation of the submatrices multiplication at each iteration step: in this way we can obtain the correct result without using ad-hoc implementation tricks. To investigate the impact of such limitation, we ran some experiments on MPOpt-Cell by introducing an unnecessary barrier. We observed that the barrier represent an important limitation to the application throughput, indeed with this barrier the throughput achieved on MPOpt-Cell is halved. Still, the barrier alone does not explain the remaining performance gap (20:1). A further conjecture involves optimization issues related to data transfers, which are treated by CellSS using a locality-aware heuristic.
Conclusions
In this paper we described (1) a CP based approach for Resource Allocation and Scheduling of data-flow applications and (2) MPOpt-Cell, a complete environment for enabling efficient development and execution of streaming programs on the Cell Broadband Engine processor. Its infrastructure leverages on an intuitive programming model based on compiler directives which allows designers to easily describe streaming applications. Compile-time and run-time optimizations ensure an efficient execution of the application through a finely tuned mapping of tasks and data-flow on top of available hardware resources. Experimental results demonstrate the efficiency of the framework. The whole framework is described and can be downloaded at http://mpopt.ing.unibo.it.
Footnotes
4
http://www.semicon.toshiba.co.jp/eng/shared/∖∖pdf/SpursEngine∖_leaf∖_e∖_2008-11.pdf.
7
or C++, or Fortran.
8
Note that a dependency arc of a SDF graph could be physically represented as a FIFO Memory Buffer where data (i.e. tokens) are stored.
9
10
Where each activity executes as soon as all of its input data are available.
11
We recall that CROSS solver computes static-time schedules. The ordering decision are extracted after the search.
