Design Space Exploration for Efficient Data

1 downloads 0 Views 3MB Size Report
Let a macrotask be the result of a fusion, as proposed by Talal et al., we reduce the number of sub-spaces by merging the sub-spaces whose solutions contain ...
Chapter 23

Design Space Exploration for Efficient Data Intensive Computing on SoCs Rosilde Corvino, Abdoulaye Gamati´e, and Pierre Boulet

1 Introduction In the last half century, the parallel computing has been used to model and solve complex problems related to meteorology, physics phenomena, mathematics, etc. Today, a large part of commercial applications uses parallel computing systems to process large sets of data in sophisticated ways, e.g., web searching, financial modeling and medical imaging. A certain form of parallelism is simple to put in practice and may consist in merely replicating overloaded key components or even whole complex processors. But, the design of a well-balanced processing system that is free from computing bottlenecks and provides high computing performance with high area and power efficiency, is far more challenging [1]. Especially in embedded systems devoted to data intensive computing, the hard design requirements on the provided computing performance and on hardware efficiency make compulsory the usage of application-specific co-processors, usually realized as hardware accelerators. The optimized design of these parallel hardware accelerators involves many complex research issues and answers the need of automatic design flows addressing the complexity of an automatic design processes allowing a fast and efficient design space exploration for data intensive computing systems. According to many existing works [2–4], such a flow should refine one or more abstract specifications of an application and provide a systematic exploration method that effectively selects optimized design solutions. For example, the platform-based design methodology [4–6] abstracts the underlying hardware

R. Corvino () University of Technology Eindhoven, Den Dolech 2, 5612 AZ, Eindhoven, The Netherlands e-mail: [email protected] A. Gamati´e • P. Boulet LIFL/CNRS and Inria, Parc Scientifique de la Haute Borne, Park Plaza – Bˆatiment A, 40 avenue Halley, Villeneuve d’Ascq, France e-mail: [email protected]; [email protected] B. Furht and A. Escalante (eds.), Handbook of Data Intensive Computing, DOI 10.1007/978-1-4614-1415-5 23, © Springer Science+Business Media, LLC 2011

581

582

R. Corvino et al.

architecture implementation by the means of Application Programmer Interface (API). Typically, most of the existing methods [7] use, as starting point of the exploration analysis, two abstract specifications: one giving the behavior to be realized, i.e., the application functional specification, and one giving the resources used to realize the intended behavior, i.e., the architecture structural specification. The use of abstract input specifications makes solvable complex design processes. However, by abstracting implementation details, several correlated problems of the architecture exploration are orthogonalized, i.e., considered as independent. This can significantly decrease the quality of the design exploration and selection results. A largely accepted orthogonalization for data intensive computing systems consider apart (1) the design of data transfer and storage micro-architectures and (2) the synthesis of simple hardwired computing cores or instruction sets [8–10]. Actually, these two problems overlap. Thus, in order to have an efficient synthesis process, their orthogonalization should be reasonably chosen. Moreover, the synthesis process should extract the constraints for one problem from the solutions of the other and vice versa, so that the two problems mutually influence each other. Consequently, in such a way, the effect of the orthogonalization on the quality of selected design solutions is minimized. The two orthogonalized problems are defined as follows. The synthesis of computational cores is mostly related to the instructions scheduling and binding problems and is a very active topic of research [11], but usually does not consider aspects related to data intensive computing behaviors. On the contrary, the design of an efficient data transfer and storage micro-architecture is a major and most complex problem in designing hardware accelerators for data intensive applications. In fact, the applicability of the data parallelism itself depends on the possibility to easily distribute data on parallel computing resources. This problem is strictly related to the computation and data partitioning, and the data parallelism exploration. Moreover, the data transfer and storage micro-architecture affects the three most important optimization criteria of a hardware accelerator design: power consumption, area occupancy and temporal performance [9, 10].

2 Related Works We review some of the existing literature about the previously introduced research challenges. We incrementally present these challenges by starting from the more general context of design automation. Then, we present a design problem strictly related to the data intensive computing systems, i.e., the data transfer and storage micro-architecture design. Finally, we discuss challenges about application functional specification with the corresponding data partitioning and parallelization problem. We also deal with challenges related to architecture structural specification with the associated resources usage scheduling problem. For each one of these research challenges, we enumerate major issues and we briefly introduce possible solutions.

23 Design Space Exploration for Efficient Data Intensive Computing on SoCs

583

2.1 Design Automation Traditionally, Electronic Design Automation (EDA) tools have supported the design process from the Register Transfer Level (RTL) to the circuit layout. Since 1990s, EDA has expanded to new tasks in order to support Electronic System Level (ESL) design. The tasks that support the ESL design involve, among others, the system level specification and exploration and the behavioral synthesis. The behavioral synthesis is also called High Level Synthesis and infers a RTL model from a functional specification of a system. Three main ESL design paradigms are becoming increasingly accepted [2]: the function-based ESL methodology, the architecture-based ESL methodology and the function/architecture codesign methodology. The function-based methodology [12] is based on Models of Computation (MoCs), which specify in an abstract way how several computing tasks communicate and execute in parallel. The MoC-based specification can be directly translated into an RTL model or into a compiled program. Consequently, in an optimized methodology, the application-specific architecture (or architecture template) is already known and automatically generated from a given MoC-based specification. The architecture-based ESL methodology uses hardware component models that can be specified at different abstraction levels. Usually it classifies the hardware components into two categories the computing processors and the transactions, i.e., communication between processors. The computing processors can represent either programmable flexible computer, whose functionality is completely specified after the hardware implementation and in a traditional compilation phase, or customizable hardware template whose behavior and structure can be specified at design-time. The architecture-based ESL methodology has to solve the problem of scheduling and binding applications on a given architecture. Even if this approach allows architectural exploration, it is limited by the complexity of the input specifications, usually in SystemC or Architecture Description Languages. Indeed, the architectural exploration lacks of automation. Thus, it cannot be systematically and extensively performed. Moreover, the ESL architecture exploration is often based on HLS synthesis whose capability does not scale with the complexity of analyzed systems and is usually limited to mono-processor systems. The function/architecture codesign-based ESL methodology [7] is the most ambitious methodology. It uses a MoC to specify the functional application behavior and an abstract architectural model to specify the architecture of the system. It refines concurrently and in a top-down manner these two models in order to achieve the actual system realization. This approach has two limitations: first, it always constructs the architecture from scratch and does not take into account that in most of industrial flows, already existing platforms are used and possibly extended; second, it does not exploit the fact that for a given application-domain optimal, architectures or architecture templates are already known. For these reasons a meetin-the middle method is required [13] to reasonably bind and lead design space exploration of application-specific systems.

584

R. Corvino et al.

Data intensive computing systems are typical systems that could benefit from such an ESL methodology. In this context the HLS would be used to synthesize the single processors or hardware accelerators of the whole system; it would need an ESL front-end exploration to orchestrate the system level data transfer and storage mechanisms. One of the biggest challenge to provide such a competitive ESL frontend exploration is to perform a design space exploration (DSE) able to find in a short time an optimal hardware realization of a target application. We can distinguish two kinds of DSE approaches: exact and exhaustive approach and approximated and heuristic-based approach. Ascia et al. [14] briefly survey these approaches and propose to mix them in order to reduce the exploration time and approximate the optimal solution more precisely. In this chapter we present our methodology that is aimed to provide an ESL exploration front-end for a HLS-based design flow. The used DSE is exhaustive and thus provides exact solutions but it includes intelligent pruning mechanisms that dramatically decrease the exploration time.

2.2 Data Transfer and Storage Micro-Architecture Exploration The problem of finding an efficient system level data transfer and storage microarchitecture has been largely studied in the literature and is related to several research issues [8]. We classify these issues into two categories: issues concerning the refactoring of functional specifications and issues concerning architecture design process. Usually a functional behavior of an application is specified according to a sequential (or partially parallel) computing model. Such a specification may not expose all the parallelism possibilities intrinsic to an application, i.e., possibilities that are only limited by the data and computational dependencies of the application. On the other side, the requirements of the architectural design usually constrain the actual achievable parallelism level. For this reason, application refactoring and architectural structure exploration are needed concurrently. These involve the application task partitioning, the data partitioning, the selection of an appropriate and efficient memory hierarchy and communication structure of the final system. An efficient memory hierarchy exploration includes an estimation of the storage requirement and the choice of a hierarchy structure, with the corresponding data mapping. Balasa et al. [15] survey works on storage requirement estimation and conclude that they are limited to the case of a single memory. One of the limit of parallelism exploitation, especially in data intensive computing systems, is the usage of shared memories requiring memory access arbitration that further delays inter-tasks communications [10,16]. Advocate the development of distributed shared memories to improve the performance of data intensive computing systems. Many studies exist on methods to restructure a high-level specification of an application in order to map it onto a parallel architecture. These techniques are usually loop transformations

23 Design Space Exploration for Efficient Data Intensive Computing on SoCs

585

[17], which enhance data locality and allow parallelism. Data-parallel applications are usually mapped onto systolic architectures [18], i.e., nets of interconnected processors, in which data are streamed through in a rhythmic way. Amar et al. [19] propose to map a high-level data parallel specification onto a Kahn Process Network (KPN), which is a network of processing nodes interconnected by FIFO queues. We argue that KPNs are not suitable to take into account the transfer of multidimensional data. The FIFO queues can be used only when two communicating processors produce and consume data in the same order. In the other cases, a local memory is necessary. When the FIFO queues are used, two communicating processors can simultaneously access the queue to read or write data. There is an exception when the first processor attempts to write in a full queue or the second processor attempts to read an empty queue. In these cases, the processors have to stall. If two inter-depending processors stall, a deadlock occurs. When a local memory is used, the risk of conflicts on a memory location imposes that the processor producing and the processor consuming data execute at different times. This creates a bottleneck in the pipeline execution, which can be avoided by using a double buffering mechanism [20]. Such a mechanism consists in using two buffers that can be accessed at the same time without conflict: the first buffer is written while the second one is read and vice versa. The works exploring the hierarchy structure [21, 22] are usually limited to the usage of cache memories. The caches are power and time consuming memories because of the fetch-on-miss and cache prediction mechanisms. We argue that an internal memory with a precalculated data fetching is more efficient for application specific circuits. In addition, Chen et al. [23] show that local memories with a “pre-execution pre-fetching” are a suitable choice to hide the I/O latency in data intensive computing systems.

2.3 Application Functional Specification Our target applications are data-parallel and have a predictable behavior, i.e., static number and address of data accesses, compile-time known number of iterations. Examples of these applications are a large number of image processing, radar, matrix computations. A well-adapted MoC for these applications is the multidimensional synchronous data flow (MDSDF) model that represents an application as a set of repeated actors communicating by multidimensional FIFOs [24]. This MoC is an abstraction of a loop-based code with static data accesses [25], compared to which it has a higher expressivity to infer data parallelism. In MDSDFs, data dependencies and rates are given, but the model lacks of information on processing and data access latencies. For an efficient exploration of possible hardware implementation a temporal analysis is needed. A way to capture the temporal behavior of a system is to associate abstract clocks to the system components. In particular for iterative applications, such a clock can be a periodic binary word capturing the rhythm of a event presence [26].

586

R. Corvino et al.

From 1970s, loop transformations, such as fusion, tiling, paving and unrolling [27] are used to adapt a loop-based model to an architecture. Since ten years, they are more and more used in loop-based high level synthesis tools or SDF-based ESL frameworks, e.g., Daedalus [28], in order to optimize memory hierarchy and parallelism [27, 29]. These ESL frameworks use the polyhedral model to represent loop-based application specifications and their corresponding transformations. The polyhedral model allows only for affine and disjoint data and computation partitioning, while MDSDF-like models can allow the analysis of more complex data manipulations, e.g. they handle sliding windows to minimize the number of data accesses when the application uses superimposed data blocks. In [30], loop transformations are adapted to the case of a MDSDF-like model. In our method, we also use an MDSDF-like model, called Array-OL, and restructure the application MoC-based specification through refactoring equivalent to loop transformations. This allows to enhance the data locality and the parallelism level exposed by the application specification and to explore possible optimized data and computation partitioning

2.4 Architecture Structural Specification The growth of single processor performances has reached a limit. Nowadays, dealing with parallelism is one of the major challenges in computer design. In domain-specific computer, used parallelization techniques have been mostly based on Message Passing Interface. The design of such a complex computing architecture requires a specific abstract model and analysis allowing to address the complexity of an application mapping. Until now, no method seems to provide a widely applicable solution to map applications on wide arrays of processors [31]. For predictable applications, a possible well adapted platform is a tile-based multiprocessor system whose abstract model, referred to as template, is described in [6]. This abstract model involves processing tiles communicating through a Network on Chip (NoC). Each tile contains local memories (for an inter-tiles communication based on distributed shared memories) and computing elements. It is able to perform computing task, local memory access arbitration and management of data transfer on the NoC. Using a comparable underlying tile-based platform in our method, we propose a different abstract model using hardware optimizations to improve the performance of data intensive computing system. Our architecture abstract model involves close and remote communications that can be mapped on NoCs, point-to-point or on bus-based communications. The communications are realized through a double buffering mechanism, i.e., two buffers are used in an alternate manner to store data of a producing or a consuming processor. This makes it possible, for each processor to perform in parallel the data access and the computation. Such a mechanism dramatically enhances the system computing capabilities and avoids the data access bottleneck that is the main cause of computing performance decrease, especially

23 Design Space Exploration for Efficient Data Intensive Computing on SoCs

587

in data intensive computing systems. The actual realization of such a processor has been studied in [10] and implemented through a HLS-based flow [32]. The proposed method uses a static scheduling to map the application tasks onto the hardware architecture. Consequently, this reduces the complexity of data transfer controller and the corresponding area overhead in the implemented hardware.

3 Overview of Our Method Our method provides an exact DSE approach to map data parallel applications onto a specific hardware accelerator. The analysis starts from a high level specification defined in the Array-OL language [33]. The method is based on a customizable architecture including parallel processors and distributed local memories used to hide the data transfer latency. The target application is transformed in order to enhance its data parallelism through data partitioning and the architecture is customized exploring different application-specific customizations. The blocks of manipulated data are streamed into the architecture. For the implementation of data intensive computing systems, several parallelism levels are possible: • Inter-task parallelism realized through a systolic processing of blocks of data • Parallelism between the data-access and the computation realized through the double buffering mechanism • Data parallelism in a single task realized through pipelining the processing of data stream or through the instantiation of parallel hardware resources The architecture parallelism level and the size of data blocks streamed through the architecture are chosen in order to hide the latency of data transfers with the computing time. Figure 23.1 presents the general flow of our methodology. The inputs are a MoC-based specification of the application and a library of abstract configurable components, namely processing elements and connections. The ESL design consists in exploring different refactoring transformations of the specification and possible data transfer and storage micro-architectures for the hardware system. Mapping constraints put in relation a given refactored application specification and a micro-architecture customization. Indeed, the micro-architecture configurations are inferred from a static resource-based scheduling of the application tasks. This scheduling has to meet the mapping constraints that are due to the architecture properties and are aimed to implement optimized data intensive computing systems. This method can be considered as a meet-in-the middle approach because many hardware optimizations for data intensive computing applications are already included into the architecture template. The data transfer and storage configuration of the template is inferred from the analysis and refactoring of the application specification. To summarize, the proposed method uses a tiling of task repetitions to model the partitioning and mapping of the application tasks into a multiprocessor system. A large space of mapping solutions is automatically explored and efficiently pruned

588

R. Corvino et al. MoC−based Specification

Abstract components Library CTRL

CU

Refactoring

CU

Customization CTRL

Mapping

CU

CU

CTRL

CTRL

CU

CU

System Level Exploration Estimation

Code Generation ASIP synthesis

HLS

Fig. 23.1 Overview of the proposed method

to the pareto solutions of the explored space. As shown in Fig. 23.1, one or more pareto mapping solutions can be realized through an HLS-based flow or through an ASIP customization flow. In this case a back-end code generator has to be added to the proposed method to ensure the interface with the HLS or ASIP customization. In [32] a code generator is presented, which builds a loop-based C-code from a customization of the abstract model of a single processor. The generated C-code is used as an optimized input of an HLS-based design flow. To illustrate our approach we refer to an application called low-pass spatial (LPS) filter. LPS filter is used to eliminate the high spatial frequency in the retina model [34, 35]. It is composed of four inter-dependent filters as shown in Fig. 23.2. Each filter performs a low pass filtering according to a given direction. For example, HFLR computes the value of the pixel of coordinates .i; j / at instant t D 1, by computing the pondered sum of the pixel .i; j  1/ at instant t D 1 and the pixel .i; j / at a previous instant t D 0.

3.1 Outline The rest of this Chapter is organized as follows: in Sect. 4 of this paper, we describe the Array-OL formalism for the application specification and its associated

23 Design Space Exploration for Efficient Data Intensive Computing on SoCs

589

Fig. 23.2 Specification of a Low Pass Spatial Filter: HFLR, VFTD, HFRL and VFBU respectively stand for Horizontal Filter Left Right, Vertical Filter Top Down, HF Right Left and VF Bottom Up. pt .i; j / is a pixel of coordinates .i; j / at an instant t and a is the filtering factor

refactoring transformations. Then, in Sect. 5, we propose an abstract architecture model for data parallel applications. In Sect. 6, we define a corresponding mapping model based on static scheduling of the application tasks. In Sect. 7, we define a method to systematically apply a set of Array-OL refactoring transformations in order to optimize the memory hierarchy and the communication sub-system for a target application. In Sect. 8, we define the used DSE approach and we illustrate the method on the low pass spatial filter example. In Sect. 9, we show a detailed example in which our method is applied. Finally, in Sect. 10, we give a summary of our proposition.

4 High-level Design of Data Parallel Applications with Array-OL Array-OL (Array-Oriented Language) is a formalism able to represent data intensive applications as a pipeline of tasks applied on multidimensional data arrays [33]. The analyzed applications must have affine array references and a compilationtime known behavior. An Array-OL representation of the LPSF is given as in Fig. 23.3. Each filter has an inter-repetition dependency and inter-task dependencies, i.e., the result of a task repetition is either used to compute the next task repetition or passed to the following task. In an Array-OL representation we distinguish three kinds of tasks: elementary, compound (or composed) and repetitive tasks. An elementary task, e.g., image generator in Fig. 23.3, is an atomic black box taken from a library and it cannot be decomposed in simpler tasks. A compound task, e.g., LPSF filter in Fig. 23.3, can be decomposed in simpler interconnected task hierarchy. A repetitive task, e.g., HFLR filter in Fig. 23.3, specifies how a task is repeated on different subsets of data: data parallelism. In an Array-OL representation, no information is given on the hardware realization of tasks. Figure 23.4 shows a detailed Array-OL representation for the HFLR filter.

590

R. Corvino et al.

U V FB

L FR H

D V FT

H FL

Image generator

R

LPSF Image display

output port input port

Fig. 23.3 Sketch of an Array-OL specification for the LPSF filter HFLR R1 < 480; 1080; ∞ > ⎛ ⎞ 0 O = ⎝0⎠ ⎛ 0⎞ 1 F = ⎝0⎠ 0 ⎛ ⎞ 400 ⎝ P = 0 1 0⎠ 001

T1

sp =< 4 > sp =< 4 >

d=

T2

< 1920; 1080; ∞ > ⎛ ⎞ 0 O=⎝ 0 ⎠ 0 ⎛ ⎞ 1 ⎝ F= 0 ⎠ 0 ⎛ ⎞ 400 ⎝ P= 0 1 0 ⎠ 001

Fig. 23.4 Array-OL specification of a HFLR filter. The filter executes 480  1080 repetitions (R1 ) per frame of 1920  1080 pixels. The task repetitions are executed on an infinite flow of frames. Temporal and spatial dimensions are all array dimensions without any distinction with each other. But usually the infinite dimension is dedicated to the time. The tilers (T1 and T2 ) describe the shape, the size and the parsing order of the patterns that partition the processed frames. The pattern sp , that tiles the input data and is described by T1 , is monodimensional. It contains four pixels and lies along the horizontal-spatial dimension. The vector d is an inter-repetition dependency vector

The information on the array tiling is given by: the origin vector O, the fitting matrix F and the paving matrix P. The origin vector specifies the coordinates of the first datum to be processed. The fitting matrix F says how to parse each data pattern and the paving matrix P says how the chosen pattern covers the data array. The size of the pattern is denoted by sp . Array-OL is a single data assignment and a deterministic language, i.e., each datum can be written only once and any possible scheduling, which respects the data dependencies, produces the same result. For this reason, an application described in Array-OL can be statically scheduled. A crucial problem for an Array-OL description is to define the optimal granularity of the repetitions, i.e., finding an optimal data paving in order to optimize the target implementation architecture. Glitia et al. [36] describe several Array-OL transformations. In our work, we use the fusion, the change paving and the tiling. These transformations are respectively equivalent to the loop fusion [37], loop unrolling [38] and loop tiling [39, 40]. One of the biggest problem for a competitive exploration is to define an optimal composition of these transformations. In fact, by composing them we can optimize the following criteria: the calculation overhead, the hierarchy complexity, the size

23 Design Space Exploration for Efficient Data Intensive Computing on SoCs

591

of the intermediate array and the parallelism level. We propose to compose them in order to find a efficient architecture to mask the data access and transfer latencies.

5 A Target Customizable Architecture We propose a synchronous architectural model (Fig. 23.5) to realize data parallel applications. This model can be customized with respect to the data transfer, the data storage and the computing process to be accomplished. All the data transfers are synchronized and managed by FIFO queues and the task execution depends on the presence of data in the FIFO queues. The data transferred by the FIFOs are stored

a

CTRL

EXT MEM

P1

P2

P4

P5

P6

P3

LM CTRL LM CU

b

CTRL

CTRL

CTRL CU

CU

CU

CU

CU

CTRL

CTRL

CU

CU

EXT. MEM

CTRL

CU

CU

CU

CU

Fig. 23.5 Target customizable architecture. (a) Abstract model. (b) Detailed model: customizable sub-system for data transfer and storage

592

R. Corvino et al.

into local memories and the access to these memories is managed by a local memory controller. At a higher abstraction level (Fig. 23.5a), the architecture is a mesh of communicating processors Pk with an external memory (EXT. MEM) interface and a global controller (CTRL). The global controller starts the communication with the external memory by injecting data into the mesh of processors. It also synchronizes the external memory accesses. As shown in Fig. 23.5b, each single processor has essentially three customizable parts: a local memory controller (C TRL), a set of buffers (Local Memory LM ) and a computing unit (C U ). The behavior of such a processor has been tested at the RTL level in [32] and can be summarized as follows. The local memories are based on double buffer mechanism so that the local memory controller and the computing unit can execute their tasks in parallel. The local memory controller has to perform two main actions: (1) managing the double buffering mechanism (2) compute the local memory address to load data from the external memory and to read and transfer data to the computing units. A computing unit realizes an application-specific computing task. Two processors may communicate through a stand-alone FIFO queue or through a local memory. In the former case, the local memory controller and the buffers are not instantiated. In the latter case, the buffers and the local memory controller are instantiated. Each buffer is a single port memory because the realization of multiport memories is not mature yet and it is subject to many technological problems [41]. For this reason, a processor receiving more data flows includes several distinct buffers, one for each communication flow. Consequently, all data transfers concerning different data flows can be realized in parallel with different local buffers and without any conflict. For a given application, the data to be processed are first stored into the global shared external memory. Then, after that the global controller starts the communication with the external memory , the data are streamed per groups of parallel blocks into the architecture. The size of data blocks is chosen, on the one hand, in order to mask the data access latency (thanks to the double buffering) and on the other hand, in order to respect the constraints on the external memory bandwidth. In the proposed abstract model of the application architecture several features can be customized through an application-driven design process. These features are: 1. The communication structure of the whole system, i.e., decide if two processors communicate through a distributed shared local memory or through a global external memory. The local communication speeds up the processing temporal performance but requires the instantiation of additional local memories. An exploration is needed to establish which kind of communication is the most profitable for the analyzed application, according to the data and task dependencies intrinsic to the application. 2. The number of computing units in a single processor. A single processor realizes a given repetitive task. Having more computing units allows to improve the data parallelism level of the system and to treat in parallel different blocks of data from the same input data set.

23 Design Space Exploration for Efficient Data Intensive Computing on SoCs

593

3. The number and size of the local memories. This feature ensures the realization of data parallelism. Indeed, by exploring different data partitioning of the manipulated data, the method can infer the sizes and number of needed local memories. 4. The functional behavior of a computing unit. A computing unit realizes the core instructions of an application task and it has to be specified by the user of the design methodology.

6 A Static Temporal Model for Data Transfer and Storage Micro-Architecture Exploration In the following section we specify the temporal model used to estimate the static scheduling and mapping of the application tasks onto a customized system architecture. We present this model progressively giving details on the final system. First, in Sect. 6.1, we present the static scheduling of a single processor with respect to its computing task. Then, in Sects. 6.2 and 6.3, we present the static scheduling of communicating processors, with respect to their data transfer and storage tasks. For each one of the presented scheduling we will give a sketch of the underlying configurable hardware architecture, a generic execution timeline of the application tasks running on the configurable hardware architecture and the associated mapping constraints. The mapping constraints ensure the correctness of system computed response, they are based on the use of optimized hardware mechanisms for data intensive computing and allow to explore possible hardware configurations and to select those that are best fits for the analyzed application.

6.1 Temporal Behavior of a Single Processor and Its Computing Time A given processor is allocated to the execution of a repeated task in an application. For such a task, its core part, i.e., the set of instructions encompassed by the innermost task repetition, is realized by the computing unit inside the dedicated processor. This computing unit is pipelined and can be either an application specific hardwired data-path or an application specific instruction set. As shown by Fig. 23.6, the iterations of a task can be pipelined on the same hardware, but in order to avoid conflicts on stream or memory accesses, a minimum interval of time, called Initiation Interval – II, has to pass between the beginning of two successive iterations [42]. The time to access data grows with the increase of the number of iterations while the datapath latency remains constant. If the number of pipelined iterations

594

R. Corvino et al. Output Computation Time to access data

data path latency

IT1 IT2

LM or FIFO access

IT3

Computing

IT4

output FIFO transfer

ITN Time Time to access the input for a single computation

Fig. 23.6 A pipeline of iterations (I T ) during which two sub-tasks are realized: access to a Local Memory (LMj ) and output computing. If the number of iterations N is very high, the datapath latency can be neglected w.r.t. the time spent to access data

is sufficiently high, the datapath latency can be neglected with respect to the time spent to access data. This simplification leads to a transaction-based model, with a reasonably high precision on the data transfer latencies and a lower precision on the execution time of a computing unit. Finally, the computing time of a processor is defined as follows: Definition 1. Let cxk be the Initiation Interval of a processor Pk . The time to k compute a set of data xj on a computing unit of the processor Pk is: tcom .xj / D L cx xj : As shown by Fig. 23.6, the Initiation Interval of a new task repetition depends on the latency of the data transfer either on the input or on the output FIFO queues of the repeated task. The FIFO latency giving the worst case time response is taken as Initiation Interval of the repeated task [42].

6.2 Data Transfer and Storage Model of Communicating Processors As shown in Fig. 23.7a, a processor can have input data blocks coming from the external memory and data blocks coming from neighbor or own local memories. The transfer of data blocks coming from the external memory are always sequentialized to avoid possible conflict on concurrent accesses to the external memory.

23 Design Space Exploration for Efficient Data Intensive Computing on SoCs

a

595

b Data pre-fetch j

Data transfer from external memory

Output Computation i Data transfer from local memories

. . .

Pk

Time

xi

sync

Fig. 23.7 The double buffering mechanism allows to mask the time to fetch with the time to compute. (a) Sketch of a processor with its input and output data flows. (b) Timeline for sequential groups of iterations on a single processor. The processor has multiple input flows (j ) and a single output flow (i ).

The transfers of data blocks coming from local memories are realized in parallel. Indeed, when a processor receives different data blocks from uncorrelated data sets, it allocates a dedicated local double buffer to store each transferred data block. To summarize, each processor of the proposed architecture can store several input data blocks, cumulatively called xj , but produce a single output data block, called xi . Thanks to the double buffering mechanism, each processor fetches the data needed to compute the next group of iterations while it computes the current group of iterations. One of the aim of the proposed method is to choose a data block size that allows to mask the time to fetch with the time to compute. As a result, such a block size must respect the following mapping constrain: maxftfetch .xj /g  tcom .xi /: j

(23.1)

As illustrated in Fig. 23.7b, the execution timeline of a processor have two parallel contributions: the time to pre-fetch and the time to compute. The duration of the longest of these contributions synchronizes the beginning of the computations of a new set of application task repetitions. The time to pre-fetch a set of data depends on the communication type, i.e., when the processor receives data from the external memory or from another processor. We propose a model for each of these transfer types.

6.2.1 Data Transfer with External Memory The external memory is accessed in a burst mode [43], i.e., there is a latency before accessing the first datum of a burst, than a new datum of the burst can be accessed at each processor cycle. A whole data partition is contained in a memory burst and the access to a whole data partition is an atomic operation. It is possible to store m

596

R. Corvino et al.

data per word of the external memory (for example we can have four pixels of eight bits per each memory word of 32 bits). Under the above hypotheses, we propose the following definition: Definition 2. Given L and m denoting respectively the latency before accessing a burst and the number of data per external memory word, the time to fetch a set of data xj from an external memory is: tfetch .xj / D L.j / C

xj m

with xj ; m 2 N :

The latency L.j / has three contributions: the latency due to the other data transfers between the target architecture and the external memory, the latency due to the burst address decode and the latency due to other data transfers which are external to the target architecture (these two last contributions are indicated as Lm ). In the worst case, this latency is: L.j / D Lm C

Xn z¤j

Lm C

xz o m

(23.2)

with xz 2 X Mem and X Mem being respectively the set of all the data transfers (in input and output) between the given architecture and the external memory. Hence, we have: tfetch .xj / D Nk Lm C

1 Vect.1/X Mem : m

(23.3)

where Nk is the number of data blocks exchanged between the target architecture and the external memory. The expression Vect.1/ indicates a line vector of coordinates 1.

6.2.2 Transfer Between Two Processors Through the Local Memories In the proposed architecture, the data computed by a processor are immediately fetched into a second processor, in a dedicated buffer. Thus, the time to fetch a set of data xj into a processor Pk corresponds to the time to compute a set of data xj by a processor Pl . Consequently, the latency of the data transfer between two processors is defined as follows: Definition 3. Given a couple of producing and consuming processors that communicate through a distributed shared local memory LMj and exchange data blocks of size xj The time that a consuming processor employs to pre-fetch xj is equal to the time that the producing processor employs to compute xj ., i.e. prod ucer consumer .xj / D tcom : tfetch

23 Design Space Exploration for Efficient Data Intensive Computing on SoCs

597

External memory bus x0 xd 3R

3R

Pl

II=3

6R

Pk

xj

1R II=6

xi

R: number of input data read to compute a single output II: Initiation Interval Fig. 23.8 A processor receiving three input data blocks and producing a single output data block

Following is an example applying the mapping constraint (23.1) and Definitions 1, 2 and 3. Example 1. Let Pk be a processor, as presented in Fig. 23.8. It has three input flows: one from the external memory, one from a processor Pl and another due to an interrepetition. To avoid a deadlock, all the data needed by an inter-repetition are stored in internal memories without any latency. From Definitions 1 and 2, the time to fetch is tfetch D maxfL C xm0 ; 3xj g and the time to compute is tcom D 6xi . Thanks to the access affinity we know that, x0 D 6xi and xj D 3xi . By applying the mapping i constraint (23.1), we have that either L C 6x m  6xi , possible only if m > 1 or 9xi > 6xi , which is impossible. In this last case, the usage of data parallelism can mask the time to fetch.

6.2.3 Scaling the Parallelism Level The method, as it has been presented until now, takes into account two levels of parallelism: (1) the parallelism between data access and computation (2) the parallelism due to the pipeline of iterations on the same processor. It is possible to further scale the parallelism level, by computing in parallel independent groups of data. We indicate with Ncuk the number of parallel computation units of a processor Pk , which is able to compute Ncuk blocks of data in parallel. To apply the data parallelism, we distinguish between the processors communicating with the external memory and the processors communicating with each other. For the processors communicating with the external memory the number of parallel computation units Ncuk is limited by the memory bandwidth. When a processor writes a data flow into

598

R. Corvino et al.

the external memory, its number of parallel computation units Ncuk is limited to m the number of data per memory word: Ncuk < m:

(23.4)

When a processor receives an input data flow from the external memory, all the data are sequentially streamed on a unique input FIFO queue. It is possible to duplicate the internal memories and the computation units, but the time to fetch the input data is multiplied by the number of parallel computation units Ncuk , while the computation time will be divided by a factor Ncuk :  xj  cxk xi < : Ncuk L C m Ncuk

(23.5)

For the processor communicating with each other the parallelism level reduces both the time to fetch and the time to compute. The inequality (23.1) becomes C xl x j C x  Nxcuk i . Thanks to the affinity of the array accesses, i.e., xj D kij xi with Ncul k kij 2 N , we can infer a constraint on the parallelism level of two communicating processors: Ncuk Ncul  : (23.6) C xk Cxl kij

6.3 Temporal Behavior of the Whole Architecture The proposed method uses a load balancing criterion to execute all the network communications and applies a periodic global synchronization of the task repetitions. The number of instantiated local memories and computation units is inferred in order to maximize the load balancing between the communicating processors. To exemplify the temporal behavior in the whole architecture we describe two kinds of communications: the communication between processors having the same parallelism level Fig. 23.9a and the communication between processors having a different parallelism level (see Fig. 23.9b, c). The temporal timelines of these kinds of communication are given in Fig. 23.10. If the two processors have the same parallelism level, their computing units have a comparable execution time. If the two processors have a different parallelism level, their computation units with a lower parallelism level are faster then those of processors with a higher parallelism level. In the current analysis we suppose that the whole system has a single clock frequency. Furthermore, the execution time of a computation unit depends on the number of data accessed by the unit itself. As a result, a faster computation unit accesses a lower number of data. Another solution could be to use different clock frequencies for each processor. The different clock

23 Design Space Exploration for Efficient Data Intensive Computing on SoCs

a

599

b

cmd

cmd

CTRL

CU1

CU2

CU3

CTRL

CU4

CU1

cmd

CU2

CU3

CU4

cmd

CTRLMUX

CTRL

CU1

CU2

CU3

CU1

CU4

CU2

c cmd

CTRL

CU1

CU2

cmd

CTRL DeMUX

CU1

CU2

CU3

CU4

Fig. 23.9 Three kinds of communications. (a) Two processors having the same parallelism level use a Point to point communication controller. (b) A processor having a higher parallelism level passes its output to the neighbor by a multiplexing communication controller. (c) A processor having a lower parallelism level passes its output to the neighbor by a demultiplexing communication controller

cycles would be multiple of an equivalent clock cycles and the multiplicity factors would be used to correct the formula in order to determine the execution time, and as a consequence, the inferred parallelism level.

600

a nodei

nodej

R. Corvino et al.

sets of ITs computed in parallel CU1 CU2 CU3 CU4

b sets of ITs computed in parallel

CU1 CU2 CU3 CU4

nodei

CU1 CU2 CU3 CU4

nodej

CU1 CU2

time (cycles)

time (cycles)

sets of ITs computed in parallel

c nodei

CU1 CU2

CU1 nodej CU2 CU3 CU4

time (cycles)

Fig. 23.10 Example of a communication timelines. (a) Case of two communicating nodes having the same parallelism level. Usage of a CTRL PtoP. (b) Case of two communicating nodes. The first node has a higher parallelism level. Usage of a CTRL MUX. (c) Case of two communicating nodes. The first node has a lower parallelism level. Usage of a CTRL DeMUX

7 Implementation of Our Methodology We generalize the mapping constraint (23.1) and Definitions 1, 2 and 3 to the whole processor network, by taking into account the processors interconnections. For that, we consider the following definitions presented in a progressive way. Definition 4. Given a network of processors, a column-vector X of possible data block sizes produced or consumed by the processors in the network is: X D .x0 ; : : : ; xn1 / where xi is a data block produced or consumed by Pk , 8k 2 Œ0; n  1. The vector X is associated with two vectors Xout and Xi n . The coordinates of Xi n (respectively Xout ) are either 0 or those of X at the same position and corresponding to the input (respectively output) data blocks in the network. A coordinate xj of Xi n equals kij xi , where xi 2 Xout and kij 2 N . The data blocks of size xj can be received either from the external or from another processor of the network.

23 Design Space Exploration for Efficient Data Intensive Computing on SoCs

601

In the following definitions, we give the relation between Xi n and Xout and we distinguish the case when the input data blocks are received from the external memory. Definition 5. Let XiMn em be a vector of possible sizes for the data blocks read from the external memory. The matrices Kı and KıM em giving the mapping between the sizes of input and output data blocks are: Xi n D Kı Xout XiMn em

D

for all the input communications

KıM em Xout

for the communications from the external memory.

where Xout and Xi n are vectors of possible sizes for respectively output and input data blocks. An element ıij of Kı (or KıM em ) is defined as follows:  kij if xj D kij xi and xj 2 Xi n . or respectively xj 2 XiMn em / ıij D 0 otherwise. M em C om and Xout be two vectors of possible sizes of data blocks Definition 6. Let Xout respectively written into the external memory and exchanged by the processors. We define two matrices IıM em and IıC om so that: M em D IıM em Xout Xout C om Xout D IıC om Kı Xout

where Kı gives the mapping between the sizes of input and output data blocks for all the input communications. An element ıij of IıM em (or IıC om ) is:  ıij D

M em 1 if xj 2 Xout . or respectively xj 2 Xi n n XiMn em / 0 otherwise.

Definition 7. Given a processor Pk , a column vector Cx giving the Initiation Interval per processor, is Cx D fcxk W cxk is the Initiation Interval of a processor Pk g: Definition 8. Given a processor Pk , a column vector Ncu giving the number of parallel computation units per processor, is Ncu D fNcuk W Ncuk is the number of parallel computation units in Pk g From the above definitions, we infer the mapping criteria to mask the data access latency for the input, output and internal communications of the target architecture. Mapping Criterion 1. Input communication from the external memory. Let be Mem X Mem D XinMem C Xout D .KıMem C IıMem /Xout and Diag.Ncu / a diagonal matrix whose diagonal elements are the coordinates of vector Ncu . It is possible to write the

602

R. Corvino et al. Vect.1/.K Mem CI Mem /X

out ı ı Eq. 23.3 as follows: tfetch .X Mem /DDiag.Ncu /Vect.1/T .Nk Lm C / m and Diag.Cx /KıMem Xout Mem Definition 1. as follows: tcom .Xin /D . The substitution of tfetch .X Mem / Diag.Ncu/ and tcom .XinMem / in mapping constraint (23.1), gives:



 Vect.1/T Vect.1/.KıMem C IıMem / Diag.Cx /KıMem  Xout  Vect.1/T Nk Lm : Diag.Ncu /Diag.Ncu / m

T Mem Mapping Criterion 2. Output communication to the external memory. Let Ncu Iı be the number of parallel computing units of the processors communicating with the external memory. From inequality (23.4), we infer:

Diag.Ncu /IıMem  m: Com Mapping Criterion 3. Given Xout of Definition 6. Let Ii and Ij be two squared matrices so that the left multiplication Ij IıCom selects the j th line of IıC om . From inequality (23.6), we infer:

8i; j

1 1 Ij Diag.Cx /IıCom Kı Diag.Ncu /  Ii Diag.Cx /Diag.Ncu /:

The above mapping criteria form a system of inequalities whose variables are the parallelism level Ncu and the size of the transferred data blocks Xout . Solving the system of inequalities means to find Ncu and Xout that mask the time to access data and respect the external memory bandwidth.

8 Design Space Exploration We propose a design space exploration (DSE) approach in order to efficiently map an application described in Array-OL onto an architecture as proposed in Fig. 23.5. The DSE chooses a task fusion that improves the execution time and uses the minimal inter-task patterns. Then, it changes the data paving in order to mask the latency of the data accesses. The space of the exploration is a set of solutions with a given parallelism, a fusion configuration and data block sizes that meet the constraints on the external memory bandwidth. The results of the exploration is a set of solutions which are optimal (also termed pareto ) with respect to two optimization criteria: the architecture latency and the internal memory amount.

8.1 Optimization Criteria A processor Pi contains a buffer of size LMj per each input flow: LMj .Pi / D 2Ncui xj :

23 Design Space Exploration for Efficient Data Intensive Computing on SoCs

603

The factor 2 is due to the double buffering. The total amount of used internal memory is: XX LMj .Pi /: IM D i

j

We define the architecture latency as the latency to compute a single output image. As in our model, the times to access data are always masked, we can approximate the architecture latency with the time necessary to execute all the output transfers towards the external memory: AL D Imagesize 4.IıMem /: where 4 denotes the determinant. I magesi ze can be inferred from the Array-OL specification.

8.2 Array-OL Model on the Target Architecture An Array-OL model can directly be mapped onto an architecture like that presented in Fig. 23.5, by using the following rules: 1. The analysis starts from a canonical Array-OL specification, which is defined to be equivalent to a perfectly loop-nested code [44]: it cannot contain repetition around the composition of repetitive task. 2. The hierarchy of the Array-OL model is analyzed from the highest to the lowest level. 3. At the highest level, we infer the parameters of the mapping, i.e., Cx , IıMem , Iıcom , Kı and KıMem . Given a task taski , let sp in .taski / and sp i n .j / be respectively the size of the output pattern and input patterns (j ). The elements of Kı and KıMem are computed as: ıi;j D

4.Diag.sp i n .j // 4.Diag.sp out .taski //

(23.7)

An element cxi of Cx is cxi D maxj fıi;j g. The values of IıMem and Iıcom elements depend on the inter-task links. 4. At each level we distinguish among an elementary, a composite or a repetitive task. • If the task is elementary or a composition of elementary tasks, a set of library elements is instantiated to realize it. • When a task is a repetition we distinguish two cases: if it is a repetition of a single task we instantiate a single processor; if it is a repetition of

604

R. Corvino et al.





>




Input port Output port

Fig. 23.11 A possible canonical Array-OL specification for a LPSF filter

a compound task we instantiate a set of processors in a SIMD, MIMD or pipelined configuration. The repetitions of the same task is iterated on the same hardware or executed in parallel according to mapping criteria. 5. Each instantiated processor contains at least a datapath (of library elements) and may contain some local buffers and a local memory controller. Example 2. Mapping constraints for a LPSF filter in an Array-OL specification. Figure 23.11 shows a possible canonical specification for a LPSF filter. The specification contains four elementary tasks, thus we instantiate four processors .P1 ; P2 ; P3 ; P4 /. The user’s specified parameters are: m D 4 and Lm D 30. From the Array-OL specification analysis, we infer the following parameters of the mapping criteria: Cx D .1; 1; 1; 1/ I 0 0 1 000000 00 B0 0 0 0 0 0 C B0 1 B B C B B C B0 0 1 0 0 0C com B0 0 IıM em D B C I Iı D B B0 0 0 0 0 0 C B0 0 B B C @0 0 0 0 0 0 A @0 0 000001 00 0 0 B0 B B B0 Kı D B B0 B @0 0

10 01 00 00 00 00

00 00 00 01 00 00

000 000 000 000 001 000

0 1 0 01000 B C 0C B0 0 0 0 0 B C 0C M em B0 0 0 0 0 DB C I Kı B0 0 0 0 1 0C B C @0 0 0 0 0 1A 0 00000

1 0 0C C C 0C CI 0C C 0A 0 1 0 0C C C 0C C 0C C 0A 0

A 4,000 lines Java code1 automatically extracts and processes these parameters and obtains the following results by solving the inequality system of the mapping criteria: Ncu  .4; 4; 4; 4/; Xout  .480; 480; 480; 480/. These constraints are used to change the data paving in the Array-OL specification. The Java program also

1

Available on demand at http://www.es.ele.tue.nl/rcorvino/tools.html

23 Design Space Exploration for Efficient Data Intensive Computing on SoCs

605

computes the optimization criteria to enable the successive design space exploration and perform the consequent solution classification and selection: IM D 15K.data/ AL D 4M.cycles/.

8.3 DSE Flow By starting from a canonical form, all the possible task fusions are explored (according to a method explained in the next paragraph). The obtained Array-OL model is mapped onto an architecture as presented in Fig. 23.5 and the repetition granularity of the merged tasks is changed in order to mask the external memory latency. The granularity is changed through a change paving and in order to have 4.Diag.spout .taski ///  xi . Finally, the obtained solutions are evaluated against the amount of internal memory used and the architecture latency. The pareto solutions are chosen.

8.4 Reducing the Space of Possible Fusions To reduce the exploration space of the possible fusions, we propose to adapt the method used by Talal et al. [45] to generate optimal coalition structures. Given a number of tasks n, we can map the space of possible fusions onto an integer partition of n. An integer partition is a set of positive integer vectors whose components add up to n. For example, the integer partition of n D 4 is Œ1; 1; 1; 1; Œ2; 1; 1; Œ2; 2; Œ3; 1; Œ4. Each vector of the integer partition can be a mapping for more possible fusions as shown in Fig. 23.12. Let a macrotask be the result of a fusion, as proposed by Talal et al., we reduce the number of sub-spaces by merging the sub-spaces whose solutions contain the same number of macrotasks. For the example of Fig. 23.12, the sub-spaces mapped on the integer partitions [3,1] and [2,2] are merged. In this way, the number of sub-spaces is limited to n. This mapping reduces the number of comparisons between the possible solutions. In fact we search for the pareto solutions of each sub-space and we compare them to each other in order to find the pareto solutions of the whole space. In Fig. 23.12, we perform 32 comparisons instead of 56. The pareto solutions are denoted by P soli in Fig. 23.12. Among these solutions, a user can choose the most adapted to his or her objectives. In our case, we have chosen the solution P sol1 , which has the most advantageous trade-off between the used internal memory and the architecture latency. The mapping constraints for this solution are given in Example 2. Blocks of data parsing the input image from the top left to the bottom right corner are pipelined on the HFLR and VFTD filters. It is possible to process up

606

R. Corvino et al. IM (data)

AL (cycles)

Ncu=(4,4,4,4)

15 K

8M

Cx=(1,1,1,1)

1,1,1,1

m=4 Lm=30 2,1,1

17 K

6M

2M

6M

17

4M

2,2

15 K

4M

3,1

2M

4M

2M

4M

2M

2M

4

Psol1

Psol2

Fig. 23.12 Exploration for a LPSF filter. Solutions merging VFTD and HFRL have to store a whole image. Thus, they use 2M of internal memory

to four parallel blocks of data per filter. Each block has to contain at least 60 data to mask the time to fetch. The execution of HFRL and VFBU filters is similar except that it starts from the bottom right and stops to the top left corner.

9 Case Study: Application to Hydrophone Monitoring In this section, we apply our method to a more complex example. The considered application implements hydrophone monitoring. A hydrophone is an electroacoustic transductor that converts the underwater sounds in electric signals for further analysis. The signals corresponding to underwater sounds first undergo through a spectral analysis performed by a Fast Fourier Transform (FFT). Then, they are partitioned in different beams representing different monitored spatial directions, frequencies and captors. The beams are treated via successive processing steps in order to extract characteristic frequencies. The functional blocks of the application are represented in Fig. 23.13. Table 23.1 gives the sizes of the input and output arrays processed by each functional block. It also gives the sizes of input and output patterns that each task uses to consume and produce input and output arrays respectively.

23 Design Space Exploration for Efficient Data Intensive Computing on SoCs

FFT

Beams

Norm

Hydrophone Monitoring

Int1

Stab

Int2

607

Hydrophone Bands

Freq.

Fig. 23.13 Array-OL specification of a hydrophone monitoring application Table 23.1 Sizes of input and output arrays of the hydrophone monitoring application blocs Input Output Number of Block Input array Output array pattern pattern task repetitions FFT f512  1g f512  256  1g 512 256 f512; 1g Beams f512  256  1g f128  1  200g 192 1 f128; 200; 1g f128  200  192g 192 Norm f128  1  200g f128  1  200g 1 1 f128; 200; 1g Bands f128  1  200g f128  1g 200 1 f128; 1g Int1 f128  1g f128  1g 8 1 f128; 1g Stab f128  1g f128  1g 8 1 f128; 1g f128  8g 8 Int2 f128  1g f128  1g 8 1 f128; 1g Each block takes its inputs from the preceding block The Beams and Stab blocks have secondary inputs that are tables of coefficients

9.1 Analysis Steps The input specification of the application is given as a series of repeated tasks. The textual representation of this specification, i.e., our actual input of the automatic DSE tool, is shown in Fig. 23.14. The set of task repetitions can be multidimensional, with an infinite dimension. In order to realize such an application, it is necessary to specify which repetitions are mapped in space (thus realized by parallel physical processors) and which are mapped in time (thus realized in a pipeline on the same hardware). In our analysis, we consider that all repetitions indexed by finite numerical values can potentially be realized in parallel on different hardware resources. The repetitions indexed by 1 are realized in a pipeline. It is possible to transform the specified model to redistribute repetitions from time to space and conversely. Once we have the input specification, the analysis goes through the following steps:

608

R. Corvino et al.

##################################### # TN=Task Name # I(O)DN=Input (Output) Dependency Name # I(O)O=Input (Output) Origin # I(0)P=Input (Output) Paving # I(0)F=Input (Output) Fitting # I(O)A=Input (Output) Array # I(O)M=Input (Output) Motif ##################################### # FFT ##################################### TN FFT TR {512,-1} IDN FFT_in IA {512,-1} IO {0,0} IP {{1,0},{0,512}} IF {0,1} IM {512} ODN FFT_out OA {512,256,-1} OO {0,0,0} OP {{1,0,0},{0,0,1}} OF {{0,1,0}} OM {256} END ##################################### ...

##################################### # Beams ##################################### TN Beams TR {128,200,-1} IDN Beams_in IA {512,256,-1} IO {0,28,0} IP {{4,0,0},{0,1,0},{0,0,1}} IF {{1,0,0}} IM {192} IDN Beams_in1 IA {128,200,192} IO {0,0,0} IP {{1,0,0},{0,1,0},{0,0,0}} IF {{0,0,1}} IM {192} ODN Beams_out OA {128,200,-1} OO {0,0,0} OP {{1,0,0},{0,1,0},{0,0,1}} OF {{0,0,0}} OM {1} ##################################### ...

Fig. 23.14 Partial input textual specification of our DSE tool for the hydrophone monitoring

1. For each task, the Initiation Interval is computed, according to formula (23.7). Then the following results are obtained: II.FTT/ D 2 II.Beams/ D 192 II.Norm/ D 1 II.Bands/ D 200 II.Int1/ D 8

II.Stab/ D 8 II.Int2/ D 8

2. In order to divide and conquer the space of possible task fusions within the application, we consider the integer partitions of 7, where 7 is the number of tasks in the hydrophone monitoring. The integer partitions of 7 are 15 as presented in Table 23.2. Each of these partitions maps a group of possible task fusions that are represented in the column “Fusions” of Table 23.2. In this column, the notation *—t1 , t2 —* represents the fusion of two tasks t1 and t2 . For each integer partition, we only consider the possible fusions. Other fusions configurations are not possible because of inter-task dependencies that are captured and analyzed through a Li nks matrix, given in Table 23.3. 3. For each fusion, we compute the parametric matrices of the mapping criteria 1, 2 and 3, which are Diag.Cx /, KıMem and IıMem . These matrices are used to automatically infer the inequality constraints on the internal buffer sizes Xout and parallelism level NCU of a task-fusion implementation. These matrices have

23 Design Space Exploration for Efficient Data Intensive Computing on SoCs

609

Table 23.2 Possible integer partitions of 7, mapping sub-spaces of fusion solutions Partition Fusions 1 1 1 1 1 1 1 [FFT, Beams, Norm, Bands, Int1, Stab, Int2] 2 1 1 1 1 1 [*—FFT, Beams—*,Norm, Bands, Int1, Stab, Int2] [FFT, *—Beams, Norm—*, Bands, Int1, Stab, Int2] [FFT, Beams, *—Norm, Bands—*, Int1, Stab, Int2] [FFT, Beams, Norm, *—Bands, Int1—*, Stab, Int2] [FFT, Beams, Norm, Bands, *—Int1, Stab—*, Int2] [FFT, Beams, Norm, Bands, Int1, *—Stab, Int2—*] 3 1 1 1 1 [*—FFT, Beams, Norm—*, Bands, Int1, Stab, Int2] [FFT, *—Beams, Norm, Bands—*, Int1, Stab, Int2] ::: 2 2 1 1 1 [*—FFT, Beams—*,*—Norm, Bands—*, Int1, Stab, Int2] [*—FFT, Beams—*, Norm, *—Bands, Int1—*, Stab, Int2] [*—FFT, Beams—*, Norm, Bands, *—Int1, Stab—*, Int2] [*—FFT, Beams—*, Norm, Bands, Int1, *—Stab, Int2—*] [FFT, *—Beams, Norm—*, Bands, Int1, *—Stab, Int2—*] ::: 4 1 1 1 ::: 5 1 1 ::: 4 2 1 ::: 3 3 1 ::: 3 2 2 ::: 6 1 ::: 5 2 ::: 4 3 ::: 7 [*—FFT, Beams, Norm, Bands, Int1, Stab, Int2—*] Table 23.3 Links matrix giving the task dependencies of hydrophone application. A “1” means that taskj depends on taski FFT Beams Norm Bands Int1 Stab Int2 FFT 0 1 0 0 0 0 0 Beams 0 0 1 0 0 0 0 Norm 0 0 0 1 0 0 0 Bands 0 0 0 0 1 0 0 Int1 0 0 0 0 0 1 0 Stab 0 0 0 0 0 0 1 Int2 0 0 0 0 0 0 0

different values and dimensions depending on the considered fusion. Below, we illustrate the methods by giving the parametric matrices and the mapping criteria for two possible fusions. Example 3. This example considers the case where tasks are not merged, which is denoted by [FFT, Beams, Norm, Bands, Int1, Stab, Int2]. Figure 23.15 gives a simplified representation of the tasks and how they communicate with each

610

R. Corvino et al.

Fig. 23.15 Example of configuration: no merged tasks

other and with the external memory. In this case, as task are not merged, all communications are achieved through the external memory. Thus, there are 16 links associated with unknown variables X . These variables give qualitative and quantitative characterizations of a specific communication. Indeed, from a qualitative viewpoint, they can be classified as output or input links, denoted Xout and Xin respectively. They can also be classified as links with the external memory or inter-task links, denoted X Mem and X Com respectively. For the example of Fig. 23.15, all the links are X Mem . From a quantitative viewpoint, the value of the X variables, computed through the mapping criteria 1, 2 and 3, gives the minimum storage requirement needed to ensure the correctness of communications. The parametric matrices are: 0 2 B0 B B0 B B0 B B B0 B B0 B B0 B B0 Diag.Cx / D B B0 B B0 B B B0 B B0 B B0 B B0 B @0 0

0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 192 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 192 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 192 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 200 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 200 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 8 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 8 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 8 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 8 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 0

1 0 0C C 0C C 0C C C 0C C 0C C 0C C 0C C, 0C C 0C C C 0C C 0C C 0C C 0C C 0A 8

23 Design Space Exploration for Efficient Data Intensive Computing on SoCs 0

0 B0 B B0 B B B0 B B0 B B0 B B0 B B B0 Kı D B B0 B B0 B B0 B B B0 B B0 B B0 B @0 0

2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 192 192 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 200 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 8 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 8 8 0 0

0 B0 B B0 B B B0 B B0 B B0 B B0 B B B0 DB B0 B B0 B B0 B B B0 B B0 B B0 B @0 0

0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0

IıMem

0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 0 0 02000 C B0 0 0 0 0 0C B B0 0 0 0 192 0C C B C B 0C B0 0 0 0 192 C B B0 0 0 0 0 0C C B B0 0 0 0 0 0C C B B0 0 0 0 0 0C C B C B 0C B0 0 0 0 0 Mem DB C, Kı B0 0 0 0 0 0C C B B0 0 0 0 0 0C C B C B0 0 0 0 0 0C B C B 0C B0 0 0 0 0 C B B0 0 0 0 0 0C C B B0 0 0 0 0 0C C B @0 0 0 0 0 8A 0 00000

1 0 0C C 0C C C 0C C 0C C 0C C 0C C C 0C C, 0C C 0C C 0C C C 0C C 0C C 0C C 0A 1

0

IıCom

0 B0 B B0 B B B0 B B0 B B0 B B0 B B B0 DB B0 B B0 B B0 B B B0 B B0 B B0 B @0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

611

0 0 0 0 0 0 0 0 200 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 8 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 8 8 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 0 0C C 0C C C 0C C 0C C 0C C 0C C C 0C C, 0C C 0C C 0C C C 0C C 0C C 0C C 8A 0

1 0 0C C 0C C C 0C C 0C C 0C C 0C C C 0C C 0C C 0C C 0C C C 0C C 0C C 0C C 0A 0

By using these matrices with the mapping criteria 1, 2 and 3, we have a system of linearly independent inequalities, in the unknown variables NCU and Xout . The resolution of this system of inequalities gives: Xout 

Œ0:0; 640:0; 0:0; 0:0; 4:987013; 0:0; 960:0; 0:0; 9:552238; 0:0; 213:33333; 0:0; 0:0; 112:94118; 0:0; 213:33333T

where Xout D .0; 1; 0; 0; 1; 0; 1; 0; 1; 0; 1; 0; 0; 1; 0; 1/X , and Ncu  Œ4:0; 4:0; 4:0; 4:0; 4:0; 4:0; 4:0T : Example 4. This example considers the case where all tasks of the application are merged, denoted [*—FFT, Beams, Norm, Bands, Int1, Stab, Int2—*]. Figure 23.16 gives the communication structure, which counts ten links: four with the external memory and six inter-task links. For this example, the parametric matrices are:

612

R. Corvino et al.

0 2 B0 B B0 B B0 B B B0 Diag.Cx / D B B0 B B0 B B0 B @0 0 0 0 B0 B B0 B B0 B B B0 Kı D B B0 B B0 B B0 B @0 0 0 0 B0 B B0 B B0 B B B0 C om Iı DB B0 B B0 B B0 B @0 0

0 0 0 0 0 00 2 0 0 0 0 00 0 192 0 0 0 0 0 0 0 192 0 0 0 0 0 0 0 1 0 00 0 0 0 0 200 0 0 0 0 0 0 0 80 0 0 0 0 0 08 0 0 0 0 0 00 0 0 0 0 0 00 20 00 00 00 00 00 00 00 00 00 00 10 00 00 00 00 00 00 00 00

1 00 0 0C C 0 0C C 0 0C C C 0 0C C; 0 0C C 0 0C C 0 0C C 8 0A

08 1 0 0 0 0 0000 020 0 00000 B0 0 0 0 0 0 0 0 0 192 0 0 0 0 0 0C C B B0 0 0 192 0 0 0 0 0 192 0 0 0 0 0 0C C B C B0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0C B C B 0 0 200 0 0 0 0C B0 0 0 0 0 0 0 0 0 Mem C ; Kı D B B0 0 0 0 0 0 0 0 0 0 0 0 8 0 0 0C C B B0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 0C C B B0 0 0 0 0 0 0 0 8 0 0 0 0 0 8 0C C B A @0 0 0 0 0 0 0 0 0 0 0 0 0008 0 0 0 0000 000 0 00000 0 1 1 0000000 0000000000 B0 0 0 0 0 0 0 0 0 0C 0 0 0 0 0 0 0C B C C B0 0 0 0 0 0 0 0 0 0C 0 0 0 0 0 0 0C B C C B0 0 0 0 0 0 0 0 0 0C 1 0 0 0 0 0 0C B C C B C C 0 1 0 0 0 0 0C B0 0 0 0 0 0 0 0 0 0C Mem DB C ; Iı C B0 0 0 0 0 0 0 0 0 0C 0 0 1 0 0 0 0C B C C B0 0 0 0 0 0 0 0 0 0C 0 0 0 1 0 0 0C B C C B0 0 0 0 0 0 0 0 0 0C 0 0 0 0 0 0 0C B C C @0 0 0 0 0 0 0 0 0 0A 0 0 0 0 0 1 0A 0000000 0000000001

1 0 0C C 0C C 0C C C 0C C; 0C C 0C C 0C C 0A 0

By using these matrices with the mapping criteria 1, 2 and 3, we have a system of linearly independent inequalities, in the unknown variables NCU and X . The resolution of this system yields: Xout  Œ0:0; 240:0; 0:0; 2:5; 768000:0; 3840:0; 480:0; 0:0; 60:0; 480:0T where Xout D .0; 1; 0; 1; 1; 1; 1; 0; 1; 1/X , and Ncu  Œ4:0; 4:0; 4:0; 4:0; 4:0; 4:0; 4:0T :

23 Design Space Exploration for Efficient Data Intensive Computing on SoCs

613

Fig. 23.16 Example of configuration: all tasks are merged

4. The previous method is applied to all fusion possibilities of Table 23.2. For each possibility, the method computes the maximum parallelism level and the data granularity that meets the external memory bandwidth. The total number of all analyzed fusions is 54. Among them, three are selected. The duration of the overall exploration is of 3 s.

10 Summary We presented a method to explore the space of possible data transfer and storage micro-architectures for data intensive applications. Such a method is very useful in order to find efficient implementations of these applications, which meet the performance requirements on system-on-chip. It starts from a canonical representation of an application in a language, named Array-OL, and applies a set of loop transformations so as to infer an application-specific architecture that masks the data transfer time with the time to perform the computations. For that purpose, we proposed a customizable model of target architecture including FIFO queues and double buffering mechanism. The mapping of an application onto this architecture is performed through a flow of Array-OL model transformations aimed to improve the parallelism level and to reduce the size of the used internal memories. We used a

614

R. Corvino et al.

method based on an integer partition to reduce the space of explored transformation scenarios. The method has been illustrated on a case study consisting of an implementation of a hydrophone monitoring application as found in sonar signal processing. Our method is aimed to serve in Gaspard2, an Array-OL framework able to map Array-OL models onto different kinds of target architectures [46]. While the proposed implementation of our method uses mapping (and scheduling) constraints to provide optimal solutions, it is limited to the case of simple linear pipelines and does not provide sufficient precision on the temporal behavior of the data transfer. We are currently working on an improved method using abstract clocks to precisely describe the data transfer and storage of data intensive computing systems. The new method will be able to explore several application model transformations as task fusion (as in the current method), paving change, unrolling and tiling.

References 1. Tony Hey, Stewart Tansley, and Kristin Tolle, editors. The Fourth Paradigm: Data-Intensive Scientific Discovery. 2009. 2. Jianwen Zhu and Nikil Dutt. Electronic system-level design and high-level synthesis. In LaungTerng Wang, Yao-Wen Chang, and Kwang-Ting (Tim) Cheng, editors, Electronic Design Automation, pages 235–297. Morgan Kaufmann, Boston, 2009. 3. Felice Balarin, Massimiliano Chiodo, Paolo Giusto, Harry Hsieh, Attila Jurecska, Luciano Lavagno, Claudio Passerone, Alberto Sangiovanni-Vincentelli, Ellen Sentovich, Kei Suzuki, and Bassam Tabbara. Hardware-software co-design of embedded systems: the POLIS approach. Kluwer Academic Publishers, Norwell, MA, USA, 1997. 4. R. Ernst, J. Henkel, Th. Benner, W. Ye, U. Holtmann, D. Herrmann, and M. Trawny. The cosyma environment for hardware/software cosynthesis of small embedded systems. Microprocessors and Microsystems, 20(3):159–166, 1996. 5. B. Kienhuis, E. Deprettere, K. Vissers, and P. Van Der Wolf. An approach for quantitative analysis of application-specific dataflow architectures. In Application-Specific Systems, Architectures and Processors, 1997. Proceedings., IEEE International Conference on, pages 338–349, Jul 1997. 6. Sander Stuijk. Predictable Mapping of Streaming Applications on Multiprocessors. PhD thesis, Technische Universiteit Eindhoven, The Nederlands, 2007. 7. Andreas Gerstlauer and Daniel D. Gajski. System-level abstraction semantics. In Proceedings of the 15th international symposium on System Synthesis, ISSS ’02, pages 231–236, New York, NY, USA, 2002. ACM. 8. P. R. Panda, F. Catthoor, N. D. Dutt, K. Danckaert, E. Brockmeyer, C. Kulkarni, A. Vandercappelle, and P. G. Kjeldsberg. Data and memory optimization techniques for embedded systems. ACM Trans. Des. Autom. Electron. Syst., 6:149–206, April 2001. 9. F. Catthoor, K. Danckaert, C. Kulkarni, E. Brockmeyer, P. G. Kjeldsberg, T. Van Achteren, and T. Omnes. Data access and storage management for embedded programmable processors. Springer, 2002. 10. Rosilde Corvino, Abdoulaye Gamati´e, and Pierre Boulet. Architecture exploration for efficient data transfer and storage in data-parallel applications. In Pasqua D’Ambra, Mario Guarracino, and Domenico Talia, editors, Euro-Par 2010 - Parallel Processing, volume 6271 of Lecture Notes in Computer Science, pages 101–116. Springer Berlin/Heidelberg, 2010.

23 Design Space Exploration for Efficient Data Intensive Computing on SoCs

615

11. Lech J´ozwiak, Nadia Nedjah, and Miguel Figueroa. Modern development methods and tools for embedded reconfigurable systems: A survey. Integration, the VLSI Journal, 43(1):1–33, 2010. 12. Edward A. Lee and David G. Messerschmitt. Synchronous Data Flow. Proceedings of the IEEE, 75(9):1235–1245, September 1987. 13. A. Sangiovanni-Vincentelli and G. Martin. Platform-based design and software design methodology for embedded systems. Design Test of Computers, IEEE, 18(6):23–33, Nov/Dec 2001. 14. Giuseppe Ascia, Vincenzo Catania, Alessandro G. Di Nuovo, Maurizio Palesi, and Davide Patti. Efficient design space exploration for application specific systems-on-a-chip. Journal of Systems Architecture, 53(10):733–750, 2007. 15. F Balasa, P Kjeldsberg, A Vandecappelle, M Palkovic, Q Hu, H Zhu, and F Catthoor. Storage Estimation and Design Space Exploration Methodologies for the Memory Management of Signal Processing Applications. Journal of Signal Processing Systems, 53(1):51–71, Nov 2008. 16. Yong Chen, Surendra Byna, Xian-He Sun, Rajeev Thakur, and William Gropp. Hiding i/o latency with pre-execution prefetching for parallel applications. In ACM/IEEE Supercomputing Conference (SC’08), page 40, 2008. 17. P. R. Panda, F. Catthoor, N. D. Dutt, K. Danckaert, E. Brockmeyer, C. Kulkarni, A. Vandercappelle, and P. G. Kjeldsberg. Data and memory optimization techniques for embedded systems. ACM Transactions on Design Automation of Electronic Systems, 6(2):149–206, 2001. 18. H T Kung. Why systolic architectures. Computer, 15(1):37–46, 1982. 19. Abdelkader Amar, Pierre Boulet, and Philippe Dumont. Projection of the Array-OL Specification Language onto the Kahn Process Network Computation Model. In ISPAN ’05: Proceedings of the 8th International Symposium on Parallel Architectures, Algorithms and Networks, pages 496–503, 2005. 20. D. Kim, R. Managuli, and Y. Kim. Data cache and direct memory access in programming mediaprocessors. Micro, IEEE, 21(4):33–42, Jul 2001. 21. Jason D. Hiser, Jack W. Davidson, and David B. Whalley. Fast, Accurate Design Space Exploration of Embedded Systems Memory Configurations. In SAC ’07: Proceedings of the 2007 ACM symposium on Applied computing, pages 699–706, New York, NY, USA, 2007. ACM. 22. Q. Hu, P. G. Kjeldsberg, A. Vandecappelle, M. Palkovic, and F. Catthoor. Incremental hierarchical memory size estimation for steering of loop transformations. ACM Transactions on Design Automation of Electronic Systems, 12(4):50, 2007. 23. Yong Chen, Surendra Byna, Xian-He Sun, Rajeev Thakur, and William Gropp. Hiding I/O latency with pre-execution prefetching for parallel applications. In SC ’08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 1–10, 2008. 24. P.K. Murthy and E.A. Lee. Multidimensional synchronous dataflow. IEEE Transactions on Signal Processing, 50(8):2064–2079, Aug. 2002. 25. F. Deprettere and T. Stefanov. Affine nested loop programs and their binary cyclo-static dataflow counterparts. In Proc. of Conf. on Application Specific Systems, Architectures, and Processors, pages 186–190, 2006. 26. Albert Cohen, Marc Duranton, Christine Eisenbeis, Claire Pagetti, Florence Plateau, and Marc Pouzet. N-synchronous kahn networks: a relaxed model of synchrony for real-time systems. In POPL ’06: Conference record of the 33rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pages 180–193, 2006. 27. Sylvain Girbal, Nicolas Vasilache, C´edric Bastoul, Albert Cohen, David Parello, Marc Sigler, and Olivier Temam. Semi-automatic composition of loop transformations for deep parallelism and memory hierarchies. Journal of Parallel Programming, 34:261–317, 2006. 28. Mark Thompson, Hristo Nikolov, Todor Stefanov, Andy D. Pimentel, Cagkan Erbas, Simon Polstra, and Ed F. Deprettere. A framework for rapid system-level exploration, synthesis, and programming of multimedia mp-socs. In Proceedings of the 5th IEEE/ACM international conference on Hardware/software codesign and system synthesis, CODES+ISSS’07, pages 9– 14, New York, NY, USA, 2007. ACM.

616

R. Corvino et al.

29. Scott Fischaber, Roger Woods, and John McAllister. Soc memory hierarchy derivation from dataflow graphs. Journal of Signal Processing Systems, 60:345–361, 2010. 30. Calin Glitia and Pierre Boulet. High Level Loop Transformations for Systematic Signal Processing Embedded Applications. Research Report RR-6469, INRIA, 2008. 31. S.H. Fuller and L.I. Millett. Computing performance: Game over or next level? Computer, 44(1):31–38, Jan. 2011. 32. Rosilde Corvino. Exploration de l’espace des architectures pour des syst`emes de traitement d’image, analyse faite sur des blocs fondamentaux de la r´etine num´erique. PhD thesis, Universit´e Joseph-Fourier - Grenoble I, France, 2009. 33. Calin Glitia, Philippe Dumont, and Pierre Boulet. Array-OL with delays, a domain specific specification language for multidimensional intensive signal processing. Multidimensional Systems and Signal Processing (Springer Netherlands), 2010. 34. B.C. de Lavarene, D. Alleysson, B. Durette, and J. Herault. Efficient demosaicing through recursive filtering. In IEEE International Conference on Image Processing (ICIP 07), volume 2, Oct. 2007. 35. Jeanny H´erault and Barth´el´emy Durette. Modeling visual perception for image processing. Computational and Ambient Intelligence (LNCS Springer Berlin/Heidelberg), pages 662–675, 2007. 36. Calin Glitia and Pierre Boulet. High level loop transformations for systematic signal processing embedded applications. Embedded Computer Systems: Architectures, Modeling, and Simulation (Springer), pages 187–196, 2008. 37. Ken Kennedy and Kathryn S. McKinley. Maximizing loop parallelism and improving data locality via loop fusion and distribution. In Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing, pages 301–320, London, UK, 1994. Springer-Verlag. 38. Frank Hannig, Hritam Dutta, and J¨urgen Teich. Parallelization approaches for hardware accelerators – loop unrolling versus loop partitioning. Architecture of Computing Systems – ARCS 2009, pages 16–27, 2009. 39. Jingling Xue. Loop tiling for parallelism. Kluwer Academic Publishers, 2000. 40. Preeti Ranjan Panda, Hiroshi Nakamura, Nikil D. Dutt, and Alexandru Nicolau. Augmenting loop tiling with data alignment for improved cache performance. IEEE Transactions on Computers, 48:142–149, 1999. 41. Lushan Liu, Pradeep Nagaraj, Shambhu Upadhyaya, and Ramalingam Sridhar. Defect analysis and defect tolerant design of multi-port srams. J. Electron. Test., 24(1–3):165–179, 2008. 42. Robert Schreiber, Shail Aditya, Scott Mahlke, Vinod Kathail, B Rau, Darren Cronquist, and Mukund Sivaraman. Pico-npa: High-level synthesis of nonprogrammable hardware accelerators. The Journal of VLSI Signal Processing, 31(2):127–142, Jun 2002. 43. Imondi GC, Zenzo M, and Fazio MA. Pipelined Burst Memory Access, US patent, August 2008. patent. 44. Nawaaz Ahmed, Nikolay Mateev, and Keshav Pingali. Synthesizing transformations for locality enhancement of imperfectly-nested loop nests. International Journal of Parallel Programming, 29(5):493–544, Oct 2001. 45. Talal Rahwan, Sarvapali Ramchurn, Nicholas Jennings, and Andrea Giovannucci. An anytime algorithm for optimal coalition structure generation. Journal of Artificial Intelligence Research (JAIR), 34:521–567, April 2009. ´ Piel, Rabie Ben Atitallah, Anne Etien, Philippe 46. Abdoulaye Gamati´e, S´ebastien Le Beux, Eric Marquet, and Jean-Luc Dekeyser. A model driven design framework for massively parallel embedded systems. ACM Transactions on Embedded Computing Systems (TECS) ACM (To appear), preliminary version at http:// hal.inria.fr/ inria-00311115/ , 2010.

Suggest Documents