Application and Architecture Modeling for Parallel Execution of Jacobi-type Algorithms Ed F. Deprettere and Edwin Rijpkema Delft University of Technology 2628 CD Delft, The Netherlands fed,
[email protected] Abstract In wireless communications of tomorrow, high-resolution space-time filtering techniques will play a crucial role. To make these techniques broadly available, low-power, lowcost embedded processors for efficient implementation must be designed. As many of these space-time filters implement routines that are built on robust Jacobi-type algorithms, it makes sense to design a processor with a good performancecost ratio over a subset of the Jacobi-type algorithms. This paper summarizes the steps in a design procedure that takes as its input executable specifications in Matlab of a subset of Jacobi-type algorithms and outputs one or more candidate processors for efficient execution of these algorithms. The procedure heavily relies on models of computation and models of realization to specify algorithm and architecture instances for evaluation of performance and cost metrics through extensive simulation. Details of the various steps can be found in the list of references.
1. Introduction In wireless communications of tomorrow, highresolution space-time filtering techniques will play a crucial role. These techniques will also be used extensively in future advanced imaging applications, such as underwater acoustics and radio astronomy. These techniques have in common that they rely on antenna/sensor array signal processing procedures. The procedures aim at the detection, recovery, separation or suppression of signals in space and time. In wireless communications, for example, multiple antennas can be used to jointly estimate the directions and frequencies of narrow-band signal sources, to increase capacity and reliability of channels or to ((semi-)blindly) estimate/equalize channels. The number of antennas in this type of applications is typically in the order of 10. For an underwater acoustic camera application, the number of sensors will be one or two orders higher, and for future
Bart Kienhuis University of California Berkeley, CA 94720, USA
[email protected] distributed radio telescope, it may be as high as 106 (see http://www.nfra.nl/).
1.1. Jacobi Type Algorithms In the last decade, a whole variety of array processing methods have been developed and evaluated on artificial as well as recorded data. Many of these methods are stochastic or deterministic signal processing procedures that rely on subspace projection techniques to separate desired signals from signals that are not of interest or interfering [12]. These high resolution beam-forming techniques have a highly appealing property in common: they can be implemented in the form of one or more Jacobi-type algorithms, which are known to be very robust. This robustness may be largely exploited in the implementation step of these algorithms (low sensitivity with respect to parameter perturbations, allowing approximate computing in special arithmetic and reduced number of bits used in the arithmetic). Moreover, ’fast versions’, i.e., reduced complexity versions for spacetime adaptive applications can be readily derived from the non-adaptive counterparts. However, the beam-forming algorithms we are referring to here are at least one order of magnitude more complex than the more traditional ones. And, in addition to that, the applications they are intended for require that they be implemented in (ultra) low-cost and (ultra) low-power embedded processors. For ease of reference, we will refer to these processors as Jacobi processors: they are low-cost, low-power processors that are highly efficient executing Jacobi-type algorithms, in particular as they appear in real-time adaptive array signal processing applications. Although a Jacobi processor need not be programmable to the same extent as, e.g, media processors, it cannot be a dedicated processor as it must support execution of a variety of Jacobi algorithms, including matrix QR decomposition, eigenvalue and singular value decomposition and extensions of these. The question that arises now is: how to design a Jacobi processor with the required high performance and low cost? Jacobi processors that satisfy these constraints are points in
a design space that is, however, way too large to explore. There are too many design choices involved that they can be explored systematically and in a reasonably amount of time. Also the search space dimensions will keep growing with increasing complexity and number of imposed constraints. Therefore, to overcome this problem, we have to reduce the dimension of the search space by pre-selecting an architecture class and providing a specification of a member of the class. It is advantageous to give the specification at a high level, i.e., close to the application’s executable specification. This is so, because performance improvement and cost reduction is more affected by abstract-level than detailedlevel parameters. Thus, high-level specification and exploration of algorithms and architectures is faster, more substantial and performance-cost radio effective than at lower levels [5].
cessor template, i.e., a high-level parameterized architecture of which the values of the parameters are as yet undecided. A high-level view of the selected template is shown in Figure 1.
1.2. Dataflow paradigm
It consists of a number of processing elements (PE’s) that are configured in a communication network of some type. One of these PE’s is a simple, low-power programmable core (processor controller); the others are weakly programmable and/or configurable signal processing elements that are tailored specificly to fast execution of large amounts of typical Jacobi operations. As shown in the inset in Figure 1, each PE is equipped with a local controller, local memory, a router, and a computational node. Communication and buffering are logic in the figure, yet it is obvious that both must be as local as possible. The PEs may be satellites to the Global controller, as e.g.,in [1] or they may be connected to a 1D or 2D switch network as in [8]. Now, the problem is to determine values for the template’s parameters , i.e., to derive one or more specific processors - Jacobi processors - which is or are satisfying the performance-cost ratio requirements. To derive the Jacobi processor, we use the so-called Y-chart approach [6].
Pre-selecting an architecture class and a high-level specification of one of its members is equivalent to providing a platform, i.e., a parameterized processor template in which the number of parameters to be decided upon - through exploration of the reduced design space - is tractable. For example, if the target architecture is a programmable DSP processor, then the Jacobi-algorithms specific instruction set is one of the parameters to be explored. Of course, a flexible (multiple) DSP software platform may violate performance and cost requirements. Similarly, if the target architecture is a dedicated VLSI chip, then the Jacobi-algorithms specific component set is one of the parameters to be explored. Again, a high performance (multi) chip hardware platform may violate efficiency and flexibility constraints. Now, in this paper we select neither the software nor the hardware extremes but a programmable data-flow processor platform that is more appropriate for the cost-efficient execution of Jacobi-type algorithms [9]. The selection of a data-flow processor template results from an analysis of the application domain which is array signal processing in our case and, more specifically, Jacobi-algorithms dominated array signal processing. Indeed, Jacobi-type algorithms have a natural underlying parallel dataflow structure, though they are commonly specified in non-parallel imperative nested loop form. Deriving corresponding dataflow specification has been one of our concerns.
2. Problem Statement and Design Approach So far, we have selected an application domain which is array signal processing and even more specifically, a subset of Jacobi-type algorithms which play a dominant role in this domain. From an analysis of this domain, the subset of algorithms that is, we have chosen to select a dataflow pro-
communication network
PE A
b u f
PE B
b u f
PE C
s i n k
Global controller
local controller
Router
s o u r c e
Memory
C. Node
b u f
Figure 1. Jacobi Processor Template.
Architecture Template
Mapper
Jacobi Applications
Retargetable Simulator
Performance Numbers
Figure 2. The Y-chart approach
2.1. Y-chart approach The Y-chart approach is conceptually as shown in Figure 2. In this figure, Applications stands for the subset of Jacobi-type algorithms. We assume they are given in the
For the case of the Jacobi processor, we take the view described in [5]. The relation between the MoR and MoC are as shown in Figure 3; more about this in Section 3.
state Router read ports
f1 f2
2.2. Design Space Exploration
...
write ports
f|F|
Controller
form of some executable dataflow specification (see Section 3). Selecting particular values in the value ranges of the parameters in the template, we effectively define architecture instances from the template. Applications are then mapped onto the architecture instance and the performance is measured by evaluating performance metrics through simulation.
Memory
C.Node
By relying on some form of design space exploration, one can establish the relation between the metrics and the parameter values, and hence select parameter values that lead to processors that satisfy the given requirements. A naive way to do so, is to repeat the procedure in the above subsection for sufficiently many values in the value ranges of the parameters. Each experiment gives a point in the performancecost plane and all experiments together will reveal the hyperbolic boundary on which or close to which the candidate architectures are located [2]. However, both the mapping and simulation steps require that we have available both architecture and application specifications.
2.3. Modeling of Architecture and Applications From the exploration point of view, the specification of the architecture should be flexible enough to allow alternatives (instances) in which we can map the applications. The achieve this goal, we heavily rely on models. For the applications, we use models of computation (MoC); for the architectures, we use models of realization (MoR). An intuitive and appealing approach is to separate both models and to establish a relationship between them for the purpose of fast simulation, performance analysis and design space exploration. Two different approaches currently exist. In [10] the separated MoC and MoR are connected (for simulation) through (symbolic) instruction traces between application processes and architecture processors. The MoR consists of interconnected realization modules taken from a building block library. The MoC is Process Networks. The approach uses two different simulations engines; one for the MoC and one for the MoR. The approach aims at making evaluation of different classes of architectures (heterogeneous architectures) easier. In [5], only one class of architectures is considered (homogeneous architectures) and design choices are very much constrained in terms of parameters. The MoC is also Process Networks, but having more structure and exhibiting a fire-and-exit behavior, referred to as SBF Objects (See section 3). The approach uses only one simulation engine for both the MoR and MoC. The relationship between MoC and MoR is through a relation between a generic model of the template’s PE and the structure of the processes in the MoC.
controller
enable signals
Figure 3. Relationship MoC and MoR.
2.4. Process Networks In both modeling cases presented, an inherent parallel model of computation is chosen to represent applications, in particular Kahn Process Networks [4, 7]. Such description consists of a network of processes that are interconnected by channels. A channel is an unbounded FIFO queue that can contain an infinite sequence of tokens, i.e. a stream. Processes can write to a channel unconditionally (non-blocking write), but can only read from the channel when the queue is non-empty (blocking read). Characteristic for this model of computation is that it describes parallelism naturally from the very fine-grained to the very coarse-grained, in a deterministic manner. Recall that we have assumed hat the Jacobi applications are written in the form of some executable dataflow specification. However, the specification will most likely be in the form of an imperative MoC like C or Matlab. This is not in the assumed form. Thus, we need a compiler to turn the imperative MoC into a dataflow MoC. For general purpose processors, compilers exist that can extract fined-grained instruction level parallelism from the application descriptions written in C or Matlab. They lack, however, the ability to exploit the coarse-grained parallelism offered by the coprocessors in our architecture template (see Figure 1). Therefore, we had to come up with a compiler that extracts the available parallelism from an application described as an NLP in Matlab and automatically convert it into a process network description. We have described this compiler in detail in [11]. With the developed compiler we extended the Y-chart environment shown in Figure 2 to obtain the environment shown in Figure 4. In the remainder of the paper we focus on the upper right corner in Figure 4.
3. SBF model and NLP-to-SBF compiler In this section we briefly review the Stream-based Function Model (SBF) as well as the compiler that converts
Algorithmic Transformations
Applications in Matlab
Validation
compiler
Architecture Template
Networks of SBF Objects
Mapping
SBFsim
Validation
Library of SBF Objects
Retargetable Simulator
Y-chart Environment
where C is the space of all possible values of c. The transition function ! determines the new state s0 from the current state s. The binding function determines what function has to be enabled for the current state s and exactly one function is associated with a state. Enabling of a function is called a firing. When a process executes, a sequence of firings will occur as given in equation 2.
Validation
finit
Performance Numbers
Figure 4. Extended Y-Chart. imperative nested loop program (NLP) specifications to stream-based dataflow specifications.
3.1. Stream-based Function Model To simulate an architecture instance that is executing an application on a single execution engine, it is necessary to go back and forth between architecture simulation time and application simulation time. As a consequence, simulation control has to switched from the architecture to the application and vice-versa. This implies that the processes in the application model have to have a fire-and-exit behavior, e.g., at each firing a particular quantum of computation is performed and control is given back to the simulation engine to continue the simulation of the architecture. The processes in the SBF model precisely do that. In the SBF model, a process is given in terms of a controller, a state, and a set of functions as shown in Figure 5.
(!(c))
,!
fa
(!(c))
,!
fb
(!(c))
,!
: : :fx
(!(c))
,!
:::
(2)
In equation 2 the functions describe the quantum of computation and the arrows precisely indicate the moments where a process relinquish control to the simulation of the architecture. When observing Figure 3, notice that the structure of the SBF Object has a direct relation with the PE of the Jacobi Processor. State, Function Repertoire, and Controller of an SBF Object have a direct relation with the local memory, the instruction set of the computational node, and the the local controller, respectively. This makes an SBF Object to a high-level model of a PE in the Jacobi processor.
3.2. From NLPs to SBFs To transform an imperative MoC like a Matlab specification into a specification that uses the SBF model, our approach is to divide this transformation into three steps as shown in Figure 6. Matlab
HiPars
Step 1.
single assignment code
state Step 2.
read ports
f1
polyhedral reduced dependence graph
f2
write ports
...
Step 3.
f|F| controller
enable signals
Figure 5. An SBF object. The set of functions defines the function repertoire F =
ff1 ; f2; ; f F g. The state of the object consists of conj
j
trol state c and data state d. The controller operates on the control state and defines two functions, i.e. the transition function !, and the binding function which are defined by
!:C!C : C ! F;
DgParser
(1)
SBF network generation
SBF objects generation
SBF network
SBF objects
Figure 6. The 3 step approach used to transform a Matlab specification into a specification that uses the SBF model. In this figure, a box represents a result and an ellipsoid represents an action. Given a Matlab specification, the first step is to convert the specification into a single assignment code (SAC) using HiPars. The SAC makes all parallelism available in the original Matlab specification explicit. The second step is to convert the SAC description into a polyhedral reduced dependence graph (PRDG) [3] specification,
using DgParser. The PRDG is a mathematical description of the available parallelism making further manipulation using linear algebra tools possible. The third step is to go from the PRDG to a description in terms of the SBF Model. This implies that a network is derive from the PRDG and the available individual SBF objects. The above three steps, in particular the third one, are outlined in more detail in [11]. For more information about the tools, we refer to http://cas.et.tudelft.nl/research/jacobium/ and http://www.gigascale.org/compaan.
4. Conclusion For designing a single chip processor, satisfying stringent requirements in terms of performance and cost, over a set of applications, one can rely on an architect’s experience and go through an extensive and costly sequence of simulations to validate the expected performance-cost metrics. However, the more stringent the requirements, the greater the chance of failure. Instead, design space exploration is a valuable – indeed a powerful – alternative that may reveal unexpected and unprecedented candidates. In this paper, we have advocated an approach in which exploration is undertaken at a high level of abstraction. The argument is that high-level performance evaluation is faster, hence covers a larger part of the design space and, in addition to that, gives orders of magnitude improvement in performance and reduction of cost. Now, for an effective and efficient exploration of the design space, it is mandatory that both the applications and architecture specifications are given in terms of models of computation and models of realization, respectively. Both models are interrelated in a way that is convenient for simulation and exploration. In this paper, the application domain has been rather narrow, and the architecture has been rather dedicated, up to parameter value ranges, which is the result of the relationship between the application modeling and the architecture modeling used in a single simulation engine. We have summerized a design procedure for a specific application domain, the domain of Jacobi-type algorithm dominated adaptive array signal processing. The target processor has been a domain specific Jacobi-processor whose structure has been pre-defined as a result of a domain analysis. We have related the processor’s structure, in particular the processor’s PE structure to aparticular Process Model of computation and we have proposed a compiler to convert the imperative model of computation specifications into the stream-based parallel dataflow model of computations. This greatly facilitates mapping and execution of application onto architecture instances.
5. Acknowledgement The authors want to thank P. Lieverse, P. van der Wolf, and K. Vissers for having participated in stimulating discussions and having given valuable suggestions in the course of the work that has led to this paper.
References [1] A. Abnous and J. Rabaey. Ultra-low-power domain-specific multimedia processors. In VLSI Signal Processing, IX, pages 461–470, 1996. [2] G. Hekstra, G. La Hei, P. Bingley, and F. Sijstermans. TriMedia CPU64 design space exploration. In International Conference on Computer Design, Austin, Texas, Oct.10 – 13 1999. [3] P. Held. Functional Design of Dataflow Networks. PhD thesis, Delft University of Technology, May 1996. [4] G. Kahn. The semantics of a simple language for parallel programming. In Proc. of the IFIP Congress 74. NorthHolland Publishing Co., Aug. 5-10 1974. [5] B. Kienhuis. Design Space Exploration of Stream-based Dataflow Architectures: Methods and Tools. PhD thesis, Delft University of Technology, The Netherlands, Jan. 1999. [6] B. Kienhuis, E. Deprettere, K. Vissers, and P. van der Wolf. An approach for quantitative analysis of applicationspecific dataflow architectures. In Proc. 11-th Int. Conf. on Application-specific Systems, Architectures and Processors, Zurich, Switzerland, July 14-16 1997. [7] E. A. Lee and T. M. Parks. Dataflow process networks. Proceedings of the IEEE, 83(5):773 – 799, May 1995. [8] J. A. Leijten, J. L. van Meerbergen, A. H. Timmer, and J. A. Jess. Prophid, a data-driven multi-processor architecture for high-performance DSP. In Proc. ED&TC, Mar. 17-20 1997. [9] P. Lieverse, E. Deprettere, B. Kienhuis, and E. de Kock. A clustering approach to explore grain-sizes in the definition of weakly programmable processing elements. In 1997 IEEE Workshop on Signal Processing Systems: Design and Implementation, pages 107–120, De Montfort University, Leicester, UK, Nov. 3-5 1997. [10] P. Lieverse, P. van der Wolf, E. Deprettere, and K. Vissers. A methodology for architecture exploration of heterogeneous signal processing systems. In IEEE Workshop on Signal Processing Systems (SiPS’99), Taipei, Taiwan, Oct. 20 – 22 1999. [11] E. Rijpkema, B. Kienhuis, and E. F. Deprettere. Compilation from matrlab to process networks. In Presented at the Second International Workshop on Compiler and Architecture Support for Embedded Systems (CASES’99), Oct. 1999. [12] A. van der Veen. Algebraic methods for deterministic blind beamforming. Proceedingsof the IEEE, 86(10):1987 – 2008, Oct. 1998.