perspective as a specialized processing module in dedicated systems devoted ..... The processors chip (hosting 16 PEs) has been designed using a full customĀ ...
The Evolution of the PAPRICA System A. Broggi, G. Conte F. Gregoretti, C. Sansoe Dip. di Ingegneria dell'Informazione Dip. di Elettronica Universita di Parma, Italy Politecnico di Torino, Italy L. M. Reyneri Dip. di Ingegneria dell'Informazione Universita di Pisa, Italy
Abstract
PAPRICA project started in 1988 as an experimental VLSI architecture devoted to the ecient computation of data with two-dimensional structure. The main goal of the project is to develop a subsystem that could operate as an attached processing unit to a standard workstation and in perspective as a specialized processing module in dedicated systems devoted to low level image analysis, cellular neural networks emulation, DRC algorithms. The architecture has been extensively used for basic low level image analysis tasks up to optical ow computation and feature tracking, showing encouraging performances even in the rst prototype version. The paper discusses the current implementation and presents a critical analysis of the project, allowing to identify some crucial points of PAPRICA design (and of array processors in general) that must be carefully considered in the case of redesign.
Introduction The PAPRICA (PArallel PRocessor for Image Checking and Analysis) project (Gregoretti, Reyneri, Sansoe, & Rigazio, 1992; Conte, Gregoretti, Reyneri, & Sansoe, 1991), was started in late 1988 in order to evaluate the application of massively parallel architectures (NCR Corporation, 1984; Fountain, 1987; Reddaway, 1973), and of mathematical morphology (Serra, 1982) to a number of tasks related to the veri cation at the layout level of integrated circuits. As a second step it was decided to build a prototype of an architecture with a bidimensional interprocessor connection as a coprocessor of a standard workstation. The main goal was to keep the system powerful, but simple enough to allow a low-cost implementation which could follow the technological evolution. The prototype named PAPRICA-1 had an instruction set oriented to the above mentioned applications and underwent extensive testing in 1992. Several other application elds were tested among which the emulation of cellular neural networks, the veri cation of integrated circuits and vision systems for vehicle driving assistance. In the frame of the Prometheus EUREKA Project a demonstrator project in the automotive eld was started in 1991, but the real time, size and power requirements of the application were not compatible with the performances of the rst prototype and a second version of the system and of its components was designed. It has 3 times the
performance of the previous one and 1/4 of the size and is currently used at the the two academic sites involved in the project and installed on the experimental vehicle used as a demonstrator workbench. At the end of the PAPRICA-1 implementation a major critical review of the project was undertaken in order the evaluate the performances and pinpoint the bottlenecks of the architecture. This led rst to the proposal of a PAPRICA-2 system (Gregoretti, Reyneri, Sansoe, Broggi, & Conte, 1993) which extended the memory addressing space maintaining an instruction set and a memory organization compatible with the previous version. This choice however implied several problems in the physical implementation of the system. A second solution came with the PAPRICA-3 proposal in which both the instruction set and the interconnection structure were changed. In particular the instruction set is more RISC oriented and the topology has become one dimensional instead of two dimensional. First the mathematical morphology computational paradigm which has been the base of the project is presented together with the external architecture of the PAPRICA-1 prototype. The following sections describe the PAPRICA-1 implementation and analyze its performances as a function of the hardware and of the application characteristics. The implementation of a widely known benchmark for parallel architectures and the performances of the system are compared to those of a number of machines. In addition a sample application is presented. The nal sections are devoted to the critical review of the project and to the presentation of the current evolution of the architecture, its characteristics, the foreseen performances and the implementation status.
The computational paradigm: mathematical morphology The computational model which has inspired the design of the PAPRICA family derives from the work of Serra (Serra, 1982), who introduced the concept of mathematical morphology as a theoretical instrument to deal with bit-mapped images, although that concept has been expanded for use with hierarchical architectures such as pyramids. Mathematical morphology (Serra, 1982; Haralick, Sternberg, & Zhuang, 1987) provides a bitmap approach to the processing of discrete images, which is derived from the set theory. In fact, images are de ned as collections of pixels P , which are N -tuples of integers: P = (P1; P2; : : :; Pi; : : :; PN ) ; (1) where Pi 2 Z is the i-th coordinate, while Z is the set of integers. The universe set of pixels is the discrete Euclidean N -space Z N . In practice images and similar types of data are always \size-limited", especially when they are stored in a computer memory. Therefore the concept of N-dimensional frame FN Z N is introduced and de ned as a convex, size-limited, \rectangular" subset of Z N : n
FN = P 2 Z N j Fi;min Pi Fi;max; for every i 2 [1 : : :N ]g ; 2
(2)
where Fi;min and Fi;max are the bounds of the i-th coordinate. The two pixels Fmin = (F1;min; F2;min; : : :; FN;min) and Fmax = (F1;max; F2;max; : : :; FN;max) can be viewed as the \corner" coordinates of FN , while the value f i = (Fi;max ? Fi;min + 1) is the frame size along the i-th coordinate. A framed image I is then de ned as a subset of FN :
I = fP j P 2 FN g FN ;
(3)
where an element present/absent in I can correspond either to \1/0" or to \white/black" (or viceversa) pixels. Consequently the binary morphological operators of dilation , erosion , and the unary operators of complement ()c, translation ()x and transposition () in Z N , introduced in (Haralick et al., 1987), have been adapted to framed images. If A; B FN and x 2 FN are respectively two framed images and a pixel, the following de nitions apply:
AB A B Ac Ax A
= = = = =
fy 2 FN j y = (a + b); for some a 2 A and b 2 B g fy 2 FN j (y + b) 2 A; for every b 2 B g fy 2 FN j y 62 A g fy 2 FN j y = (a + x); for some a 2 A g fy 2 FN j y = ?a; for some a 2 A g :
(4) (5) (6) (7) (8)
Furthermore FN is de ned as the universe set of images over the frame FN .
Matching Operators
The computational paradigm of the PAPRICA family is based on the concept of matching operator , which is derived from the hit-miss transform described in (Serra, 1982). This is a rather general approach and includes the other morphological operators as special cases. Matching operators in PAPRICA are de ned as set transforms FN 7! FN , as follows: a simple N-dimensional matching template is a couple Q = (Q0; Q1), where Q0 ; Q1 FN , with the constraint T that Q0 Q1 ; (i.e. the empty set). An elementary matching with a simple matching template is therefore de ned as:
A Q = fy 2 FN j (y + q1 ) 2 A and (y + q0 ) 2= A;
for every q1 2 Q1 ; q0 2 Q0 g FN
(9)
or, in terms of the erosion operator:
A Q = (A Q1) (Ac Q0 ) \
(10)
A complemented matching c is also de ned, as:
A c Q = (A Q)c 3
(11)
A simple 3 3 bidimensional matching template Q can be sketched using the following notation: x x x x x x , where \x" is either 0, 1 or ? i the corresponding pixel of the matching template is an x x x element of Q0, Q1 or none of the two, respectively. The matrix center coincides with pixel (0; 0) of Q0 and Q1. PAPRICA-1 uses a 3 3 template, while PAPRICA-3 uses a larger 5 5 template, as shown in Fig. 9b. A composite matching with a matching list QL = fQ1; : : :; Qk ; : : :; g is the union of elementary matchings: [ A QL = (A Qk ) (12) k
A matching list can be sketched as a list of simple matching templates:
QL =
x x x x x xc x x x x x x ; x x x ; : : :; K x x x ; : : : x x x x x x x x x
(13)
where K 2 [2; 4; 8]. The superscript c declares that a complementary matching c must be used instead of with that speci c template, while the numeric constant K is a short form for the list of K possible rotations of the matching template by 360 K degrees.
Data Structures and Hierarchies
Mathematical morphology as described above matches primarily computing architectures based on N-dimensional meshes (Du, 1979; Golay, 1969; Graham & Norgren, 1980; Kruse, 1977) and applies primarily to binary images. It cannot be used directly for other computing architectures having a more complex data organization, such as pyramids, hierarchical systems, multi-layered images and the PAPRICA system itself. In this section morphological operators are extended to such systems, without however entering too deeply into theory. A data structure S of parallelism C over the frame FN is de ned as a C -tuple of framed images Sj FN , also called layers: S = (S1; S2; : : :; Sj ; : : :; SC) : (14) Data structures can be used for: gray-scale images, color and multi-layered images and numerical matrices. An example of a simple data structure is the memory organization of a computer, where N = 1 (linear organization) and C = 16 or 32 bits is the memory parallelism. Obviously frame limits are: F1;min = 0 and F1;max = (MemorySize ? 1). Several independent data structures (with dierent dimensions, sizes, and parallelisms) can be combined together into what has been called a data hierarchy H of deepness D:
H = H1; H2; : : :; H; : : :; HD :
(15)
This is a D-tuple of possibly dierent data structures H , where the index 2 [1 : : : D] identi es the hierarchy level of parallelism C over a frame FN . 4
Data hierarchies H can also be used to describe the \data storage organization" ML of several computing architectures (e.g. disks, main memory, caches, registers, etc.). Note that the memory organization ML of a computing system is not sucient to completely de ne its interconnection topology. In fact this latter must be de ned also in combination with the processor instruction set and the mechanisms devoted to the management of data interchanges inside the system.
PAPRICA-1: its memory organization is a two level data hierarchy (D = 2). The rst level ( = 1) is the bidimensional data structure K of the Processor Array (PA), with dimension N1 = 2, parallelism C 1 = 64 and frame sizes f 11 = f 12 = 16. The size of this data structure can
be easily extended to larger values by \virtualizing" the processor array, by means of a speci c Update Block construct, provided that the program satis es a simple semantic constraint. It then results a \virtual data structure" K0 with no limitations in size. This recon gurable data structure is the key feature to map other computing architectures, as described in the following. The second level of PAPRICA-1 hierarchy ( = 2) is the Image Memory, which is logically organized as a data structure of dimension N2 = 2. Parallelism C 2 and both frame sizes f 21 and f 22 are user-programmable. The product (C 2 f 21 f 22 ) is limited by the total memory size (4 to 8MB in the current version).
Extensions to hierarchical morphology
The concept of matching operators can be further extended to data structures and hierarchies. A simple structure template is a couple R = (R0; R1), where R0; R1 are two structures which satisfy the constraint \ (16) R0j R1j ;; for every j 2 [1 : : : C ]: If A is a data structure, an elementary structure matching with a simple structure template R is a set transform given by: fy2F j (y+r )2A AR= j 1j N
and (y+r0j )2= Aj ; for every j 2[1:::C ]; r1j 2R1j ; r0j 2R0j FN
n
(17)
o
while a composite structure matching is entirely de ned by a matching structures list RL = R1; : : :; Rk ; : : : : [
A RL = (A Rk ) FN : k
(18)
The above de nitions can also be extended to data hierarchies and to hierarchical matching operators (Gregoretti et al., 1993).
Update Blocks and array virtualization
PAPRICA uses a speci c construct U called Update Block, which is a t-uple of either 16, 32, 48 or 64 composite structure matchings (RL 0 ; RL1 ; : : :; RL63 ), which operates on the PA K. In practice an Update Block U () is a set transform de ned as
U (I ) = (I RL0 ; I RL1 ; : : :; I RL63 ) 5
(19)
where I K. Each matching list RLi of the Update Block U is made of a sequence of several simpler operators taken from the elementary instruction set, for PAPRICA-1). Since the size of most \real-world" images (often 106 pixels) is much larger than that of a PA of reasonable size (few thousands PEs max.), it is necessary to introduce a mechanism to \virtualize" the PA for large images. This mechanism, which is called UPDATE, requires the presence of an Image Memory of sucient size to store the whole image to be processed and it is based on splitting the input image I in an array of smaller sub-windows I k such that (size(I k ) size(K)), as shown in Fig. 1.
Figure 1: UPDATE mechanism and area of validity A PAPRICA program is made of a sequence of one or more Update Blocks U (), which are lists of instructions delimited by speci c UPDATE instructions, placed according to a set of simple semantic rules. The controller rst loads into the PA the pixels from the rst sub-window, then it executes the program up to the next UPDATE instruction and nally it stores the results back into the Image Memory. These operations are repeated for all the sub-windows until the image is completely scanned, and for all the Update blocks. The eect of these steps is to apply the list of instructions between two consecutive UPDATEs to all the pixels of the image I , as if they would all t into a large \virtual" PA. The only limitation is obviously given by the total size of the Image Memory. If an Update Block satis es the following semantic constraint:
U (I ) = U (U (I )) ; for every I 2 FN ;
(20)
and sub-windows are properly overlapped (see further), the result is independent of the size of the PA (not proven here). Because of the presence of \borders" in the frame FN and in K, it is intuitive that the image result of a structure matching may be \unde ned" along the borders. The area where results are valid is de ned more formally below. Few de nitions are necessary to better understand the formalism (see also Fig. 1):
an N-cube ?m of half-size m is an image: fP 2 F j ?m P +m ; for every i 2 [1 : : :N ] g ?m = N i 6
(21)
the margin width M (RL) of a matching structures list RL is the half-size of the smallest N-cube which fully \contains" all list elements: n
M (RL ) = minm m 2 Z (?m (Rk0j S Rk1j ) ; for every j 2 [1 : : : C ]; k 1
o
(22)
the margin width M (U ()) of an Update Block U () is de ned as: M (U ()) = maxk k 2 Z M (RLk ); for every k 2 [1 : : : 63]
o
(23)
the Limit Area AL is the N-dimensional frame AL = FN in which the original image is de ned. the Validity Area AV of an Update Block U () is the N-dimensional frame AV AL in which the processed image is correct. It is the erosion of the limit area AL by ?M (U ()) as structuring element:
AV = AL ?M (U ())
(24)
The margin M (U ()) is also the amount by which the sub-windows I k must be overlapped for proper array virtualization.
Mapping Data Structures to PAPRICA Image Memory
So far only the logical organization ML of data within a processing system has been considered. There is also a physical data organization MP which usually coincides with the traditional uni- (or bi- or tri-) dimensional organization of computer memories (i.e. MX MY MZ words by B bits). MX , MY , MZ and B are respectively the maximum memory addresses along the three coordinates and the memory parallelism (i.e. number of bits per word). The physical organization of PAPRICA1 image memory, as seen from the host computer, has B = 16=32 bits, while both MX = f 11 and MY = f 12 must be powers of two and MZ = C1=16 is an integer number. Logical and physical organizations are mapped by means of the so-called memory mappings:
A 3-D Memory Mapping O of a N-dimensional frame FN is a one-to-one FN 7! [0 : : :MX ] [0 : : :MY ] [0 : : :MZ ] [0 : : :B ] mapping, such that O(P ) gives the physical address of pixel P in the computer memory (namely, word \coordinates" and bit). There is also an Inverse Memory Mapping O?1 , which is obviously a one-to-one [0 : : :MX ] [0 : : :MY ] [0 : : :MZ ] [0 : : :B ] 7! FN mapping. Combining dierent Memory Mappings, Inverse Memory Mappings and Update Blocks together it is possible to emulate several non-mesh computing architectures, such as pyramids.
External architecture of PAPRICA-1
The instruction set of the PAPRICA family can be partitioned into three dierent classes, namely: 7
Elementary instructions, to perform logical and morphological operations; executed within a
single clock cycle, in parallel by every PE of the array. Mapping statements to implement the virtualization of the \processor array" K. Control ow statements. The following is a short description of the instruction set of PAPRICA-1.
Elementary instruction set
Each PAPRICA-1 instruction is a structure matching with a structure template built from the cascade of a graphic operator G() and a logic operator . A PAPRICA instruction in assembly format is:
LD = G(LS1 ) LS2 [%A]
(25)
where LD , LS 1 and LS 2 2 [0 : : : 63] are three layer identi ers (Destination, Source1, and Source2). This speci c structure-matching operates on KLS1 (i.e. the (LS 1)-th layer of K) and KLS2 and stores the result into KLD . The graphic operator G is one of the sixteen composite matching operators listed in table 1, while the logical operator '' is one of the eight unary, binary or ternary Boolean operators listed in table 2. The optional switch %A introduces an additional operation [
K0 = K0 KLD
(26)
which is useful in many circumstances (e.g. in CAD tools to accumulate errors). This last operation will not be compatible with the constraint given by formula (20).
Memory mapping statements
PAPRICA-1 de nes and uses only simple Memory Mappings, as can be de ned by six \skip" parameters. These parameters are modi ed by means of SKIP.READ and SKIP.WRITE control instructions. Mapping statements of PAPRICA-1 are listed in table 3.
Flow control statements
PAPRICA-1 ow control statements are listed in table 4. During processing, three global ags are computed for each Update Block, which are set according to the result of the last operator of that block. The SET, RESET and NOCHANGE ags are respectively set i (for the last operator of that block) KLD AV , KLD ; or KLD KL D , where KL D is the value of KLD before the operator was computed. A PAPRICA Program P is a tuple of hierarchical operators implemented as the cascade of one or more Update Blocks Ur , some control instructions and optionally one or more Mappings Ot , which concur to perform a desired image-processing function. Although being made of Update Blocks, a PAPRICA Program in its globality usually does not satisfy the semantic constraint of the individual Update Blocks (see formula (20)). 8
Table 1: List of PAPRICA-1 Graphic Operators G(LS1 )
NAME
DESCRIPTION
NOP (
L
INV (
L
S1 )
S1 )
No OPeration INVersion
NMOV (
S1 )
North MOVe
SMOV (
S1 )
South MOVe
L
L
WMOV (
L
EMOV (
L
EXP (
S1 )
S1 )
West MOVe East MOVe
S1 )
EXPansion VEXP ( S1 ) Vertical EXPansion HEXP ( S1 ) Horizontal EXPansion NEEXP ( S1 ) NorthEast EXPansion ERS ( S1 ) ERoSion VERS ( S1 ) Vertical ERoSion HERS ( S1 ) Horizontal ERoSion NEERS ( S1 ) NorthEast ERoSion BOR ( S1 ) BORder L
L
L
L
L
L
L
L
L
LS2 (
L
S1 )
LesS than 2
STRUCTURING ELEMENT
- - - 1 - - - - - 0 - - - - - - - 1 - 1 - - - - - - - - 1 - - - - 1 - - - 0 0 0c 0 0 0 0 0 0c - 0 - 0 - 0 - - -c 0 0 0 - - - - -c 0 0 0 0 1 1 1 1 1 1 1 1 1 - 1 - 1 - 1 - - 1 1 1 - - - 1 1 - 1 1 - - - - 8- 1 0 - - 1 1 - 1 1 - - - 4 1 1 - ;4 1 1 - ; - 0 1 1 - - - 0 - - -
EQUIVALENT TO ? KLS1 ?
KLS1
?
?
c
KLS1 f0;?1g ? KLS1 f0;1g
?
c
KLS1 f?1;0g ? KLS1 f1;0g
KLS1 11 11 11
?
KLS1
?
KLS1
?
KLS1
?
KLS1
?
KLS1
?
KLS1
?
KLS1
-
1 1 1 1 1 1 -
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 -
1 1 1 1 1 1 1 1 1 1 -
-
The PAPRICA-1 implementation The PAPRICA-1 system operates as a specialized coprocessor attached to a general purpose host workstation. It comprises 3 major functional parts, namely:
A Processing Array (PA) hosting a 16 16 square matrix of 1-bit PEs each one with full
8-neighbors connectivity. A Program Memory storing up to 256k instructions, and a 3MB to 7MB Image Memory; both memories are dual-port and can be accessed directly from host. A VLSI based Control Unit (CU) which drives the virtualization mechanism and controls the program ow. The relationships between these sub-units are shown in Fig. 2a, while Fig. 3 shows the photographs of the processing chip and of the complete PAPRICA-1 system. 9
Table 2: List of PAPRICA-1 Logic Operators '' NAME DESCRIPTION EQUIVALENT TO (omitted) No operation KLD = G(KLS1 ) NOT KLD = G(KLS1 )cT & AND KLD = G(KLS1 ) T KLS2 & ANDNOT KLD = G(KLS1 ) S(KLc S2 j OR KLD = G(KLS1 ) KLS2 j ORNOT KLD = G(KLS1 ) ST(KLS2 )c S ^ EXOR KLD = (G(KLS1 ) (KLS2 )c) (G(KLS1 )c T KLS2 ) + PLUS (Arithm. Plus) KLD = \arithmetic sum" of G(KLS1 ) + KLS2 + KL0 K0 = \carry out" of sum of G(KLS1 ) + KLS2 + KL0 Table 3: List of PAPRICA-1 mapping statements. STATEMENT MEANING MEMORG Sets frame sizes f 21 and f 22 and parallelism C2 of second level of PAPRICA-1 hierarchy ( = 2). SKIP.READ Sets 3 of the parameters which de ne memory mapping. SKIP.WRITE Sets the other 3 parameters which de ne memory mapping. VALIDITY Sets validity area to for the following Update Blocks. LIMIT Sets limit area AL to for the following Update Blocks. MARGIN Sets margin width to for the Update Blocks which follow. SETDEF Sets base address for based addressing mode. The core of the system is the PE, which executes in a single cycle one of the elementary instructions described in the previous section. The internal structure of a PE, shown in Fig. 2b, is composed of:
a 64-bit register le which is used to store one pixel of the active Image Window. The corre-
sponding bits belonging to dierent PEs form a binary layer, which can be used as source or destination of a PAPRICA instruction; an execution unit, internally subdivided into a Graphic and a Logic processor, which executes one instruction per cycle, storing the result in the appropriate bit of the register le; two address decoders, one used when accessing the register le during load or store operations from the Image Memory, the other used to access source and destination layers of a PAPRICA instruction. The processors chip (hosting 16 PEs) has been designed using a full custom methodology and fabricated with a 1.5 m CMOS technology, with a total complexity of approximately 35000 transistors and an area of 35 mm2. The PE instruction cycle time is 350 ns. 10
STATEMENT UPDATE
Table 4: List of PAPRICA-1 ow control instructions. MEANING Separates two Update Blocks. All the other mapping and ow control instructions are also implicit updates which separate Update Blocks.
FOR instructions ENDFOR IF [NOT] < ag> instructions ELSE instructions ENDIF REPEAT instructions UNTIL [NOT] < ag> CALL RET JUMP
Repeats instructions times. Executes this program segment if < ag> is [NOT] set. Otherwise executes this program segment. Repeats instructions until < ag> is [NOT] set. Calls a subroutine at . Returns from subroutine. Jumps Program Counter to . To neighbours
PAPRICA subsystem
Image Memory
1
8
From neighbours
3
Operands 2
Program Memory
REGISTER
Results
flags
EXECUTION UNIT
Processor Array Memory
Control Unit Window Manager
FILE
Address
Execution Unit
Operand
Decoder
Address 16 8
m-bus
26 Operating Code
Decoders i-bus 12
Memory/Instruction bus
VME bus
Figure 2: (a) Block diagram of PAPRICA-1 system; (b) Internal structure of a Processing Element The PA as a whole may operate in Memory or Processor mode. In Memory mode it behaves like a standard RAM device composed of the 256 individual 64-bit PE registers for a total of 16k bits organized as 1k16-bit words. In Processor mode all PEs execute the same array instruction fetched by the CU from the Program Memory. The CU drives the virtualization mechanism and controls the program ow. These two tasks are quite dierent in nature and have been assigned to two sub-units: the Window Manager and the Execution unit. The Window Manager is responsible of data transfer between the PA and the Image Memory, assuring that every portion of the image is processed correctly for each Update Block. This unit is particularly complex, because it has to perform all array virtualization operations which has been previously described. It was designed as a standard-cell chip using 1.5 m technology and it can 11
Figure 3: Photograph of the processing chip and of the complete board (PAPRICA-1) handle a data transfer cycle of 100 ns. The Execution Unit was designed using one ALTERA FLEX 81188 (232 pin PGA). The FPGA solution was chosen because it is then possible to alter the behavior of the unit via software. This allows even a change in the ow control instruction set with a very fast turn-around time. This
exibility is obtained at the price of some speed reduction, but, as it will be explained in the following section, this aects only marginally the overall performance. The Program and Image Memory are dual-port RAMs, giving access to the host computer and the PAPRICA-1 system. Cycle time in PA to Image Memory transfers is 100 ns/word. The PAPRICA-1 system is built as a 6U slave VME board with dimensions compatible with those of VXI instruments. The host processor has direct access to Image and Program Memory and can control PAPRICA-1 program execution and video frames acquisition via some control and status registers. Table 5 reports some performance gures of PAPRICA-1, referring to basic low-level image processing algorithms performed on grey-level (8 bits/pixel) images. Table 5 shows:
The number of clock cycles required to process a single sub-window loaded into the PA. It is a
rough index of the amount of computation needed by the algorithm, and it is derived from the evaluation of the Assembly code. The neighborhood. It indicates the locality of the computation, and it is an index of the amount of I/O needed by the algorithm. The processing speed. It takes into account the overall performance of the system in the case of a 16 16 PE array.
Performance analysis The processing speed Spr is a function system technology and architectural parameters (i.e. number of Processing Elements Q2 , memory cycle time TM , and processor cycle time TC ), and on the parameters 12
Table 5: Performance gures of the current implementation of PAPRICA-1
Operations
Cycles Neighborhood Processing Speed 11 3.84 Mpixel/s 11 2.04 Mpixel/s 33 2.32 Mpixel/s 33 2.26 Mpixel/s 33 1.10 Mpixel/s 33 1.10 Mpixel/s 33 1.04 Mpixel/s 17 17 0.50 Mpixel/s 55 0.24 Mpixel/s 77 52 kpixel/s
Addition, Subtraction ' 10 Maximum, Minimum ' 70 x and y Derivatives ' 80 Average ' 90 Kirsh Gradient ' 450 Prewitt Gradient ' 450 3 3 Gradient ' 490 Binary Thinning ' 50 Clustering (1 iteration) ' 2000 Optical Flow ' 845700
of the particular computational task (program length L, number nupd and margin M of the Update P upd M (Ui )). It is de ned as the Blocks, and sum of the margin values of each Update Block G = ni=1 number of pixels over the total time needed for processing: of pixels processed : (27) Spr = Number Processing time If the application does not involve neighborhood-based operations (i.e. the program is made of just one Update Block), the time required to process a single sub-window is given by the sum of three components. First, the time Q2 TM required to load the array from the Image Memory; second, the time LTC required to execute L instructions (nupd = 1); third, another time Q2 TM to store back the results. Since a sub-window is formed by Q2 elements, Spr is then given by: 2 Spr = 2 T QQ2 + L T : M C
(28)
Since TC and TM are of the same order of magnitude, if L 2Q2 (for example when a large PA and a fairly simple application are considered), Spr reduces to: Spr ' 2 T1M , showing that
the computational time is bound by the I/O time. On the other hand, if L 2Q2 (for example when complex computations, such as number 2 crunching, are performed on small-sized PAs), then equation (28) reduces to: Spr ' LQTC , showing that for processing bound problems the speedup is linear with the number of processors Q2 . As an example, a complex computation ( oating point multiplication) consisting of about 2000 Assembly instructions (L ' 2000) has been run on architectures with dierent PA sizes, thanks to a software simulator. Fig. 4 represents the processing speed as a function of Q2 : it can be noticed that, according to the previous assessments, Spr vs Q2 has a saturation behavior (at Q2 ' L2 ), and when Q2 L2 the speedup is linear with Q2 , while when Q2 L2 , then Spr tends asintotically to its maximum value. 13
Q2
16 36 64 144 256 576 1024 4096 16384 65536
Spr [kFLOP/s] 22 48 84 176 286 514 714 1142 1344 1406
Figure 4: Performances on architectures with dierent PA sizes, considering the case in which TM = TC = 350 ns Moreover, considering also the Validity Area reduction and assuming that all the margin values of each Update Block are almost the same, after a few approximations (Broggi, 1994), Spr becomes
G 2 Q ? n2upd Spr ' 2 Q2 T n + L T ; (29) M upd C showing its dependence from the various parameters. nupd is the only parameter that can be tuned with the aim to maximize Spr : assuming 2Q2nupd L, the optimal value for nupd is determined as the maximum of Spr , and is given by the solution of
2
Q?2 M d S ' d (30) pr d nupd d nupd 2 Q2 TM nupd = 0 ; G , indicates the average margin value of each Update Block. where M , de ned as M = nupd Let us now de ne parameter x as the ratio between the linear size of the PA (Q) and the average Q . linear reduction of the VA within each Update Block (2 M ): x = 2M L , where the limit values refer to: The meaningful range for x is given by: 1 < x Q 2G x = 1 () Q = 2 M () M = Q2 : in this case the VA AV vanishes completely and Spr reduces to 0. x = Q2 GL () nupd = L: in this case each instruction is followed by an UPDATE operation. nupd > L would lead to a program with two or more adjacent UPDATE operations which is also meaningless. 2 (x ? 1)2 Q 1 d = 0 , and its two solutions Using parameter x, equation (30) becomes d x 8 TM G2 x3 14
Processing speed
2.00
4.00
6.00
8.00
10.00
Parameter x
Figure 5: Processing speed Spr versus x are:
G =) M = Q nupd 1 = 2 Q 2 (31) > Q > : x2 = 3 M = =) nupd 2 = 6 G = ) 6 Q The diagram presented in Fig. 5 (Spr vs x) shows that the rst solution corresponds to a minimum while the second one identi es a maximum. The optimal value of nupd is then given by: i h G: (32) nupd opt = 6 Q The same result can be achieved considering the PA eciency PA to process a single sub-window (Q Q) of the data set. PA eciency PA is de ned as the ratio between the number of PEs which 8 > >