IEEE Conference - CiteSeerX

1 downloads 0 Views 460KB Size Report
of the resulting Processor Array towards the constraints of the target architecture ... The indexing functions lj and ri,j are affine mappings lj(o) = Ljo+lj and ri,j(o+h) ...
Optimized Data-Reuse in Processor Arrays Sebastian Siegel, Renate Merker Institute of Circuits and Systems Department of Electrical Engineering and Information Technology Dresden University of Technology, Germany @iee1.et.tu-dresden.de Abstract— In this paper, we present a method for copartitioning affine indexed algorithms resulting in a Processor Array with an optimized data-reuse. Through this method, a memory hierarchy with an optimized data transfer is derived which allows a significant reduction of the power consumption caused by memory accesses. Apart from former design flows which begin with a space-time-transformation, we start with the co-partitioning of the iteration space. This allows an adaption of the resulting Processor Array towards the constraints of the target architecture at the beginning of the design. We illustrate our method for the motion estimation algorithm which bears a high potential of data-reuse.

I. I NTRODUCTION A large amount of computational resources is often required in modern technical equipment, e. g. in multimedia and communication devices. Besides the fact that most of these devices can only be realized by the use of parallel processing, the growing demand for handheld equipment raises the need for energy-efficient hardware. As parallel processing allows a lower system speed, it can greatly reduce the amount of energy needed to run an algorithm. On the other hand, a well designed memory architecture, e. g. a memory hierarchy, can achieve a significant reduction of the power consumption [1]. Parallel processing is achieved by the usage of a number of (typically 10 to 100) Processing Elements (PEs) forming a Processor Array (PA). Realizations in hardware include reconfigurable systems in FPGAs, ASIC-designs, and CPUcores. Software bears the potential of parallel processing, e. g. by the use of sub-word parallelism. It is very important to map algorithms onto target architectures systematically and to derive quantitative optimization criteria. Only by that, the large design space can be exploited effectively. In the 1980s, methods for mapping algorithms to PAs were developed [2], [3]. The chances to implement such a design of a PA are often very limited due to the constraints of the target architecture, e. g. chip size and communication resources. Therefore, in the past few years, new methods were derived which allow the adaption of a design to the target architecture [4], [5], [6] and tools for the computer aided design of PAs were developed [7], [8], [9]. The issue of an optimized memory organization for digital signal processors is investigated in [10]. The entire sequence of design steps is as follows: embedding and localization, space-time transformation, partitioning,

and relocalization - as shown in Fig. 1 (a). A major disadvantage of this classical design flow comes from the fact that only in the last two steps, the design is adapted to the target architecture which causes problems for choosing an optimal space-time transformation. In [11], Teich and Thiele present a first approach to partition affine dependence algorithms and they conclude that partitioning should be performed first in the design flow. [11] concentrates on the issue of the localization of affine data dependencies, whereas we present a modified design flow, whose main contribution is a method for an optimal partitioning in the following sense. Based on the results in [12] and [13], we present a method for co-partitioning affine indexed algorithms which optimizes the reuse of data. This method is favorable in particular for algorithms which bear a high potential of data-reuse, such as the motion estimation algorithm, where data is reused up to 1024 times (search range 32 × 32). We apply our method for this algorithm in detail. Fig. 1 (b) shows our method for mapping affine index algorithms to Processor Arrays. This method allows to regard the constraints of an architecture in the beginning of the design. In this paper we focus on the design step co-partitioning. Affine Indexed Algorithm

Affine Indexed Algorithm

Embedding/Localization

Embedding

Space­Time Transformation

Co­partitioning

Partitioning

Localization

Relocalization

Processor Array

Processor Array

(a) Fig. 1.

(b)

Classical design flow (a) and modified design flow (b)

The remaining part of this paper is organized as follows. The notation of affine indexed algorithms will be given in section II. Section III shows the principle of co-partitioning including. The issue of an optimized data-reuse is discussed in section IV. The modified design flow will be applied for the example of the motion estimation algorithm in section V. And in section VI we give a summary of our contribution.

II. A LGORITHM CODING We regard affine indexed algorithms and start with their definition based on [14] and [12]. Definition 1 (Affine indexed algorithm, AIA): An affine indexed algorithm is a set of statements yj [lj (o)] := Φj (fj (. . . , yi [ri,j (o+h)] , . . .)) , ∀o ∈ Oj . h∈Hj

The indexing functions lj and ri,j are affine mappings lj (o) = Lj o+lj and ri,j (o+h) = Ri,j (o+h)+ri,j respectively, where Lj ∈ Zmj ×n , lj ∈ Zmj , Ri,j ∈ Zqi,j ×n , and ri,j ∈ Zqi,j . Domains Oj and Hj are polyhedral subsets of Z–modules. Variables on the left-hand side are uniquely defined (single assignment property). Φ is P an Q associative and commutative operator such as max, min, , , gcd, and lcm. The images of the indexing functions form the index spaces IjInd = lj (Oj ) Ind and Ii,j = ri,j (Oj ⊕ Hj ). The iteration space for statement j is given by Ij = Oj ⊕ Hj . The embedding procedure aligns all iteration spaces Ij into a common iteration space I. After the embedding, the algorithm is given by its iterations i ∈ I. Example 1: Consider the following difference equation of a special IIR-filter: X y[k] := b[l] · y[k − l], 2 ≤ l ≤ 7, 0 ≤ k ≤ 7. (1) i

An AIA notation is then given by: X yy [(1 0 ) o] := yb [(0 1 ) (o+h)+2]·yy [(1 −1 ) (o+h)−2] | {z } h∈H

The tiling for co-partitioning consists of two consecutive tilings. The first tiling according to (3) derives the copartitions, as given by (4), where Θ(1) represents the size b (1) points to a co-partition, and the shape of a co-partition, κ (1) and κ points to an iteration within a co-partition. The second tiling according to (3) is given by (5). It separates the iterations κ(1) from within the co-partitons into LS-partitions. (2) (1) The non-zero entries ϑk of Θ(2) must divide ϑk for all k ∈ {1, 2, ..., n} so that a whole-numbered amount of LSpartitions fits into a co-partition. As we regard co-partitioning, we rewrite (5) and get (6) where we use a different notation. b (1) i = Θ(1) κ = Θ

(1)

(1)

b κ

S

P

κ(1)

+ + Co

= Θ | {zΘ } κ

Θ

+

b κ

S

P

(2) i1 , 0 ≤ i where o ∈ O = {o = ≤ 7} and H = {h = 1 0  0 , 0 ≤ i ≤ 5}. The iteration space I consists of eight by 2 i2 six iterations, as depicted in Fig. 2. III. C O - PARTITIONING In this section, we present the co-partitioning (Fig. 1 (b)) which transforms an embedded AIA into a PA. It consists of two steps: tiling and scheduling. The shape of the PA is derived by the tiling procedure. The scheduling defines the succession of the iterations.

+

Θ κ

(4) (2)

(5)

S

(6)

κ

+

κ

ΘCo

For (6) we introduce the following sets: KCo consists of all κ , KP consists of all κP , and KS consists of all κS . Example 2: The two-dimensional iteration space from example 1 is to be tiled according to the tiling matrices ΘS = ( 20 01 ) and ΘP = ( 20 02 ). In Fig. 2, the result is visualized. The tiles of the first tiling (co-partitions) are surrounded by a solid line. Their representatives κCo are marked by the dark gray filling of the iterations. The tiles gained from the second tiling are shaded and surrounded by a dotted line. Their representatives are marked by the hatched iterations κP . The iterations within these tiles differ by κS . Co

f (yb ,yy )



(2)

(2)

1

5

0 2 κP 2 1

4 3

0 1 κP 2 1 0

Co-partition

LS-partition

Co κP 2 κ2 i2

2 1

0

0

A. Tiling Before we derive the tiling for co-partitioning, we show the general tiling scheme within an iteration space I ⊂ Zn . The basis of a tiling is the tiling matrix Θ = diag(ϑ1 · · · ϑn ) ∈ Nn×n + , whose diagonal elements represent the extension of a tile in each direction in the iteration space. A tile corresponds to a multidimensional rectangle. Each iteration i is separated as follows:  0 ≤ κk < ϑk , 1 ≤ k ≤ n, i = Θb κ + κ, (3) b ⊂ Zn , κ ∈ K ⊂ Nn , b∈K κ where κ shows the position of i within the corresponding tile b represents the position of that tile. and κ

0

1

2

3

4

0

6

1

κS 1

1

κP 1 0

0

κS 1

1

0

7 i1 κCo 1

1

0 0

5

1

κS 1

1

κP 1

0

1 κS 1

Fig. 2. Co-partitioning of a two-dimensional iteration space, ΘS = diag(2 1) and ΘP = diag(2 2)

The tiling for co-partitioning defines the alignment of the Processing Elements in the PA as follows: each element κP ∈ KP corresponds to one PE. This property restricts the choice of a suitable tiling matrix ΘP . As PAs consist of PEs that are

aligned along two dimensions at most, the number of entries in ΘP that are greater than one may not exceed two. B. Scheduling The scheduling defines the succession of the iterations in the PEs of the PA. In co-partitioning, the scheduling consists of two sequential schedules. The inner sequential schedule defines the succession of the iterations κS within an LS-partition, and the outer sequential schedule defines the succession of the co-partitions represented by κCo . In the following, we derive a linear schedule that serves as a sequential schedule. We show the principle of a linear schedule for the succession of the co-partitions within an iteration space I that corresponds to a multidimensional rectangle. We b whose diagonal elements represent the introduce a matrix Θ number of co-partitions in each direction as follows: l ϑI m l ϑI m k k b = diag(ϑb1 · · · ϑbn ) and ϑbk = = , (7) Θ Co ϑSk ϑP ϑ k k where ϑIk is given by the number of iterations in direction k. b = |KCo |. This leads to the property: det(Θ) Each co-partition represented by κCo ∈ KCo is assigned to a different time tCo using the scheduling vector τ Co as follows: tCo = τ Co · κCo ,

τ Co ∈ N1×n , κCo ∈ Nn×1 .

(8)

All values tCo for different co-partitions represented by κCo must be unique. We get a dense linear schedule with:  Co b − 1. τ Co ·κCo ·κCo = det(Θ) (9) max 1 −τ 2 Co Co ∀{κCo 1 ,κ2 }∈K

b that are Let m denote the number of entries in the matrix Θ greater than one. Then there are m! · 2m different scheduling vectors for which (9) holds. This number of different scheduling vectors gives us a quantification of the degree of freedom when using the linear schedule to define the succession of co-partitions. Note that we have an additional degree of freedom when defining the sequential schedule within the LSpartitions. Here, an analog investigation of the tiling matrix ΘS determines that degree. From (9) we know that a scheduling vector τ Co depends on b And Θ b depends on the extension of the iteration space I. Θ. In order to define the sequential succession of co-partitions independent from the extension of the iteration space, we define a succession vector αCo . Definition 2 (Succession vector αCo ): The succession vector αCo ∈ Z1×n determines the succession of co-partitions within an n-dimensional iteration space I ⊂ Zn . Each entry αkCo in αCo defines the sequential schedule of co-partitions in the scheduling direction k to be along e|αCo , where 1 ≤ k ≤ n k | and el denotes the l-th column vector of the identity matrix E ⊂ Zn×n which serves as an orthonormal basis in Zn . The sign of αkCo distinguishes whether time is counted increasingly (positive) or decreasingly (negative) in that scheduling direction.

For each succession vector αCo , the following equation determines the entries of the corresponding scheduling vector τ Co = ( τ1Co ··· τnCo ):  Co Co τ|α Co | = sign(α1 ),   1   Q k−1 Co Co , 2 ≤ k ≤ m, τ|αCo | = sign(αk ) · l=1 ϑ|αCo (10) | l k    Co τ|α m < k ≤ n ∧ k > 1.  Co | = 0, k

Example 3: For co-partitions aligned in Z2 , all succession vectors αCo are to be determined. There are 2! · 22 = 8 solutions: t

t

t

t

−2 αCo ∈ {( 12 ) , ( 21 ) , ( −1 2 ),( 1 ),

`

´ ` 2 ´t ` −1 ´t ` −2 ´t 1 t }. −2 , −1 , −2 , −1

b = ( 2 0 ) we have m = n = 2. Therefore, we get a Given Θ 03 different scheduling vector for each of the succession vectors. The corresponding scheduling vectors τ Co are: t

t

t

τ Co ∈ {( 12 ) , ( 31 ) , ( −1 2 ),

`

´ ` 1 ´t −3 t ` −1 ´t ` −3 ´t 3 t }. −1 , −2 , ( 1 ) , −2 , −1

Note that definition 2 refers to the notation of αCo and τ Co from the outer sequential schedule given by tCo = τ Co · κCo b where the scheduling vector τ Co is determined by αCo and Θ. S S S The inner sequential schedule is given by t = τ · κ where the scheduling vector τ S is determined by αS and ΘS . All iterations that differ only in the value of κP are executed in parallel on the PA. In order to allow data dependencies between these iterations, we introduce a scheduling offset: toffs = τ offs · κP ,

τ offs ∈ Z1×n , κP ∈ Zn×1 .

(11)

The schedule for co-partitioning is determined by both linear schedules and the scheduling offset as follows: t = det(ΘS ) · τ Co · κCo + τ S · κS + τ offs · κP

(12)

Example 4: Fig. 3 shows an example of a schedule according to the tiling given in example 2 (ΘS = ( 20 01 ), ΘP = ( 20 02 ), b = ( 2 0 )). The time t according to (12) is shown in each and Θ 03 circle (iteration). The scheduling properties are as follows: αCo = ( −2 1 ) (and therefore: τ Co = ( 3 −1 )), τ S = ( 1 0 ) and τ offs = ( 1 −1 ). The succession vector αCo can be read as follows: The sequential schedule of co-partitions in the Co first scheduling direction  is determined by α1 . Therefore, it 0 is along −e2 = −1 (down) and the sequential schedule of co-partitions in the second scheduling direction is along e1 = ( 10 ) (right). The scheduling offset ensures that the data dependencies for yy do not cause any causality conflicts. In Fig. 3, the dependency vectors for yy [0] are exemplified. The computation of yy [0] is completed at the lower left iteration where all six dependency vectors meet at time t = 0. And at the positions these dependence vectors point to, the value is used again (at t = 1 at the earliest). The herein presented co-partitioning covers both, the locally sequential, globally parallel partitioning and the locally parallel, globally sequential partitioning each as a special case. Co-partitioning has several advantages i. e. a balancing of I/Obehavior and memory [15].

Co κP 2 κ2

1 0 2 κP 2 1 0 1 κP 2 1 0

0

-5

-4

-4

-3

1

2

2

3

-4

-3

-3

-2

2

3

3

4

-3

-2

-2

-1

3

4

4

5

-2

-1

-1

0

4

5

5

6

-1

0

0

1

5

6

6

7

0

1

1

2

6

7

7

8

0

κCo 1

1

0

1

0

1 κS 1 0

1

κP 1 κS 1

0

1

κP 1

0

1 κS 1 0

1 κS 1

4

2

5

3

6

Fig. 4.

Data-reuse for yy

whereas the highest level cache will be accessed fewest.

Fig. 3. Schedule for co-partitioning according to t = 2 · (3 − 1) · κCo + (1 0) · κS + (1 − 1) · κP

Processor Array

IV. O PTIMIZATION OF DATA - REUSE In this section, we will derive a method for co-partitioning which maximizes the reuse of data between adjacent copartitions. Storing data that was used in one co-partition and reused in another co-partition in a local memory (cache) on the PA, the communication between the PA and the peripheral memory can be reduced significantly. Example 5: Let us consider the potential data-reuse of the variable yy from example 1. In Fig. 4, all iterations in the iteration space where the same instance of the variable yy is referenced, are connected by a solid line. In each copartition, five different instances of the variable yy must be made available. Given the sequential order of the co-partitions as shown by the numbers in Fig. 4, three out of five instances can be reused in two adjacent co-partitions scheduled in the first scheduling direction given by α1Co = −2. Data-reuse can also be exploited in scheduling directions given by αkCo with k > 1. In this example there is only one such scheduling direction given by α2Co = 1. To determine the reuse of data in that scheduling direction, we consider the two blocks that are surrounded by the dashed line in Fig. 4. Nine instances of the variable yy are referenced within each block, and five instances can be reused in the adjacent block. Fig. 5 shows a PA with a memory hierarchy. Each PE has a local memory L0 . Some PEs at the border of the array have access to cache L1 . The cache Lk has access to cache Lk+1 , and the cache of the highest level is connected to the external memory. In Fig. 5, two levels of cache (L1 and L2 ) are shown. We suggest that one way of using the memory hierarchy efficiently is to store the instances of each variable that are reused along the sequential schedule of co-partitions in the first scheduling direction in the cache of the first level. The second level cache stores the instances of each variable that are reused in the second scheduling direction, and so on. We follow an optimization which results in a memory hierarchy where the lowest level cache has the most memory accesses

1

PE

PE

M em L 0

M em L 0

PE

PE

M em L 0

M em L 0

Fig. 5.

Mem L2

External Memory

Mem L1

PA with a memory hierarchy

In the following, we will quantify the performance of datareuse along the sequential schedule of co-partitions in the first scheduling direction. We will use the volume function [12], [13]: Co Vi,j,1 = Υ(ΘCo , Ri,j ) (13) to determine the number of instances that are referenced within one co-partition for variable yi in the j-th statement of the AIA-notation. ΘCo = ΘS ΘP refers to the size of the co-partition and Ri,j follows from the AIA-notation of the algorithm (see definition 1). In the same way we can determine the number of instances referenced within two consecutive and adjacent co-partitions: Co Vi,j,2 = Υ(ΘCo E2 , Ri,j )

(14)

where  6 l  ekl = 0, if k = ekl = 1, if k = l 6= |α1Co | E2 = (ekl ), with  ekl = 2, if k = l = |α1Co |

(15)

and α1Co refers to the sequential schedule of co-partitions in the first scheduling direction. The following equation determines the number of instances that can not be reused in two adjacent co-partitions: Co Co Co ∆Vi,j = Vi,j,2 − Vi,j,1 .

(16)

In order to determine the performance of the data-reuse, we introduce the average communication rate ci,j : ci,j =

Co Co Vi,j,1 + (ϑb|αCo − 1) · ∆Vi,j 1 1 | · . det(ΘCo ) ϑb|αCo | 1

(17)

The average communication rate determines the number of instances referenced per iteration for one variable as an average over all iterations within all co-partitions in the first represents the number of scheduling direction. The term ϑb|αCo 1 | co-partitions in the first scheduling direction. The smaller the value of ci,j , the less communication between the PA and its peripheral memory is necessary. In order to allow a ranking of the average communication rate ci,j , we determine the theoretically achievable minimal average communication rate cmin,i,j to: cmin,i,j

Υ(ΘI , Ri,j ) = , det(ΘI )

(18)

where ΘI represents the size of the iteration space I. Example 6: Suppose the numbers in Fig. 4 show the succession of the co-partitions. Let us regard variable yy from example 1 indexed by R = ( 1 −1 ). We get V1Co = 5, 9 V2Co = 7, ∆V Co = 2, and c = 18 · 5+(3−1)·2 = 24 = 0.375. 3 Instead of loading one instance for each iteration (no datareuse), only 0.375 instances on average need to be fetched for each iteration. The theoretically achievable minimal average communication rate cmin = 13 48 ≈ 0.271 is reached by approx. 72.2 percent.



 :=

argmin

In this section, we will show how to derive a Processor Array with an optimized memory hierarchy that can be used efficiently for the motion estimation algorithm. The algorithm consists of the computation of the sums of absolute differences (SAD) for blocks within the current frame c and (shifted) blocks of a reference frame r and an evaluation of these SAD which determines the motion vector (MV) according to the best matching (smallest SAD). We refer to the frames as r(x, y) = r [ xy ] and c(x, y) = c [ xy ], where 0 ≤ x ≤ X − 1 and 0 ≤ y ≤ Y − 1, and a frame consists of X by Y pixels. For the computation of the motion vectors including the intermediate SAD, six indexes have to be regarded: (˜ x, y˜) . . . block-index (Mx, My) . . . block-shift (i, j) . . . position within a block x c, y˜ = b Ny c, With a block-size of M by N we get x ˜ = bM ˜ −1} and y˜ ∈ {0, 1, . . . , Y˜ −1}, and their sets: x ˜ ∈ {0, 1, . . . , X ˜ = d X e and Y˜ = d Y e. where X M N Further, we define the sets for Mx and My: Mx ∈ MMx = {− MX, − MX + 1, ..., MX − 1}, My ∈ MMy = {− MY, − MY + 1, ..., MY − 1}. The following two equations represent the motion estimation algorithm:  x˜  M −1 N −1 X X  M ·˜x+i   M ·˜x+i+Mx  y˜ c , SAD Mx := N ·˜ y +j − r N ·˜ y +j+My i=0 j=0

(19)

SAD

∀(Mx,My)∈MMx ×MMy

x ˜ y˜ Mx My

 .

(20)

With (19), the SAD for block (˜ x, y˜) in the current frame c and a block in reference frame r shifted by (M x, M y) is computed. The best matching block is detected with the argmin-operator (20) which returns the pair (M x, M y) for which the SAD is minimal for a constant (˜ x, y˜). An AIA-notation of (19) according to Def. 1 is as follows: X yc [rc,SAD (o+h)] , o ∈ OSAD , ySAD [lSAD (o)] := −yr [rr,SAD (o+h)] h∈HSAD (21) where ySAD [◦], yc [◦], and yr [◦] are affine indexed with: 1 0 0 0 0 0  0  0 LSAD = 00 10 01 00 00 00 , lSAD = −MX , −MY

000100 0 0010 N 0 0 0 1), 0 1010 (M 0 N 0 1 0 1),

Rc,SAD = ( M 0

rc,SAD = 0,

Rr,SAD =

rr,SAD =

( OSAD =

o = ( i1

t i2 i3 i4 0 0 )

n HSAD = h = ( 0 0 0 0 i5

V. E XAMPLE : M OTION E STIMATION

My

MV

 x˜ 

i6

t

) ,

,

−MX −MY

˜ 0≤i1