systematic high-level address code transformations for ... - CiteSeerX

12 downloads 29207 Views 70KB Size Report
INDEXING: ILLUSTRATION ON A MEDICAL IMAGING. ALGORITHM .... grammable processor architectures and their state-of-the-art compilers. In contrast, it is ...
SYSTEMATIC HIGH-LEVEL ADDRESS CODE TRANSFORMATIONS FOR PIECE-WISE LINEAR INDEXING: ILLUSTRATION ON A MEDICAL IMAGING ALGORITHM

C. Ghez M. Miranda A. Vandecappelle F. Catthoor

D. Verkest

IMEC. Kapeldreef 75, 3001 Leuven, Belgium Also Professor at the Katholieke Universiteit Leuven, Belgium

Abstract Exploring data transfer and storage issues is crucial to efficiently map data intensive applications (e.g., multimedia) onto programmable processors. Code transformations are used to minimise main memory bus load and hence also power and system performance. However, this typically incurs a considerable arithmetic overhead in the addressing and local control. For instance, memory optimising in-place and data-layout transformations add costly modulo and integer division operations to the initial addressing code. In this paper, we show how the cycle overhead can be almost completely removed. This is done according to a systematic methodology which is a combination of an algebraic transformation exploration approach for the (non)linear arithmetic on top of an efficient transformation technique for reducing the piece-wise linear indexing to linear pointer arithmetic. The approach is illustrated on a real-life medical application, using a variety of programmable processor architectures. Total gains in cycle count ranging between a factor 5 and 25 are obtained compared to conventional compilers.

INTRODUCTION AND RELATED WORK Specialised programmable units and/or fully custom hardware units add to the design process complexity and cost. To improve design productivity at low cost, a compiler approach is particularly attractive since it enables potential performance gains without an increase in design complexity. However, it requires an optimised mapping of the application onto the target programmable platform. For instance, data transfer and storage exploration (DTSE) issues are crucial to efficiently map data intensive applications (e.g., multimedia) onto programmable platforms. These techniques [14] transform the initial code to minimise the load of the (shared) memory bus. A substantial reduction in bus load is also translated into significant savings in energy consumption, system performance (at the board level) and even potential speed-up of the CPU. However, this is typically done at the expense of a considerable arithmetic overhead in the addressing and manifest branching, hence resulting in an actual CPU performance degradation. For instance, memory in-place [19] and data-layout [18] DTSE stages modify the initial index expressions by adding costly integer modulo and division operations to the initial addressing code. In this paper, we show that using a systematic approach, it is possible to completely remove

these operations again, until arriving at nearly no cycle overhead while of course still maintaining all the advantages on the memory access and bus load. In the context of address code generation, traditional compiler approaches have targeted only low (pointer) level optimisations. Two techniques have been widely addressed: the optimal assignment of offset values to index registers [4, 5, 7, 8, 9, 12] and the optimal layout for allocating scalar data into memory [6, 10, 11]. Both techniques aim at minimising the number of register loads by maximally utilising the auto-increment features of the instruction-set ACU (Address Calculation Unit). These techniques assume an explicit expansion of the initial addressing code into pointer arithmetic which is later optimised. For non-linear arithmetic (e.g., modulo arithmetic), current compilers expand the non-linear code using generic function libraries. At best, the compiler can detect particular cases like power-of-two constants which can be mapped into an ACU shift operator. Even if more efficient library function are devised [2], compilers lack the scope of exploration during the expansion phase and thus implement an overly general functionality. Based on our previous experience within the custom ACU oriented high-level address optimisation approach [13], we have found that huge gains can be obtained by performing aggressive source-to-source optimisations and then using standard compilers for low-level code generation [17]. For instance, modulo operations affecting affine arithmetic can be modelled as a piece-wise linear arithmetic problem for which effective transformations can be applied before the pointer level optimisation techniques take place. In this paper, we extend this script by adding transformation techniques aiming at reducing the strength of integer modulo and division operations in the context of piece-wise linear addressing. This is done by transforming the modulo affected affine index expressions onto a combination of linear induction variables and conditional code. To avoid having unnecessary index register allocation and local control overhead, modulo related algebraic transformations complemented by aggressive code hoisting techniques are first applied in the initial code to reduce the amount of modulo related expressions. We illustrate the application of our high-level address optimisation script augmented with the proposed approach and on top of conventional compilers. The selected driver is a cavity detection algorithmic kernel, which is used in real-life medical applications. Performance results are reported using a large variety of programmable processor architectures and their state-of-the-art compilers. In contrast, it is possible to apply code transformations for mapping modulo operations on programmable processors. Franssen [3] et al proposes to transform affine expressions of a loop index affected by a modulo by using linear auto-incremented pointers to represent the piece-wise linear behavior of modulo in combination with conditions for resetting the pointers to mimic its repetitive nature. However, this technique is not formalised for nested loops nor integer division operations and in its initial form, it is limited to constant loop bounds. Balasa [1] eliminates modulo operations inside nested loops even if there are affine dependencies between the loop iterators. Extra loops are added to reproduce the repetitive behavior of the index function. However, this technique assumes code in single-assignment form, which is usually not the case in our context. For integer division operations, Linear Induction Variable Analysis [16] can be used if the step increment of the loop iterator is

a multiple of the dividend. In that case, the entire expression has a linear variation and can be replaced by an auto-incremented pointer. This special case is only rarely encountered, though.

POINTER SUBSTITUTION TECHNIQUE FOR PIECE-WISE LINEAR INDEXING Modulo and divisions are expensive operations on all commercial processors. Even if there is HW support, these are multi-cycle operations. Moreover, during the DTSE stage, a significant amount of modulo and divisions is introduced on addressing in the inner loops. An expansion technique for or piece-wise linear operators like modulo or division is therefore needed. The technique we propose, transforms both modulo and divisions placed inside nested loops. It allows dependencies in the loop bounds between the loop iterators. It is limited to affine expressions of iterators in both index and loop start expressions, but that is compatible with all practical occurences in multimedia algorithms. The concept of our technique is described in Figure 1. Modulo operations on affine expressions are substituted by pointers. These pointers are initialized before the loop and auto-incremented with a compile-time known offset in every iteration. On the other hand, conditions have been added to reset the pointer when its maximum value is reached. for(i=0; i= 4)

i%4

tmp++

-4

i

Figure 1: Illustration of a piece-wise linear evolution sequence. Figure 2 shows typical addressing1 code obtained after the DTSE transformations and more particularly after the data layout optimisation step. The figure shows array ”A” being affected by this transformation where both a modulo and an integer division operations are introduced in the addressing code. Figure 3 illustrates the resulting code after substituting the modulo and division by a pointer. Also in-place optimisation step of DTSE introduces modulo operations which can be remoced by this technique. for(iter1= d11; iter1 < stop1; iter1 += step1) ........ for(iterN= dN1*iter1+..+dNN; iterN= val) { modPtr1 -= val; divPtr1++;} modPtr2 = modPtr1; divPtr2 = divPtr1; ........ for(iterN= dN1*iter1+..+dNN; iterN= val) {modPtrN -= val; divPtrN++;} A[modPtrN + divPtrN]; modPtrN += mod_incrN; divPtrN += div_incrN; } ........ modPtr1 += mod_incr1; divPtr1 += div_incr1; }

Figure 3: Illustration of modulo/division substitution on code of Figure 2. value of the pointer to the base value of the pointer corresponding to the surrounding loop. The initialisation of the outer pointer is given by the manifest (computable at compile time) values:  modPtr = (c + c  init +... +c  init  ) val divPtr = (c + c  init +... +c   init  )  val where init  = d  init + d   init + ... + d   is the lower bound of the loop i. In the transformed code of Figure 3, each pointer modPtr  and divPtr  is incremented at the end of the loop, with a pre-computed value given by the following formulas.  val and div incr  = base   val, where: mod incr  = base 

 

with:

*

base

* ,+ .  0/1 23 +





" # !  $% 

c & $  E$' )(  step 

E ' - E$4' 5

 $ #! 6% 

d & $ '  & 6  E 6 ' 

  (

278(

Formula (1) is related to the affine expression affected by the modulo and division operations and takes care of the pointer variation along the loop execution. Therefore, it depends on the index coefficients c ,c ,...,c  and on the loop step. Formula (2) adjusts the increment computation in case dependencies exist between loop iterators (it depends on the coefficients of the loop start expression) and is equal to 0 otherwise. Note that there is no dependencies on the loops upper bound ; thus, the technique is also applicable to loop with data-dependent number of iterations. The increment and initilization values can be pre-computed since they only depend on the values c  , d  '  and step  . Even if these are not statically known, they can be pre-computed outside of the loop nest and thus have to be computed only once. The modulo substitution technique is general enough to efficiently transform array piece-wise linear indexing. It can even be further extended to cope with other non-linear cases like piece-wise polynomial indexing, using non linear induction variable replacing techniques [17] to express the pointer increments but that falls outside the scope of this paper. The transformation presented introduces extra pointers and conditions. Therefore, to avoid having unnecessary index register allocation

and local control overhead, modulo related algebraic transformations complemented by aggressive code hoisting are preliminary applied to reduce the amount of modulo related expressions to a minimum set. Designers apply many of such techniques already manually in an adhoc way, but formalisation is crucial for compiler automation for all possible cases. To our knowledge, such a general technique has not been published earlier [16, 24, 25].

ADDRESS OPTIMISATIONS FOR THE CAVITY DETECTION ALGORITHM The main goal of DTSE is a reduction of the system bus load and of the power consumption in the background memory hierarchy. It consists of several code transformation steps: for breaking data flow bottlenecks, increasing locality of memory accesses, and optimising the storage in a hierarchical memory organisation [21, 22]. On the other hand, an overhead in addressing is introduced hence affecting the CPU performance. For instance, the loop transformations step in DTSE introduces many conditions, the data reuse step creates small arrays and therefore more addressing, the in-place mapping step makes a windowing in these small arrays introducing many modulo operations and the memory data layout optimisation step introduces divisions and also modulo operations by re-arranging the layout of the data in main memory to improve the mapping to the cache architecture. The ADOPT methodology [17] can eliminate nearly all this overhead without affecting the benefits of DTSE. A combination of DTSE and ADOPT gives therefore good results on both power consumption and system/CPU performance. Our ADOPT script for programmable processors is now illustrated on a real-life driver: the cavity detection algorithm, an image processing application used mainly in the medical field. The starting point of the experiments has been a version of the code after DTSE [15]. The initial cavity detection algorithm consists of three functions. Each has one image frame (of 1280  1000 pixel) as input and one as output. After DTSE, only one image frame is present and the number of background accesses has drastically decreased, but the numbers of executed arithmetic operations are now 68.9 millions modulo and 2.6 millions integer divisions for addressing and 2.5 millions integer divisions and 7.6 millions multiplications for data processing. The application is hence dominated by arithmetic mainly inside the addressing which now becomes the main bottleneck.

High-level address transformations script The ADOPT script aims at high-level address code optimisations and consists of two stages: a processor independent stage and a processor (or processor family) specific one. The first stage removes the arithmetic bottlenecks by reducing the complexity of the non-linear arithmetic and the amount of linear operations. No trade-off is involved because these transformations remove essentially redundant computations. Then, for the final algorithm mapping, decisions have to be taken concerning the remaining linear arithmetic and the control flow issues. This is done in the processor specific stage. For example, in the cavity detection algorithm, many conditions have been introduced by DTSE. These can be removed from the heart of the loop nest by a loop folding transformation [15] during the processor specific transformations.

The script presents 3 main processor independent steps (after Address Expressions Extraction) namely: Algebraic Cost Minimisation (ACM), Advanced Code Hoisting (ACH) and Non-linear Operator Strength Reduction (NOSR). The ACM step allows a reduction of the number of operation instances. ACH reduces the number of times the operations are executed. NOSR, finally, reduces the cost of non-linear operations like modulo. The next three subsections show examples of transformations on the cavity detection driver. The main focus of this paper is on the novel aspects related to piece-wise linear arithmetic but the other steps are also briefly described to provide the necessary context. Most of the transformations applied for this work have been done manually but in a systematic way. Because of the formalism applied, we believe that these can be automated in a high-level code optimiser which is our current work. Some of the steps have already been automated in prototype tools such as the exploration support for the alebraic transformations on linear arithmetic and the substitution of piece-wise linear operations to pointers.

Algebraic Cost Minimisation In the ACM step, algebraic transformations are performed on the address expressions to minimise the number of operation instances. The algebraic transformations open up opportunities for hidden Common Sub-expression Elimination (CSE), constant propagation, dead code elimination, etc. The goal is to find a factorisation of the address expressions which allows to reuse computations as much as possible [20]. distributivity: (x + 4)%3 = (x%3 + 4%3)%3 constant folding: = (x%3 + 1)%3 constant unfolding: = (x%3 + 1%3)%3 invert distributivity: = (x + 1)%3 (a) modulo expansion: (x+2)%3 = 3 - x%3 - (x+1)%3 (b)

Figure 4: Illustration of the factorisation exploration for modulo expressions using algebraic transformations applied on the DTSE optimised driver For example, in the cavity detection code, we find many expressions affected by a modulo. The subsitution technique described in the previous section can be applied on these modulo expressions. However, every modulo operation is mapped into a pointer plus a conditional, which creates a potential overhead in index register allocation and local control that could be avoided at this stage. The easiest way to avoid this is to first reduce the number of modulo operations to a minimum set by exploiting modulo related algebraic transformations. Figure 4(a) illustrates some of the transformations applied to our driver. One possible transformation exploited in this code is to take advantage of the circular property of the values generated by a modulo operation. Figure 4(b) shows an example in the cavity detection driver where the modulo value (equal to 3) is very small. In this case, it is beneficial to expand the initial expression and to collapse the three common sub-expression variables created during ACM onto just two of them. The third one is then produced outside the expansion. Figure 5 shows the effects of the transformations on a fragment of the DTSE optimised cavity detection algorithm. After the ACM step, only two modulo

expressions are now needed instead of the four initially present. for (x=..) for (x=..) array1[x%3]; cse1=x%3; array1[(x+4)%3]; cse2=(x+1)%3; array2[(x+1)%3]; array1[cse1]; array2[(x+2)%3)]; array1[cse2]; array2[cse2]; array2[3-cse1-cse2]; (a) (b)

for(x=...) for(x=...) if(x_range1) cse = x%3; cse1 = x%3; if(x_range1) array[cse1]; array[cse]; if(x_range2) if(x_range2) cse2 = x%3; array[cse]; array[cse2]; (c) (d)

Figure 5: Fragment of DTSE optimised Code (a) initially and (b) after Algebraic Transformations. (c) is code before ACH and (d) after ACH showing overlapping ranges of the loop iterator inside condition test expressions. Algebraic transformations are intensively applied in the cavity detection driver, allowing to reduce the number of executed operations (see Table 1). Simply exposing the factorisation opportunities for CSE and letting the compiler do the actual CSE doesn’t show any difference on performance. The reason is that the compiler doesn’t see the opportunities for CSE, due to the presence of the non-linear modulo operations. For the integer divisions, similar opportunities may be available. However, the cavity detection code after DTSE contains only two integer division instances, that do not expose opportunities for the application of the transformations. modulo operations

DTSE 68.9

ACM 21.7

ACH 5.1

NOSR 0

Table 1: Number (in million) of calls to modulo operations after the different steps

Advanced Code Hoisting In the Advanced Code Hoisting step, the arithmetic optimised during ACM is spread over higher level scopes. The goal is to minimise the number of times the operations are executed [17], especially for the non-linear arithmetic. Figure 5 (c and d) illustrates a piece of code from the cavity detection algorithm after DTSE transformation before and after applying code hoisting aggressively across conditional scopes. Most of the modulo operations present in the cavity detection code are embedded inside different conditions. Therefore, CSE between expressions used in different conditions requires the common sub-expression to be hoisted outside the conditional scope. However, this is only advantageous if both conditions are ... true most of the time. Therefore, the condition ranges are analysed to determine if there is sufficient overlap. In the cavity detection, this is so far all CSE opportunities. In addition, the number of executed modulo operations is significantly reduced here because the expressions depending only on the outer loop iterator have been moved to the scope of the outer loop.

Non-linear Operator Strength Reduction After ACM and ACH, a significantly lower number of non-linear operations has to be executed. The Non-linear Operator Strength Reduction (NOSR) step transforms the remaining costly operations to reduce the number of cycles needed to compute them. In particular, the modulo and integer division operators are very costly because the linker implements them by calling complex algorithmic libraries containing many instructions and requiring many cycles.

for (y = 0; y < 1000; y++){ y_cse = y%3; for (x = 0; x < 1280; x++){ x_cse = x%3; .... ARRAY[y_cse + x_cse];}} (a)

y_ptr = 0; for (y=0; y< 1000; y++){ if(y_ptr >= 3) y_ptr-=3; x_ptr = 0; for (x=0; x< 1280; x++){ if(x_ptr>= 3) x_ptr-=3; .... ARRAY[y_ptr + x_ptr]; x_ptr++; } y_ptr++; } (b)

Figure 6: Examples of code (a) before NOSR and (b) after NOSR illustrating the transformation of modulo operations. Special cases are present when the modulo value is a power of 2. It is well know that this can be simply replaced by a mask operation (which is a primitive operation). Conversely a shift operation can be used for integer divisions. This kind of local transformations is already performed by a good optimizing compiler. When the modulo value is not a power of 2, the technique described in section 3 is applied. The Figure 6 shows two pieces of code extracted from the cavity detection algorithm, one before NOSR (i.e. after ACM and ACH) and one after NOSR where no modulo operation is present. The strength of the modulo operator is reduced by replacing a complex algorithm by a combination of conditional and linear arithmetic. The more ACM and ACH can reduce the number of modulo operations, the less is the gain brought by NOSR. As results of our complete script, the amount of operations approaches 0 for both modulo and integer divisions (see Table 1).

RESULTS This section shows and discusses the results obtained using our approach on a cavity detection algorithm. To evaluate the platform independence of the transformations, the same code has been compiled on four platforms using their respective native compilers: Pentium, MIPS, TriMedia and HP-RISC. The final version is obtained after the application of a DTSE stage [15, 22] and the ADOPT script described here. About a factor three improvement with respect to the initial version is observed irrespective of the platform used (see Table 2). We have also applied the ADOPT script directly to the initial code. In this case we also gain in CPU performance (up to a factor of two), however this is less than when applying a DTSE stage upfront because the data transfers dominate the application. initial initial + ADOPT initial + DTSE initial + DTSE + ADOPT Total gain factor

Pentium-II 3.37 1.59 8.36 1.00 3.37

MIPS 3.33 1.59 5.15 0.88 3.78

TriMedia 8.12 4.35 44.09 2.73 2.97

HP-RISC 1.99 1.79 7.68 0.67 2.97

HP-RISC no FPU 4.24 2.27 41.10 1.70 2.49

Table 2: Number of Mcycles obtained when (not) removing the DTSE bottleneck up front.



Table 3 also shows the number of executed cycles (  + ) but this time, for each of the different steps described in the previous section. These processors all have a

very different internal architecture. Still, each step brings on each platform a significant improvement in the number of cycles. This improvement is due to the reduction of executed operations at the code level and it is a large extent processor style independent. TriMedia has no floating-point unit available. Therefore, after DTSE the executed number of cycles is much larger than for the other architectures due to the overhead introduced in the modulo operations. The same happens for the RISC processor when disabling the floating point unit. This absence of the floating point unit is also the reason why the gain for the NOSR step is bigger on these architectures. However, after the ADOPT transformations, this difference becomes much less apparent. This clearly shows the processor-independent nature of our transformation script. The results of Table 2 show that the overhead on addressing introduced by DTSE (a factor between 3 and 10) can be fully removed at compile time because of our advanced analysis and transformation approach. This overhead is distributed all over the code hence current compilers, which only focus on the local scope optimisations, cannot remove it. However, when eliminating this overhead using the ADOPT script as DTSE back-end, significant gain factors are finally visible. DTSE optimised ACM ACH NOSR Proc.spec. Trafos Total Gain factor

Pentium-II 8.36 4.17 2.61 2.05 1.00 8.36

MIPS 5.15 2.47 1.81 1.60 0.88 5.85

TriMedia 44.09 17.73 8.90 5.02 2.73 16.5

HP-RISC 7.68 2.74 1.60 1.36 0.67 11.46

HP-RISC no FPU 41.10 14.55 4.56 3.04 1.70 24.18

Table 3: Number of Mcycles obtained across Adopt’s transformation steps. By using the combination of DTSE and ADOPT, the final number of cycles is now minimum (see Table 2). This reduction in number of overall cycles by a factor 3 (on all platforms) is also expected to be matched by a reduction in energy due to less instructions being decoded/executed [23]. ADOPT indeed removes the addressing overhead introduced by DTSE and therefore it reveals the improvement of cycles enabled by the DTSE stage. The memory subsystem is now well controlled and the possible side effects are avoided and/or removed.

CONCLUSION Advanced data transfer and storage exploration for programmable processor targets allows a main memory bus load minimisation and a power reduction but introduces a significant arithmetic overhead on addressing that reduces the CPU performance. Traditional compilers can only focus on the local scope optimisations and therefore, are not able to reduce the arithmetic overhead which is present on a global scope. This paper presents the ADOPT source-to-source transformation script extended with techniques for removing the overhead of piece-wise linear operations like integer modulo and division, without affecting the benefits of DTSE. This has been illustrated in a real-life medical application kernel, showing a significant reduction of number of executed modulo operations (up to a factor 13) and in CPU cycles (up to a factor of 24). By removing the data transfer bottleneck first using a

DTSE approach and then removing the addressing overhead, about a factor of 3 in overall CPU performance gain has been obtained with respect to the initial version, irrespective of the target platform.

References [1] F.Balasa, F.H.M.Franssen, F.Catthoor, H.De Man, Transformation of nested loops with modulo indexing to affine recurrences. Parallel Processing Letters, Dec. 1994. [2] K.Karagianni, T.Stouraitis, A novel hardware algorithm for residue evaluation. IEEE Workshop on Signal Processing Systems, 1999. [3] F.H.M.Franssen, M.F.X.B.Van Swaaij, F.Catthoor, H.De Man, Modeling piece-wise linear and data dependent signal indexing for multi-dimensional signal processing. Proc. HLSW, Laguna Beach, Ca, Nov. 1992. [4] C.Gebotys, DSP address optimisation using a minimum cost circulation technique. Proc. Intl. Conf. on Computer-Aided Design, pages 100-104, 1997. [5] A.Sudarsanam, S.Liao, S.Devadas, Analysis and evaluation of address arithmetic capabilities in custom DSP architectures. Design Automation Conference, 1997. [6] R.Leupers, P.Marwedel, Algorithms for Address Assignment in DSP Code Generation. Intl. Conf. on Computer-Aided Design, 1996. [7] R.Leupers, F.David, A Uniform optimisation technique for offset assignment problems. Proc. Intl. Symp. on System Synthesis, Taiwan, 1998. [8] A.Basu, R.Leupers, P.Marwedel, Register-constrained address computation in DSP programs. Proc. Design Automation and Test in Europe Conf., pages 929-930, March 1998. [9] A.Sudarsanam, S.Malik, S.Tjiang, S.Liao, Paged Absolute Addressing Mode Optimisations for Embedded Digital Signal Processors Using Post-pass Data-flow Analysis. Design Automation for Embedded Systems, no. 4, pages 41-59, 1999. [10] Cheng et. al., DSP addressing optimisation for loop execution on the auto-increment/decrement architecture. Proc. Intl. Symp. on System Synthesis, 1998. [11] B.Wess, Minimisation of data address computation overhead in DSP programs. Design Automation for Embedded Systems, no. 4, pages 167-185, 1999. [12] C.Liem, P.Pauling, A.Jerraya, Address calculation for retargetable compilation and exploration for instruction-set architectures. Proc. 33rd Design Automation Conference, pages 597-600, 1996. [13] M.Miranda, F.Catthoor, M. Janssen, H.De Man, High-Level Address Optimisation and Synthesis Techniques for Data-Transfer Intensive Applications, IEEE Trans. on VLSI Systems, no.4, vol.6, Dec. 1998. [14] K. Danckaert, F. Catthoor, H. De Man, System-level memory management for weakly parallel image processing. EuroPar Conference, Aug. 1996. [15] K. Danckaert, F. Catthoor, H. De Man, Platform independent data transfer and storage exploration illustrated on a parallel cavity detection algorithm. PDPTA (CSREA conf. on Parallel and Distributed Processing Techniques and Applications), June 1999. [16] A.Aho, R.Sethi, J.Ullman, Compilers: Principles, Techniques and Tools, Addison-Wesley Publishing Company, 1986. [17] S.Gupta, M.Miranda, F.Catthoor, R.Gupta, Analysis of High-level Address Code Transformations for Programmable Processors, In Proc. IEEE Design and Test in Europe Conf., March, Paris, 2000. [18] C.Kulkarni, F.Catthoor, H.De Man, Advanced Data Layout Optimization for Multimedia Applications Workshop on Parallel and Distributed Computing in Image, Video and Multimedia Processing, May 2000. [19] E. De Greef, F. Catthoor, H. De Man, Program Transformation Strategies for Memory Size and Power Reduction of Pseudo-regular Multimedia Subsystems IEEE Transactions on Circuits and Systems for Video Technology Vol. 8, No. 6, pp. 719-733, October 1998. [20] J.M.Janssen, F.Catthoor, H.De Man, A specification invariant technique for operation cost minimisation in flow-graphs, Proc. 7th ACM/IEEE Intln. Symp. on High-level Synthesis, pp. 146-157, 1994. [21] F.Catthoor, S.Wuytack, E.De Greef, F.Balasa, L.Nachtergaele, A.Vandecappelle, Custom Memory Management Methodology - Exploration of Memory Organisation for Embedded Multimedia System Design, book publication, Kluwer Academic Publishers, Boston, 1998. [22] F.Catthoor, K.Danckaert, C.Kulkarni, T.Omnes, Data transfer and storage architecture issues and exploration in multimedia processors, book chapter in “Programmable Digital Signal Processors: Architecture, Programming, and Applications” (ed. Y.H.Yu), Marcel Dekker, Inc., New York, 2000. [23] V.Tiwari, D.Singh, S.Rajgopla, G.Mehta, R.Patel, F.Baez, Reducing power in high-performance microprocessors, Proc. 35th ACM/IEEE Design Automation Conf., San Francisco CA, June 1998. [24] R.Wilhelm, D. Maurer, Compiler Design Addison-Wesley Publishing Company, 1995. [25] S.Muchnick, Advanced Compiler Design and Implementation Morgan Kaufmann Publishers, 1997.

Suggest Documents