Matisse: A system-on-chip design methodology ... - Semantic Scholar

[Journal Name], [Volumn Number], 1–9 ([Volumn Year])

c [Volumn Year] Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.

Matisse: A system-on-chip design methodology emphasizing dynamic memory management DIEDERIK VERKEST, JULIO LEAO DA SILVA JR., CHANTAL YKMAN, KRIS CROES, MIGUEL MIRANDA, SVEN WUYTACK, FRANCKY CATTHOOR IMEC, Kapeldreef 75, B-3001 Leuven, Belgium GJALT DE JONG Alcatel Telecom, F. Wellesplein 1, B-2018 Antwerp, Belgium HUGO DE MAN IMEC, Kapeldreef 75, B-3001 Leuven, Belgium Professor at Katholieke Universiteit Leuven, Belgium

is a design environment intended for developing systems characterized by a tight interaction Abstract. between control and data-flow behavior, intensive data storage and transfer, and stringent real-time requirements. Matisse bridges the gap from a system specification, using a concurrent object-oriented language, to an optimized embedded single-chip hardware/software implementation. Matisse supports stepwise exploration and refinement of dynamic memory management, memory architecture exploration, and gradual incorporation of timing constraints before going to traditional tools for hardware synthesis, software compilation, and inter-processor communication synthesis. With this approach, specifications of embedded systems can be written in a high-level programming language using data abstraction. Application of Matisse on telecom protocol processing systems in the ATM area shows significant improvements in area usage and power consumption. Keywords: embedded system design, low-power, network protocols, (dynamic) memory management 1. Introduction The complexity of modern telecommunication systems is rapidly increasing. A wide variety of services has to be transported and elaborate network management is needed. Such complex systems require a combination of hardware and software components to implement the required functionality at the desired performance level. For applications in this domain, the desired behavior is often characterized by complex algorithms that oper-

ate on large dynamically allocated store data structures (e.g. linked lists, trees, . . . ) resulting in intensive data transfers and large data storage. Therefore, the data storage contributes to a large part of the area in this domain and the data transfers contribute to a large part of the power. Ideally the specification should reflect the “conceptual” partitioning of the problem, which typically corresponds to abstract data types (ADTs) along with services provided on the ADTs, and algorithms for the different processes. As these conceptual entities can be readily specified in an object-oriented programming

2

THE AUTHORS???

model using data abstraction and class inheritance features, the system uses the C++ programming language as the basis for the behavioral specification. The language extends the standard C++ language with features for expressing concurrent processes and synchronization. Behavioral hardware synthesis has been an active area of research for more than a decade (see e.g. [18]), but commercial behavioral synthesis tools offer only very limited support for complex data structures: usually only statically declared arrays and records are supported. All these synthesis environments provide scheduling and resource allocation capabilities that permit the designer to abstract from timing and hardware partitioning details. However, the designer is still largely responsible for specifying the memory hierarchy and organization. Also manipulation and modification of stored data structures must be specified in terms of explicit memory I/O operations. In a traditional “software run-time environment”, the underlying operating system is responsible for all the background memory and storage related tasks. In addition, the memory hierarchy is usually fixed. However, for embedded systems solutions, relying on software run-time support may be expensive in terms of area, performance, and power. In addition, dedicated distributed memory architectures may be used. Hence, dynamic memory management behavior must be synthesised in the embedded system implementation itself. In this paper, we discuss , a design environment that takes care of the background memory management problem for dynamic data structure intensive applications by bridging the gap between the conceptual design entry specification, based on C++, and traditional behavioral synthesis. The environment addresses all the aforementioned tasks to synthesize a custom distributed memory architecture. It permits to explore different architectures so that an optimal choice can be made, which is crucial as memory bandwidth often is the main performance bottleneck in this type of applications. Another important benefit of the environment is that the specification level is raised to a higher level than currently used for behavioral synthesis. The design entry point can be a high-level program using data abstraction, where the designer is not burdened unnecessarily by all the details of the implementation of data structures in a memory architecture. In the next section, we situate in a global context. In Section 3, we present the spec-

ification model and design flow. In subsequent sections, we elaborate on those steps in the design flow directly related to the dynamic memory management. The extensive exploration feasible with the environment will be illustrated on an industrial ATM application.

2. Related work This subsection summarizes related work on methodologies, design flows, and tools. At the end, is situated.

Methodologies, design flows, and tools can be classified in two axes: according to the level of abstraction they are situated and according to the class of applications they can handle. According to the level of abstraction, three levels are distinguished: system synthesis, that starts from a specification independent from the final processor architecture, hardware/software codesign, that concentrates on hardware/software partitioning, process to processor mapping, communication synthesis, and cosimulation, and processor architecture integration , that concentrates on synthesis/compilation into one custom (HW) or instruction set (SW) processor at a time, and particularly on operation scheduling, hardware building block allocation, and technology mapping. According to the class of applications, four types are distinguished: control-dominated, mostly specified using logical and control operations, data-flow, in which the major part is data-flow arithmetic, static data-dominated, mostly characterized by processing of static arrays or array streams, and dynamic data-dominated, mostly characterized by processing of irregular data types that are (de)allocated at run-time.

For control-dominated applications, several codesign approaches exist (e.g. Chinook [9], Polis [4], SpecSyn [12], Statemate [13], Cosmos [5]). These approaches support control specification and synthesis without any support for memory management. For data-flow applications, several system synthesis approaches are available, such as BONES/SPW from Alta/Cadence, COSSAP from Synopsys, DSP Station from Frontier, Ocapi [20], PowerPlay [17], and Ptolemy [7]. Also a few codesign approaches exist for these applications (e.g. Grape II [15], Ptolemy [7]) . These approaches support synthesis of data-flow arithmetic without any support for memory management.

THE TITLE??? For static data-dominated applications, the system synthesis approach Acropolis [11] focuses on data transfer and storage exploration issues. At the codesign level, the Atomium methodology [8] provides data and control flow optimization and supports data reuse in a hierarchical memory context. These approaches provide memory management for regular and static data types, but no support for irregular dynamic data types is available. For dynamic data-dominated applications, such as protocol processing applications, three system synthesis approaches can be considered. Hemani [22] starts from SDL and supports dynamic process synthesis and inter-process communication refinement. FORM [14] provides a simulation environment for dynamic data-dominated applications. that provides not only a simulation environment for dynamic data-dominated applications, at system level, but also bridges the gap between system level and processor architecture synthesis. The overall system is presented in Section 3.

3. Matisse system The system design flow, starts from a concurrent object-oriented system specification using the model [10], and targets an optimized embedded single-chip hardware/software implementation. We first introduce the model, and then discuss the design flow.

nization. We summarize the main features of the underlying model below. Processes and concurrency - It is possible to specify processes, called active objects, and data, called passive objects. Processes have their own local virtual memory space and default thread of control. They are only created at compile-time. Concurrency is specified at the process level, by the concurrent execution of the default threads of control of all created processes. Data may be created and destroyed in the local virtual memory space of the processes, either at compile-time or at run-time. Communication - Within one process, communication is specified using C++ pointers. Between processes, communication is specified using global pointers. Except for their potentially higher cost of use, global pointers are used just like C++ pointers. Synchronization - Due to concurrent computations, simultaneous accesses to data should be synchronized by using atomic functions. Whenever several threads call an atomic function, the function is executed the required number of times in a sequential order. The execution of an atomic function never interleaves with the execution of another atomic function within one process.

3.2. Design flow

design flow is depicted in Figure 1. The The input to the design flow is a system together with its environment specified at the algorithmic level, using the language. Abstract Machine (AM) generation creates an executable specification, suitable for simulation, exploration, and refinement of the system specification. The AM consists of a set of communicating concurrent processes, an ultra-light operating system to manage the execution of these processes and a user interface to allow the designer to make refinements of the specification. The AM allows profiling of record accesses, inter-process communication, and virtual memory accesses. These profiling data are used to select an optimized implementation for the records, perform process concurrency management, and physical memory management, respectively. Dynamic memory management - Protocol processing applications are often characterized by algorithms that operate on large data structures, which are dynamically allocated. The language allows the

3.1. Model Protocol processing applications are conceptually seen as sets of concurrent processes that access data (defined as sets of records). Although the target implementation of protocol processing applications is often a mixture of hardware and software components, they are best conceived at the top level from a software perspective. Concurrent object-oriented models play a central role in large scale hardware/software system design, since they allow system specification and fast system-level simulation. In [10], the language is presented in detail. It is a concurrent object-oriented specification language, extended from the widely used programming language C++. Minimal syntactic extensions to C++ are introduced to allow the specification of concurrent processes, inter-process communication and synchro-

3

4

THE AUTHORS???

designer to define these data structures using Abstract Data Types (ADTs), without low-level specification details. When implementing these applications on a chip, efficient organization and implementation of the ADTs is crucial [2, 16, 23], and dynamic memory allocation must be handled efficiently both in terms of time and number of memory accesses. Therefore, refinement of the specification of the ADTs (ADT refinement) and of the memory management (Virtual Memory Management) is required before proceeding with synthesis. Process concurrency management - The goal of process concurrency management is to meet the overall real-time requirements imposed on the application. This step involves process concurrency extraction, thread scheduling, processor allocation, process to processor assignment and inter-process communication refinement. Physical memory management - Typically, protocol processing applications require large storage capacities and very high I/O bandwidth to achieve the real-time requirements. This step aims to synthesize area and power efficient distributed memory architectures and memory management units, meeting the real-time requirements. Finally, software compilation proceeds using traditional software compilers, hardware synthesis proceeds using high-level synthesis tools and interface synthesis generates software device drivers for each software processor and VHDL specifications of the necessary hardware blocks allowing communication between hardware and software processors. The interface synthesis is performed using the hardware/software co-design environment [6, 1]. In the next three sections, we elaborate on the three steps that are relevant for the dynamic memory man-

Matisse specification Abstract machine generation

Dynamic memory management

Process concurrency management Physical memory management

HW synthesis

HW/SW i/f synthesis

SW synthesis

Fig. 1. Matisse Design Flow

agement: ADT refinement (section 4), virtual memory management (section 5), and physical memory management (section 6). 4. ADT refinement In an implementation independent specification, complex data structures are typically specified by means of ADTs that represent a certain functionality without imposing implementation decisions. A dictionary type, i.e. a set of records indexed by means of keys, is a typical example of an ADT occurring in transport layer network interface applications. The ADT provides a number of services (e.g., inserting, locating, or removing a record from a set) which can be used to specify the functionality of an application without knowing their implementation. A set of records accessible through one or more keys can be represented by many different data structures. All of these have different characteristics in terms of memory occupation, number of memory accesses to locate a certain record, power dissipation, ... To allow the designer to make a motivated choice, all possible data structures have to be represented in a model such that the best solutions for a given application can be searched for. 4.1. A hierarchical ADT model In our model there are four primitive data structures (linked lists, trees, arrays, and pointer arrays) that can be combined to create more complex data structures. A complex ADT is represented as a tree composed of primitive data structures. With every key corresponds a layer in the tree. The bottom layer is the record layer which has no key associated with it. The top layer (i.e., the root of the tree) represents the entire set of records. Each layer below represents a partitioning of the whole set into a number of subsets. Specifying a value for the key corresponding to a layer, selects the subset of records for which the key has the specified value. This process can be applied hierarchically from the top layer till the records are selected. Each node in the tree (except for the bottom layer) has to associate values of the corresponding key with a node on the next layer. This functionality can be implemented with a single primitive data structure. Up to this point, we have assumed that every key corresponds to one layer in the hierarchy. This is not

THE TITLE??? necessary, however. Keys can also be split into subkeys, or several keys can be combined into one super key. This may heavily impact the implementation cost. Also, the order in which the keys are used to access the data structures heavily impacts the required memory size, the average number of memory accesses to locate a record, and the power cost. Therefore, it is important to find the optimal key ordering for the given application as well as the optimal number of layers. When the keys are not uniformly distributed, hashing can be used to improve the results (hashing applies a permutation function to a key or combination of keys). Note that hashing can be combined with any of the primitive data structures, thereby providing an orthogonal axis of freedom in the search space. Hashing is especially useful in combination with key splitting, because it allows to reduce the (average) size of the primitive data structures associated with the sub-keys after splitting. Many possible data structures within the model can realize a given set of records. Each one can be seen as a combination of different major options which are relatively orthogonal (Figure 2). Within each option, more detailed choices can still be made. Finding the best combination for a given application is not trivial, since it depends on the parameters in the model. Moreover, the full search space is too large to scan it exhaustively. To determine the optimal data structure we have to define the number of layers in the hierarchy, the key ordering, the hashing function for each key, and the primitive data structure for each layer in the hierarchy. Experiments showed that some decisions are much more important than others, and the heuristic decision ordering indicated in Figure 2, leads to near optimal solutions without exhaustively exploring all combinations. For a detailed description of the full optimization methodology we refer to [26]. 4.2. Experiments The set-of-records ADT in the ATM application was optimized for power using two realistic parameter sets. The first one assumed a storage of records in a memory built from 1 Mbit SRAMs, the second a memory built from 4 Mbit SRAMs. The optimal solution for the ADT data structures in both cases differs. Both are two layer structures with two keys. The first key indexes a pointer array, whereas the primitive data structure (DS) on the second layer is a pointer array and an array of

2 Key splitting

1 Hashing No

5

Yes Hashing function

4 Primitive data structure

3 Key ordering Array

Pointer Array

Binary Tree Linked List

Fig. 2. ADT refinement search space

records, for the first and second solution respectively. Applying the optimal DS for one set of parameters in the context for which the other DS was optimized, results in a power consumption that is more than 2.5 times above that of the optimal DS. Moreover, the entire search space spans a power range of four orders of magnitude, clearly substantiating the importance of a thorough exploration before deciding on a solution. 5. Virtual memory management The VMM step reserves storage space for each concrete data type obtained during the ADT refinement step, by defining a virtual memory segment for each concrete data type. Subsequently, it determines a custom virtual memory manager (VMM) for each data type that is dynamically allocated in the application. A VMM takes care of allocating and recycling blocks from the virtual memory segments. Allocation is the mechanism that searches the pool of free blocks and returns a free block large enough in order to satisfy a request of a given application. Recycling is the mechanism that returns a block which is not used anymore to the pool of free blocks for later reuse. Much literature is available about possible implementation choices for allocation mechanisms [3, 24] but none of the earlier work provides a complete search space useful for a systematic exploration. 5.1. VMM search space Similar to the ADT refinement problem, this is only feasible in practice by identifying the orthogonal decision trees in the available search space . Below we present the decision trees for allocation and recycling mechanisms. Keeping track of free blocks - The allocator keeps track of free blocks using either link fields within free blocks or lookup tables (Figure 3.a). Using link fields within free blocks does not introduce overhead in terms of memory usage as long as a minimum block size is

6

THE AUTHORS???

respected, while lookup tables always imply an overhead in terms of memory usage. The allocators are differentiated based on the indexing mechanism (by size, by address, . . . ). Choosing a free block - Different mechanisms exist for choosing a free block from the pool (Figure 3.b). The pool may be partitioned in sectors per size or type. The chosen block may be an exact match or an approximate match for the requested size. The allocator will try to satisfy an allocation request by returning either the first free block that is large enough (first fit) or the free block that is closest in size to the requested one (best fit). Freeing used blocks - A block that is freed by the application has to be returned to the pool of free blocks (Figure 3.c). Obvious mechanisms which provide good performance are LIFO or FIFO schemes. A scheme that respects an index order (e.g. size) may avoid wasting memory when combined with splitting or merging techniques (see next sections) at the cost of a performance penalty. Splitting block being allocated - When the free block chosen to satisfy a request is larger than the required one, a policy for splitting the block can be implemented (Figure 3.d). The remainder of the split block is returned to the pool of free blocks. The splitting mechanisms are differentiated based on which part of the free block is used and on whether or not splitting respects an index order (e.g. size). Merging free blocks - When adjacent blocks are free, the allocator may decide to merge the blocks in order to have more opportunities to accommodate a subsequent larger allocation request (Figure 3.e). In general it is (a)

index order

free blocks tracking

none lookup table

completely

interesting to defer the merging in order to avoid subsequent splitting operations. Deferred merging may be implemented in different ways: wait for a fixed or variable amount of allocation requests before merging or wait for an unsatisfied allocation request before merging. The amount of blocks to be merged can vary between merging all blocks and merging only enough blocks to satisfy the last request. 5.2. Experiments The three data types in the ATM application that contribute most to the background memory are the Internal Packet Identifier (IPI), the Routing Record (RR), and the ATM cell. The virtual memory segments for these data types range in size from 3K to 12K words. For each virtual memory segment, a VMM mechanism has to be selected. Different choices result in power figures differing up to a factor of 5 for the IPI, 11 for the RR, and 25 for the ATM cell. In this application, the VMM with the minimal power figure is the same one for each data type. However, power is not the only parameter in the trade-off. When the amount of storage in use for two data types reaches a maximum at different moments during the lifetime of the application, it is possible to combine their virtual memory segments, at least if the VMM mechanism allows for this possibility. A second VMM mechanism that has an only slightly higher power figure for the IPI and RR data types, offers this possibility. It might therefore be possible to save area by combining the virtual memory segments for the IPI and RR data types, without affecting the power consumption. Unfortunately, in this application both IPI and RR data types reach there maximal use in an overlapping period of time.

link fields address size

(b)

free pool

sector per type/size

(c)

FIFO

block splitting (when) never

(e)

entire pool

exact

sequential fit

approximate first

best

always

indexed part of free block used first first

block merging (when)

immediate

deferred

last block merging (how much)

never

all

6. Physical memory management

...

free blocks reusage

LIFO

(d)

match

enough

fixed/variable amount unsatisfied request

Fig. 3. Search space for VMM mechanisms

Usually, for data-intensive algorithms the cycle budget available is insufficient to perform all the memory accesses sequentially. Hence a number of accesses have to be done in parallel. Distributed memory architectures allow to exploit parallelism, thus alleviating memory access bottlenecks. However, as the required memory bandwidth increases, the cycle budget available for each access individually become smaller since the number of addresses that has to be generated in parallel per processed data becomes higher, thus leading to an addressing overhead.

THE TITLE??? 6.1. PMM methodology The signals accessed in parallel have to be assigned to different memories or they have to be accessed through different ports of a multi-port memory. Many different orders of the memory accesses are possible for the given cycle budget. Manually exploring all different ordering possibilities and memory configurations for area and power efficiency is a very tedious task. Therefore, an automated methodology [21, 25] has been developed. Basic groups - The virtual memory segments are split into smaller groups of data which are called basic groups. Every data item belongs to exactly one basic group, so that basic groups can be assigned to physical memories independently from each other. Basic groups are kept as small as possible, to increase the freedom of assigning basic groups to physical memories and to increase the parallel accessibility of the data in a virtual memory segment. Access ordering - The access ordering step optimizes the memory cost for the required storage bandwidth, by determining which basic groups should be made simultaneously accessible in the memory architecture such that the imposed timing constraints can be met. For this purpose, the data accesses are ordered within a given cycle budget. Whenever two accesses to two basic groups occur in the same cycle, there is an access conflict because the basic groups cannot share the same memory port. These conflicts have to be resolved during the subsequent memory allocation and assignment step by assigning conflicting basic groups either to different memories or to a multi-port memory such that they are simultaneously accessible. Memory allocation and assignment - Memory allocation and assignment determines the number and type of the memories, the number and type of their ports, and an assignment of basic groups on the allocated memories in a power and/or area optimized memory architecture. The conflict relations between the basic groups are used to restrict the search space to memory architectures that provide enough memory bandwidth to meet the timing constraints. Address optimization - Address manipulation forms a crucial component of any architecture which deals with data transfer intensive algorithms. The efficient access to the memories within real-time constraints requires an optimized mapping of the address expressions in the algorithm onto address arithmetic optimized for

7

both area and power. A methodology [19] has been developed to reduce the cost overhead for address generation for both custom and instruction-set processors. This methodology includes address expression splitting/clustering, induction variable analysis, target architecture selection, and global scope algebraic optimizations. In addition, high-level controller synthesis and optimal partitioning of the arithmetic unit are incorporated for the synthesis of custom memory management units. 6.2. Experiments Several experiments have been performed on the ATM application with varying cycle budgets for the memory accesses. The virtual memory segments from the ATM application can be split in 14 basic groups. This reduces the critical path from 15 cycles to 9 cycles. The access ordering showed 13 conflicts between the basic groups. Several memory architectures were generated that satisfy the cycle budget constraints derived from the previous steps. The best solution is a trade-off between area and power. The best solution for power is a configuration with 6 memories. A configuration with 3 memories consumes a factor of 1.98 more power, and a configuration with a single memory 6.85 times more. To show the impact of the high-level address optimization, the resulting solutions where compared to those obtained by traditional synthesis tools. The RTVHDL description that generates a hardwired solution (every address expression mapped on a separate unit), results in an area of 1.7 mm after synthesis with Synopsys DC. A behavioral VHDL description synthesis ed with high-level synthesis (Synopsys BC), results in an area of 1.48 mm , subject to the constraint of generating one address expression in every clock cycle. When using the high-level address optimization of before using high-level synthesis, an area of 0.42 mm is obtained.

7. Conclusions In this paper, we have addressed the support for system design exploration for applications that require manipulation of a large amount of dynamically allocated stored data, as found in e.g. protocol processing applications used in telecom networks. Using the language the designer is able to write a system specification, which abstracts low-level details, and is easily

8

THE AUTHORS???

retargetable to different embedded hardware/software realizations. The design flow assists the designer to explore the design space at system level for different ADT implementations and memory managers, and to explore different memory architectures for mixed hardware/software realizations. We demonstrated the results of the system design exploration using an industrial ATM application. We demonstrated that despite the higher level of abstraction of our input with respect to e.g., high-level synthesis (HLS), we achieve more efficient implementations.

Acknowledgments This work is partly funded by the Flemish IWT in the HASTEC project and the European commission in the MEDIA project. Julio Leao da Silva Junior is supported by a Brazilian Government Fellowship - CAPES. We would further like to thank Bill Lin (University of California, San Diego) and Mark Genoe (Alcatel Telecom) for many insightful discussions. Notes 1. Related work at this level will not be mentioned in this section because it is less relevant for our focus. 2. We do not consider implicit recycling mechanisms, known as garbage collectors, in our search space.

References 1. Coware. http://www.coware.com/. 2. A. Alles. ATM in private networking, a tutorial. In INTEROP’93, 1993. 3. G. Attardi and T. Flagea. A customisable memory management framework. In Proceedings USENIX C++ Conference. Cambridge, MA, 1994. 4. F. Balarin, M. Chiodo, P. Giusto, H. Hsieh, A. Jurecska, L. Lavagno, C. Passerone, A. Sangiovanni-Vincentelli, E. Sentovich, K. Suzuki, and B. Tabbara. Hardware-Software CoDesign of Embedded Systems: The Polis Approach. Kluwer Academic Press, June 1997. 5. T. Ben Ismail, K. O’Brien, and A. Jerraya. Synthesis steps amd design models for codesign. IEEE Computer, Special Issue on Rapid Prototyping of Micro-Electronic Systems, pages 44–52, February 1995. 6. I. Bolsens, H. De Man, B. Lin, K. Van Rompaey, S. Vercauteren, and D. Verkest. Hardware-software codesign of telecommunication systems. IEEE Proceedings, Special Issue on Hardware-Software Codesign, 85(3):391 – 418, March 1997. 7. J. Buck, S. Ha, E. A. Lee, and D. G. Messerschmitt. Ptolemy: A framework for simulating and prototyping heterogeneous systems. Technical report, University of California, Berkeley, August 1992.

8. F. Catthoor, S. Wuytack, E. De Greef, F. Franssen, L. Nachtergaele, and H. De Man. System-level transformations for low power data transfer and storage. In B. Brodersen and A. Chandrakasan, editors, Low power design. IEEE Press, 1998. 9. P. Chou, R. Ortega, and G. Borriello. The Chinook hardware/software co-synthesis system. In Proceedings 8th ACM/IEEE International Symposium on System Synthesis. Cannes, France, September 1995. 10. J. Leao da Silva Jr., Ch. Ykman-Couvreur, and G. de Jong. Matisse: A concurrent and object-oriented system specification language. In Int. Conf. on VLSI, 1997. 11. K. Danckaert, F. Catthoor, and H. De Man. System level memory optimization for hardware-software co-design. In Proc. IEEE Intnl. Workshop on Hardware/Software Co-design, pages 55–59. Braunschweig, Germany, March 1997. 12. D. Gajski, F. Vahid, S. Narayan, and J. Jong. Specification and Design of Embedded Systems. Prentice Hall, Englewood Cliffs, NJ, 1994. 13. D. Harel. STATEMATE: a working environment for the development of complex reactive systems. IEEE Transations on Software Engineering, 16(4), April 1990. 14. K. Higuchi and K. Shirakawa. Innovative system-level design environment based on form for transport processing systems. In Proc. Design Automation and Test in Europe, pages 883– 890. Paris, France, March 1998. 15. R. Lauwereins, M. Engels, M. Ad´e, and J. A. Peperstraete. Grape-II: A system-level prototyping environment for DSP applications. IEEE Computer, pages 35–43, February 1995. 16. J.-Y. Le Boudec. The asynchronous transfer mode: a tutorial. Computer Networks and ISDN Systems, 24:279 – 309, 1992. 17. D. Lidsky and J. Rabaey. Early power exploration-a world wide web application. In Proc. Design Automation Conference, pages 27–32. Las Vegas, CA, June 1996. 18. P. Lippens, J. van Meerbergen, W. Verhaegh, and A. van der Werf. Allocation of multiport memories for hierarchical data streams. In Proceedings of the IEEE International Conference on Computer Aided Design, ICCAD-93. Santa Clara, CA, November 1993. 19. M. Miranda, F. Catthoor, M. Janssen, and H. De Man. ADOPT: Efficient hardware address generation in distributed memory architectures. In IEEE/ACM Proceedings of the International Symposium on System Level Synthesis, 1996. 20. P. Schaumont, S. Vernalde, L. Rijnders, M. Engels, and I. Bolsens. A programming environment for the design of complex high speed asics. In Proc. Design Automation Conference. San Francisco, USA, June 1998. 21. P. Slock, S. Wuytack, F. Catthoor, and G. de Jong. Fast and extensive system-level memory exploration for ATM applications. In Proceedings of the 10th International Sympopsium on System Synthesis. Antwerp, Belgium, 1997. 22. B. Svantesson, S. Kumar, and A. Hemani. A methodology and algorithms for efficient interprocess communication synthesis from system descriptions in SDL. In International Conference on VLSI Design. Chennai, India, January 1998. 23. Y. Therasse, G. Petit, and M. Delvaux. VLSI architecture of a SDMS/ATM router. Annales des Telecommunications, 48(3-4), 1993. 24. P. R. Wilson, M. Johnstone, M. Neely, and D. Boles. Dynamic storage allocation: A survey and critical review. In Proceedings International Workshop on Memory Management. Kinross, Scotland, UK, September 1995. 25. S. Wuytack, F. Catthoor, G. de Jong, B. Lin, and H. De Man. Flow graph balancing for minimizing the required memory

THE TITLE???

bandwidth. In Proceedings of the International Sympopsium on System Synthesis, pages 127–132, November 1996.

9

26. S. Wuytack, F. Catthoor, and H. De Man. Transforming set data types to power optimal data structures. IEEE Transactions on Computer-aided Design, CAD-15(6):619 – 629, June 1996.