tion of complex systems using hardware-software co-simulation. ... ural abstraction for the modeling of large real-time systems such as a speech recognition system or ... capable of supporting multiple models simultaneously for design ...... tronics Research Laboratory, University of California, berkeley, December 1991. 35.
A Hardware-Software Co-Simulation Environment
by Seungjun Lee
B. S. (Seoul National University) 1986 M. S. (University of California, Berkeley) 1989
A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Engineering-Electrical Engineering and Computer Science in the GRADUATE DIVISION of the UNIVERSITY of CALIFORNIA at BERKELEY Committee in charge: Professor Jan M. Rabaey Professor Robert W. Brodersen Professor Stuart E. Dreyfus
1993
The dissertation of Seungjun Lee is approved:
Chair
Date
Date
Date
University of California at Berkeley 1993
1
Abstract
A Hardware-Software Co-Simulation Environment by Seungjun Lee Doctor of Philosophy in Engineering-Electrical Engineering and Computer Science University of California Berkeley Professor Jan M. Rabaey, Chair This thesis presents a simulation environment developed for the fast prototyping and evaluation of complex systems using hardware-software co-simulation. A system is represented as a set of concurrent processes communicating with each other by message passing, which provides a natural abstraction for the modeling of large real-time systems such as a speech recognition system or a distributed computer network. Hardware implementation and software simulation models are handled identically so that they can be freely intermixed together for the simulation at system level. Existing hardware board can be directly used as a simulation model, or a fast programmable DSP board can be interconnected to software simulation as a simulation accelerator. The software simulation can be incrementally transformed into physical implementation during the design process. The co-simulation environment has been developed within Ptolemy framework which is capable of supporting multiple models simultaneously for design specification. Consequently, each process model can be further refined into a data flow model or an event-driven model. A customized multi-tasking kernel has been implemented as the simulation engine for concurrent processes using SUN’s Light Weight Process(LWP) library. Lastly the implementation of cosimulation based on a target hardware is described.
Chairman of the Committee
2
DRAFT VERSION: 10/25/93
iii
Table of Contents INTRODUCTION ........................................................................................1 1.1 Motivation and Objective........................................................................3 1.2 Issues in Co-Simulation ..........................................................................5 1.2.1 Level of Partitioning in Design Representation .........................................5 1.2.2 Level of Abstraction for Communication .................................................6 1.2.3 Synchronization of Simulation with Real-time Execution ...........................7 1.3 Related Work on Co-Simulation .............................................................8 1.3.1 Hardware Modeling .............................................................................8 1.3.2 Logic Emulation ................................................................................10 1.3.3 Hardware Accelerator .........................................................................12 1.3.4 Hardware/Software Co-Simulation .......................................................13 1.4 Summary ...............................................................................................14 OVERVIEW OF CO-SIMULATION ENVIRONMENT .......................16 2.1 Research Goals ......................................................................................17 2.2 Software Platform - Ptolemy.................................................................18 2.2.1 Overview of Ptolemy ..........................................................................18 Genealogy of Ptolemy ............................................................................18 Object-Oriented Methodology ................................................................19 System Representation in Ptolemy .........................................................20 Wormholes in Ptolemy............................................................................21 2.2.2 Domains in Ptolemy ...........................................................................23 SDF Domain ...........................................................................................23 DDF Domain...........................................................................................24 DE Domain .............................................................................................25 Thor Domain...........................................................................................26 CG Domain .............................................................................................28 Other Domains ........................................................................................29
2.3 CP Domain in Ptolemy.........................................................................29 2.3.1 Overview of the CP Domain ................................................................30 2.3.2 Co-Simulation in the CP Domain .........................................................31 2.3.3 System Simulation in the CP Domain ....................................................34 2.4 Summary ...............................................................................................38
iv
HIGH LEVEL SYSTEM SPECIFICATION ...........................................41 3.1 Requirements for System Specification ................................................42 3.1.1 Specification Capture Capability .........................................................42 3.1.2 Specification Refinement Capability .....................................................44 3.1.3 Specification Simulation Capability ......................................................45 3.2 Previous Work on System Specification ...............................................46 3.2.1 Petri Nets..........................................................................................46 3.2.2 Data Flow .........................................................................................48 3.2.3 CSP .................................................................................................49 3.2.4 Hybrid Approach ...............................................................................51 3.3 System Representation in the CP Domain ............................................53 3.3.1 Network of Communicating Processes ..................................................53 3.3.2 Inter-Process Communication ..............................................................56 3.3.3 Execution Model of a Process ..............................................................58 3.3.4 Simultaneous Events and Non-determinism ...........................................60 3.4 Process Description ...............................................................................60 3.4.1 Structure of a CP Star .........................................................................60 3.4.2 Library Constructs in the CP Domain ....................................................62 setProtocol()............................................................................................62 msgSend() / msgReceive()......................................................................62 TMsgSend() / TMsgReceive() ................................................................62 waitFor() .................................................................................................63 waitAll() / TWaitAll() .............................................................................63 waitAny() / TWaitAny()/transReady()....................................................63 waitOne() / TWaitOne() ..........................................................................67 3.4.3 Examples ..........................................................................................67 Producer-Consumer Model.....................................................................67 Central Node in Aloha Network .............................................................68
3.5 Summary ...............................................................................................69 IMPLEMENTATION OF SIMULATION KERNEL.............................71 4.1 Software Support for Multithread .........................................................72 4.1.1 Nomenclature ....................................................................................72 4.1.2 Libraries for Multiple Threads .............................................................75 4.2 Thread Library.......................................................................................79 4.3 Thread Scheduling.................................................................................81 4.4 Inter-Process Communication (IPC) .....................................................84 4.4.1 Rendezvous Paradigm underneath msgSend()/msgReceive() .....................84
v
4.4.2 Timer Object .....................................................................................86 4.4.3 Implementation of Queue Star ..............................................................88
4.5 Other Implementation Issues.................................................................90 4.5.1 Wait Statements .................................................................................91 4.5.2 Wormhole .........................................................................................92 4.6 Examples ...............................................................................................94 4.6.1 M/M/1 Queue ....................................................................................94 4.6.2 InfoPad Simulation ............................................................................98 4.7 Summary .............................................................................................100 INTEGRATION OF HARDWARE BOARDS ......................................101 5.1 General Methodology of Co-Simulation.............................................102 5.1.1 Two Approaches ..............................................................................103 Send/Receive Paradigm ........................................................................103 Wrapper Paradigm ................................................................................105 5.1.2 Our Approach .................................................................................107
5.2 Target Architecture..............................................................................107 5.3 Interface Implementation .................................................................... 110 5.3.1 Interface between Simulation and Single Board Computer...................... 111 5.3.2 Interface between Simulation and Custom Board .................................. 114 5.3.3 Interface between Remote Processes ................................................... 116 5.4 Examples ............................................................................................. 117 5.5 Summary .............................................................................................122 CONCLUSIONS AND FUTURE WORKS............................................124 6.1 Conclusions and Contributions ...........................................................124 6.2 Future Directions.................................................................................126 BIBLIOGRAPHY.....................................................................................128 APPENDIX A: CP DOMAIN USER’S MANUAL ................................133 A.1 Introduction .........................................................................................133 A.2 Execution Model of a Star...................................................................134 A.3 Inter-Process Communication .............................................................135 A.3.1 Communication Channels .................................................................135
vi
A.3.2 Communication Protocols................................................................135
A.4 The Scheduler......................................................................................135 A.4.1 Process Scheduling .........................................................................136 A.4.2 Simultaneous Events and Non-Determinism .......................................136 A.5 Writing a Star in CP Domain...............................................................137 A.6 Additional Remark ..............................................................................141 A.6.1 Auto Forking .................................................................................141 A.6.2 Support of ANYTYPE ....................................................................141 APPENDIX B: USER’S MANUAL FOR UTILITY ROUTINES........143 B.1 getRandom()........................................................................................144 B.2 msgReceive().......................................................................................145 B.3 msgSend()............................................................................................146 B.4 setProtocol() ........................................................................................148 B.5 currentTime()/waitFor() ......................................................................149 B.6 transReady() ........................................................................................150 B.7 waitAll()/waitAny()/waitOne() ...........................................................151 APPENDIX C: EXAMPLE OF INTERFACE IMPLEMENTATION FOR CO-SIMULATION..........................................................................152 C.1 Implementation on the Simulation Side ..............................................153 C.1.1 Wrapper ........................................................................................153 C.1.2 SockPort .......................................................................................156 C.2 Interface Library of the Single Board Computer ................................161 C.3 Interface Library of the Robot Board ..................................................167 APPENDIX D: EXAMPLE OF BUILDING A CO-SIMULATION....176 D.1 Building a Wrapper .............................................................................176 D.2 Co-Simulation with the Single Board Computer ................................178 D.3 Co-Simulation with the Robot Board..................................................179
vii
List of Figures Figure 1-1 : Personal Communication System .............................................................................. 2 Figure 1-2 : Co-Simulation Environment...................................................................................... 5 Figure 1-3 : Structure of Hardware Modeler ................................................................................. 9 Figure 1-4 : Symbolic Representation of Logic Emulation......................................................... 10 Figure 1-5 : Comparison of hardware Modeling with Logic Emulation..................................... 12 Figure 2-1 : System Representation in Ptolemy .......................................................................... 21 Figure 2-2 : Wormhole Interface with Y Domain inside X Universe ......................................... 22 Figure 2-3 : Firing Model in SDF Domain.................................................................................. 24 Figure 2-4 : Examples of DDF Stars ........................................................................................... 25 Figure 2-5 : Heterogeneous System Representation in CP Domain............................................ 32 Figure 2-6 : Top Level Representation of the Personal Communication System ....................... 35 Figure 2-7 : Refinement of InfoPad in System Environment...................................................... 36 Figure 2-8 : Simulation of Wireless Link in System Environment ............................................. 37 Figure 2-9 : Refinement of Backbone Network .......................................................................... 38 Figure 2-10 : Hardware Acceleration by Co-Simulation .............................................................. 39 Figure 3-1 : Simple Petri Net Description of a Producer and Consumer .................................... 47 Figure 3-2 : Specification Model in CP Domain......................................................................... 54 Figure 3-3 : System Representation in the CP Domain............................................................... 55 Figure 3-4 : Communication through Parameter Updating ......................................................... 57 Figure 3-5 : Example of a CP Star............................................................................................... 61 Figure 3-6 : Use of waitAll() in Two-Input Adder ...................................................................... 64 Figure 3-7 : WaitAny() and TransReady() in Two-Input Merge ................................................. 65 Figure 3-8 : Clock Generator with Asynchronous Reset............................................................. 66 Figure 3-9 : Output from Clock Generator.................................................................................. 67 Figure 3-10 : : Producer-Consumer Model.................................................................................... 68 Figure 4-1 : Execution Models of Co-Routines .......................................................................... 74 Figure 4-2 : Definition of Thread Class ...................................................................................... 80 Figure 4-3 : State Transition of Threads in the CP domain ......................................................... 82 Figure 4-4 : Pseudo-Code for the Thread Scheduler in the CP Domain ..................................... 83 Figure 4-5 : Implementation of Inter-Process Communication Based on Rendezvous Paradigm .. 85 Figure 4-6 : The General Usage of Timer Object........................................................................ 86 Figure 4-7 : Use of a Timer in the Implementation of msgSend() .............................................. 87 Figure 4-8 : Implementation of waitAll() .................................................................................... 92
viii
Figure 4-9 : Implementation of waitAny() .................................................................................. 93 Figure 4-10 : Implementation of a CP Wormhole ......................................................................... 95 Figure 4-11 : Simulation Model and Results of M/M/1 Queue..................................................... 96 Figure 4-12 : Simulation of InfoPad and BaseStation Communicating through Noisy Channel.. 99 Figure 5-1 : Implementation Issue in Co-Simulation ................................................................ 102 Figure 5-2 : Inter-Processor Communication in Send/Receive Paradigm................................. 103 Figure 5-3 : Inter-Processor Communication through a Wrapper ............................................. 105 Figure 5-4 : Communication Overhead in Wrapper Paradigm.................................................. 106 Figure 5-5 : Co-Simulation Methodology Based on Wrapper Paradigm .................................. 108 Figure 5-6 : Layered Architecture Template ............................................................................. 109 Figure 5-7 : : Interface Structure with Hardware Boards .......................................................... 111 Figure 5-8 : : Code for a Server with One Input and One Output ............................................. 112 Figure 5-9 : : Code for the Wrapper Block of the Server .......................................................... 113 Figure 5-10 : Efficient Communication between Remote Processes .......................................... 116 Figure 5-11 : : Example of Hardware/Software Co-Simulation.................................................. 117 Figure 5-12 : : Hardware/Software Co-Simulation of InfoPad ................................................... 118 Figure 5-13 : Simulation Model of Decompress Block............................................................... 119 Figure 5-14 : Decompression Routine for DSP Board ................................................................ 120 Figure 5-15 : Speed of Each Communication Link..................................................................... 121
ix
List of Tables Table 5-1 : Utility Routines in the Interface Library.....................................................114
A CKNOWLEDGMENTS First I would like to thank Lord for allowing me to complete this dissertation. A large number of people have contributed in many different ways to the successful completion of this research. My greatest appreciation is to my advisor Professor Jan M. Rabaey for his encouragement, guidance, and support during the course of this research. I consider it a privilege to have worked with him. His scholarship and insight was a constant source of invaluable constructive criticisms that have greatly enhanced the quality and presentation of this dissertation. I would also like to thank Professors Robert W. Brodersen and Stuart E. Dreyfus for serving on my dissertation committee. I thank Professor Brodersen for sharing his deep knowledge and experience in the real-time systems design. Next I would like to thank Professor Edward A. Lee for his sound advice during the qualifying exams and many other discussions afterwards. The bjgroup has been an excellent environment for research in every respect - faculty, equipments, co-researchers, and technical support. My hurtful thanks go to all the members of bjgroup, especially the old members of jansgroup Wook Koh, Phu Hoang, Dev Chen, Miodrag Potkonjak, Alfred Yeung, and Chi-Min Chu. A special thanks to Mani Srivastava for his invaluable help in this research. Also, I would like to thank Soonhoi Ha, Joe Buck, and Tom Parks for their endless help with Ptolemy and C++ programming. Outside of Cory, pastor Andy Lee and all the members of Berkland Baptist Church deserve many thanks. Especially I am grateful to the members of the A5 cell group for their love and care, and their persistent prayers for the completion of this work. Finally, I want to acknowledge three very important people. My work demanded their patience as well as mine. My mom and dad, although thousand of miles away from Berkeley, have always provided the most valuable support and cheering. Without their trust x
and encouragement, I may not have completed this work. This achievement is as much theirs as it is mine. My biggest debt of gratitude goes to my wife, Hyewon, who gave me so much love and support over the years. She has never complained about Bam-Cham, or midnight-snack that she had to prepare for me every night. She had the strength and the patience to take good care of a stressed graduate student in his last year of graduate study struggling to meet the deadline.
xi
CHAPTER 1
1
“The fear of the LORD is the beginning of knowledge ... “ (Proverbs 1:7)
I NTRODUCTION
1
2
A well-defined design methodology for VLSI circuits has been established in the last decade. As a result, major part of the design process can be automated. Silicon compilation and design synthesis tools have been implemented by many researchers and industrial companies. Lager [Brodersen92][Shung91][Rabaey85], Cathedral-II [Rabaey88], and Olympus [Micheli90] are examples of such tools. With these automatic CAD tools, fairly complex chips can be fabricated within a few man-months starting from a high-level specification such as SDL [Brodersen92], Silage [Hilfinger89], Hardware-C [Micheli90], or VHDL. The success of CAD for application-specific IC’s motivated the research on system-level design automation. Quite a number of researchers are attempting to apply the design methodologies developed for VLSI to the system level designs [Srivastava91b]. The design problems are essentially the same: specification, simulation, implementation, verification, and test. More than a circuit, however, a system is a heterogeneous connection of general-purpose programmable boards and application-specific custom boards, combined together to perform a given task. Examples of such systems are robot controllers, speech and video recognition systems or wireless personal communication
3
High-Speed Fiber Backbone
Video Database
Compute Servers Base Station
Speech Recognizer
.. Commercial Database (news, financial information, etc.)
Wireless Multimedia Terminals
Figure 1-1 : Personal Communication System
environments (Figure 1-1)[Brodersen91]. Most of these systems can naturally be decomposed into concurrent components which communicate with one another either synchronously or asynchronously. Consequently, issues related to handling heterogeneity and concurrency, inherent in a system, play important roles in system design at all levels from specification to implementation. This leads to the necessity for a design environment that can represent and manipulate the heterogeneous and concurrent behavior of a system.
1.1 Motivation and Objective A design process begins with providing a specification of the system at the algorithmic level. The initial specification is simulated to verify the functional correctness, and to optimize algorithms and related design parameters. A clear and unambiguous specification is required not only for verification, but also for driving various kinds of analysis and synthesis tools. However, a concise and complete system description is rarely
4
available for most designs. Even if some sort of specification is available, it is normally presented in a mixed and unorganized fashion: block diagrams, schematics, or C programs. Block diagrams and schematics are convenient to describe the structural decomposition of a system, while procedural languages such as C are very efficient for algorithmic descriptions. Also, a system can be naturally viewed as a set of concurrent modules which interact with each other either synchronously or asynchronouly. Therefore, an ideal system specification should be able to represent a system as a hierarchical collection of modules which execute concurrently and communicate with each other via an interconnect network. Leaf nodes of the hierarchy tree might be very diverse and can be specified using a variety of languages. Once the analysis at the top level is completed, the design is refined into more concrete hardware and software components. And the specification needs to be simulated more and more thoroughly. As the granularity of the design representation gets finer, the verification task needs to process increasing amounts of data, which may require days of simulation. This is especially true for many real-time systems which require repetitive runs over a range of parameters to validate or analyze the algorithm. For instance, the validation of the description of a real-time speech recognition requires the processing of a set of complete sentences in the presence of an elaborated speech database. This requires the processing of about 200,000 data samples and amounts to hundreds of giga-operations [Stölzle92]. The required simulation speeds can hardly met by conventional hardware such as generalpurpose workstations. Therefore, it is essential to integrate the concept of hardware acceleration with the simulation environment to handle computation-intensive tasks. Another desirable feature for system level simulation is hardware modeling, which means that device itself is used as the simulation model [Widdoes88]. This feature can simplify the modeling task for existing modules. It also allows the initial specification to evolve
5
System Implementation
System Simulation
Software Model
Software Model
Co-Simulation Software Model
Figure 1-2 : Co-Simulation Environment
incrementally into real implementation by replacing the software simulation model by actual hardware in a transparent way. This process continues until every block in the specification turns into either a hardware board or a software module. These are combined together and tested with a set of stimuli. The test result is compared with the simulation result of the initial specification to verify the correctness of the design. Based on those observations, it is possible to identify following issues in system level design: a. high level system specification, b. simulation acceleration, c. and mixed hardware and software simulation This thesis addresses a unified environment for high level system specification and hardware-software co-simulation. In the context of co-simulation in this thesis, software means the models of system components written either in a programming language or a hardware description language for computer simulation, and hardware is used to address
6
the physical implementation of system components either in the form of applicationspecific devices or a piece of code executing in a programmable processor. Co-simulation means that the physical implementation of hardware and software modules co-exists with software simulation models in a system specification. They can be interconnected with each other seamlessly and simulated together. Existing hardware modules can be directly used as simulation models. When any part of a system is implemented, it can be readily combined with the software simulation of the rest of the system and tested with the same set of input stimuli as the initial simulation (Figure 1-2). Moreover, a fast programmable DSP board may be interconnected to the simulation and used as a simulation accelerator for computation-intensive tasks.
1.2 Issues in Co-Simulation There are several issues to be addressed in the hardware-software co-simulation. They are identified in this section. The following sections will investigate these issues more thoroughly in conjunction with related work.
1.2.1 Level of Partitioning in Design Representation First, the level of partitioning on the initial specification should be decided. The granularity of an atomic block whose simulation model can be safely replaced by real implementation depends on the abstraction level supported by the simulator and the application the co-simulation is intended for. Atomic blocks for co-simulation can be digital circuit blocks, VLSI devices, dedicated custom boards, or general-purpose programmable boards. The granularity of the atomic block suitable for co-simulation may be larger than the one allowed for software simulation. For example, an AND gate is an
7
atomic block for gate-level simulator, but it is not appropriate to replace the AND gate by real hardware.
1.2.2 Level of Abstraction for Communication Secondly, the level of abstraction for the communication between blocks needs to be selected appropriately. In some cases it can be automatically determined by the abstraction level of the design representation. For example, it can be assumed that the blocks in gatelevel simulation communicate with each other by sending and receiving binary data through memory-less wires. However, the communication between blocks may be represented in various abstraction levels as the granularity of the primitive block gets larger. For example, the communication between two hardware boards connected through backplane bus may be described by a set of control signals and data buses. This is the lowest level of abstraction, and the modeling at this level is necessary for the implementation of interface modules [Sun91]. At the next level of abstraction, read/write operations may form the primitives upon which the various synchronization mechanisms are built to implement the communication protocol between the boards. Or, the communication protocol may be specified in higher level of abstraction such as message passing in blocking mode, which means the sender(receiver) blocks until the corresponding receiver(sender) is ready to receive(send). The level of abstraction for the communication should be high enough such that it can be generally applied to the co-simulation with many different hardware architectures which employ various communication protocols. At the same time, it shouldn’t be too abstract to generate meaningful simulation data.
8
1.2.3 Synchronization of Simulation with Real-time Execution The most important issue in the integration of real implementation and software simulation is how to synchronize the execution of both sides. The implementation will execute in real time, and the software simulation in general does not keep up with it. There are two solutions to this problem. One is to slow down the execution of the implementation, and the other is to put buffers between the implementation and the software simulation. The execution of the real implementation can be slowed down if the control is datadependent. That is, if the execution of a certain block depends on the availability of input data, the simulation model of the block can be safely replaced by real implementation without changing the system behavior. Software modules are usually more controllable than hardware modules in that aspect, so that they can be executed in data-driven fashion. On the other hand, for the time-driven operation, both the input and output data to and from the real hardware should be buffered. The inputs need to be stored in memory, so that the whole sequence of inputs can be applied to the hardware in real-time whenever the hardware needs to be evaluated. The outputs from the hardware are buffered in order that the simulation should not lose any data due to the slow execution. Extra hardware is required for storing and management of the data. Besides, this mode of operation is feasible only for short period of time because the buffers can not grow infinitely.
1.3 Related Work on Co-Simulation Mixed hardware and software simulation may come in four different contexts. First, hardware is directly used as a simulation model. In the second approach, a logic emulator is a hardware tool that performs the functions of the real hardware at a speed close to that
9
of the actual hardware. Thirdly, a hardware accelerator is combined with software simulators to accelerate gate-level simulation of VLSI circuits. Last approach is hardware/ software co-simulation for concurrent development of hardware and software.
1.3.1 Hardware Modeling Hardware modeling is a simulation technique where the device itself is used as the simulation model [Widdoes88], [George89]. Those hardware models can provide a solution to the problem of finding or generating simulation models, because 100% accurate models are available on demand as long as the devices are available. The hardware modeling approach was first introduced in 1984 [Widdoes84]. Since then many vendors have been developing hardware model libraries that can be interconnected to their proprietary simulators [Williams91][Stoll85]. Recent development in this domain provides interfaces with multiple simulators [Kelly89]. The goal of the hardware modeling systems is to readily provide accurate simulation models for standard devices and application-specific IC’s. Naturally, VLSI devices are considered as atomic blocks. Their model of communication with the software simulation is to send and receive binary data through wires in time-driven fashion. At every simulation cycle, the hardware-modeled device is fed with input stimuli from the software simulation. The output produced by the device after the evaluation is sent back to the software simulation. The structure of a typical hardware modeler is shown in Figure 1-3 [Widdoes88]. A software shell is the user-written model for the hardware-modeled device. The simulator views the software shell as any other software models, and the shell communicates with the hardware modeler using the pre-defined set of interface functions. Input stimuli need to be buffered for the evaluation of dynamic devices which can not
10
Ethernet UDP/IP
simulator
interface and control electronics
pattern memory
pin electronics sensing
driving
software shells
HOST COMPUTER
HARDWARE MODELER
Figure 1-3 : Structure of Hardware Modeler
maintain their internal states. When an event occurs on an input of a dynamic device, the device is reset to its initial state and driven with the entire sequence of input data produced by the simulator from the time simulation begins until prior to the current input. The hardware modeler then proceeds with the evaluation of the device by presenting the current event. The current event is also stored in memory so that it can be used for the next evaluation. This mode of operation imposes limits on simulation length and evaluation speed, since the history of input events should be stored in a memory of finite size and presented completely for each new evaluation. Even though the hardware modeling approach has been very successful with VLSI devices, its model of computation is too restrictive to be applied to system-level design
11
where communication between subblocks is likely to be modeled as asynchronous message passing.
1.3.2 Logic Emulation Logic emulation is the enabling technology which allows designers to verify their ASIC or custom designs in context of the entire system. It maps complex IC designs into reusable hardware, providing a functional representation of the design (Figure 1-4). This hardware is then connected to the end product allowing software development and system verification to proceed in parallel with IC development. While simulation uses algorithmic representation of designs, emulation uses real hardware wires and gates to implement the
Logic Design
Design Mapping
Emulation Kernel
Multiple FPGA-Based Logic Emulator
To End Product
Figure 1-4 : Symbolic Representation of Logic Emulation
12
user’s circuit so that it can execute up to a thousand times faster than the simulation [Mars93]. Examples of logic emulators are the Quickturn Systems’ Rapid Prototyping Machine [Wolff90] and the MARS II Logic Emulation System from PiE Design Systems [Ko92]. The three key hardware elements of a logic emulator are a control processor, an emulation array containing the FPGA’s, and the in-circuit interfaces to connect the emulation array to the target system. The control processor administers the operation of the system. A stimulus generator and a logic analyzer are also included in the system to provide the input vectors and to monitor the output. The emulation arrays consist of a collection of programmable gate arrays. They can be reprogrammed to support the emulation of many different IC designs and allows engineers to make frequent changes during the design development. The logic emulator also provides the synthesis software to map the design into the emulation array. It converts the ASIC netlists into programming information for an arrangement of programmable gate arrays such that an instant hardware prototype is created. The prototype thus created can be plugged into the target system and the design is exercised at speeds comparable to the speed of actual hardware. The operation mode of the logic emulation system is very similar to that of the hardware modeling system. An ASIC chip is considered as a primitive block for emulation. And the communication between the emulated hardware and the rest of the system is made at bit level through wires. However, the objective of the logic emulation is to exercise the design in an actual hardware environment. Therefore, a logic emulator connects the emulated hardware to the real hardware environment and executed in real time, where a hardware modeler connects a real hardware part to a software-simulated environment and runs it in
13
simulated time (Figure 1-5). All in all logic emulation is cut out for fast prototyping of ASIC design, and it can not provide much help for system level design. Software Simulation
Hardware Part
software model
software model
Hardware Modeling Real System Environment
Emulated Hardware
Logic Emulation
Figure 1-5 : Comparison of hardware Modeling with Logic Emulation
1.3.3 Hardware Accelerator As the number of devices on an integrated circuit increases, the task of logic and fault simulation poses a serious bottleneck in the design process. Various approaches to maximize the performance and accuracy in large circuit simulation have been proposed. Specialized hardware simulation engines, such as Mach 1000 from Zycad [Zycad87] or Ikos 2800 Hardware Simulator [Mcleod89], have been developed to speed up gate-level
14
simulation by exporting time consuming operations from software to hardware accelerator. Those are usually coupled with more general software simulators resulting in a heterogeneous simulation environment. As the hardware accelerators target at gate-level simulation, the granularity is usually below chip-level. The model of communication is the same as that of the hardware modeler, and the abstraction level is again too low for application to system-level design. The hardware accelerator interacts with the simulator in a “client-server” relation. Consequently, the execution is synchronized with each other in a lock-step fashion.
1.3.4 Hardware/Software Co-Simulation The concept of hardware/software co-simulation is investigated in [Gupta92] and [Becker92]. They are aiming at systems which consist of both hardware and software components. The target architecture consists of a general-purpose processor and other application-specific modules. Software components are mapped into the general-purpose processor, and hardware components are implemented as application-specific devices. Within the co-simulation environment the software components of a system are linked to the simulation of hardware components. The software modules are either compiled into the assembly code of the target microprocessor [Gupta92], or created as separate programs and interact with the hardware simulation using Unix interprocess communication(IPC) mechanisms [Becker92]. That allows the development of hardware and software components to proceed concurrently. The communication between hardware and software is performed using message passing either in blocking mode or non-blocking mode. A blocking mode transfer requires the sender(receiver) to block transfer until the corresponding receiver(sender) is ready to receive(send) data. Neither the sender nor the receiver blocks in non-blocking transfer, so
15
that it may overrun or starve the interface buffer. The software modules are controlled to execute synchronously with the software simulation of the hardware components. This approach has moved the concept of co-simulation up to system-level design. But the architecture model is quite restrictive. Their model only allows for one programmable processor when many complex systems are implemented using multiple processors. Furthermore, the co-simulation is defined only between the software components and the software simulation of the hardware components.
1.4 Summary From the discussion in the previous section it is clear that little work has been directed towards treating software simulation and physical implementation in an integrated fashion for system level design. The following chapters describe the approach used in this work to build a unified environment for hardware/software co-simulation and the details of implementation. Chapter 2 presents an overview of Ptolemy framework which forms the foundation of this work, and the summary of CP domain which is the basis for the cosimulation environment within Ptolemy. In chapter 3, requirements for system-level specification are discussed followed by a brief summary of the related work. The description of the specification model in the CP domain fills in the rest of the chapter. Chapter 4 focuses on the implementation of the multi-thread kernel for the simulation of concurrent processes. Examples of system-level simulation are also provided. Implementation of co-simulation with real hardware boards is presented in chapter 5. The target architecture is introduced first, and the implementation of the interface between software simulation and target hardware board is discussed. Finally, chapter 6 concludes with a summary of contributions, and discusses how this work can be extended to support a wider range of applications.
2dummy
CHAPTER 2
O VERVIEW OF C O- S IMULATION E NVIRONMENT “Yea doubtless, and I count all things but loss for the excellency of the knowledge of Christ Jesus my Lord ... “ (Philippians 3:8)
System design efforts can be focused on several different levels of abstraction. Focus on the gate level is quite common, while register-transfer level(RTL) is becoming more common as RTL simulation and synthesis tools gain acceptance. However design efforts are rarely focused on higher levels of abstractions, where the functionality of the system is described using a software-like language. This lack of design efforts can be attributed to a lack of proven design methodologies, and the lack of tools to support the design tasks at that level. From the discussion in the previous chapter, three specific issues were identified about system level design: a. Lack of high level specification which can describe the concurrent behavior of a system b. Support for the heterogenous simulation of mixed hardware and software c. Need for simulation acceleration These issues are explored in this thesis. The overview of a system design environment,
16
17
which is the result of this research, is given in the following sections. Also presented in this chapter is a description of Ptolemy which forms the software platform of this work.
2.1 Research Goals The objective of this research is to develop a unified simulation environment for the fast prototyping and evaluation of large real-time systems using hardware-software cosimulation. Concurrent behavior of a system should be naturally expressed in a truly heterogeneous fashion in this environment. Physical implementations of hardware/ software modules and software simulation models can be used interchangeably throughout the design process from the initial specification to the testing phase in a transparent fashion, i.e. with virtually no overhead for the designer. For example, an offshelf component can be directly used as a simulation model in the initial specification instead of writing software model of that module. A programmable DSP board can be connected to the software simulation and used as a simulation accelerator for the execution of computation-intensive DSP algorithms. It can also be used as a hardware/ software co-design environment where software modules are loaded to a target processor and debugged along with the simulation of hardware modules. The objective can be sub-divided into four specific goals: a. Generate a high-level specification model which provides a natural abstraction for the modeling of large real-time systems and for a seamless hardware-software cosimulation as well b. Establish a methodology for hardware-software co-simulation at system level c. Build a simulation engine for the execution of heterogeneous system specification which consists of mixed hardware and software d. Implement an example of co-simulation on target hardware boards Instead of building the simulation environment from scratch, we decided to implement it
18
as a new domain within Ptolemy framework. Ptolemy allows different models to co-exist in a single system description such that it provides an excellent platform upon which to build the heterogeneous specifications of large real-time systems. Existing models of computation in Ptolemy can be incorporated in the system specification at no extra cost. Furthermore, a lot of duplicate coding can be saved by sharing the internal data structure, the run-time environment, and user interface. The following sections give a detailed description on Ptolemy.
2.2 Software Platform - Ptolemy Ptolemy [Buck91][Ptolemy91] is an object-oriented framework developed at U.C.Berkeley to support heterogeneous system specification, simulation, and design. It provides a highly flexible foundation upon which to build simulation environments at different levels of abstraction. The key property is the ability to combine multiple different environments into a multi-paradigm simulation whenever desired necessary by the designer. It is implemented as an object-oriented framework within which diverse models of computations can co-exist and interact. In addition to the usual use of hierarchy to manage complexity, Ptolemy uses hierarchy to mix heterogeneous models of computations.
2.2.1 Overview of Ptolemy Genealogy of Ptolemy Ptolemy is a third-generation software environment that supports heterogeneous system specification, simulation, and rapid prototyping. This work is an outgrowth of two previous generations of design environments, Blosim [Messer84] and Gabriel [Lee89]. Blosim, which stands for “block simulator,” is a general purpose simulation program for
19
sampled data systems. It is efficient for simulating systems which operate on data at regular time interval. And Gabriel is a design environment intended to manage the complete development of real-time DSP applications, from the conceptualization and experimentation stage to the deployment in real-time hardware. It performs non-real-time simulation, as well as code synthesis for a variety of DSP hardware platforms. Both environments use the data-flow model, based on block-diagram linguistics, for the description of the algorithms. They have proven to be very useful and efficient tools for the development of DSP applications. To apply the design methodology to general systems beyond DSP, however, other computation models are needed, such as discreteevent scheduling, mixed compile-time and run-time scheduling, and computational models based on shared data structures. These are not supported very well by Blosim or Gabriel. Most importantly, an environment for system simulation needs to be flexible so that it can be extended to new computational models without re-implementation of the existing system.
Object-Oriented Methodology Ptolemy uses an object-oriented programming methodology to support heterogeneity, and is implemented in C++ [Strou86]. Data abstraction and polymorphism, the two important features of object-oriented programming, allows for the abstraction of the models of computation so that their differences are not visible from the designer. The goal is to make the system non-dogmatic, in the sense that the environment itself doesn’t impose any particular computational model, and it is extensible to new models by simply adding to the system without modifying what is already there Many useful features of its two ancestors, Blosim and Gabriel, are incorporated, such as modularity and reusability of user-programmed software modules, friendly graphical window interfaces, and code generation for different target hardware platforms.
20
System Representation in Ptolemy Ptolemy is based on block-diagram linguistics. A system is represented as a set of blocks interconnected with arcs. Blocks communicate with one another by passing tokens, or particles in its terminology, through the arcs. A particle can carry a piece of data in an arbitrary data type: an integer, a real number, a complex number, or any user-defined data structure. Particles are sent and received between blocks in the same fashion regardless of the data type they are carrying, which is attributed to the polymorphism provided by C++. The functionality of each block is described in C++. Blocks are executed, or fired, by a scheduler following a certain rule which is subject to the model of computation. The scheduler is responsible for firing a block as soon as the firing condition is met. The firing order may be determined statically during compile-time as in the data-flow model, or it has to be decided dynamically during run-time as in the event-driven model. Each model of computation in Ptolemy is called a domain. A domain consists of a scheduler which enforces its firing rule, and a set of functional blocks. An elementary functional block is called a star. Even though the basic structure of a star is shared throughout different domains, the behavior of each star should comply with the model of computation supported by the domain it belongs to. Ptolemy provides rich libraries of stars for each domain. Since the stars in the libraries were designed to be as generic as possible, many complicated functions can be realized by combining those stars. Nonetheless, Ptolemy also provides support for adding customized stars. The newly designed stars can be dynamically linked into Ptolemy when they are required, which avoids the hassle of frequent recompilation of the system and helps maintaining the integrity of the system. The block diagram representing a system can be hierarchical, with a cluster of stars being
21
called a galaxy. A galaxy can be manipulated as if it were a star, and it can also contain galaxies in itself. An entire application is called the universe, which therefore has no inputs or outputs (Figure 2-1). The cosmological terminology is inherited from Blosim, and was also adopted in Gabriel.
Wormholes in Ptolemy The most important innovation in Ptolemy is the extension of the hierarchy to include objects called wormholes. A wormhole is named as such because from the outside, it looks just like a star, but inside, it looks more like a universe. The scheduler on the outside treats it exactly like a star, but internally it contains its own scheduler. The inside and outside schedulers need not abide by the same model of computation. This is the basic mechanism for supporting heterogeneity. The domain, or the model of computation, of a star is specified in the star’s definition.
Universe
Star
Star
Galaxy
Star
Star
Galaxy
Star
Figure 2-1 : System Representation in Ptolemy
Star
22
The domain of a galaxy or universe must be explicitly specified by setting its domain property. If not otherwise specified, the domain of a galaxy is the same as that of the galaxy or universe it sits within. If the specified domain is different from that of the galaxy or universe it sits within, then the galaxy is realized as a wormhole. Once a wormhole is created, it behaves externally like a star. Internally, however, it encapsulates an entire foreign domain invisible from the outside domain. Design of the interface between two different domain is the key issue in the realization of a wormhole because there are some domain-specific operations to be performed at the interface. In the worst case, N2 interfaces should be defined if there are N domains. To make matters worse, if a user wants to add a new domain into Ptolemy, he/she has to know the details of the existing domains to design the interfaces with them. The approach taken in Ptolemy is to provide a universal interface (Figure 2-2). Each domain only needs to define the interfaces to and from the universal interface, and it doesn’t need to know about
XUniverse XDomain
XfromUniversal XtoUniversal
XWormHole
Universal Interface
XScheduler
YDomain YScheduler YtoUniversal YfromUniversal
Figure 2-2 : Wormhole Interface with Y Domain inside X Universe
23
the other domains. Thus, only 2N interfaces are needed for N domains, and all domains are totally independent of one another. Even though the concept of this approach is very noble and there are working examples with existing domains, this approach is still considered experimental and the question about the universality of the universal interface is yet to be answered.
2.2.2 Domains in Ptolemy Many different models of computation have been implemented in Ptolemy. The domains based on the data flow and event-driven model are most widely used. Some other domains are still experimental. The features of each domain are briefly discussed here.
SDF Domain The synchronous data flow (SDF) is defined a data-driven, statically scheduled domain in Ptolemy. It is a direct implementation of the techniques described in [Lee87a] and [Lee87b]. The data flow principle is that any star can fire whenever input data are available on its input ports. Because the execution is controlled by the availability of data, data flow model is said to be “data-driven.” The characteristics of data flow model will be discussed more in the following chapter. Synchronous data flow is a special case of data flow, in which the number of particles produced or consumed by each star on each firing is specified a priori (Figure 2-3). That feature makes the static scheduling possible. The automatic code generation for DSP hardware architectures can done very efficiently with the static scheduling. The major application of this domain is to describe DSP algorithms and sampled data systems that manipulate an infinite stream of data. Various digital filters and image
24
1 1
2
1 UPSAMPLE
1
UPSAMPLE
ENABLED
DOWNSAMPLE
1
2
ENABLED FIRED
DOWNSAMPLE
FIRED
ENABLED
FIRED
Figure 2-3 : Firing Model in SDF Domain
processing algorithms are most naturally described in this domain. However, data dependent operations and time-related behavior which appear quite often in system level descriptions can not be gracefully modeled.
DDF Domain Dynamic data flow (DDF) domain in Ptolemy provides a general environment for data flow computation and modeling of the systems. It is a superset of the SDF domain. In this domain, a star may consume and produce a varying number of particles for each execution. Therefore, static scheduling is no longer applicable. Instead, the scheduler should detect during run-time which stars are runnable, i.e. have enough data to be fired, and fires them one by one. The lower runtime efficiency of dynamic scheduling is the cost to be paid for the enhanced modeling power. Since the DDF domain is a superset of the SDF domain, all SDF stars can be used in the DDF domain. Besides the SDF stars, the DDF domain has some DDF-specific stars which can model the dynamic constructs such as conditionals, data-dependant iterations, and recursions. Figure 2-4 shows a couple of DDF stars. SWITCH sends incoming data to
25
TRUE
SWITCH T F
SWITCH T F
ENABLED
TRUE
FIRED
T F SELECT
T F SELECT
ENABLED
FIRED
FALSE
SWITCH T F
ENABLED
FALSE T F SELECT
ENABLED
SWITCH T F
FIRED
T F SELECT
FIRED
Figure 2-4 : Examples of DDF Stars one of two outputs depending on the control input, and SELECT chooses input token depending on the control input and routes it to the output.
DE Domain Discrete event (DE) domain provides a general environment for the time-related simulations of systems such as queueing systems, communication networks, and computer architectures. It implements the event-driven model. Stars are fired up whenever an event happens at one of their input ports. In this domain, a particle represents an event. If there is a change in the system state, it causes a star to generate particles which reflect the change. Each particle carries an associated time stamp with it. Time stamp is generated by the star which produces the particle, using the time stamps of the input particles and the latency of the star. Once generated, The particles, or events, are placed in the global event-queue which is maintained by the DE scheduler. There they are processed by the scheduler in chronological order.
26
A DE star can be viewed as an event-processing unit, which receives events form the outside, processes them, and generates output events after some latency. There is a special class of stars called self-scheduling stars. A self-scheduling star generates an event to itself so that it can be fired again after a certain period of time without any stimulus from outside. Examples of this class are event sources such as a clock generator and a random number generators. A DE star executes as a subroutine. That means it always executes from the beginning of its functional body when it is fired. The consequence is that it loses its run-time states when it returns after the execution unless it saves them explicitly. This model of execution is quite useful to describe the behavior of small blocks whose functionality is rather simple. However, it is very cumbersome to model the behavior of a large block which has many inputs and makes complex state transitions during the execution. Consequently, a large block with complex behavior is usually built by combining small blocks together instead of writing a big star. In other words, a system is modeled in bottom-up fashion. As already pointed out in the previous chapter, a system design is usually carried out in topdown fashion, from the initial specification at abstract level down to more refined and concrete modules. And for the top level description, a system is most naturally viewed as a set of concurrent processes. Even though DE domain has many advantages for the system modeling with fine granularity, it can not serve as a system specification environment by itself due to its restrictive execution model and lack of support for the process-oriented view.
Thor Domain Thor domain is an event-driven domain. It implements the Thor simulator written by VLSI/CAD group at Stanford University [Thor87]. Thor was developed to provide
27
hardware designers with an interactive and efficient tool for functional simulation. It supports the simulation of the circuits with abstraction levels from gate level to behavioral level. All the features required for functional simulation are provided such as generation of input stimuli, monitoring output results, modeling capability, and interactive simulation control. Thor scheduler is a conventional event-driven scheduler. It keeps a time-wheel which works as an global event queue. The time-wheel consists of uniformly spaced time slots in which future events are scheduled. That means all the events are supposed to occur at discrete time, and time stamps associated with the events should always be integer multiples of the unit time. The time-wheel mechanism makes model-writing simpler, and also increases the run-time efficiency. The disadvantage is insufficient support for modeling time-critical behavior such as clock skew or critical path analysis. Thor domain in Ptolemy is the direct implementation of the Thor simulator. Most features of the simulator are directly available - modeling capability, generation of input stimuli, and monitoring output results. Furthermore, Thor wormholes are provided for mixeddomain simulation so that hardware/software co-design can be made in Ptolemy [Kalavade91]. Functional models of the Thor simulator are encapsulated in stars of Thor domain. A preprocessor is implemented which generates Thor stars automatically from the models written in CHDL, which is the hardware description language based on the C programming language with added features for hardware modeling. Stars in Thor domain are functional as in the DE domain. That imposes the same restriction on using this domain for the high level system modeling, even though Thor provides some library support for high level description. Furthermore, the communication channels between blocks can carry only binary data. That is the result of the Thor
28
simulator being originally designed for hardware verification, and it makes system level simulation extremely inefficient in this domain.
CG Domain The code generation (CG) domain is used to generate code rather than to run simulations. It has its derivative domains depending on the language generated, such as CGC (C programming language), CG56 (DSP56000 assembly code), and CG96 (DSP96000 assembly code). The CG domain is intended as a template, and only the derivative domains are of practical use for generating code. All the code generation domains derived from the CG obey SDF semantics, and can thus be scheduled at compile time. Each code generation star contains a codeblock which is a pseudo-language specification of a code segment. Pseudo-language means the block of code is written in the target language with interspersed macros. Once the program graph is scheduled, the codeblocks are combined together to generate the code in the target language. A key feature of code generation domains is the notion of a target architecture. Multiple target architectures can be supported in CG domains including a multiprocessor target. Every application should have a user-specified target architecture, selected from a set of targets supported by the user-selected domain. Such operations as scheduling, compiling, assembling, and downloading the code are determined according to the target architecture. The code generation domains are experimental, and still under construction. Even though many noble ideas have been incorporated to implement these domains, several issues are remaining such as the handling of data-dependent execution and the implementation of wormhole interface.
29
Other Domains Several other domains are currently being implemented in Ptolemy. That includes Silage domain and VHDL domain which incorporate Silage and VHDL simulator into Ptolemy respectively, and Boolean data flow (BDF) domain which implements the token flow model, an analytical model for the behavior of data flow graphs with data-dependent control flow [Buck93].
2.3 CP Domain in Ptolemy Many different specification paradigms are necessary in a system level description. The heterogeneity supported in Ptolemy allows for a unified and integrated approach to bring them together. Several different representation models have been implemented to aid design specification at various abstraction levels, and they can be freely mixed together in a single application to exercise a multi-level and multi-paradigm simulation. As was pointed out in the previous chapter concurrency is another important element in the system level description. The data flow domains in Ptolemy such as the SDF domain and the DDF domain can describe concurrency in the sampled digital systems to exploit the inherent parallelism in many DSP algorithms. Furthermore, the code generation domains support schedulers for multi-processor architectures which can map a data flow graph onto a target hardware with multiple processors to generate code for parallel execution. However, the data flow actors are functional blocks and don’t preserve the run-time states between successive firing. And they can only have a single entry point at each firing. Consequently, the data flow models support only the fine grain concurrency in the system representation where each block shows a simple and regular behavior, so that they don’t
30
provide a general environment for representing the concurrent behavior of an arbitrary system. Therefore, a new domain needs to be incorporated in Ptolemy which will form the highest modeling level and bring the different computational elements together by providing a process-oriented view to the system representation.
2.3.1 Overview of the CP Domain Communicating processes (CP) domain adds the capability of modeling concurrent behavior of large systems to Ptolemy. It has been implemented as a unified environment for high level system specification and hardware-software co-simulation. The process model provided in this domain enables the complex behavior of a large block to be described easily and naturally. Hardware modules can be substituted for its simulation model to realize the hardware/software co-simulation in a transparent way. The processoriented view of this domain provides a natural platform to combine the physical implementation with software simulation seamlessly. As each process runs independently from the other processes except for the inter-process communication, it doesn’t affect the overall system behavior whether the process executes as a software simulation or as a real implementation as long as both of them follow the same communication protocol. In the CP domain, a system is modeled as a set of concurrent processes interacting with one another by message-passing. Stars represent autonomous processes. Each star is converted into an independent thread once the simulation begins, and it keeps running until the simulation ends instead of being fired repeatedly. The CP scheduler has been built as a multi-tasking kernel to simulate parallel processes. It supervises the execution of multiple threads when they compete for the CPU. The major application of this domain should be the system modeling at an early design stage where the task of a system needs to be conceptualized and the specification usually
31
consists of large subblocks at very abstract level. Consequently, a CP star tends to show a complicated and application-specific behavior, and the user is expected to write his/her own models for the application. A library of constructs is provided to describe various inter-process communication protocols and process synchronization. Once the analysis at the top level is completed, each block needs to be refined into more concrete subblocks. This refinement process may continue within the CP domain, or it may require the incorporation of other domains such as SDF for the description of digital signal processing algorithms, or Thor for the analysis of a hardware block at the gate level. In any cases the CP domain works as the highest modeling domain and maintains the top level view of the system representation throughout the different abstraction levels. Figure 2-5 shows the heterogeneous representation of a system in the CP domain which includes a hardware board wrapped in a special CP star as well as wormholes in the SDF and the Thor domains.
2.3.2 Co-Simulation in the CP Domain Hardware implementation can be intermixed with software simulation to make the CP domain a unified environment for hardware/software co-simulation as well as high level specification. A special class of stars called wrapper stars are provided to serve that purpose. A wrapper star looks like an ordinary CP star from outside, but internally it works as a gateway to actual hardware. It communicates with the other stars by packetpassing, and then it transfers the packets to and from the target hardware. But the implementation of the data transfer between a wrapper star and the target hardware may vary depending on the hardware/software architecture of the target. A set of library routines has been implemented for the interface between the wrapper stars and target hardware.
32
Hardware CP Universe
Star
Star
Wormhole
SDF
Star
Wormhole
Thor
+
Figure 2-5 : Heterogeneous System Representation in CP Domain
The systems of interest in this thesis are large real-time systems which usually consist of a set of general purpose programmable boards and dedicated custom boards interconnected through backplane busses or local area networks (LAN). Naturally, a hardware board is considered as an atomic block for co-simulation in the CP domain. It may replace its software model in the system specification and communicate with the simulation of the rest of the system. If the granularity of an atomic block goes down to chip level, a programmable interface board would be needed which can be configured to the pin
33
assignment of each target hardware and manage the bit-wise communication between the hardware and the software simulation. This model of execution is similar to that of the hardware modeler presented in the previous chapter. Hardware boards communicate with one another through a backplane shared bus, a dedicated private bus, or via LAN’s. The implementation of communication channels varies depending on the desired bandwidth and available hardware resources. The communication link may be modeled, at the lowest level, as a set of wires carrying binary data and control signals. This is generally called the physical layer of the communication model [Walrand91]. The next layer manages transmission of packets of data on a given link. The actual implementation of the transmission mechanism will be tailored to the hardware configuration and the specific needs of the application. On top of this layer are the application layers which construct communication services for user application. The CP domain provides the capability of describing the communication scheme at any layer. However, incorporation of a hardware board requires the definition of a certain model of communication which can be generally applied to various hardware architectures. For the co-simulation in the CP domain, a hardware board is assumed to communicate with the rest of the system by sending and receiving a packet of data in a blocking mode. That is, the hardware board should be able to stall when it is waiting for incoming data, or trying to send data and the receiver is not ready yet. Every subsystem that runs as a server in the “client-server” configuration falls in this category. The speech recognizer or compute server in the personal communication system are some such examples [Figure 2-6]. Presumably a packet can be of arbitrary length, but the maximum size of a packet may be limited by the hardware constraints. A board which is supposed to transfer data to and from the rest of the system in real-time
34
is not suitable for the co-simulation. For example, the front-end board in the speech recognition system [Stölzle92] which receives voice input and produces sampled data in real time can not be directly connected to software simulation. Real-time data can be handled by using dedicated interface board which can store the incoming data in real time and feed them to software simulation sequentially. This technique was also used in the hardware modeler to manage dynamic devices, but it has limitations on the simulation length and the efficiency of the execution as was pointed out in the previous discussion.
2.3.3 System Simulation in the CP Domain The methodology of system simulation in the CP domain is summarized here with the use of an example. Figure 2-6 shows the top level representation of the wireless personal communication system mentioned in the previous chapter (Figure 1-1). The wireless multimedia terminal is called an “InfoPad”. Base Station serves as a gateway between InfoPads and the computer network. It is connected to a high speed backbone network, and communicates with InfoPads through wireless links. Various servers connected to the backbone network provide different network services to the InfoPads such as the recognition of voice input from the InfoPad and real-time access to audio and video database. The major purpose of simulation at this level is verification and analysis. The system is decomposed into a few interconnected components. The concepts modeled at this level are often statistical. This type of modeling helps determine the key architectural parameters such as the required bandwidth at each communication link in light of an estimated work load. Furthermore, this topmost description works as a system environment throughout the design process. As the design progresses, the components are decomposed into smaller functional blocks.
35
CP Universe
Video Database
Speech Recognizer
Backbone Network
Compute Server
Base Station
InfoPad
InfoPad InfoPad
Figure 2-6 : Top Level Representation of the Personal Communication System The behavior of each may be describe again in the CP domain, or it may be modeled in other domains such as SDF or DE depending on the specific needs for each block. The initial block in the top-level description is replaced by the refined design which consists of a set of smaller subblocks, and simulated again. In this way the design of each component can be recursively refined and simulated along with the rest of the system coherently without exponential growth of complexity as long as its model of communication with the rest of the system remains the same. Designers of an InfoPad may decompose the InfoPad into several subblocks maintaining
36
the process view on the system, and simulate it in the CP domain (Figure 2-7). Various digital communication schemes and channel characteristics of the wireless link between InfoPads and a base station can be best analyzed by modeling it in the SDF domain. By inserting the SDF galaxy which models the wireless channel between the base station and an InfoPad, we can examine the effect of channel noise on video images by comparing it with those through an ideal link (Figure 2-8). The backbone network is required to meet real-time constraint as well as to provide enormous bandwidth. Investigation of real-time network protocol, sensitivity against packet loss, and simulation of network traffic needs to be performed, and DE domain may be most suitable for that task. The block
Video Database
CP Universe Speech Recognizer
Backbone Network
Compute Server
Base Station Infopad
InfoPad
InfoPad
Voice
CP
Send
Pen
Recv
Decompress
Sound
Display
Figure 2-7 : Refinement of InfoPad in System Environment
37
CP Universe
Video Database
Speech Recognizer
Backbone Network
Compute Server
InfoPad
Base Station
Infopad
Wireless Link
Wireless Link
InfoPad
SDF
Demod
Encoder Filter Channel
Filter Decoder
Mod
Figure 2-8 : Simulation of Wireless Link in System Environment
representing the backbone network is replaced by a DE galaxy, and the network traffic and the rate of packet loss can be analyzed within the entire system environment (Figure 2-9). As was mentioned in the previous chapter, the speech recognition is a very complicated task and it requires enormous amount of computation to exercise the algorithm. Consequently, the speech recognizer block may work as an bottleneck for the simulation of the entire system, and tremendously slow down the speed of the simulation. To relieve the bottleneck and speed up the simulation, a hardware accelerator can be incorporated so that the time-consuming speech recognition algorithm is executed in separate hardware dedicated to the execution of digit signal processing algorithms (Figure 2-10). When the
38
Video Database
CP Universe Speech Recognizer
Base Station
InfoPad
Compute Server
Backbone Network
DE
Node
Node Node
InfoPad
InfoPad
Node
Figure 2-9 : Refinement of Backbone Network
customized speech recognition system is implemented, it can be directly used as a hardware model instead of the hardware accelerator, and tested in the same system environment. In that way the specification and simulation environment can also be used as a testing environment, and a lot of effort for generating testing vectors and writing testing programs can be saved. Any other blocks available in the form of actual implementation can also be integrated into the system simulation without much overhead from the user. This process can be continued in the CP domain until the entire system specification turns into real implementation.
2.4 Summary The goal of this research has been directed to provide an environment for system level simulation. Representation and simulation of the concurrent and heterogeneous behavior
39
Video Database
CP Universe
Backbone Network
Speech Recognizer
Base Station
Hardware
InfoPad DSP
Compute Server
DSP
InfoPad
InfoPad
DSP
Figure 2-10 : Hardware Acceleration by Co-Simulation
of a system in a consistent way is the most important issue. An Overview of Ptolemy was presented, which provides a powerful environment for heterogeneous system specification and simulation. However, large-grain concurrency is not modeled gracefully, and the models of computation currently supported are not sufficient for the representation of heterogeneity present in system level design. The CP domain is implemented to fill up those gaps. It allows process-oriented representation of a system, and also provides an environment for the co-simulation of real hardware and software models. Thus it forms the highest modeling level and works as a unified system simulation environment by bringing the different computational elements together including actual hardware. There are three major issues to be addressed in the implementation of the CP domain:
40
a. generate a high-level specification model which incorporates the process-oriented view, and provide the facilities for high-level modeling b. build a customized multi-tasking kernel to simulate concurrent processes c. define the structure of the interface with hardware boards, and provide the facilities to build an interface with actual hardware Following chapters elaborate these issues one by one.
CHAPTER 3 (John 8:31-32)
3
“If ye continue in My word, then are ye My disciples indeed. And ye shall know the truth, and the truth shall make you free.”
H IGH L EVEL S YSTEM S PECIFICATION
41
42
The specification of the system is the first step in the system design process. A clear and unambiguous specification is essential in carrying out a large design project by teams of designers. A good specification environment can also help each individual designer to think more clearly about the target system and its environment. This chapter focuses on system specification in the CP domain. The requirements for system specification are discussed first, followed by a brief review on the existing methods for system specification. The chapter is concluded with an explanation of the specification model adopted in the CP domain using a couple of examples.
3.1 Requirements for System Specification The objective of a system specification is to describe “what does the system do” and “how well does it do it” in such a way that the behavior of the system can be understood clearly and completely by anybody without ambiguity. Therefore a specification language should have the capability to capture and depict precisely the designer’s conceptual view of the system with a minimal designer effort. Once the designer has captured the specification of a system, he/she needs a specification refinement capability. Also the specification needs to be verified by simulation at every refinement step.
3.1.1 Specification Capture Capability A specification language is used by a designer to transform his/her conceptual view of the system into a concise and unambiguous description. It should, therefore, serve the designer as the most natural medium so that he/she can make the transformation easily and transparently without losing any of his/her perception on the expected behavior of the system. Two fundamentally different kinds of information need to be described in a system specification. The first is structure and the other is behavioral information [Hill87].
43
Most systems can be naturally modeled as a set of interconnected modules communicating with one another. Specification of such systems should describe the structural decomposition of the system and the interconnection between the components, which is the structural aspect, and the behavior of the each component. The structure can be expressed in a series of declarations with little or no order dependencies. However, each behavioral operation within a component has side effects that can change the state of the world around the component. Another important kind of information is the communication mechanism between modules. The interacting behavior of a module with other modules needs to be specified, and it should be described with the topology of the system in mind. In other words, a complete description of the system is formed by providing an appropriate model of the communication at each connection such that they glue the structural information of the system and the behavioral description of each component together. Thus, a specification language should be able to describe various models of communication. For many designers procedural languages such as FORTRAN and C have been the favorite tools for the initial description of a systems behavior. They have the advantage of being readily available and easily simulatable. They are very efficient when it comes to algorithmic descriptions which have to run on a single data stream processor. They also provide various data types, a wide range of control syntax, and powerful debugging tools. However, they were originally designed as programming languages with a single data stream processor in mind. Consequently they allow only single thread of control, which makes it hard to describe the modularized and concurrent behavior of a system with these languages. There are extended versions of C available such as concurrent C [Gehani90] and C++. They alleviate those problems by allowing more modularity. Especially, the object-oriented nature of C++ makes it widely used for the implementation of system
44
specification and simulation environment such as Ptolemy. Block diagram/equation type of notations may be the natural medium for a system designer. Therefore, data flow type descriptions tend to be favored for the description of high performance sub-systems and for the description of algorithms on a high level of abstraction. Examples of such languages are Silage [Hilfinger89] and SA/RT [Ward85]. Most data flow languages come with graphical representations which have the advantage that the introduction of constraints or the partitioning of the design is much easier than in a procedural language. On the other hand, procedural descriptions and control constructs are not easily handled in such an environment, and the model of communication is rather restricted. Also, data flow descriptions are not executable by itself. They need to be translated into executable format as was done for SA/RT [Pulli88], or a customized execution engine needs to be built such as the SDF domain [Ptolemy91] in Ptolemy.
3.1.2 Specification Refinement Capability Most system design projects are carried out as a co-operative effort between many designers. The initial description of the system is partitioned into subblocks, and the design of each subblock is performed in parallel. During the process the specification of a subblock may be modified, extended, and refined. And this should be done in such a way that the changes made within a subblock shouldn’t affect the rest of the system as long as the behavior at the interface remains the same. Therefore the specification environment should guarantee that the description of each subblock can be refined independently from each other. Also it is highly desirable that different types of description, or languages, are allowed within a single system specification such that each subblock in the system can be described in a most suitable language to it. This requirement becomes apparent when we
45
consider the heterogeneous nature of real-time computer systems. Depending on the functionality and the structure of the algorithm, the availability of synthesis tools and his/ her own experience, a designer may choose to use a procedural language, a data flow specification, etc. Naturally the complete description of a system may consist of a variety of languages. Thus, it is desirable that the specification environment allows for different types of descriptions to co-exist by providing well-defined interface between them. As the design progresses, the structure of each component gets refined and a more detailed description of the behavior is required. The goal of the specification at each design stage varies from architectural and algorithmic issues to structural and timing-related behavior. The common practice has been to use different description languages and simulators at each level of abstraction such as behavioral level, functional level, gate level, and so on. The descriptions are made separately and there’s no interaction between them. The problems of this approach have been pointed out earlier by D. Hill [Hill80] and many other researchers. Many tools have been developed to tackle the problems by allowing multi-level specification and simulation in a single environment. ADLIB [Hill87], Verilog [Verilog87], and VHDL [Lipsett89] are some of the most popular description languages in which systems can be described at multiple levels of abstraction. On the other hand, using different languages has the advantage that each part of a system can be described by the semantics tuned to that particular level, and each simulator is optimized in terms of runtime organization for that particular task. From the above observation, the optimum specification and simulation environment can be formulated by merging only the advantages of each approach. That formulates into a specification environment which allows multi-level descriptions in the heterogeneous fashion. Within this environment, the specification of a system gets refined incrementally in such a way that some part of the system can be described at lower level of abstraction in a different language tuned to that
46
particular level and simulated together with the rest of the system description.
3.1.3 Specification Simulation Capability System specifications are potentially erroneous themselves, and it is unlikely that designers have confidence in them until they are verified by simulation. Simulation also helps the optimization of algorithms and related design parameters in the system specification through performance analysis. Thus, a system specification is not practically useful unless it can be simulated. Programming languages such as C and FORTRAN have the advantage of being readily simulatable. Otherwise, a dedicated simulation engine for the description language should be provided. A system specification generally shows concurrent behavior, and it is a complicated task to build an execution engine that can emulate the parallel execution.
3.2 Previous Work on System Specification Many models and languages have been proposed over the years for the system level specification. Some of the popular models are described here, including Petri Nets, data flow, CSP, and hybrid approaches such as SpecCharts.
3.2.1 Petri Nets Petri nets are an example of the class of visual modeling paradigms for representing systems. A petri net is defined as a directed graph with a set of vertices connected with a bag of arcs.1 A vertex represents either a place or a transition, and each arc in a Petri net either connects a place to a transition, or a transition to a place. In addition, places may
1. A bag is distinguished from a set in that a given element can be included n times in a bag, so that the membership function is integer-valued rather than Boolean-valued.
47
Channel
Consumer
busy
ready
Producer
busy
ready
condition:valid
event:active
Figure 3-1 : Simple Petri Net Description of a Producer and Consumer
contain some number of tokens. Using these definitions a system can be graphically modeled using circles for places, bars for transitions, arcs for dependencies, and black dots for tokens. Since its first introduction in 1962 by C. A. Petri in his doctorate thesis, the Petri net model has been extended, modified, and reformulated by many researchers [Molloy89]. As a result, almost all graphical models of computation can be formulated as either a special case of, or in other cases, a generalization of Petri net models. There are several classes of Petri nets. For example, the condition/event net is a model similar to the pure and simple Petri net with a different interpretation. It engages some semantics on the places and transitions. The places represent conditions which hold when a token resides in the place. The transitions represent the events that require certain conditions to hold before they occur and will change the conditions when they do occur. (Figure 3-1) [Woo92]. Directed arcs between conditions and events describe the movement of tokens through the net.
48
Execution rules for the net specify how tokens are placed into and taken from conditions. An event is “active” when all of its input conditions are valid, i.e. contain a token, and all of its output conditions are not valid. An event “fires” when active by removing a token from each input condition and placing a token in each output condition. A maximum of one token is allowed in each condition and, by definition, an event occurs in zero time. The simple graphical representation of Petri net model and the ease with which Petri nets could represent parallel and asynchronous operations made the model successful in the formal specification of systems such as communication protocols [Merlin79] and computer architecture designs [Noe71]. It is also possible to represent the Petri nets by matrices upon which various mathematical techniques can be applied for verification purposes. A special class of Petri net models called stochastic Petri nets is widely used for performance modeling of parallel and distributed computer architectures [Marsan84] and software systems [Dugan89]. Even though the application of Petri net models proved useful in the areas of system specification, it is usually limited to specific areas such as formal verification and performance analysis of stochastic systems. Petri nets represent systems at the uninterpreted level which is characterized by a lack of data types and undefined transformational blocks. Tokens are used instead of actual data to mark the flow of information through a network of states, and the processing performed on the tokens is not defined. This model can be used at preliminary design stages to explore issues such as design styles, resource allocation and coordination, data flow and component utilization. It has limitations, however, to be applied for the system specification across the full range of abstraction when the specification needs to get refined and the different models of computation are required for the description of each component. Furthermore, Petri net models are not executable by themselves. They need to be translated into an executable form using such languages as PASCAL [Baldassari88] or
49
VHDL [Woo92], and the translation is not an easy task even for simple Petri net models [Woo92].
3.2.2 Data Flow The data flow model, introduced by Dennis [Dennis75], is a natural paradigm for the description of DSP applications. A system is modeled as a directed graph where each node represents a function and each arc represents a signal path. The communication channel, represented by an arc, allows only a single writer and a single reader, and each channel is implicitly assumed to have a FIFO buffer with infinite depth. Two important properties of the basic data flow model are the data-driven control and the functional dependence of output values on input values [Dennis75]. Data-driven control means the execution of a node is enabled by the availability of the required inputs. Functional dependency implies that nodes are functional so that they generates a set of outputs which are completely and uniquely determined by the set of inputs. The consequence is that nodes don’t have side-effects and they can be fired in parallel as long as their firing rule is satisfied. Also there is no notion of time. The time at which a node is fired and the execution delay of a node can not be specified in the data flow model. Arcs in a data flow graph specify the control precedences which impose an order of execution. However, the order is relative to one another and doesn’t provide any information about the timing. Nodes in a data flow graph fire repeatedly, constrained only by the availability of data on their inputs. This feature makes the data flow model attractive for signal processing and for the modeling of physical systems, such as digital circuits. However, as it has been already pointed out in the previous chapter, the nodes in a data flow graph are functional so that they have only a single entry point at each firing and don’t preserve the run-time
50
states between successive firing. Also, they can only represent deterministic behavior. All these make it extremely difficult and inefficient to use the data flow model to describe the complex behavior of large real-time systems, which is often characterized as concurrent and non-deterministic. Furthermore, a lack of timing information, poor handling of asynchronous signals, and the fixed model of communication make it less attractive for the system level description.
3.2.3 CSP In the Communicating Sequential Processes (CSP) notation, input and output are considered as basic programming primitives and the parallel composition of communicating sequential processes is a fundamental program structuring method [Hoarse78]. A system is represented as a parallel command which specifies the concurrent execution of its constituent sequential commands, or processes. All the processes start simultaneously and execute in parallel. They may not communicate with each other by updating global variables. Simple forms of input and output command are provided instead, which are used for communication between concurrent processes. Such communication occurs when one process names another as a destination for output and the second process names the first as a source for input. There is no automatic buffering: the input or output command is delayed until the other process is ready with the corresponding output or input. Consequently all the communications are assumed to be synchronous. Another important feature of CSP is that Dijkstra’s guarded commands [Dijkstra75] are adopted as the means of introducing and controlling non-determinism. The basic ideas of CSP bear much resemblance to those of coroutines proposed in [Kahn74] [Kahn78], where the system is modeled as a set of independent processes which coexist and interact with one another. The differences lie in the support of non-
51
determinism and the model of communication between processes. They are summarized by the following: a. Coroutines are strictly deterministic where CSP allows non-determinism. b. In coroutines the output commands are automatically buffered to any required degree where there is no buffer assumed in the communications of CSP. c. In coroutines the output of one process can be fanned out to any number of processes where CSP allows only a single receiver. The CSP notation was an initial attempt to provide the basis for the development of programming languages for concurrent programming. It has been the major influence on the implementation of Occam [Jones85] and Ada [Gehani84]. Several other concurrent programming languages have been developed based on the concepts of CSP such as CSPS [Patnaik84], CSP/80 [Hull86], and CSP-i [Wrench88]. Though CSP was originally proposed as a paradigm of programming language, its usefulness for the system specification and simulation has been noticed and investigated by many researchers [Moore91] [Reed88] [Trees91] [Wernimont90]. The programming primitives proposed in CSP, i.e. input, output, and concurrency, can be directly applied to the modeling of a system which consists of concurrent blocks communicating with each other by message passing. However, the original proposal of CSP doesn’t have the notion of time, and it doesn’t support the description of asynchronous communications between processes which is required quite often in system level specification.
3.2.4 Hybrid Approach There also has been approaches which combine different models together to form a hybrid specification environment. VHDL and SpecCharts are some examples of this approach.
52
SpecCharts is a language designed for system level specification by researchers at U.C. Irvine [Vahid90] [Narayan92]. It is a combination of Statecharts [Harel87] and VHDL. In creating the language, they exploited the elegant graphical representation of the hierarchical state diagrams in Statecharts and VHDL’s advantages of wide acceptability and availability of simulators by developing a representation of hierarchical and concurrent state diagrams on top of VHDL. The basic object in SpecCharts is a behavior. A behavior is expressed in one of three ways: a. concurrent subbehaviors: subbehaviors activate simultaneously when the parent behavior is activated. Concurrent behaviors, often called processes, can exist at any level of hierarchy. b. sequential subbehaviors: this mode of expression corresponds somewhat to traditional state diagrams. When a parent behavior is activated, it executes its subbehaviors sequentially in the order specified by the arcs between the subbehaviors. c. leaf behaviors: program code using VHDL sequential statements. When activated, they begin executing their VHDL code. Two constructs, the channel and the protocol, are provided to describe the communication between concurrent behaviors. Two concurrent behaviors can communicate over a channel that has an associated protocol. A channel and a protocol are extensions of a port and a resolution function in VHDL. A channel may consist of multiple ports, and a protocol defines the input/output behavior through the ports. The most distinguished feature of SpecCharts language is that it provides a very flexible and powerful way of describing a system in multiple levels of hierarchy, either concurrent or sequential. And it also allows comprehensive control over the execution of any subbehavior in the hierarchy, so that the asynchronous signal handling, for example, is
53
described rather easily. However, the leaf behaviors and the communication protocols between concurrent behaviors are to be described in VHDL. As previously pointed out by other researchers [Hubbard90] [Srivastava91] the VHDL signal, which was designed with electric wires in mind, is not a suitable inter-process communication (IPC) primitive for high-level modeling where message channels, events and shared data objects are the commonly used IPC and synchronization primitives. Besides, the lack of a random number generator and the poor handling of the non-deterministic behavior in case of simultaneous events are other weaknesses of VHDL for high level modeling which are also present in SpecCharts.
3.3 System Representation in the CP Domain Our choice for a high-level specification model was driven by the following requirements. First, it should be able to naturally express the systems we are most interested in, i.e. the systems that consist of general-purpose programmable boards and application-specific custom boards combined together. Or the system may span multiple computers and a number of card-cages interconnected over the network. This serves the first requirement for system specification, i.e. the specification capture capability. Secondly, it should be suitable for hardware-software co-simulation. It should provide an easy platform where software simulation models and real hardware implementation coexist and are intermixed within a single specification. This requirement is related to the specification refinement capability. Lastly, it should be simulatable. A specification is not particularly useful unless it can be simulated. The implementation of a simulation kernel is the topic of the next chapter.
54
3.3.1 Network of Communicating Processes As it has been observed over the years by many researchers [Kahn78] [Hoarse78] [Schwetman86] [Srivastava91], concurrency is an important aspect in system representation. Most systems are naturally expressed at the top level as a set of processes that operate concurrently and communicate with each other. The validity of this observation has already been clarified by the discussion in the first chapter. The previous work on the system specification, as we reviewed in the previous section, has provided models to describe concurrent behavior of systems and the intercommunication in one way or another. In the CP domain, a system is viewed as a static, hierarchical, network of processes that execute concurrently. Processes interact using a well-defined communication mechanism
Process Model
Process Model N
unbuffered Process Model
Figure 3-2 : Specification Model in CP Domain
55
based on channels which can be either FIFO buffered or unbuffered (Figure 3-2). A channel connects an output port to an input port. A process communicates with other processes by sending and receiving messages through the ports. The communication mechanism between processes is determined by the combination of the channel type and the protocol specified on the ports. Another attractive feature of the process network model is that it does not make any distinction between hardware and software, which is essential for the implementation of hardware-software co-simulation.This model is largely based on the model proposed in [Srivastava91]. Figure 3-3 shows a model of Aloha communication network [Walrand91] in the CP domain. The VEM schematic editor [Harrison89] is used as a graphical interface to manage the structural information. The behavior of each block is described in the C++ language. The code shown within boxes is the model of a leaf node in the Aloha network described in the CP domain.
3.3.2 Inter-Process Communication Processes communicates with one another by passing messages through ports. Each star may have several input ports and output ports. A channel connects an output port to an input port. Only one-to-one connections are allowed. When broadcasting is needed, a fork star needs to be defined which receives a message from the input port and repeats it through multiple output ports. Channels can be buffered or unbuffered. A buffered channel contains a FIFO queue, and the capacity of the queue can be either finite or infinite (Figure 3-2). The default action of connecting an output port to an input port is to create an unbuffered channel. A queue object is available from the library to implemented a buffered channel. The capacity of the queue can be specified during run-time. The
56
Leaf Node
Transmitter Central Node
Histogram
Figure 3-3 : System Representation in the CP Domain
57
capacity of (-1) is interpreted as an infinite buffer. In that case the queue can grow without limit, and the sender to this queue will never block. A port is characterized by the data type it carries and a port protocol. The port protocol specifies the behavior when a channel is full or empty. Three different protocols are supported for each input and output port. An output port can either block on full, overwrite on full, and ignore on full. Likewise, an input port may block on empty, previous on empty, and ignore on empty. Block on full/empty means the process blocks if it tries to send/ receive a message through an output/input port and the channel connected to the port is full/empty, or the receiver/sender is not ready in case of an unbuffered channel. This protocol is usually used to describe a process with the data-driven control such as the Remote Procedure Call (RPC) mechanism. Overwrite/previous means the process overwrites the message at the tail of buffer when it is full, or the process reads the previous message once again when the buffer is empty. This protocol is meaningful only with buffered channels. For example, a process, P1, gets an input from a user and updates a parameter that is used by another process. The other process, P2, may be a process controlling a motor. It reads the parameter regularly and adjusts the motor speed accordingly. This type of communication can be modeled as two processes communicating through a channel with unit depth (Figure 3-4). The output protocol of P1 is overwrite on full, and the input protocol of P2 is previous on empty. The parameter stored in the channel can be updated atomically and asynchronously so that it always maintains the latest value. Ignore means the message is ignored, or lost, when the channel is full. And the process reads no data when the channel is empty. This protocol is useful in modeling the wireless broadcasting, or the polling mechanisms for the communication with shared memory.
58
The port protocol is fixed and cannot change dynamically. The default protocol is block on full/empty, but it can be manually overridden.
3.3.3 Execution Model of a Process All the processes are assumed to run concurrently. A process keeps running until any one of the following occurs: a. a process blocks when it tries to communicate with another process and the communication can not be made immediately b. a process blocks while it waits upon a certain set of inputs and outputs c. a process suspends itself for a certain amount of time d. a process terminates A blocked process resumes as soon as the condition which caused the blocking of the process is resolved; for example the communication channel becomes available. A suspended process resumes after the given time period has elapsed.
P1
P2
while (TRUE){
while (TRUE){
msgReceive(p, in);
p = getInput(); msgSend(p, out); }
channel depth = 1 overwrite on full
compute(p); }
previous on empty
Figure 3-4 : Communication through Parameter Updating
59
A process is assumed to execute in zero time. That is, the global clock doesn’t advance while a process is performing some operations. The execution should be suspended explicitly to model the execution delay, which allow the global clock to advance. This is explained by looking at a simple server example. It waits for input data, processes it, and sends the results through the output port. The following code shows the behavior of the server described in CP domain: go { while(TRUE) { msgReceive(data,input); result = processData(data); waitFor(DELAY); msgSend(result,output); } }
The functionality of a star in Ptolemy is described within go{} block. In the above example, the go{} block consists of a functional block within an infinite loop, which is typical for a CP star. In msgReceive() the server receives data from the input port. MsgReceive() doesn’t return until data is received, i. e. the process blocks at that point and waits for incoming data if the data is not immediately available. As soon as the data arrives at the input port, msgReceive() returns. The execution now proceeds to the next line and the server begins processing of the data in processData(). The execution of processData() may take some time, but it is assumed to be done in zero time and the global clock doesn’t advance. To take the execution time into account, waitFor() is called in the next line. WaitFor(DELAY) suspends the server for given DELAY. When the global clock is advanced for DELAY, the server wakes up and resumes execution. It sends out the result through the output port. Again the msgSend() returns only after the data is sent successfully. And the whole process repeats.
60
While the server is suspended by waitFor(), it remains insensitive to all the inputs and can not respond to any input stimuli. If the server needs to stay sensitive to a certain input such as an asynchronous reset, TWaitAny() should be used instead of waitFor(). TWaitAny(DELAY, reset) suspends a process for DELAY, but it remains sensitive to reset port. It returns immediately if data arrives at the reset port. Otherwise, it returns after the given DELAY has passed. MsgReceive(), msgSend(), waitFor(), and TWaitAny() are standard constructs provided in the CP domain along with many other utility routines. The full description of the utility routines will be given in the next section. Manual pages are also available in the appendix. ProcessData() in the above example may contain arbitrary functionality of the server.
3.3.4 Simultaneous Events and Non-determinism It has been pointed out in [Dijkstra75] [Hoare78] that allowing non-deterministic program components makes lots of programming problems easy and simple which would be very cumbersome otherwise. Not only in software programming but in many concurrent systems non-determinism occurs frequently as a consequence of concurrent computation [Reed91]. The CP domain provides a standard construct for the description of non-deterministic behavior caused by simultaneous events. A process may wait on a set of input and output ports. If several ports are ready at the same time, only one is selected and the others have no effect. But the choice is made arbitrarily, which allows potential non-determinism.
61
3.4 Process Description 3.4.1 Structure of a CP Star Figure 3-5 shows a complete description of the server example in the CP domain. The bold letters represent keywords which are generally used across domains. The Ptolemy defstar { name { Server } domain { CP } desc { Receive an input, process it, and return it } author { Seungjun Lee } defstate { name { DELAY } type { float } default { 1.0 } } input { name { input } type { float } } output { name { output } type { float } } setup { } go { float data, result; while (TRUE) { msgReceive(data, input); result = processData(data); waitFor(DELAY); msgSend(result, output); } } wrapup { } }
Figure 3-5 : Example of a CP Star
62
preprocessor exists to generate the C++ code from the description. A definition of a star consists of declaration sections and action codes. The declaration section contains definitions of input and output ports, state variables and other descriptions for documentation purpose. The action code consists of three parts - setup{}, go{}, and wrapup{}, and it defines the behavior of the star. Setup{} and wrapup{} are provided for simulation purpose. Setting parameters and initialization of the state variables are usually described in setup{}. Wrapup{} is usually used to display simulation results. Go{} block describes the main functionality of a star. All C++ constructs are allowed. There is no implicit loop assumed around go{} as in the case of process statement in VHDL. Therefore, an infinite loop should be stated explicitly to represent a continuing process. A library of constructs is also provided to express inter-process communication and process synchronization.
3.4.2 Library Constructs in the CP Domain Many utility routines are available which make high-level modeling simple and easy. Function overloading, which is the main feature of C++, is extensively used in these routines to handle different data types. They are explained in this section. Precise syntax and more detailed descriptions are given in the appendix.
setProtocol() The communication protocol on ports is set to block on full/empty by default. SetProtocol(port, protocol) can be used to override the default. The protocol can be one of BLOCK, OVERWRITE, and IGNORE for an output port, which means block on full, overwrite on full, and ignore on full, respectively. For an input port, it can be one of BLOCK, PREVIOUS, and IGNORE, which means block on empty, previous on empty,
63
and ignore on empty.
msgSend() / msgReceive() MsgSend(data, outPort) is used to send data through the outPort, and msgReceive(data, inPort) receives data through inPort. If the data transaction cannot be made immediately, the process may either block or proceed depending on the protocol specified on the port. MsgSend()/msgReceive() suspends the process if the protocol is BLOCK. It returns immediately in the other cases, i.e. OVERWRITE, PREVIOUS, or IGNORE. Data can be an integer number, a real number, or any arbitrary user-defined data type.
TMsgSend() / TMsgReceive() TMsgSend(data, outPort, time-out) and TMsgReceive(data, inPort, time-out) can be used to describe the behavior “block on full/empty with time-out”. They have the same behavior as msgSend()/msgReceive(), except that they return after time-out even if the communication has not succeeded. They are only meaningful when they are called on the ports defined with BLOCK mode.
waitFor() waitFor(timePeriod) suspends the process until timePeriod passes. It is usually used to model execution delay. The process is reactivated after timePeriod has elapsed, and it can not respond to any external stimuli during that period.
waitAll() / TWaitAll() WaitAll(in1, in2, ...) waits on a set of input or output ports, and returns only when all of them become ready for communication. It takes arbitrary number of ports as its arguments, and returns the number of ports available for the data transaction. It can be used to describe the synchronization with multiple inputs. WaitAll() can also be used to
64
emulate the data flow model. The following example shows the usage of waitAll() in the description of two-input adder (Figure 3-6). The adder waits until input data arrives at both input ports for its execution. This basically represents the data flow model. WaitAll() only examines the availability of data transactions, and it doesn’t receive or send
defstar { name { Add } domain { CP } desc { adder with two inputs } input { name { in1 } type { float } } input { name { in2 } type { float } } output { name { out } type { float } } defstate { name { DELAY } type { float } default { 0.1 } desc { amount of execution delay } } go { float x1, x2; while (TRUE) { waitAll(2, &in1, &in2); msgReceive(x1, in1); msgReceive(x2, in2); waitFor(double(DELAY)); msgSend(x1 + x2, out); } } }
Figure 3-6 : Use of waitAll() in Two-Input Adder
65
any data. Hence, it should be followed by msgReceive() or msgSend() for the actual data transaction as is shown in the above example. TWaitAll(time-out, in1, in2, ...) has the same functionality as waitAll() except that it takes one more argument as the time-out value. It will return when the time limit expires even if all the ports are not ready. defstar { name { Merge } domain { CP } desc { Merge input events } input { name { in1 } type { float } } input { name { in2 } type { float } } output { name { out } type { float } } go { float data; while (TRUE) { waitAny(2, &in1, &in2); if (transReady(in1)){ msgReceive(data, in1); msgSend(data, out); } if (transReady(in2)){ msgReceive(data, in2); msgSend(data, out); } } } }
Figure 3-7 : WaitAny() and TransReady() in Two-Input Merge
66
waitAny() / TWaitAny()/transReady() WaitAny(in1, in2, ...) waits on a set of input or output ports, and returns if any of them becomes ready for communication. It takes an arbitrary number of ports as its arguments, and returns the number of ports available for the data transaction. TransReady() usually follows after waitAny() returns to determine which portholes are ready for transaction. TransReady(port) returns TRUE if the data transaction can be made through the port. WaitAny() is similar to wait on signal-list in VHDL, and it is useful for event-driven modeling. Figure 3-7 shows the description of Merge star which receives data from two inputs and merges them into the single output. In this example, in1 port has the priority over in2 port when data arrives at both ports simultaneously. TWaitAny(time-out, in1, in2, ...) is waitAny() with time-out. It takes the time-out value as an argument, and it returns when the time limit expires even if there is no port ready for transaction. It is very useful to model an asynchronous signal such as a reset or an interrupt. Figure 3-8 shows the usage of TWaitAny() in a clock generator with asynchronous reset. The clock generates pulses with given period. After generating a pulse, it waits for certain interval. If the reset arrives during the waiting period, the clock restarts from the moment the reset signal is asserted (Figure 3-9).
waitOne() / TWaitOne() WaitOne() is the sole means of introducing non-determinism. It describes nondeterministic behavior with simultaneous events. It waits on a set of input and output port, and returns a port which is ready for transaction. When there is more than one port ready at the same time, one is selected randomly and returned. TWaitOne() takes one more argument as the time-out value, and returns when the time limit expires even if there is no ready port.
67
defstar { name { Clock } domain { CP } desc { Generate pulses with given interval } input { name { reset } type { int } } output { name { out } type { float } } defstate { name { MAGNITUDE } type { float } default { 1.0 } desc { magnitude of out-going pulse } } defstate { name { INTERVAL } type { float } default { 1.0 } desc { clock period } } go { int data; while (TRUE) { msgSend(float(MAGNITUDE),out); if(TWaitAny(float(INTERVAL),1, &reset)) msgReceive(data,reset); } } }
Figure 3-8 : Clock Generator with Asynchronous Reset
68
MAGNITUDE INTERVAL
0
t RESET
Figure 3-9 : Output from Clock Generator
OutPort out(int,BLOCK); Producer(){ while (TRUE){ produce(data); msgSend(data,out); waitFor(delay); } }
InPort in(int,BLOCK); out
N
in
Consumer(){ while (TRUE){ msgReceive(data,in); consume(data); waitFor(delay); } }
Figure 3-10 : : Producer-Consumer Model
3.4.3 Examples Producer-Consumer Model Figure 3-10 shows a simple example of two models, Producer and Consumer, connected by means of a finite size queue. The protocol on the output port of the producer is
69
BLOCK. If the producer executes faster than the consumer, the queue fills after some time. Then the execution of the producer will block at msgSend(). As soon as the consumer fetches data from the queue, the producer will resume its execution. Next chapter will present more meaningful examples.
Central Node in Aloha Network The next example shows a fairly complex behavior. It is the model of the central node in Aloha network. The central node listens to packets transmitted by leaf nodes at one frequency and retransmits these packets at the other frequency. When two transmissions occur simultaneously, they garble each other. In such an instance, packets are described as colliding. The central node acknowledges the correct packets it receives. When a leaf node does not get an acknowledgment within a specific time-out, it assumes that its packet collided. The behavior of the central node is described below using the library routines provided in the CP domain extensively: defstar { name { Central } domain { CP } desc { central node in ALOHA network } inmulti { name { inputs } type { int } } outmulti { name { ack } type { int } } output { name { transmit } type { int } } defstate { name { transTime } type { float } default { 1.0 } desc { time to transmit a packet }
70
} go { int data; while (TRUE) { // accept request for transmission InCPPort* p = waitOne(1, &inputs); msgReceive(data, *p); // detect collision if (TWaitAny(transTime,1,&inputs)){ // collision detected! // clear the collided packets msgReceive(inputs); // detect more collisions while (TWaitAny(transTime,1,&inputs)) msgReceive(inputs); } else { // correct packet received // acknowledge the senders msgSend(ACKNOWLEDGE,ack); msgSend(data,transmit); } } } }
3.5 Summary Specification capture, refinement, and simulation are the important requirements for system specification. Several important models and languages for the system specification were discussed which include Petri nets, data flow, CSP, and SpecCharts. Each model supports the description of concurrency in one way or another. In the CP domain a system is represented as a static set of concurrent processes communicating with each other by sending and receiving messages via FIFO queues. This model can most naturally describe large real-time systems which can be characterized as the combination of heterogeneous and concurrent components interacting with each other.
71
Furthermore, the process-oriented view in the CP domain makes it easy to introduce different models for the description of the components and combine them together to build a heterogeneous representation of a system. The heterogeneity supported in this environment is mature enough to incorporate physical hardware as well as software models in different domains. Thus, the CP domain forms a unified environment for high level system specification and hardware-software co-simulation. The next chapter presents the implementation of a multithreaded kernel for simulation of the processoriented specification in the CP domain.
CHAPTER 4 (Romans 8:28)
4
“And we know that all things work together for good to them that love God, to them who are the called according to His Purpose.”
I MPLEMENTATION OF S IMULATION K ERNEL
71
72
This chapter presents the implementation of the simulation kernel in the CP domain. The kernel is required to have the multithreading capability necessary to simulate the concurrent behavior of a system specification in a uniprocessor environment. The multithreading kernel is also needed for a multiprocessor architecture as well, because several processes are mapped onto a single processor in many cases and expected to execute in parallel. The kernel of the CP domain is currently implemented based on SUN’s Light-weight Process (LWP) library. The LWP library provides a mechanism to support multiple threads of control within a single user process [Kepecs85] [LWP90]. Each star in a system specification turns into a separate thread and executes in parallel with other stars. The concurrent execution of the system is simulated by the quasi-parallel execution of the multiple threads supervised by a customized process scheduler.
4.1 Software Support for Multithread Geared by the growing interest in parallel computation and concurrent algorithms, many software tools and libraries have been developed on top of UNIX operating system as an environment to support a variety of concurrency work such as the development and testing of concurrent algorithms and the process-oriented simulation of distributed systems [Cormack88] [Buhr90] [Grunwald91] [Gorlen90] [Stroustrup87] [Kepecs85]. Furthermore, some UNIX based operating systems have recently begun providing facilities for support of multiple threads within single processes [Posix90] [Powell91].
4.1.1 Nomenclature Co-routine, thread, and task are the terminologies commonly used to mention concurrency in the execution of a software program. Even though their exact meanings are
73
not the same, they are usually used interchangeably in many literatures. Lightweight process is another terminology recently introduced for the easy and efficient management of multiple threads. The definition of those terminologies are clarified in this section. The initial definition of a co-routine is “an autonomous program which communicates with adjacent modules as if they were input or output subroutines. Thus, coroutines are subroutines all at the same level, each acting as if it were the master program when in fact there is no master program.” [Conway63] This view of co-routines as mutual subroutines has remained the most common view of how co-routines can be used. From the above definition two fundamental characteristics of a co-routine can be derived [Marlin80]: a. the value of data local to a co-routine persist between successive occasions on which control enters it, and b. the execution of a co-routine is suspended as control leaves it, only to carry on where it left off when control re-enters the co-routine at some later stage. These characteristics describe a mechanism which allows a set of co-routines to call each other in symmetric fashion, and to pass control back and forth between each other. The importance of co-routine comes from the possibility of parallel execution of the coroutines. If co-routines communicate only via FIFO queues and there is no explicit transfer of control between the coroutines, the flow of control is dynamically determined by the data dependencies in the program. Under this condition, a program which consists of a set of co-routines can be executed in parallel. Kahn’s process network [Kahn78] and Dennis’ data flow representation exploited this model. Alternatively, the control can be transferred explicitly form one co-routine to another, causing currently executing coroutine to become suspended and a target co-routine to resume execution. In this case, only one co-routine is ever executing at any given time, and parallel execution is not possible.
74
The terminology of a thread of control, or more simply a thread, is not universally agreed upon. The common definition of a thread is a sequence of computing instructions that comprises a program or a process [Powell91]. A thread has a program counter (PC) and a stack to keep track of local variables and return addresses. While co-routines are used to describe the structure of a program, threads define how a program is executed. Therefore, the co-routine concept is often used in the computer language and programming methodology society, and thread is usually mentioned along with the computer architectures and operating systems. A program which consists of co-routines may have a single thread of control, or multiple threads. When the control is tranferred explicitly from one co-routine to another, there is a single thread of control (Figure 4-1(a)). Otherwise, coroutines run in multiple threads (Figure 4-1(b)). co-routine A
co-routine B
co-routine A
co-routine B
receive resume B
send B resume A
resume B
resume A
(a) Single Thread
receive send A
(b) Multiple Threads Thread of Control Communication between Co-Routines
Figure 4-1 : Execution Models of Co-Routines
A task is a program component with its own thread of control. Because a traditional UNIX
75
process has a single thread of control, a task has often been used equivalently in place of a process. Now that the concept of co-routines is getting popular and a single user process may execute as co-routines in multiple threads, the meaning of a task is getting more close to “a co-routine with its own thread of control” as in [Hansen90] [Buhr90]. A lightweight process (lwp) is “a process that maintains little associated state so that context switches and process creation are relatively inexpensive”. [Kepecs85] It provides a general-purpose facility for supporting more than one thread of control within a single process. The traditional UNIX process model can support multiple threads of control by using one UNIX process per thread. However, UNIX processes have a high run-time cost and space overhead for creation and execution. Furthermore, the interprocess mechanisms are limited and very expensive. On the contrary, lwp’s exist within a single UNIX process and share the address space of the process. As an lwp does not have its own memory context, it can switch context much faster than a full UNIX process and communicate with other lpw’s more efficiently. Because of its being “lighter”, lwp’s are an attractive means to create and control multiple threads. Thus, threads are often referred to as “lightweight processes” [Powell91].
4.1.2 Libraries for Multiple Threads Many software tools and libraries have been developed to support concurrency by allowing multiple threads of control. There are two approaches to provide multithreading capability. One is to provide user-level library routines on top of an existing operating system such as UNIX that can be used to create and control multiple threads within a single user process. AWESIME [Grunwald91], μSystem [Buhr90], NIH Class Library [Gorlen90], task library [Stroustrup87], and LWP library [LPW90] fall into this category. The other approach is that operating system provides facilities to support multiple threads
76
of control. Solaris 2.0 [Powell91] and POSIX.4a [Gallmeiser91] are some of the examples. μSystem is a library of C routines that provides lightweight concurrency on uniprocessor and multiprocessor computers running the UNIX operating system [Buhr90]. It provides primitives with which concurrent operations are explicitly specified and executed on a virtual multiprocessor architecture. There exist two levels of concurrency, co-routines and tasks. A co-routine is a program component whose execution can be suspended and resumed. A task consists of co-routines and has its own thread of control. There is a single thread of control in the co-routines which belong to the same task, and the control is tranferred explicitly from one co-routine to another. Tasks are implemented as lightweight processes and execute within a single shared memory. The communication between tasks is done by message passing in a blocking mode using send/receive/reply primitives. Two other important entities, virtual processors and clusters, form a virtual multiprocessor engine on which multiple tasks are executed concurrently. A virtual processor is a ‘software processor’ that executes tasks. It is implemented as a UNIX process that is subsequently scheduled for execution on the actual processor(s) by the underlying operating system. A cluster is a collection of tasks and virtual processors; it provides a run-time environment for their execution. It uses a single-queue multi-server queueing model for the scheduling of its collection of tasks on virtual processors. AWESIME is an object-oriented library written in C++ for parallel programming and process-oriented simulations on the computers with a shared address space [Grunwald91]. It is structured as groups of C++ class definitions that manipulate or specify different aspects of a parallel programming environment. An AWESIME application is composed of one or more kinds of threads; each kind of thread may have multiple instances. The
77
threads are managed by a cpu_multiplexor that multiplexes the cpu on behalf of the threads. The cpu_multiplexor encapsulates different scheduling policies depending on the hardware configuration. Special classes of cpu_multiplexors are provided to build process-oriented simulation. They impose a global simulated time order over threads, and multiplex threads accordingly. The time order is specified in terms specific to a simulation. Threads can await for a specified time, or can be blocked on resources and released at a later point in time. Many other facilities are provided to manage the advancement of the global simulated time and events, where events represent threads that are supposed to be executed at a particular time. The NIH Class Library (NIHCL) implements abstract data types that have been designed to simplify object-oriented programming using C++ [Gorlen90]. It contains over 60 general-purpose classes among which are the classes Process, Scheduler, Semaphore, and SharedQueue. They are used for multiprogramming with co-routines. They support lightweight processes (LWP) in order to simulate concurrent computation in a serial processor. An instance of a class derived form class Process is created as an LWP, and the class Scheduler performs the scheduling of the LWP’s. The scheduling is non-preemptive, that is the Scheduler cannot intervene in the execution of an LWP which will continue to be active until it explicitly releases control of the processor. The classes Semaphore and SharedQueue are used for the synchronization and communication between the LWPs. The C++ task library is a set of C++ classes from AT&T for co-routine style programming [Stroustrup87]. A task is an object with an associated co-routine. The task library includes a scheduler that enables each task to execute just when it has work to do, and to wait when necessary for whatever is needed.
78
Programming with tasks is appropriate for simulations, real-time process control, and other applications which are naturally represented as sets of concurrent activities. The task system is particularly useful for writing process-oriented simulations. Tasks execute in a simulated time frame presented by the variable clock, and objects of class timer provide a convenient and efficient facility for using the clock. Concurrent execution of multiple tasks can be simulated by using the simulated time to make the quasi-parallel execution of the co-routines appear actually parallel. The lightweight process (LWP) library from SUN Microsystems provides a mechanism to support multiple threads of control that share a single address space [LPW90]. The lightweight processes are distinguished from ordinary, or heavy-weight UNIX processes by the fact that all the light-weight processes inside a heavy-weight process share the same address space and system resources such as file descriptors. As a result communication, synchronization, and context-switching are much cheaper in the case of light-weight processes than in the case of heavy-weight ones. The LWP library provides primitives for manipulating threads, as well as for controlling all events (interrupts and traps) on a processor. Scheduling is, by default, priority-based and non-preemptive within a priority. However, sufficient primitives are available such that it is possible to write a customized scheduler. The availability of LWP’s provides an abstraction well-suited to writing programs which react to the asynchronous events such as servers. LWP’s are especially useful for simulation programs which model concurrent situations. The availability of multiprocessor hardware and emphasis on application concurrency have motivated some UNIX based operating systems to provide facilities for support of multiple threads within single processes. Solaris 2.0, the recently introduced operating
79
system from SUN Microsystems, provides a symmetric multiprocessing environment with a multithreaded kernel [Powell91]. It extends the UNIX Application Programming Interface for a multithreaded environment. The threads are intended to be sufficiently lightweight so that there can be thousands present and that the synchronization and context switching can be accomplished rapidly without entering the kernel. Also, POSIX.4a, the proposed IEEE standards for “Threads Extension”, specifies facilities for the support of multiple threads of execution within a single process [Posix90]. It proposes facilities for the control of threads, for mutual exclusion and condition variables, and for thread-specific data and thread scheduling. It is expected to become an international standard, and support of it will rapidly become mandatory for commercial systems.
4.2 Thread Library The current implementation of the multithreaded kernel in the CP domain is based on the SUN lightweight process (LWP) library. The choice was made because of the following reasons: a. the LWP library provides sufficient primitives to create and destroy multiple threads and to build a customized scheduler for the control of them, b. it is readily available on SUN workstations, the computing environment for this research AWESIME and μSystem are large systems by themselves and have excessive baggage which are not really needed for our application. NIHCL and the task library from AT&T are not compatible with g++, the C++ compiler from the Free Software Foundation, which is the compiler used in Ptolemy. Furthermore, every multithread library contains machine dependent code to provide multithreading capability on top of the existing operating systems so that it needs to be ported explicitly to each supported compiler/ processor platform.
80
A drawback of the current implementation is that it is not compatible with other machine architectures. Another problem is that threads currently lack kernel support, such that the system calls such as I/O serialize the thread activity. For example, whenever one of the light-weight processes attempts an I/O outside the UNIX process, all the other lightweight processes also block. This problem will be solved when the OS kernel provides support of multithreading as does the Solaris 2.0. The LWP library is a set of C routines, and it is not directly compatible with C++. To resolve the incompatibility, a Thread library has been implemented in Ptolemy [Parks91]. It defines several C++ class objects such as Thread, Condition, and Monitor, which have been built on top of the primitives of the LWP library. Figure 4-2 shows the definition of the Thread class using the LWP library primitives which are shown in bold letters. A new instance of the Thread class creates a lightweight process which executes the given functionality in its own thread of control. In the rest of this chapter, a thread will be used in place of a lightweight process. The multithreaded kernel of the CP domain has been implemented using this library. The library is also used in several other multi-threaded domains in Ptolemy.
4.3 Thread Scheduling In the CP domain, all the stars are presumably running concurrently. But the threads do not actually execute concurrently because they are running on a single processor environment. A thread continues to execute until it suspends or terminates itself; when the executing thread stops executing, another thread generally resumes execution. Therefore, a scheduler is needed to simulate concurrent execution of stars using sequentially executing threads. It makes the quasi-parallel execution of multiple threads appear actually parallel by scheduling the execution of the threads in accordance with simulated
81
class Thread : public thread_t { public: Thread(int priority, void (*func)()); Thread(int priority,void(*func)(void*),void*arg); Thread(int priority,void(*func)(void*),void* arg1,void* arg2); static Thread& init(int maxPriority, int stackSize, int numStacks=1); ~Thread(); protected: }; Thread::Thread(int priority,void (*f)()){ lwp_create(this,f,priority,0,lwp_newstk(),0); } Thread::Thread(intpriority,void (*f)(void*),void*arg){ lwp_create(this,f,priority,0,lwp_newstk(),1,arg); } Thread::Thread(int priority,void (*f)(void*,void*), void* arg1, void* arg2){ lwp_create(this,f,priority,0,lwp_newstk(),2,arg1, arg2); } Thread::~Thread(){ lwp_destroy(*this); } Thread& Thread::init(int maxPriority,int stackSize,int numStack){ pod_setmaxpri(maxPriority); lwp_setstkcache(stackSize,numStack); lwp_self(&mainThread); lwp_setpri(mainThread,maxPriority); return (Thread&)mainThread; }
Figure 4-2 : Definition of Thread Class
time. A distributed simulation in a multi-processing environment, although attractive, hasn’t been investigated in this work. In the LWP library, each thread is created with a certain priority. The scheduling is, by
82
default, non-preemptive within a priority, and within a priority, threads enter the run queue of a FIFO based. Thus, a thread continues to run until it voluntarily relinquishes the control or an event occurs to enable a higher priority thread. When several threads are created with the same priority, they are queued for execution in the order of creation. This order may not be preserved as threads yield and block within a priority. There is no ancestry in threads: the creator of a thread has no special relation to the threads it created. A customized process scheduler has been built on top of the LWP scheduler. The scheduler is just another process with higher priority than stars. It implements a concept of simulated time. A unit of simulated time can represent any amount of real time, and it is possible to compute without consuming any simulated time. The global clock in the scheduler keeps the simulated time. The scheduler maintains the list of threads, and manages the execution of all the threads proceeding in parallel according to the global clock. At the beginning of the simulation the scheduler converts each star into a thread. All the threads are created with the same priority. After that, the scheduler voluntarily relinquishes the CPU and goes to sleep. A thread can be in one of four states, which are RUNNABLE, RUNNING, BLOCKED, and SUSPENDED (Figure 4-3). As soon as threads are created, they are in RUNNABLE state. When the scheduler relinquishes the CPU, the RUNNABLE threads execute one at a time until the execution is either BLOCKED, or SUSPENDED. The RUNNING thread is always in control, and it can not be preempted by any other thread. While threads are executing, the scheduler sleeps and the global clock doesn’t advance. Hence, the execution of threads doesn’t consume any simulated time, and it is considered to happen instantaneously. In that way all the threads appear to execute in parallel with respect to the simulated time.
83
When there are no more RUNNABLE threads available, the scheduler wakes up and regains the CPU. It fetches the earliest SUSPENDED thread from the waiting list and advances the global clock accordingly. If the global clock is past the final time, the scheduler returns and the simulation is finished. Otherwise, the scheduler resumes all the SUSPENDED threads scheduled at that time slot and goes back to sleep (Figure 4-4).
4.4 Inter-Process Communication (IPC) 4.4.1 Rendezvous Paradigm underneath msgSend()/ msgReceive() CP stars communicate with each other using msgSend() and msgReceive() primitives. The communication mechanism has been implemented following the rendezvous paradigm. That is, one thread issues a msgSend() and another thread issues a
start
RUNNABLE
time-out
Linked on Scheduler Run List
SUSPENDED Linked on Scheduler Wait List
get CPU buffer ready / time-out
waitFor()
msgSend() / msgReceive() BLOCKED
buffer full/empty
RUNNING
Figure 4-3 : State Transition of Threads in the CP domain
84
msgReceive(). Whichever thread gets to the corresponding primitive first waits for the other, hence the term rendezvous. When the rendezvous takes place, both the sender and the receiver get unblocked. The rendezvous paradigm applies not only for unbuffered channels but also for buffered ones. The buffered channel has been implemented as another star which communicates with the sender and the receiver using msgSend() and msgReceive() as well. The LWP library provides primitives such as msg_send(), msg_receive(), and msg_reply() to support the rendezvous paradigm. However, they can only describe BLOCKing mode of communication. Other communication protocols in the CP domain can not be expressed elegantly. Furthermore, those primitives are specific to SUN’s LWP library and
CPScheduler(stopTime){ while (currentTime < stopTime){ // scheduler releases the control setSelfPriority(SLEEP); // threads start execution // ... // scheduler regains the control setSelfPriority(RUNNING); // advance the global clock currentTime = getNextTime(); // resume threads in the waiting list for (threads_scheduled_at_“currentTime”) resumeProcess(); } return; }
Figure 4-4 : Pseudo-Code for the Thread Scheduler in the CP Domain
85
they may not be supported in other multithread libraries. This will cause problems when the CP domain is revised using more portable thread library, which is planned in near future to increase portability with other architectures. Therefore, the rendezvous paradigm is implemented by explicit transferring of control using suspend() and resume() which are generally supported primitives for the co-routine style programming. Figure 4-5 shows the internal data structure and the pseudo-code which implements the rendezvous paradigm. Because of the symmetric nature of rendezvous, the codes for the sender and the receiver are equivalent. The behavior of a sender or a receiver depends on the state of the far side port. A port can be in one of three states, READY, WAIT, or EMPTY. READY means the thread has been blocked on that port since it called either msgSend() or msgReceive(). WAIT means the thread has been waiting on that port by calling waitAll() or waitAny(). Otherwise, the state of the port remains EMPTY. Rendezvous only occurs when the far side port has been READY. Then, the data is transferred from the source port to the destination, the far side thread gets resumed, and the current thread continues. Otherwise, the current thread sets the port READY and suspends itself. It also resumes the far side thread if it has been WAITing. Because the scheduling is non-preemptive and the control is tranferred explicitly by the threads, the mutual exclusion mechanism such as semaphores or monitors are not needed for the access of CPGeodesic, the critical section shared by the sender and the receiver.
4.4.2 Timer Object The rendezvous paradigm described above implements the communication in BLOCK mode, but it also forms the basis upon which to build other protocols as well. The Timer object has been introduced for the implementation of other communication protocols. It is also used to implement other time-related primitives such as TMsgSend() and TWaitAny().
86
Sender
Receiver
far()
CPGeodesic sourcePort()
OutCPPort state
geo()
destPort() geo()
state InCPPort
far() parent()
parent()
OutCPPort::sendData(){ state = READY; farState = far()->getState(); switch(farState){ case READY: geo()->rendezvous(); far()->setState(EMPTY); far()->parent()->resume(); break; case WAIT: far()->parent()->resume(); default: // farState == EMPTY parent()->suspend(); break; } state = EMPTY; } InCPPort::receiveData(){ OutCPPort::sendData(); } CPGeodesic::rendezvous(){ srcP = sourcePort()->particle; destP = destPort()->particle; destP = srcP; }
Figure 4-5 : Implementation of Inter-Process Communication Based on Rendezvous Paradigm
87
A Timer has a pointer to a thread and the time at which the thread is to be resumed. It is created and added to the scheduler’s waiting list when a thread may get blocked but it must be resumed after some time limit expires. The Timer should be deleted if the thread get resumed before it runs out (Figure 4-6). Figure 4-7 shows the implementation of msgSend() using the Timer object. MsgSend() should return immediately when the protocol is OVERWRITE or IGNORE even when the data transfer can not be made. A naive implementation might be that msgSend() just returns when the receiving port is not READY. There are situations, however, where the CPStar::TFoo(double timeout, ...){ Timer* timer = startTimer(timeout); Foo(...); if (timeOut==FALSE) deleteTimer(timer); }
Figure 4-6 : The General Usage of Timer Object
CPStar::msgSend(int msg, OutCPPort out){ out protocol == BLOCK) out->sendData(); else{ // IGNORE or PREVIOUS Timer* timer = startTimer(0); out->sendData(); if (notTimeOut) deleteTimer(timer); } }
Figure 4-7 : Use of a Timer in the Implementation of msgSend()
88
naive solution fails. Suppose the receiver is currently RUNNABLE and waiting for the CPU, and it is supposed to execute msgReceive() as soon as it gets RUNNING. Theoretically, both msgSend() and msgReceive() are happening at the same time so that the communication should succeed. However, the receiving port never gets a chance to become READY until the sender releases the CPU, and the sender returns from msgSend() unsuccessfully after it finds out the receiving port is not READY. This is the side effect of the quasi-parallel execution of concurrent processes. The use of a Timer provides a solution for that case. MsgSend() first creates a Timer at “currentTime + 0.” The Timer gets scheduled at the head of the waiting list in the scheduler. Then the sender gets blocked instead of returning immediately when the receiver is not READY. The result is that the receiver gets the CPU and starts RUNNING. It calls msgReceive() and the communication succeeds this time. It resumes the sender and keeps on RUNNING until it gets BLOCKED or SUSPENDED. Then the sender re-takes the control. As it was resumed by the receiver and not by the scheduler, it unschedules the Timer and returns from msgSend() with the communication succeeded. If the receiver is not RUNNABLE at the currentTime, the scheduler wakes up instead. It fetches the Timer from the head of the list, and resumes the thread that initiated the Timer, which is the sender in this case. As the Timer was scheduled at the currentTime, the global clock does not advance. Thus, the sender returns from the msgSend() virtually in no time, and the communication is unsuccessful in this case.
4.4.3 Implementation of Queue Star The buffered channel has been implemented as a separate star which contains a FIFO queue. Its basic functionality is simple: it waits on both input and output ports. If data arrives at the input port, it is stored into the queue. When there is data request at the output
89
port, the data is fetched from the head of the queue and sent through the output port. However, the implementation of the queue star is not straightforward because it should handle the protocols such as OVERWRITE and PREVIOUS as well as the BLOCK mode. Also it should handle arbitrary data types, and the size of the queue may be either finite or infinite. The result is a big chunk of code which contains lots of conditionals to handle every possible case: defstar { name {Queue} domain {CP} defstate { name { capacity } type { int } default { “-1” } } input { name { in } type { anytype } } output { name { out } type { =in } } protected { Queue buffer; int overWrite; } setup { buffer.initialize(); if (in.far()->protocol == OVERWRITE) overWrite = TRUE; else overWrite = FALSE; } go { CPPortHole* farIn = in.far(); CPPortHole* farOut= out.far(); while (TRUE){ int size = buffer.length(); // receive data if buffer is empty if (size == 0){ in.receiveData(); buffer.put() parent()->resume(); if (farOut->state == WAIT) farOut->parent()->resume(); // mark the ports READY out state = EMPTY; continue; } port->state = WAIT; break; } if (numPorts == number_of_ports) break; else suspend(); } return numPorts; }
Figure 4-8 : Implementation of waitAll()
actually wait on all ports but only on the port of the last visit. That is because waitAll() will never return as long as there is at least one port remains not-ready. The implementation of waitAny() is a little bit more complicated (Figure 4-9). It consists of two identical loops. In the first loop, the ports are visited one by one and marked as WAIT while the state of the far side port is checked out. If there is no READY port, the thread is suspended. The second loop is executed only when there are some READY ports: either they have been identified in the first loop, or they become READY later on so that the thread gets resumed. The readiness of the far side ports is examined again and the
93
int CPStar::waitAny(ports){ int numPorts = 0; for (each_port){ farP = port->far(); if (fatP->state == READY){ numPort++; continue; } port->state = WAIT; } if (numPorts == 0) suspend(); for (each_port){ farP = port->far(); if (farP->state == READY) numPort++; port->state = EMPTY; } return numPorts; }
Figure 4-9 : Implementation of waitAny()
state of the ports are reset to EMPTY.
4.5.2 Wormhole The implementation of a wormhole is the toughest part whenever a new domain is introduced in Ptolemy. The basic functionality of a wormhole is that it receives particles at the input ports, transfers them to the internal galaxy, executes the galaxy until current time, and sends the results from the galaxy through the output ports. Even though the concept is simple and easy, the actual implementation is very complicated in order to
94
make sure that the behavior at the interface complies with the different models of computation. A CP wormhole is similar to other CP stars, and it turns into a separate thread once the simulation begins. Its behavior is described within an infinite loop. First, it waits for inputs. If any inputs arrive, it fires the internal galaxy. When the execution of the internal galaxy is finished, the produced data is sent out through the output ports. If there is a future event inside the galaxy, it is reflected in the scheduling of the wormhole. The tricky part is that the outputs may not be sent immediately if the receiving stars are not ready, and then the execution of the wormhole gets blocked there, which is not the desired behavior. The technique used here is to send data only to the READY ports. The other ports are just added in a blocked port list. After all the output data are taken care of in one way or another, the thread goes back to the beginning and waits. But in this time, it waits not only on the input ports, but on the list of blocked ports also. Therefore, the thread resumes if one of the following happens: a. data arrives at any one of its inputs, b. the scheduler wakes it up when there is a scheduled event in the internal galaxy, or c. any output port in the blocked list becomes READY For the first two cases, the thread goes through the regular execution. When the thread is resumed due to the output ports, it sends data through the now READY ports and simply goes back to wait without executing the internal galaxy (Figure 4-10).
4.6 Examples The CP domain is designed to describe systems consisting of a fixed number of components where the topology of the system is statically determined. The biggest merit of the CP domain is that the modeling of a system follows the most natural way the system
95
void CPWormhole::go(){ blockedPorts = NULL; internalEvent = NULL; while (TRUE){ waitAny(inputPorts,internalEvent,blockedPorts); if (output_became_READY) for(each port in blockedPorts){ if (port->far()->state == READY) msgSend(port); deleteFromBlockedPorts(port); } continue; } // run(galaxy); for(each_output_port) if (port->newData()) if (port->far()->state == READY) msgSend(); else blockedList->add(port); if (galaxy_has_future_event) interalEvent->add(futureTime); } // END OF WHILE } // END OF GO
Figure 4-10 : Implementation of a CP Wormhole behaves so that the description is very easy for the designer and understandable to others. The following examples are intended to illustrate how a range of practical problems can make use of this environment.
4.6.1 M/M/1 Queue The first example is the simulation of an M/M/1 queue. Customers arrive at a counter with negative exponential interarrival time, and the service time for each customer also
96
follows the same distribution. The simulation measures the average queue delay and the server utilization. It also shows the variation of the queue length over the time. Figure 4-11 shows the representation of a M/M/1 queue and the results of the simulation for 100 time units when the mean interarrival time of customers is equal to one and the mean service time is 0.5. The model consists of three blocks, a customer, a queue, and a server, and each block runs as a separate thread. The queue is a standard block provided in the library, so the user provides the model for the customer and the server. The following code for the customer and the server demonstrates the simplicity and easiness of modeling in the CP domain. NegativeExpntl in this example is a built-in class in the GNU library for an exponentially distributed random numbers: defstar { name {Customer} ... start { expntl = new NegativeExpntl(MEAN); } go { entry = 0; while (TRUE){ waitFor((*expntl)()); msgSend(currentTime(), out); entry++; } } wrapup { msgOut connect to %s (%s)\n”, server, inet_address); }
/* returns current time in seconds */ double currentTime(){ ULONG currTime; currTime = tickGet(); return ((double)(currTime / 60.0)); } /* sleep for given delay */ void waitFor(delay) int delay; { taskDelay(delay*60); } /* send message */ int msgSend(data, port, size) char* data; Port* port; int size; { int length; int reply; length = readn(port->sock, (char*)&reply, BYTES_PER_INT); if (length < 0){ perror(“recv failed in msgSend\n”); return (-1); } if (reply != READY){ perror(“protocol error in msgSend\n”); return (-1); } length = writen(port->sock, data, size);
165
if (length < 0){ perror(“send failed in msgSend\n”); return (-1); } else return length; } /* receive message */ int msgReceive(data, port, size) char* data; Port* port; int size; { int length; int reply = READY; length = writen(port->sock, (char*)&reply, BYTES_PER_INT); if (length < 0){ perror(“send failed in msgReceive\n”); return (-1); } length = readn(port->sock, data, size); if (length < 0){ perror(“recv failed in msgReceive\n”); return (-1); } else return length; } /* close the socket */ void wrapup(port) Port *port; { close(port->sock); port->sock = NULL; }
/* * read “n” bytes from a descriptor */ int readn(fd, ptr, nbytes)
166
int fd; char* ptr; int nbytes; { int nleft, nread; nleft = nbytes; while (nleft > 0){ if (nleft > WRTLENG) nread = read(fd, ptr, WRTLENG); else nread = read(fd, ptr, nleft); if (nread < 0) /* error, return < 0 */ return (nread); else if (nread == 0) /* EOF */ break; nleft -= nread; ptr += nread; } return (nbytes - nleft); } /* * write “n” bytes to a descriptor */ int writen(fd, ptr, nbytes) int fd; char* ptr; int nbytes; { int nleft, nwritten; nleft = nbytes; while (nleft > 0){ if (nleft > WRTLENG) nwritten = write(fd, ptr, WRTLENG); else nwritten = write(fd, ptr, nleft); if (nwritten cmd = PH_COPEN; msg->args.open.path = path; msg->args.open.flags = flags; msg->args.open.mode = mode; IO_send(FS_stdHost, &frame); errno = msg->status.errno; return(msg->status.value); } int close(fd) intfd; { IF_Frameframe; PH_Msg*msg; IF_new(&frame); msg = (PH_Msg *)frame.msg.items; msg->cmd = PH_CCLOSE; msg->args.close.fd = fd; IO_send(FS_stdHost, &frame); errno = msg->status.errno; return(msg->status.value); } int read(fd, buf, nbyte) intfd; char*buf; intnbyte; { IF_Frameframe; PH_Msg*msg; inti; char*tmpBuf, *tmpChan; IF_new(&frame);
169
msg = (PH_Msg *)frame.msg.items; msg->cmd = PH_CREAD; msg->args.read.fd = fd; msg->args.read.buf = (Ptr)inChannel; msg->args.read.nbytes = nbyte; frame.buf.maxsize = 512; frame.buf.size = nbyte; frame.buf.addr = (Ptr)inChannel; IO_send(FS_stdHost, &frame); tmpBuf = buf; tmpChan = inChannel; for (i = 0; i < msg->status.value; i++) *tmpBuf++ = *tmpChan++; errno = msg->status.errno; return(msg->status.value); } int write(fd, buf, nbyte) intfd; char*buf; intnbyte; { IF_Frameframe; PH_Msg*msg; inti; char*tmpBuf, *tmpChan;
IF_new(&frame); msg = (PH_Msg *)frame.msg.items; tmpBuf = buf; tmpChan = outChannel; for (i = 0; i < nbyte; i++) *tmpChan++ = *tmpBuf++; msg->cmd = PH_CWRITE; msg->args.write.fd = fd; msg->args.write.buf = (Ptr)outChannel; msg->args.write.nbytes = nbyte;
170
frame.buf.maxsize = 512; frame.buf.size = nbyte; frame.buf.addr = (Ptr)outChannel; IO_send(FS_stdHost, &frame); errno = msg->status.errno; return(msg->status.value); } /* utility functions for IPC */ /* initialize a port */ void init(port, dir, server, servPort) Port*port; PortDirdir; unsigned longserver; intservPort; { IF_Frameframe; PH_Msg*msg; IF_new(&frame); msg = (PH_Msg *)frame.msg.items; port->dir = dir; msg->cmd = PH_CAPP; msg->args.app.args[0] = PT_INIT; msg->args.app.args[1] = (Int)server; msg->args.app.args[3] = (Int)servPort; IO_send(FS_stdHost, &frame); errno = msg->status.errno; port->sock = msg->status.value; return; }
/* sleep for “delay” seconds */ void waitFor(delay) int delay; {
171
/* sleep for given seconds */ KT_sleep(1000*delay); } /* return current time in seconds */ double currentTime(){ IF_Frameframe; PH_Msg*msg; IF_new(&frame); msg = (PH_Msg *)frame.msg.items; msg->cmd = PH_CAPP; msg->args.app.args[0] = PT_TIME; IO_send(FS_stdHost, &frame); errno = msg->status.errno; return((double)msg->status.value); } /* close the connection */ void wrapup(port) Port* port; { IF_Frameframe; PH_Msg*msg; IF_new(&frame); msg = (PH_Msg *)frame.msg.items; msg->cmd = PH_CAPP; msg->args.app.args[0] = PT_WRAP; msg->args.app.args[1] = port->sock; IO_send(FS_stdHost, &frame); errno = msg->status.errno; return; }
int msgSend(data, port, size)
172
char*data; Port*port; intsize; { int length, ready; char *reply; reply = (char*)malloc(BYTES_PER_INT); length = readn(port->sock, reply, BYTES_PER_INT); if (length < 0){ printf(“readn failed in msgSend\n”); return -1; } str2int(reply, &ready); if (ready != READY){ printf(“protocol error in msgSend\n”); return -1; } length = writen(port->sock, data, size); if (length < 0){ printf(“writen failed in msgSend\n”); return -1; } return length; }
int msgReceive(data, port, size) char*data; Port*port; intsize; { int length; int ready = READY; char *reply; reply = (char*)malloc(BYTES_PER_INT); int2str(&ready, reply); length = writen(port->sock, reply, BYTES_PER_INT); if (length < 0){ printf(“writen failed in msgReceive\n”); return -1;
173
} length = readn(port->sock, data, size); if (length < 0){ printf(“readn failed in msgReceive\n”); return -1; } return length; } void int2str(src, dest) int* src; char* dest; { char* tmp; int tmp2, i; tmp = dest; for (i = 0; i < 4; i++){ tmp2 = *src; *tmp++ = 0x00ff & (tmp2>>(8*(3-i))); } } void str2int(src, dest) char* src; int* dest; { char* tmp; int i; *dest = 0x0; tmp = src; for (i = 0; i < 4; i++){ *dest = *dest 0){ if (nleft > 512) nread = read(fd, ptr, 512); else nread = read(fd, ptr, nleft); if (nread < 0) /* error, return < 0 */ return (nread); else if (nread == 0) /* EOF */ break; nleft -= nread; ptr += nread; } return (nbytes - nleft); } /* * write “n” bytes to a descriptor */ int writen(fd, ptr, nbytes) int fd; char* ptr; int nbytes; { int nleft, nwritten; nleft = nbytes; while (nleft > 0){ if (nleft > 512) nwritten = write(fd, ptr, 512); else nwritten = write(fd, ptr, nleft); if (nwritten “