performance prediction and verification environment for super ...

2001 Society for Design and Process Science Printed in the United States of America

PERFORMANCE PREDICTION AND VERIFICATION ENVIRONMENT FOR SUPER-INTEGRATED DATA-DRIVEN PROCESSORS; RESCUE Yasuhiro Wabiko Hiroaki Nishikawa Institute of Information Sciences and Electronics University of Tsukuba, Tsukuba Science City, Japan In order for multi-media networking environment to meet future infrastructure needs, it must satisfy not only functional specification but also performance specification such as turnaround time, data flow rate and so on. The authors have focused on the fact that data-driven schema (DDS), a natural notation for describing parallel processing, has been broadly utilized as specification tool which also is a representation of executable program on data-driven processors. An integrated development environment that supports description of both functional and performance specifications, as well as providing executable programs which take into account engineering constraints can be realized by adopting DDS. This is possible because DDS can represent engineering constraints such as pipeline configuration. Furthermore, turnaround time of a data-driven program can be directly and explicitly shown onto the DDS representing the program. Since verification on performance specification is on the premise that functional specification has already been verified, the authors first studied a reuse-oriented specification environment to support side-effect detection for programs generated from functional specification described in DDS. Based on this environment, the authors are currently developing a performance prediction and verification environment named Realtime Execution System for CUE (Coordinating Users requirements and Engineering constraints)-series data-driven processors: RESCUE. This paper first describes DDS-based specification for both users requirements and engineering constraints. Some examples of the users requirements include algorithm, data-structure, turnaround time, and data flow rate. Engineering constraints include pipeline configuration, instruction set supported by the processor, etc. Implementations for performance prediction and verification follow. The paper goes on to show an interactive optimization scheme for program allocation to multi-processors as a tuning method when the users requirement for turnaround time is not met. Incorporation of audio media into future user/system interface for RESCUE is also discussed, since visual description and display of schemas are not sufficient and also because audio media is essential to realize a truly interactive environment.

1. Introduction Many researches aiming at parallel/distributed systems to build multi-media networking environment as future infrastructures have been widely carried out (Mampaey, 2000, Wade and Lewis, 1999, Nishikawa, Transactions of the SDPS

DECEMBER 2001, Vol. 5, No. 4, pp. 55-76

1997). The authors have carried out CUE project to study effective software/hardware realization of multi-media networking environment. CUE project focuses particularly on super-integrated dynamic data-driven processors to achieve effective multi-processing system that is essential in realizing a multimedia networking environment. A single-chip super-integrated dynamic data-driven processor CUE-p (CUE-prototype) and CUE-v1 (CUE-version 1) have been developed (Nishikawa and Miyata, 1998). One of the authors has studied data-driven implementation of protocol handling on CUE processors. The protocol handling includes protocols such as TCP/IP (Transmission Control Protocol / Internet Protocol) and IIOP/GIOP (Internet Inter-ORB Protocol / General Inter-ORB Protocol). It has been reported that the protocol handling program can execute ideal multi-processing without any runtime overhead such as priority control, where each stream is inputted in maximum effective throughput of OC-3 ATM (Optical Carrier 3 Asynchronous Transfer Mode) (Nishikawa and Aoki, 1998). One of the authors also has been studying data-driven implementation of media processing such as audio compression and moving picture compression. In a software lifetime, reuse is more dominant than a new development from scratch. With this in mind, the authors have studied a reuse-oriented specification environment (Nishikawa and Wabiko, 1998) that assists user in detecting unexpected side-effects, when software components are reused with modifications. Please refer to section 2.1 for definition of side-effects. Reports by the authors indicate that this environment is effective for verification (Hayashi and Igarashi, 1976, Wallace et al., 1989). However, considering that multi-media processing will be a dominant target application area for information systems, reusing software should avoid side-effects on not only well-behaveness (Dennis, 1972) but also performance requirements including realtime constraints. At the same time, user/system interaction is crucial in an interactive environment. To realize a truly interactive environment, audio, including voice messages, is essential since we rely not only on visual information, but also on audio in everyday communication. This paper proposes a performance prediction and verification environment to support developing realtime application such as protocol handling and media processing on CUE processors. In this paper, the authors first show that reducing the critical path in a program is essential for satisfying requirements on time constraints on CUE processors as long as hardware resources are not problems. The term critical path in this paper means one or more path(s) in DDS that takes the longest execution time which consists of processing time in primitive operations and communication time between two primitive operations. A method of specification with both users requirements and engineering constraints is then proposed. Users requirements include not only functional specification but also performance specification such as turnaround time and data flow rate. Engineering constraints include pipeline organization. In its lowest level, pipeline organization is a description of connection between pipeline stages in the processor. In its highest level, pipeline organization is a description of inter-chip connection in a multi-chip configuration, or inter-board connection in a multi-board system. Then implementation for performance prediction and verification is described. An algorithm for finding critical path in programs is then shown. The prototype of our interactive optimization facility of turnaround time is then described, followed by preliminary evaluation results which demonstrates the ease of tuning capability that RESCUE has to offer. Lastly, user/system interaction via audio is discussed based on multi-modal user interface in which realization by pattern-matching method is described.

2. Data-Driven Realtime Processing System and its Development Environment 2.1. Specification and Prototyping Environment for Data-Driven Processor Systems The authors have developed a prototype of a reuse-oriented data-driven specification environment to support detection of unexpected side-effects in reusing software components. The authors argue that Journal of Integrated Design and Process Science

DECEMBER 2001, Vol. 5, No. 4, 56

side-effect free execution of a program should be defined as per program specification, not an existence of determinism in executing the program (Nishikawa and Wabiko, 1998). In this paper, definition of sideeffect free execution is as follows : Def. Let S be a DDS which has m input arcs and n output arcs, and let M0 be a marking of S which inside has initial tokens and memory that satisfy the specification. Let S be S with added input and output arcs. Execution of schema S is side-effect free for M0 and input tokens if either of the following two cases are satisfied. 1) If S, after execution of finite sequence starting from M0, the state of S returns to M1 which satisfies the specification. 2) If each infinite execution sequence itself satisfies specification.

The authors have pointed out that the difficulties in reusing software components are derived from difficulties in detecting side-effects in programs from its functional specification, which is caused by the gap between declarative specification and corresponding procedural program executed on conventional sequential processors. To fundamentally resolve this problem, a unique representation both for specifications and programs is needed. The authors have focused on the fact that data-driven schema (DDS)(Dennis, 1972), which is the program representation on data-driven processors, has been widely utilized as specification tool. SADT (Ross,1977) is such an example. In this study, DDS has been adopted for both specification tool and program representation. To directly reflect the side-effects in the generated program onto its specification, consistency between the specification and the program is maintained with respect to datadependence. This is made possible by directly generating programs from a specification via datadependence that is required both in specifications and programs. Unlike most conventional programming languages, there is no compilation in generating program. The specification environment supports a prototyping method which explicitly shows the result of prototyping onto a specification whenever a change is made in a generated program. The prototyping method is based on symbolic execution (Tamai and Fukunaga, 1982) to provide comprehensive testing for domain of the data. Furthermore, the specification environment supports restructuring a hierarchical specification to maintain understandability of the specification and to free users from conventional top-down or bottom-up constraints in building the hierarchy. The specification environment has been applied to actual software development and its ability to support detecting side-effects was evaluated (Nishikawa and Wabiko, 1998). As a result, some sideeffects that prevented the program from executing were detected. Furthermore, incorrectness in a sense the specification did not correctly reflect users requirements were also detected, despite the fact that the generated program was executable. Conventionally such incorrectness has been said to be difficult to detect. From this evaluation, it was shown that the specification environment has not only side-effect detection capability, but also verification (Hayashi and Igarashi, 1976, Wallace et al., 1989) capability. 2.2. Super-Integrated Data-Driven Processors: CUE In super-integrating multiple processors onto a limited chip area, heterogeneous processors have an advantage over homogeneous ones because a multi-processors system comprised of heterogeneous processors can achieve overall higher functionality than the one consisting only of homogeneous processors. In particular, the authors have focused on VLSI super-integration of heterogeneous processing elements (PEs). A super-integrated dynamic data-driven processor CUE-p (CUE-prototype) (Nishikawa and Miyata, 1998) has been realized. As shown in Fig.1, CUE-p consists of 2 super-integrated data-driven processors (DDPs). 4 heterogeneous PEs (named INT, GNT, TBL and MUL) and a router are super-integrated in each DDP. INT (Integer & Logical Operation), GNT (Generation Manipulation), TBL (Lookup Table Reference)

Transactions of the SDPS


CUE-p

a) Overview of CUE-p Board DDP

DDP

INT

INT

MUL

Router

Router GNT

MUL

GNT

TBL

TBL

DDP: super-integrated Data-Driven Processor INT: Integer & Logical Operation GNT: Generation Manipulation PE TBL: Lookup Table Reference MUL: Multiplier & Static Accumulator (b) Block Diagram of a CUE-p

J

FC Circular Elastic Pipeline

FP

B PS J : Joint FP: Functional B : Branch Processor FC : Firing Control PS : Program Storage (c) Block Diagram of a Fig.1 Super-integrated data-driven processor: CUE-p. and MUL (Multiplier & Static Accumulator) have different instruction set among each other. Thus, CUE-p has 8 heterogeneous PEs and 2 routers in total. Each PE consists of 5 functional modules like Joint (J) for merging 4 input ports, Firing Control (FC) for data-driven execution, Functional Processor (FP) for executing operations, Program Storage (PS), and Branch (B) to distribute packets to 4 output ports.

Journal of Integrated Design and Process Science


Latch Combinatorial Logic Circuit Clk

Self-Timed Transfer Control Circuit Fig. 2

Handshaking Clk : Clock Elastic pipeline.

As shown in Fig.2, the elastic pipeline makes up each PE and executes self-timed data transfer by local handshaking (Nishikawa, Terada et al., 1987); i.e. each data packet in pipeline flows autonomously. Unlike conventional processors, the circular elastic pipeline does not depend on a central system clock. The self-timed data transfer mechanism makes it possible to join two elastic pipelines. The elastic pipeline also realizes routers that connect PEs. Furthermore, inter-chip communications are also realized by local handshaking between two routers. Thus, multi-processor system of CUE-p is organized by multiple pipelines with fully maintaining their elastic nature in total. The instruction-set of CUE processors is designed to make the CUE processors stream-oriented. It has been reported that a multi-processor system comprised of DDPs has sufficient performance to process Japanese high-definition standard for processing moving picture known as MUSE decoding in realtime (Yoshida et al.,1995). Furthermore, CUE processors have multi-processing capability without any context-switching overhead, which is derived from the elastic pipeline and data-driven principle. That is, there is no interference among independent processes running multiple/concurrently on a CUE processor as long as hardware resources requirements are satisfied. It has already been reported that CUE-p can realize TCP/IP (TransPipilinemission Control Protocol / Internet Protocol) protocol handling in 135.6Mbit/s that is the maximum data flow rate in OC-3 ATM (Optical Carrier 3 Asynchronous Transfer Mode) (Nishikawa and Aoki, 1998). Based on these results, CUE-v1 (CUE-version 1) has been developed. 12 PEs and 4 routers are super-integrated in a CUE-v1. One of the authors have realized data-driven implementation of IIOP (Internet Inter-ORB Protocol) / GIOP (General Inter-ORB Protocol) (Object Management Group, 1998) protocol handling used in ORB (Object Request Broker) using CUE-v1. One of the authors has currently been studying realtime media processing such as audio compression and moving picture compression. 2.3. Performance Prediction and Verification Facility for Data-Driven Realtime Systems In realtime processing, execution free from side-effects is not enough; it is essential to assure performance requirements such as data flow rate and turnaround time. In multi-processor systems, performance in executing a program depends on how to allocate the program to multi-processors. Optimization problem of program allocation to multi-processor system is known as NP-complete or NPhard; generally no algorithm can solve the problem at practical computation cost. Therefore the authors have proposed a method in which a user interactively optimizes initial allocation that was generated by genetic algorithm (GA). The GA-based allocation method has been realized by one of the authors, and applied to actual software (Nishikawa, Ishii et al, 1998). As a result, it has been shown that the GA-based method could achieve as high throughput as the user-optimized program, but could not achieve a turnaround time shorter than ones optimized by the user. Transactions of the SDPS


In CUE processor system, because of the CUE processors multi-processing capability without any context-switching overhead, critical path of a program and its corresponding pipeline stages determine turnaround time of a process that executes the program, assuming hardware resource requirements are met. In other words, the turnaround time of a process running on CUE processor can be accurately predicted by adding every delay of packet transfer at each pipeline stage that packets in critical path pass through, regardless of number of independent processes running simultaneously. To verify that hardware resources are not problems in executing a program in a CUE system, pipelinelevel emulation that can reproduce handshaking between pipeline stages is required. But the emulation is impossible if a user specifies excessive requirement in consideration of a CUE system. Thus, prediction facility of processor load is crucial to ensure that users input requirement is not too excessive as to cause the pipelines in the processors overflow. Load of a CUE processor is determined by its pipeline occupancy. Prototyping facility in RESCUE consists of three levels: (1) Symbolic execution to avoid unexpected side-effects as well as incorrectness in specification itself, (2) Performance prediction of turnaround time and pipeline occupancy of the generated program, and (3) Performance verification using pipeline-level emulation to ensure hardware resources are available enough to execute the program. The side-effect detection support using symbolic execution and pipeline-level emulator have already been discussed in previous papers (Nishikawa and Wabiko, 1998, Urata and Nishikawa, 1999). This paper, therefore, focuses on performance prediction and verification facilities.

3. Implementation of Performance Prediction/Verification in Rescue 3.1. Overview of RESCUE Fig.3 and Fig.4 show the schematic overview of RESCUE. As shown in Fig.3, RESCUE consists of 4 facilities symbolic execution, performance prediction, performance verification and pipeline-level emulation. User describes specification that consists of data-dependence, data structure, program allocation, data flow rate and pipeline configuration. Whenever a part of program generated from corresponding specification varies according to the users operation, RESCUE shows the user the result of symbolic execution. At the same time, RESCUE generates emulation information from the specification, and shows the result of performance prediction based on the information provided. Also, RESCUE sends the information to the pipeline-level CUE processor emulator, and shows the result of the performance verification by using the trace log of the emulation. Fig.4 shows system organization of RESCUE. Specification environment is currently implemented on PCs because GUI (Graphical User Interface) toolkit is not available on CUE processors. Currently specification environment is implemented using an interpreter language Perl and a GUI toolkit Perl/Tk. Pipeline-level CUE processor emulator is currently implemented on CUE-v1 because fine-grained parallel processing is required to effectively emulate CUE processors (Urata and Nishikawa, 1999). In logical level, the specification environment and the pipeline-level CUE processor emulator communicate with each other via TCP/IP protocol. On the PC side, TCP/IP protocol stack on conventional UNIX OS is utilized. On the CUE-v1 side, data-driven implementation of TCP/IP protocol, which was studied by one of the authors, Hiroaki Nishikawa (Nishikawa and Aoki, 1998) is utilized. In physical level, specification environment and CUE-v1 are connected by Fast Ethernet or ATM (Asynchronous Transfer Mode).



Specification From Users

To Users

DataDependence Data Structure Program Allocation Data Flow Rate Pipeline Configuration Symbolic Execution

Emulation Information

Network

RESCUE Specification Environment (on PC)

Emulator (on CUE-v1)

Input Stream

Pipeline -Level Processor Emulation

Packet-Flow Control Information Pipeline Configuration

Performance Prediction Performance Verification

Trace Log

Fig.3 Schematic overview of RESCUE.

Specification Environment

TCP/IP

Pipeline-level Emulator

Request for Emulation

TCP/IP Network

Trace Log PC

CUE-v1 Board

Network I/F Board

OC-3 ATM / Fast Ethernet Network

CUE-v1 System

Fig.4 System organization of RESCUE.



Name Data-Driven

Input Ports Output Ports Name Name

Sub-Schema a) Block Name c) Source

Name b) Node Name Name d) Sink

e) Arc

Fig. 5 Elementary notations in DDS. 3.2. Performance Specification for Turnaround Time and Data rate Fig.5 shows elemental notations in DDS. DDS is very similar to dataflow diagram (Dennis, 1972). DDS consists of block (Fig.5 (a)) to describe specification or a part of specification. Each block contains node (Fig.5 (b)) to describe a functional element. Each node has its input/output ports. Each block also has one or more source (Fig.5 (c)) and sink (Fig.5 (d)) to represent input/output from/to outside. arc (Fig.5 (e)) represents data-dependence between the source / output port and the sink / input port, respectively. Hierarchical specification is described by assigning input/output ports of a node to sources/sinks to a block that contains more detailed specification of the node. A node in a block in its lowest level of hierarchy corresponds to a primitive instruction that realizes the functionality of the node. Fig.6 (b) shows an example. This specification describes a part of receive module of IP protocol. This block splits an IP datagram into a IP Header and a IP Body, and it splits the IP Header into the first 8 bytes (IP Header #0-#7) and the rest(IP Header #8-#19). Header Length and Split Position are constants. The former is described as 20, and the latter is described as 8. Fig. 6 also shows an example of a specification with requirements for turnaround time and data flow rate. Turnaround time is the interval between the time at which a pair of input tokens are inputted and the time at which a pair of output tokens are generated. And usually what the user wants is to obtain an output stream at (or until) a particular time after the stream is inputted to a block. Thus the most direct way of specifying turnaround time is to describe the turnaround time itself onto the sink as delay from the time at which one of input tokens to generate the output stream was inputted to corresponding source. In the same way, as data flow rate of input/output stream should be defined to the stream itself, RESCUE enables users to specify it directly. Fig.6(a) shows specification example for hierarchical data-structure of a stream and its data flow rate described as 135[Mbit/s]. Fig.6(b) shows that turnaround time of this block is required to be 2000 [nsec]. 3.3. Specification for Engineering Constraints RESCUE supports specification for pipeline organization as engineering constraints. This specification can be detailed to inter-stage connection level. Fig.7 (a)-(c) show an example. Fig.7 (a) shows the interstage connection of a functional module Join. Fig.7 (b) shows a specification of a PE that consists of some functional modules. Fig.7(c) shows a specification of CUE-v1 that super-integrates 12 PEs and 4 routers. VM (Video Memory & Complex Arithmetic Logical Operation) has external or internal video memory and has complex arithmetic logical instructions to utilize the memory. SUM (Sum Operation) has sum instructions.



Data Flow Rate

: Assigned

(a) A Screen Shot of Specification of Data-Structure and Data Flow Rate

Turnaround Time (b) A Screen Shot of Specification for Data-Dependence and Time Constraints

Fig.6 Performance specification with turnaround time and data flow rate.

3.4. Implementation for Performance Prediction 3.4.1. Predicting Turnaround Time and Critical Path To explicitly show communication overheads onto the specification, processing time and communication time are defined in RESCUE as follows. executing the program.



INT: Integer & Logical Operation GNT: Generation Manipulation TBL: Lookup Table Reference MUL: Multiplier & Static Accumulator VM : Video Memory & Complex Arithmetic Logical Operation SUM : Sum Operation

PE

Router

Functional Module

zSpecification of a Chip: CUEv1

J : Joint B : Branch FC : Firing Control FP : Functional Processor PS : Program Storage

Stage

(b) Specification of a PE: INT

(c) Specification of a Functional Module: Joint

Fig.7 Specification for engineering constraints.



Processing Time Communication Time

Turnaround Time

Pipeline Stage

% 36

)& )3

(b) A PE to which the “Extract IP Header” Node is allocated. J : Joint B : Branch

% 36

)& )3

(c) A PE to which the “Extract First 8 Bytes” Node is allocated. FC : Firing Control FP : Functional Processor

PS : Program Storage

Fig. 8 Predicting turnaround time. As shown in Fig.8, communication time between two nodes in the specification are derived from the path traveled from the previous PS (Program Storage) to the next FC (Firing Control) in the authors defined communication time in an arc as the difference between the time at which a packet reaches the first stage in Branch and the time at which the packet reaches the first stage in FC. In the same way, the authors defined processing time in a node as the difference between the time at which a packet reaches the first stage in FC and the time at which the packet reaches the first stage in Branch. Both the processing time and communication time are predicted as the sum of the average time that the packets spends in passing through each pipeline stage. RESCUE eagerly predicts processing time in nodes and communication time in arcs and displays the information onto specification. To help users reduce the turnaround time of critical path in the program, RESCUE also visualizes the critical path and turnaround time onto the specification. Fig.8 shows such an example. Both nodes named Extract IP Header and Extract First 8 bytes are realized by a primitive



instruction, and are allocated to different PEs. In these PEs in CUE-v1, the number of pipeline stages in Joint, FC, FP, PS, and Branch are 4, 11, 8, 1, and 2, respectively. Average transfer time per a stage in pipeline of the PEs is approximately 4.17 [nsec]. Therefore processing time of each node is: (11+8+1) ´ 4.17 @ 83 [nsec]. Similarly, IP Header arc corresponds to 33 pipeline stages (2 in Branch, 27 in interPE communication, and 4 in Joint) and its communication time is: 33 ´ 4.17 @ 137 [nsec]. As a result, actual turnaround time is estimated at 1209 [nsec]. Since turnaround time required by the user is 2000 [nsec], it is predicted that the program will be satisfy the users requirement. The critical path of this block is visualized by bold arcs. RESCUE also displays ideal turnaround time defined as the total amount of processing time. Since any program allocation cannot results in shorter turnaround time than the ideal turnaround time as long as processor architecture is unchanged, the ideal turnaround time is utilized to check that the users requirement is not excessive to the current processor architecture. In Fig.8, ideal turnaround time for IP Header #0-#7 is estimated at 166 [nsec]. Next section describes how to find critical path using prediction result

An Algorithm for Finding the Critical Path This algorithm first finds a critical path of a block. It makes an assumption that:

• All turnaround time of nodes and arcs in the block have already been predicted. In RESCUE, a default primitive (currently a binomial addition operator) is assigned to the node to which neither a primitive nor a block is assigned. Also, all nodes that have not been allocated to a PE are allocated to a default PE. Therefore, the assumption always holds true as turnaround time of nodes and arcs can be predicted. Every entity in DDS (sources, sinks, nodes, input/output ports and arcs) has its unique identifier: sourceID, sinkID, nodeID, inputportID, outputportID, and arcID. A route is a path from a sink to a source. Critical path is one (or more) of all routes in the block. In searching for the critical path, each route keeps following information: • routeID : unique identifier of this route. nextArc : arcID to be visited next time. • tat : current turnaround time. • idealTat : current ideal turnaround time. • path : a linear list that contains a sinkID, zero or more arcID and a sourceID to represent this route. (Step0) First of all, some variables are initialized as follows: • maxTat : current maximum turnaround time in this block. Initially 0. • maxIdealTat : current maximum ideal turnaround time in this block. Initially 0. • crPath : routeID of a route which is currently critical path of this block. Initially null. • toDo : a FIFO of some routeID of routes with incomplete path. Then, do followings for each arc connected to all sinks in the block : • Create a new routeID. • Initialize following information related to the routeID: Ø Let nextArc to be arcID of the arc. Ø Let tat to be 0. Ø Let idealTat to be 0. Ø Let path to be a linear list that contains sinkID of the sink and arcID of the arc.



Ø Ø Ø Ø

Let nextArc to be arcID of the arc. Let tat to be 0. Let idealTat to be 0. Let path to be a linear list that contains sinkID of the sink and arcID of the arc.

• Append the routeID to toDo. (Step 1) • If toDo is not empty, take the first one from toDo, and let curRoute (current route) to be it. If toDo is empty, go to Step 3. (Step 2) • Add predicted turnaround time of nextArc of curRoute to tat if: Ø nextArc is connected to neither a source nor a sink, or Ø nextArc is connected to a source or a sink and nextArc is in the topmost block in the hierarchical specification.

•

If nextArc is connected to a source and if tat > maxTat do followings : Ø Let maxTat to be tat. Ø Let maxIdealTat to be maxIdealTat. Ø Let crPath to be curRoute. Ø Go to Step 1.

• •

Add predicted turnaround time of nodeID where nextArc starts to tat. Let arcList to be a list of arcID of arcs that connects to input ports of the nodeID.If there is two or more arcs, take out all arcs except one from the arcList and do followings for each arcID of them: Ø Create a new routeID. Ø Copy tat, idealTat and path for curRoute to ones for the routeID. Ø Let nextArc to be the arcID. Ø Append arcID to toDo. Let nextArc to be an arcID in arcList. Go to Step 2.

• (Step 3) Finished. Current crPath is critical path of this block. And current maxTat and maxIdealTat is actual turnaround time and ideal turnaround time of the critical path. An example is shown in Fig.9. In Fig.9, predicted processing time of nodes and communication time of arcs are omitted to simplify this figure. In the example program shown in Fig.9 (a), s0-s4, k0-k3 , n0n3, i0-i7, o0-o6 and a0-a11 are sourceID, sinkID, nodeID, inputportID, outputportID and arcID, respectively. First, by (Step 0), toDo, routeID, path and nextArc are initialized as shown in Fig.9(b)(1). Then (Step 1) lets curRoute to be r0. The node from which nextArc a6 of curRoute r0 is described has two input ports. Therefore, in (Step 2), a new route r4 is created, and its path and nextArc are generated as shown in Fig.9(b)(2). Search for complete path of the route r4 remains as pending. By the next (Step 2), the route r0 reaches a source s0, and its path is completed. Then the next (Step 1) lets curRoute to be r1. In this way, this algorithm continues until all paths are determined. 3. 4. 2. Predicting Pipeline Occupancy of Processors This prediction scheme of pipeline occupancy assumes constant data flow rate and constant number of input packets for every input port of every primitive in a program. These data flow rate and number of packets are given in specification as typical case.



An Example of Program

s0

a0

n0 a1

i4

s3 a4

i0

i6 a5

s1

k0

a6

n2

o0

i1

o3

s4 a2

i5

i2 o1

o4

a7 a8

o5

a10

i7

o6

a11 a1

i3

k2

1

n1 a3

k1

n3

o2

a9

s2

k3 s0-s4 : sourceID, k0-k3 : sinkID, n0-n3 : nodeID, i0-i7: inputportID, o0-o6: outputportID, a0-a11 : arcID (a) An Example of Program

(1) After Step 0. toDo = (r0, r1, r2, r3) routeID path nextArc r0 (k0, a6) a6 r1 (k1, a10) a10 r2 (k2, a11) a11 r3 (k3, a9) a9

(2) After Step 1&2. toDo = (r1, r2, r3, r4) routeID path r0 (k0, a6, a4) r1 (k1, a10) r2 (k2, a11) r3 (k3, a9) r4 (k0, a6, a5)

(3) After Step 2. toDo = (r1, r2, r3, r4) nextArc routeID path a4 r0 (k0, a6, a4, s3) a10 r1 (k1, a10) a11 r2 (k2, a11) a9 r3 (k3, a9) a5 r4 (k0, a6, a5)

(b) Examples of Information of Each Route

nextArc a10 a11 a9 a5

r0-r4 : routeID

Fig. 9 Example of finding critical path in a program. As shown in Fig.10, a graph of pipeline occupancy of PE in executing a primitive shapes a trapezoid that consists of three parts: input phase, steady phase and output phase. First, the beginning of the input phase delays as much as I0. I0 is the turnaround time of previous part of the program. Input phase begins at t0 (= I0). In the input phase, as the data flow rate is constant, pipeline occupancy linearly increases until that the first packet goes around in circular pipeline of the PE. I1 , which is the length of this phase, is the total time needed for a packet to go around a pipeline of the PE. Thus I1 is determined by tSend [nsec] x nStages [stages] where tSend is average transfer time of each stage and nStages is number of stages in the PE. Then steady phase begins at t1 (= t0 + I1). In steady phase, the pipeline occupancy remains constant because the number of packets going outside of PE equals to that of packets coming into the PE. σ, which is the pipeline occupancy of this phase, is determined by dRate [MPackets/sec] x tSend [nsec] x 103 x 102 [%], where dRate is data flow rate. This steady phase continues for (I2 - I1). I2 is determined by nPkt / fRate x 106 [sec], where nPkt is number of input packets of the primitive.



Pipeline Occupancy

I1

I1

I0

I2

t0 Fig. 10

t1

Time

t2

t3

Prediction model of pipeline occupancy.

The graph decreases until all packets exit the PE. This is output phase. The output phase begins at t2 (= t0 + I2 ). Obviously, the length of this phase is I1. Thus the graph finishes at t3 (= t2 + I1 ). Fig.11 shows an example. The graph in Fig.11 (b) begins at 362[nsec], and it increases linearly until 446[nsec]. Then the graph keeps 7[%] while the time is in a range from 446[nsec] to 2132[nsec]. The graph decreases from 7[%] to 0[%] while the time is from 2132[nsec] to 2215[nsec]. Then, as shown in Fig.11 (d), the pipeline occupancy of a PE in executing sub-program allocated to the PE is predicted by summing all graphs for each node allocated to the PE (Kudo, Wabiko et al, 2000). 3. 5. Implementation for Performance Verification As described in (3.1) before, if a node is realized by a primitive, processing time of the node is verified by difference between the time at which a packet exits the PS and the time at which the packet entered the FC. Otherwise, if the node is realized by a block which contains some primitives, processing time of the node is verified by difference between the time at which a packet exited the last PS in the critical path of the block and the time at which the packet entered the first FC in the critical path. The pipeline-level emulator generates packet information stream that contains packet identifiers and timestamps of transfer events between each pipeline stages. Specification environment picks up events only for packets that goes out of PS and comes into FC and calculates processing time in nodes, communication time in arcs and turnaround time of blocks. Verification for pipeline occupancy is implemented in the same way for turnaround time. Pipeline occupancy of a PE is calculated by dividing the number of packets in the PE at that time by the number of pipeline stages of the PE. An example is shown in Fig.12. In this example, a graph with simpler shape is prediction result, and the other one with complex saw shape is verification result using pipeline-level CUE processor emulator. This saw shape is derived from the fact that in actual processing the timing of packets entering and leaving a PE does not necessarily result in a smooth exchange of packet leaving, then immediately after that another packet entering. A typical example is that the number of stages in a PE is not divisible by the input packet interval (how long it takes between a packet and the next one) of an input/output stream. For example, if a stream at a constant data flow rate of 1 packet per 66 [nsec] comes into a pipeline that has 26 stages, the number of packets in the pipeline repeats 5 packets while 22 [nsec], 4 packets while the next 44 [nsec], 5 packets the next 22 [nsec], and so on.



(a) A Screen Shot of Specification for a Part of TCP/IP Protocol Handling Pipeline Occupancy (%)

Pipeline Occupancy (%)

7

14

362 446

0

2132 2215

0

583 667 Time (nsec)

Time (nsec)

(b) Pipeline Occupancy of a PE in Executing a Node Sum

1763 1846

(c) Pipeline Occupancy of a PE in Executing a Node Sum

14

Pipeline Occupancy (%)

7

0

386 583 446 667

Time (nsec)

1763 1846

2132 2215

(d) Total Pipeline Occupancy of a PE in Executing This Block Fig. 11 An example of predicting pipeline occupancy of a PE.



Emulation Result Prediction Result

Fig. 12 An example for result of verification on pipeline occupancy.

3.6. Optimizing Program Allocation In realizing the interactive optimization of program In order to support feature to reduce turnaround time of critical path, it is essential to explicitly visualize inter-PE or inter-chip communications. Sequence charts have been broadly utilized to explicitly visualize communication (Tsai, 1996, Yamazaki et al., 1993). Therefore the authors are currently developing an interactive optimization facility with sequence charts. Fig.13 shows the interactive optimization windows that are currently under development. Fig.13 (a) shows the interactive optimization operation window with critical path viewer. The x-axis represents PEs and the y-axis represents data-dependence. Inter-processor communication is represented as a series of connected line segments between two nodes that are allocated in different processors. Small circles on the series of connected line segments mean just passing through the PE. Fig.13 (b) shows an example of configuration of a multi-processor network. Users can reduce time spent in inter-PE communications in the critical path by moving the nodes to another PE by drag&drop operation. RESCUE computes the turnaround time of the program and shows the results whenever the user changes the program allocation. As a preliminary evaluation, the authors applied their method to the message transmission part of data-driven implementation of IIOP (Internet Inter-ORB Protocol) / GIOP (General Inter-ORB Protocol) adopted in CORBA (Common Object Request Broker Architecture)(Object Management Group, 1998). Its executable program consisted of about 160 nodes. The ideal turnaround time without considering hardware resource constraints was about 10.6[msec]. The turnaround time was estimated about 12.6[msec] when GA initially allocated the program to multi-processor system that consists of 36 PEs. Its critical path consisted of about 25[%] of whole program. The allocations of nodes corresponding to about 5[%] of the entire program were optimized by using the interactive optimization window. The turnaround time



PE Critical Path

Primitive

Just passing through this PE

(a) Reallocation Operation Window with Critical Path Viewer

(b) Processor Network Configuration Window

Fig.13 Graphical user Interface for the interactive optimization scheme on program allocation. of the re-allocated program was improved to about 11.3[msec]. That is, most of the overhead involved within the initial allocation by GA was reduced by our interactive method. This evaluation shows the easy tuning capability in our data-driven system.

4. User / System Interaction via audio 4.1 Audio for User/System Interaction From our research on RESCUE, the authors have obtained a perspective of realizing a development environment that enables users to reuse software components with prototyping via user/system interac-



tion. Therefore the kind of user/system interface provided becomes more and more crucial issue for RESCUE. Considering that human beings use several media in communication, it is too strong constraint that available input media are limited only to the mouse and keyboard and available output media is only through screens. Based on such discussions, multi-modal user interface has been broadly studied (Martine, 1997, Takeda et al., 1997). For example, in reusing a software component, a user may want to change the names described in the components. In changing names represented by character strings, conventionally the user first specifies which name is being changed by clicking the name on the screen and then inputs the new name by typing keyboards. However, it may be easier to read aloud like Rename A with B.. Furthermore, in embedding a block to a node, conventionally user choices the block from menu. However, reading like Embed A in B. may be easier. As for output media, considering that sirens in railroad crossing in real life, audio is effective for warnings for side-effects or guidance to complete specification. Therefore this paper focuses upon these kinds of operations, warnings and guidance via audio as one of future user/ system interface in RESCUE. 4. 2. Implementation of user/system interaction via audio Approaches to realize user interface via audio for input media can be roughly classified into two kinds: one is natural language processing such as morphological analysis and the other is keyword extraction using pattern matching. The former targets on flexible recognition of ordinary spoken language; e.g. artificial intelligence (A.I.), and the latter is for a particular situation where available sentence patterns are limited; e.g. data-base search system or flight reservation system. For RESCUE, because users operations are limited and flexible recognition of ordinary spoken language is not necessary, the latter approach has been adopted. In the subsection a. below, first, available operations are listed in a typical case, and then the regular expression for pattern matching is shown for the example case. As for audio as output media, approaches can be classified into two: one is the audio synthesis and the other is the playback pre-recorded sentences. The former can read aloud any sentences, but the latter can read aloud only prepared sentences. In RESCUE, sentence patterns in warnings or guidance can be designed to be limited by representing the warnings or guide messages like A side-effect possibility has been detected at this node., so that playback approach can be adopted. However, one of the most important features of audio is that the user doesnt have to look at display if enough information has already been obtained from audio such as A side-effect possibility has been detected at Compute Checksum node. (Of course, he is free to look at display if he chooses.) To avoid eliminating the merit of audio for user/system interaction, the audio synthesis approach has been adopted in our study. More detailed realization is described in subsection b below. a. Operation via Audio In RESCUE, users operations in reusing are follows: first, if the desired software component is not described as a block, extract the component as a block using restructuring hierarchical specification. Then user binds the component as implementation for a node. Source names or port names in the component may be changed to be easily understood by the user. For example, the audio operation form for restructuring is as follows: (D indicates a name of input or output data. D+ indicates k Ds in the form of D0, D1, ···, Dk-2 and Dk-1.) a)Focus on following data. b)D+ as input. c)D+ as output d)Restructure.



In the same way, audio operation form for interactive optimization of program allocation follows: (R means the name of a primitive, and X means the name/number of a processor.) e)Move R to Processor X. As mentioned at the beginning of this chapter, pattern matching has been adopted to recognize which operation the user has specified. For example, a sentence pattern D+ as input. is represented by regular expression as follows: (.+)((,.+)*(and.+ ))? as input. where . means any one character, + means 1 or more times, * means 0 or more times, ? means 0 or 1 times, and ( ) means the order of priority. b. Warnings and Guidance via Audio Following items are some examples of representative warnings or guidance message forms: (N means a name of node. P means a name of port.) 1) Warnings a) A side-effect possibility has been detected at the node N. b) Phantom tokens have been detected at the input port P. 2) Guidance a) Specify the implementation of the node N. b) Specify the data-dependence of the input port P. RESCUE generates these sentences and transfers these messages to audio synthesizer. The first version of the prototype audio operation facility is currently under development. 5. Conclusion In this paper, the authors proposed a specification and prototyping environment that supports not only well-behaved execution but also realtime execution. This paper first showed that reducing critical path was crucial to satisfy realtime constraints. A specification method that incorporates users requirements on turnaround time as well as data flow rate was described. At the same time, a specification method including pipeline organization as engineering constraints was shown. Then implementation for performance prediction and verification facilities in RESCUE was discussed. An algorithm for finding critical path was then shown. Moreover, interactive optimization facility of turnaround time was then described, with a preliminary evaluation result that shows the potential of RESCUE as am interactive optimization method. An implementation of the user/system interaction via audio was also discussed to realize multimodal user interface. In this paper the authors showed that input interface via audio could be realized based on the pattern matching method using regular expressions. Then the realization of output interface via audio was shown using audio synthesizer. Further works include prediction / verification support for virtual instructions in tuning processor architecture. Specification and prototyping facilities for more detailed hardware-specific parameters are also needed. Examples of such parameters include IC design rules and IC die size. To predict and verify Journal of Integrated Design and Process Science


performance in practical situations, input/output stream specifications for sudden fluctuation and congestion during runtime need to be supported in RESCUE. Prototype multi-media networking environment will be constructed and the authors will apply their RESCUE to meet two kinds of requirements that consists of one from end users such as low cost and flexibility, and the other from network service providers such as high performance and high reliability. Also, the effectiveness of RESCUE will need to be further demonstrated. 6. Acknowledgements Although it is impossible to give credit individually to all those who organizes and supports the CUE project, the authors would like to express their sincere appreciation to all the colleagues in the project. One of the authors, H. Nishikawa, is very grateful to Prof. Arvind and Prof. Jack B. Dennis of Laboratory for Computer Science, MIT for their fruitful discussions on data-driven paradigm. The authors would also like to give thanks to R. T. Shichiku for his efficient proof-reading in preparation of this paper. This research is partially supported by the Japan Society for the Promotion of Science. 7. References Mampaey, M., 2000, TINA for Services and Advanced Signaling and Control in Next-Generation Networks, IEEE Communications Magazine, pp.104-110. Wade, V. P., Lewis, D, 1999, Three Keys to Developing and Integrating Telecommunications Service Management System, IEEE Communications Magazine, pp.140-146. Nishikawa, H., 1997, Towards Hyper-Distributed Systems Environment, Anritsu News, Vol.16, No.84, pp.2-7. Nishikawa, H. and Miyata, S., 1998, Design Philosophy of Super-Integrated Data-Driven Processors: CUE, Proceedings of the 1998 International Conference on Parallel and Distributed Processing Techniques and Applications, pp. 415-422. Nishikawa, H. and Wabiko,Y., 1998, A Reuse-Oriented Specification Environment Based on Novel Data-Driven Paradigm, Proceedings of the 3rd World Conference on Integrated Design and Process Technology. Hayashi, T. and Igarashi, S., 1976, Theory of Program Verification, Journal of Information Processing, Vol.17, No.5, pp.437-447, May 1976, in Japanese. Wallace, D.R. and Fujii., R.U. 1989, Software Verification and Validation: An Overview, IEEE Software, Vol.6, No.5, pp.10-17. Dennis, J.B., 1972, Dataflow Schemas, Project MAC, pp.187-216, M.I.T. Ross, D.T., 1977, Structured Analysis (SA): A Language for Communicating Ideas, IEEE Transactions on Software Engineering, vol.SE-3, no.1, pp.16-34. Tamai, T. and Fukunaga, K., 1982, Symbolic Execution System, Journal of Information Processing, Vol.23, No.1, pp.18-28. Nishikawa, H. and Wabiko, Y., 1998, Prototype Data-Driven Specification Environment and Its Evaluations, Proceedings of the 1998 International Conference on Parallel and Distributed Processing Techniques and Applications. Nishikawa, H., Terada, H., Komatsu, K., Yoshida, S., Okamoto, T., Tsuji, Y., Takakura, S., Tokura, T., Nishikawa, Y., Hara, S. and Meichi, M., 1987, Architecture of a One-Chip Data-Driven Processor: Qp, Proceedings of the 16th International Conference on Parallel Processing, IEEE, pp. 319-326. Yoshida, S., Shichiku, R.T, Matuura, Y., Okamoto, T. and Miyata., S, 1995, Video Signal Processing Oriented Data-Driven Processor, Technical Report of IEICE, ICD95-158, pp.39-46 (in Japanese). Transactions of the SDPS


Nishikawa, H. and Aoki, K., 1998, Data-Driven Implementation of Protocol Handling, Proceedings of the 1998 International Conference on Parallel and Distributed Processing Techniques and Applications, pp. 430-437. Object Management Group, 1998, The Common Object Request Broker: Architecture and Specification, v2.2 Nishikawa, H., Ishii, H., Kurebayashi, R., Aoki, K. and Komatsu, N., 1998, Data-Driven TCP/IP Multi-Processor Implementation with Optimized Program Allocation, Proceedings of the International Conference on Commun. Systems. Kudo, R., Wabiko, Y., Nishikawa, H., 2000, Performance Verification Scheme for Data-Driven Realtime Processing, Proceedings of 2000 International Conference on Parallel and Distributed Processing Techniques and Applications, pp. 1977-1983. Tsai, J.J.P, 1996, Debugging for Timing-Constraint Violations, IEEE Software, Vol.13, No.2, 89-99. Yamazaki, S., Kajihara, K., Ito, M. and Yasuhra, R., 1993, Object-Oriented Design of Telecommunication Software, IEEE Software, Vol.10, No.1, 81-87. The Object Management Group, 1998, CORBA/ IIOP 2.2 Specification, Object Management Group, Inc. Publications. Martin, J.C., 1997, Towards Intelligent Cooperation between Modalities. The Example of a System Enabling Multimodal Interaction with a Map, Proceedings of the International Joint Conference on Artificial Intelligence on Intelligent Multi-modal Systems. Takeda, H., Kobayashi, N., Matsubara, Y. and Nishida, T., 1997, Towards Ubiquitous Human-Robot Interaction, International Joint Conference on Artificial Intelligence Workshop on Intelligent Multimodal Systems, pp. 1-8.



performance prediction and verification environment for super ...

performance prediction and verification environment for super ...

Suggest Documents

Performance Prediction and Verification for Bistatic SAR ... - DLR ELIB

Environment Abstraction for Parameterized Verification*

Performance Prediction in a Grid Environment - CiteSeerX

PROMINENCE PREDICTION FOR SUPER-SENTENTIAL PROSODIC ...

A Generic Verification Environment for Video ...

A Verification Environment for Critical Systems

Verification & Validation Environment for Automation Functions ...

Computational Prediction and Experimental Verification of ... | PLOS

Bioinformatic prediction and experimental verification ... - BioMedSearch

Iterative Compilation and Performance Prediction for ... - CiteSeerX

Performance and Reliability Prediction for Evolving ... - CiteSeerX

A Methodology for Drop Performance Prediction and

Performance Evaluation and Prediction for Legacy ... - CiteSeerX

Performance Evaluation of Super Performance Evaluation of Super ...

Performance Prediction for Software Architectures

Performance prediction for supporting mobile

Exploiting Nonstationarity for Performance Prediction

Intermediate Report on Program Verification Environment and ...

Performance Prediction in a Grid Environment - Semantic Scholar

VSE - Verification Support Environment - Verisoft

Characterization and Verification Environment for the RD53A ... - arXiv

testing environment and verification procedure for in-car tasks with ...

Mathematical Verification for Transmission Performance of Centralized ...

Comparison of Speaker Verification Performance for ...