Developing Monitoring and Debugging Tools for ... - Semantic Scholar

4 downloads 1471 Views 248KB Size Report
existing display tools ParaGraph and Upshot, and a new compound event analysis .... interrupt, similar to the way monitoring information is sent to the host in theĀ ...
Developing Monitoring and Debugging Tools for the AP1000 Array Multiprocessor C W Johnson, P B Thistlewaite, D Walsh, M Zellner Department of Computer Science Australian National University

Abstract The LERP project aims to assist the ap1000 programmer with essential monitoring and

debugging tools. LERP is based on o -line analysis and replay of an event trace. A common event trace format includes both low level system events and user-de ned events. In this paper we address the issues of instrumentation of ap1000 programs with minimal interference, and presentation of traced program behaviour to the user. We describe an interface to the existing display tools ParaGraph and Upshot, and a new compound event analysis tool suited to machines of the ap1000 scale.

1 Introduction The ap1000 is a pioneering member of a class of MIMD wormhole message-passing machines with many hundreds of processors (each of which is a general purpose microprocessor with signi cant speed and memory in its own right), connected by three communications networks. We may term this class the \kilo-processor" machines.1 It provides signi cant challenges to programmers who need to develop correct and ecient multiprocessor parallel programs. As it is a new machine, it lacks extensive software debugging aids for the programmer. The LERP project aims to assist the programmer by producing the monitoring and debugging tools for the ap1000 that are an essential part of the programmer's toolkit. Traditional debugging methods that are used on sequential machine programs are less e ective on multiple processes and distributed memory. The usual methods for parallel debugging fall into three classes: playback, breakpoint, and static analysis [14]. LERP is based on o -line analysis and replay of an event trace. Preliminary stages in its design are described in previous papers [11] [12]. In this paper we consider the software engineering aspects of implementing event trace collection and the interface to both existing and new toolsets, and our initial stages of research into better tools for this scale of machine. We provide a common event trace format for both low level system events and user-de ned events. The event trace makes no assumptions about the style of programming: each event is associated with a particular task and cell. We have implemented event trace collection by modifying the ap1000 cell operating system, both in the kernel and in the library code. LERP presents traces to the user through a growing set of both graphical and text-based tools, controlled by a common graphical driver. Selective focusing on subranges of time and of processes will be by user-selected lters on the trace. The tools include both static analysis 0 1

This work is supported by the ANU-Fujitsu Joint Research Agreement the relationship to the term \killer-micros" to describe these machines is not accidental.

1

of the execution that is recorded in the trace, and some animated tools that allow the user to visualise the ow of events, message trac, and computation load, over time. The static tools include virtual channel analysis of message trac and compound event recognition to reduce the trace to a humanly manageable volume of meaningful information. This volume reduction can be achieved for traces both within individual tasks, by replacing sequences of atomic events with derived simple regular expressions, and within a collection of tasks, by imposing appropriate equivalence relations on the collection. The dynamic replay tools include our own replay debugger. A number of event trace analysis tools are available for other parallel computing systems. In the case of two existing tools, Picl/ParaGraph and Upshot, the program model is close enough to the ap1000 that it is simple and worthwhile to interface them to the ap1000 event trace format. The structure of this paper is: a description of instrumentation and trace collection (section 2); the interfacing of existing tools (section 3); a new tool for analysis of compound events (section 5); and discussion of other related work (section 6) and future requirements for kiloprocessor support tools (section 7).

2 Instrumenting the AP1000 Operating System Software 2.1 Standard LERP Format

We have implemented two kinds of instrumentation for the ap1000. The simpler kind is a snapshot timer for each cell and task, that divides time into usage in user tasks, system, idle, and busy-waiting. The more important is a continuous event trace of all system call events in each cell.

2.1.1 Usage of Time

We have added the usage() call to the ap1000 CellOS. This function is analogous to Unix's time() call, supplying four classes of timing:  user compute time per task,  communication time spent by the kernel performing message communication,  busy-wait time spent waiting in the xy-library communication routines, and  idle idle time for the cell. The distinction between idle and busy-waiting time is important for the ap1000 architecture. These times can guide the user's choice between synchronous and asynchronous message-sending operations, other things being equal. The implementation must be sensitive to using calls to the system timer functions, since these invoke an operating system trap if the processor is not already in kernel mode. Most cases are no problem, since most system call events are logged when the cell is already in kernel mode to handle that call. This is not the case for the synchronous xy routines, which are very ecient, do not switch to kernel mode and do not make system calls to instigate message communication. These routines use busy-waiting in user mode during synchronisation and message transmission. The time used is measured by incrementing a simple counter in the busy-wait loop. The counter value contributes to the timing (according to previous loop calibration measurements) when the user makes the call to usage(). Early experiments have shown that this implementation has a

negligible performance cost and more than than 95 per cent accuracy for the xy routines with xed sized data.2

2.1.2 Event Traces A canonical LERP event history trace format has been de ned and implemented on the ap1000. It satis es both the LERP replay debugger (section 4) and other post-mortem debugging and monitoring tools (Upshot, ParaGraph - see section 3). Every event logged in the trace includes a cell-based timestamp, the type of event (send, receive, broadcast, con gure, synchronise, user-event etc.), and in some cases the identi er of a second cell that participates in the event (e.g. a receive event also records the sender). An event is logged for every operating system call, and for system library calls controlling external communications. An example of the size of a typical event trace is that for the CAP example heat ow program [6]: a complete trace including message content requires 4.2 MBytes for 100 iterations of computation plus border-exchange of data, for 64 cells (1 second simulated problem time). Events in the log have a three word header containing process ID, event code and timestamp. For some simple events (e.g. EXIT) the header is all that is recorded, while others (SEND, RECV, CONFIG) have extra data. Two variants of LERP trace format allow complete traces (including message contents - allowing process replay and detailed debugging) and abbreviated traces (for analysis and animated event display). For the complete trace the contents of messages are logged after SEND and REQUEST events. Events may be classi ed into four categories: 1. Process control and con guration at process initialisation and completion, with the timestamp at startup for timing normalisation, con guration of the cells, task exit times and status. 2. Message send and receive with message contents for all send-class events, with a key that ties this message to the corresponding receive events. The receipt of a message by the cell (RECEIVE and GOT REPLY) are distinct from the task's (potentially blocking) request for the message (RECV event). 3. Event completion and return status ACK events marking the return from a call with its return value and whether the task was blocked. 4. Status and synchronisation events are recorded for all tasks participating in the sync() or pstat() operations.

2.2 Event Collection and Instrumentation Error

Events are logged into cell memory for selected operating system traps. The accumulated event log is sent to the host for collection into a combined disk le, either at the end of processing or at synchronised regular intervals during execution. Event collection is controlled by the user by a switch on program execution. When logging is selected, a LERP bu er is allocated in each cell's memory. Each event is logged by calling a common routine that records the event into the LERP bu er on the cell. The bu er is periodically downloaded to the host where bu ers from all cells are interleaved into a single le in order of arrival. 2 Alternatively, allowing direct user mode access to timer0 would obviate the problem, and increase performance of the gettime routines.

There are several issues involved in minimising instrumentation error (probe e ect). Three main areas of concern are discussed below: the ecient logging of events, the memory requirements of the LERP bu er, and event bu er download to the host.

2.2.1 Event logging Event logging is of most concern. Currently, ap1000 message transmission time for 40 byte messages is 130 microseconds (event logging disabled), rising to 215 microseconds (event logging without message content) and to 320 microseconds (with the lot). Most of the cost is in the copying of header and message information into the LERP bu er. (For these measurements, note that ve events are logged in transmitting a message between the two processes with la send and crecv, namely SEND, ACK, CRECV, RECEIVED, and ACK). Careful programming of logging is one strategy used to limit the cost. We use many macros rather than functions, minimising processor register window movement and possible time-consuming over ow. Even the smallest instrumentation e ect can change the behaviour of certain time-sensitive applications. The user can choose between tracing with message content (more probe e ect, useful in debugging semantic errors in algorithms) or without (less probe e ect, sucient information for debugging for performance or for synchronisation and deadlocking problems). It is fortunate that the latter class of debugging problem which is more sensitive to instrumentation error can be served by the less interfering trace. We note that using o -line perturbation analysis techniques, for example timed Petri-nets [1], could signi cantly reduce the e ect. However, the variable nature of individual events in the ap1000 architecture will make this technique less e ective for the ap1000 than for other formats.

2.2.2 Downloading trace information

Downloading traces from the cells to the host is an issue for both cell memory and communications bandwidth. Downloading once, at the completion of processing, is ideal in having no e ect on the program's communication behaviour, but of course requires a large event bu er that may con ict with the application's memory requirements. More importantly, if a task hangs then the trace will never be downloaded. To overcome these problems, two additional modes of trace download are provided: on cell demand, and on timer. With demand downloading a smaller bu er allocated in the cells, and all cells download whenever any cell's bu er is full. A lled cell communicates with the host, which then sends stop-and-download control messages to interrupt all cells. Thus downloading is synchronised on all cells to avoid extreme probe e ect. The third mode guarantees the most complete trace le for a program that fails to terminate (by deadlocking, for instance). It downloads cells at regular intervals in response to the timer1 interrupt, similar to the way monitoring information is sent to the host in the original CellOS.

3 Using existing analysis and display tools A large number of tools exist for parallel debugging and performance monitoring on di erent parallel machines. Some of these have a program model that is suciently compatible with the ap1000 to be useful. We include the ability to use these existing tools by providing lters to transform the LERP trace format to the desired formats. Two implemented examples are

the ParaGraph tools, originally for programs written with the PICL portable programming system, and Upshot, for those with the Argonne P4 macros.

3.1 Interface to Upshot

Upshot is an X-based graphics tool for viewing log les produced by parallel programs written with the P4 macros [9]. Upshot has been ported to a number of parallel systems, including Intel's Touchstone Delta. Unlike PICL, which supports a multitude of views, Upshot supports a single view, in which events are aligned on the parallel time lines of individual processes. However, its advantages include:  the view can be scrolled, forwards and backwards in time, and zoomed in or out  data associated with individual events is easily viewable via select-and-click pop-ups  the user can de ne states, represented by colour bars on a timeline, by de ning entry and exit event types for each state We have developed a translator from LERP to Upshot trace format which fully supports all Upshot capabilities. The translator automatically de nes a set of Upshot states corresponding to the set of compound events that we derive from the LERP log (section 5). This work has pin-pointed several technical implementation constraints on the current version of Upshot: the number of primitive events for the ap1000, and the richer associated data in the LERP trace, exceed Upshot's designed display abilities. We are presently working with Argonne to remedy these technical de ciencies. More substantial problems with the display and interface model are shown in attempting to investigate traces from the large number of ap1000 processes, without rst selectively ltering the trace to a few \interesting" processes and events (see discussion in section 7 below).

3.2 The ParaGraph Toolkit

The ParaGraph performance debugger and visualisation tool is being developed by Heath and Etheridge as part of the \Performance Characterisation project" [8]. The ParaGraph performance visualiser is used to visualise the performance of parallel programs written using the PICL subroutine library, which produces trace les to be debugged using ParaGraph in a post-processing manner. PICL is a portable parallel programming subroutine library also developed at Oak Ridge for the same project. The philosophy of ParaGraph is to provide a portable, useful medium technology tool to nd performance bottlenecks in parallel programs and to visualise many di erent computational and communication aspects of a program's behaviour. ParaGraph is mainly used for performance debugging rather than correctness debugging, but its views contain many ideas of how the actual behaviour of parallel programs can be visualised. The design of ParaGraph incorporates many di erent views (over twenty ve) ranging from simple statistical views (message queue length and CPU usage for all cells) to more complex views such as Gantt charts, Kiviat diagrams and Spacetime diagrams. The plethora of views is designed to give the programmer as many di erent views as possible to choose from, with the expectation that many of the views will be appropriate for the application being debugged. ParaGraph is also extensible, allowing new application speci c views to be added. ParaGraph attempts to provide a simulation of the program, in the sense that all of the views show the same excerpt of time, and all views progress in the computation at the same rate.

The simulation can only move forward in time through the trace le, which restricts exploratory interaction. Although the plethora of views is useful, too many simultaneous views can become cluttered with no connecting structure except the point in simulation time. With increases in the number of processors, the views themselves (such as the Space-time view) can become indecipherable due to low graphics resolution, and human information overload.

3.3 lerptopicl

We have implemented a conversion lter called lerptopicl which converts trace les in the LERP format into trace les acceptable to ParaGraph. Since the underlying models and the implementations of CellOS and PICL are di erent some translation must occur, as follows:

 Since the PICL model has only one process per processor, whereas the ap1000 may have     



more than one task per cell, each CellOS task is mapped into a separate virtual PICL node. Timestamps in the trace are normalised to the minimum starting time of the cells, since CellOS has no explicit calls to normalise the processor clocks. The host is ignored, since other users and operating system delays make its timings unreliable. The numerous ways of sending a message in CellOS (including broadcast, though excluding xy send) are mapped into the single PICL send0 primitive. CellOS request and reply primitives become a send0 with a receive0 on the replying node, and a send0 from the replying node to a receive0 on the requesting node. The various types of qrecv calls are not mapped into trace records for the PICL probe primitive but produce a statistics record with an incremented probe count. The LERP trace le format produces a CUE event to acknowledge the receipt of a message that satis ed a blocking receive. The corresponding PICL recv waking event must contain the type and size of the message which is not contained in the CUE event. The translator includes a unique key with each message to allow a postprocessing phase to match sends to receives. The more esoteric CellOS subroutines such as scatter and gather and the xy calls are not yet handled.

The translator has been used to demonstrate the ParaGraph tools on sample ap1000 event traces from the standard ap1000 example Heat Flow program [6]. The nearest neighbour communication and synchronisation points of the heat ow program are clearly evident from the Spacetime views, and the Kiviat diagram indicated that the program was mostly computation, and well load balanced, giving an assurance of the eciency of the heat ow program.

4 Replay Debugger The replay debugger is a tool writen explicitly for the ap1000. It accepts an event log recorded in the extended trace format, which includes a copy of the contents of messages sent. The replay debugger allows the programmer to investigate the behaviour of individual tasks in detail at the stop-and-display-variables level of debugging, while simulating apparent interaction with other

tasks. By working from an event trace the debugger allows the user to follow the same succession of execution states across the combined set of processors, without the potential non-determinism of successive real executions. Only selected tasks are actually rerun. This allows replay in a reasonable time o -line from the ap1000, on a conventional Sun workstation which has the same Sparc processor type as the ap1000 cells: no recompilation is necessary. The user controls each replaying task through a dbx process which allows setting breakpoints, displaying variables, and relating execution to source code. The state of the non-executing tasks (as viewed by the selected replaying tasks) is maintained by an event reader process which reads the event trace. At present each replaying task has its own event reader, and tasks are unsynchronised. In future synchronisation between replaying tasks, to the granularity of recorded events, will be obtained by having each share a single event reader containing multiple, related read positions. The event trace includes the results of system calls as well as the contents of messages. These are used to feed results to the simulated system calls during replay. Where a replaying task executes a sending call the contents of the corresponding message recorded in the trace are expected to match. In some cases they will not do so (if, for example, the data includes pointer values which will di er because of the di erent memory mapping models in CellOS and SunOS). A mismatch of message data is reported as a warning. On the other hand, a mismatch in the sequence of events between trace and replay is regarded as a fatal error. (We note that some variations in sequencing should be allowed even for one task's event stream, which we shall investigate in later work.) The implementation is a modi cation of the Casim simulator for the ap1000 [12].

5 Compound Event Trace Analysis 5.1 Motivation

Existing tools such as ParaGraph and Upshot are ine ective for viewing or analysing logs of parallel programs with a large number of tasks. Indeed, many existing tools have soft-wired limits to the number of processes that can be viewed (for example, 16). Moreover, an extensive search of the literature together with discussions with a number of other researchers in the area of parallel program debugging and performance monitoring indicates that few researchers have considered the problems associated with massively parallel architectures where hundreds or even thousands of processors might be involved - the kilo-processors. When the number of processors exceeds some small number the user of a parallel debugging or performance monitoring tool faces the problem of \bandwidth", or of \not being able to see the forest for the trees" or vice versa. The mass of information makes it easy to miss important individual events (e.g. anomalous or pathological events), and makes it hard to see an overall execution pattern (e.g. topology) amongst the processes that make up the program. Symbolic debuggers are especially useful for understanding the exact nature of an anomaly, and in the case of exception-induced process termination (e.g. fatal errors), the approximate location of an anomaly. But in the case of parallel systems, especially message-passing architectures, non-fatal or indirectly fatal errors can reside in processes other than the one that terminates with an error condition. Animation tools provide approximate, impressionistic depictions of \patterns" within and between event streams, but do not provide pattern descriptions which exactly describe event streams. Human visual processing can detect only a few of these patterns; the human must be assisted by automatic pattern recognition.

Such exact descriptions can be used for the deduction of similarities between processes and parts of a single process, and these can lead, in turn, to the detection of anomalous individual events within a stream, the analysis of the topology of a program, and the selection of a manageable number of \representative" tasks for closer examination.

5.2 Compound Event Recognition Algorithm

In a LERP log the events that constitute a particular task event stream are interleaved throughout the log. However, for the purposes of compound event recognition, we will separate individual task event streams, and we will order events within a particular stream according to time. Assume then that we have logged a program using processors each having tasks per processor giving a total of =  separate event streams. Each event stream, i , is a timek , we can now de ne a compound event as ordered sequence of events. Where j = 1j j c+d c . A compound event is a structure a complete subsequence of an event stream, j j having the following elds:  a type-identi er, which maps to the sequence of primitive event types that constitute that subsequence  a start-time, being the timestamp of the event ci  an end-time, being the timestamp of the event ic+d  a count, the use and meaning of which will be explained later. Note that in our terminology a compound event does not span or relate events from di erent tasks (e.g. a SEND from task A and its matching RECEIVE in task B). However, we do not want just any arbitrary subsequence of an event stream to constitute a compound event. Ideally, we want to select those subsequences that correspond, in some meaningful sense, with actual \modules" or \blocks" within a task; for example, a SEND followed by an ACK followed by a CUE. In the absence of clues provided by the user, we can use two criteria for determining which subsequences form natural blocks:  the subsequence recurs frequently within a particular event stream, or between di erent event streams  the subsequence is bounded by another compound event, or lies exactly between other compound events, or occurs repeatedly so as to indicate a loop. In addition, we would like to recognise larger sequences as compound events before smaller sequences, everything else being equal. These considerations lead to assigning a value to every subsequence of every event stream, determined by the formula: p

n

p

t

t

n

n

< e ; :::; e

>

< e ; :::; e

>

e

e

V alue

=(

)  log 10(

total f requency of occurrence

)

length of subsequence

repeat count

where the is the number of times the subsequence is immediately followed in an event stream by an instance of itself. Compound event recognition proceeds by identifying subsequences of events in each task's event stream, within speci ed length limits, and computing adjacency counts for repeated adjacent instances of subsequences. The highest Value subsequence is chosen as a new compound repeat count

event, given the start- and end-times of its rst and last constituent events, and that compound event is substituted for every instance of the constituent subsequence. Repeated adjacent instances of the compound event are replaced by one instance with the corresponding count and appropriate start- and end-times. This process is called compaction. This algorithm is repeatedly applied to successively re ned event stream sequences, and a new highest-valued subsequence is selected as the next compound event for substitution and compaction within the event stream sequences. The repetition terminates when all adjacency values are non-zero. At termination, we will have recognised a small number of compound event types. These compound event types represent frequent patterns of adjacent event subsequences. Users can control the eciency and power of this process by setting parameters such as minimum and maximum subsequence length, whether compound events can themselves be compounded, or can only be a compound of primitive event types, and whether to terminate short of exhaustion when a given number of the best compound events have been found. An example of two primitive event traces from tasks in the one program and the compound events that are recognised by this algorithm is in gure 1.

5.2.1 Partitioning Tasks into Equivalence Classes

After recognising compound events, we have reduced the event stream for each task to a more compact, yet exact, description of the original, by substituting compound events for large subsequence of other events, and compacting adjacent instances of a compound event into the one record. If we ignore the time information in each event stream while preserving the local ordering, and explicitly tag each occurrence of a compound event with its count, the resulting event stream description is a regular expression representing the sequence of event types of events in the original event stream. Let us call such a description the signature of an event stream for a given task. Signatures for the two tasks are shown in the right hand side of gure 1. Now, we can partition the set of tasks associated with a particular program using equivalence relations de ned in terms of task signatures. Where two tasks have identical signatures, we call them identical tasks. We can also ignore the count tag in the signatures, thereby getting a description which we call the reduced signature of a task. Where two tasks have identical reduced signatures, we call them cognate tasks. Clearly, these identities give rise to equivalence classes. In the example there are two equivalence classes of identical tasks among the whole group of tasks (the signatures of the other tasks were not shown in the gure): SIGNATURE for cells 0..5, 7..13, 15..21, 23..29, 31..37, 39..45, 47..53, 55..61, 63

:

CE2 CE1(3) CE3

SIGNATURE for cells 6, 14, 22, 30, 38, 46, 54, 62

:

CE2 CE1 CE4 CE1 CE3

Having automatically partitioned all tasks into equivalence classes, the user can:  determine whether the topology induced by the set of equivalence classes corresponds to the intended topology  identify tasks which may have anomalous events by examining singleton equivalence classes

 examine the signature for a task or set of tasks to see whether the \block" structure

appears to re ect the intended substructuring of the task  select a subset of tasks, typically one or two from each equivalence class, for viewing using available tools (such as Upshot or those provided via ParaGraph), or for examining using a symbolic debugger. In our example the topology of the identical tasks corresponds to a single column of an 8  8 cell array, as shown below. 0 8 16 24 32 40 48 56

1 9 17 25 33 41 49 57

2 10 18 26 34 42 50 58

3 11 19 27 35 43 51 59

4 12 20 28 36 44 52 60

5 13 21 29 37 45 53 61

6 14 22 30 38 46 54 62 *

7 15 23 31 39 47 55 63

The topology of processor classes is clearly that the column of cells 6, 14, 22, etc. behaves di erently from the rest. We can o er no explanation in this case.

5.2.2 Determining Phases of a Process

Reduced signatures, to the extent that they each re ect a natural, abstracted \block" structure within each task, also re ect the computational \phases" of that task, with phase boundaries corresponding to abstracted compound event boundaries. With cognate tasks, these phases will be common. We can display the phases of a class of cognate tasks (or, indeed, of all tasks) using Upshot, by representing a particular phase as an Upshot state, and inserting dummy events into the log to mark the boundaries of each phase.

5.3 Further Work

Determining equivalence classes of tasks, and the phases of tasks, using signatures and reduced signatures, may be generalised by taking relations other than identity. For example, it may be possible, by using the techniques associated with simulated annealing, to (i) allow for gaps within signatures, and (ii) to treat the occurrence of di erent primitive LERP event types, such as ISEND and FSEND, as being the same within a particular context. We are currently exploring these possibilities.

6 Relation to other work Many papers have been written about debugging parallel programs, but not many have focused on general problems such as dealing with the nondeterminacy inherent in parallel programs, and the need to develop debugging tools that can deal with thousands of processors, running possibly di erent programs. Not many of the claimed \debugging" tools assist in actually nding and correcting erroneous program behaviour, but rather allow program behaviour to be viewed and tuned for performance after the correctness of the program has been established.

Our work on compound events is similar to the EDL [4] model of event abstraction. The language EDL was originally developed to specify the behaviour of parallel programs in terms of a sequence of primitive or higher level program events. Primitive events (such as sends and receives) could then be grouped into higher level events using a regular expression like syntax. This grouping of events is termed behavioural abstraction since the primitive event stream can be simpli ed into higher level events giving a more abstract view of the behaviour of the program. The EBBA set of tools has been constructed based on the EDL model that allows dynamic matching of high level event de nitions to the lower level primitive event stream [2] [3]. The EBBA toolset is used in distributed systems to determine whether the actual program behaviour matches the behaviour speci ed by the programmer of the distributed system in terms of EDL higher level events and involves pattern matching using sophisticated shue automata. We feel that the EDL model and its implementation in the EBBA toolset are a good basis for future research into extremely large parallel systems, since unlike other debugging systems, they allow the programmer to debug by comparing expected behaviour with actual program behaviour in an abstract way, that can be automatically checked. The Belvedere debugger [10] is an implementation of a graphical user interface to the information supplied by the EBBA toolset. The debugger allows the programmer to see directly the di erence between actual and expected program behaviour. All papers on Belvedere seem to indicate that it is limited visually (like other debuggers of its class) to a limited number of processors. Few debugging systems allow a message passing program run on a distributed memory machine to be debugged in the context of the source code of the program, rather than just the external state of the program such as message trac. Systems such as PIE [17] provide a visual programming environment (albeit for shared memory architectures) that allows an integrated view of the program as a whole, including views of performance data, and of the text of the source program itself. Such a tool for a distributed memory machine would be invaluable in allowing either online or post mortem debugging of a message passing parallel program, and a step towards this goal has been taken by both the replay server we have implemented for the ap1000 [12] and the online version of the gdb debugger supplied with the ap1000 system software. Our work on instrumenting the ap1000 kernel has previously been done for many other parallel machines and their operating systems, with the e ect of having a plethora of operating systems and trace le formats. Work has been carried out on attempting to standardise the trace le formats, and also on providing a generic operating system for message passing distributed memory machines. The discussion in papers such as [1] [15] [16] focuses on producing a wish list for the format and semantics of trace les from sequential and parallel systems, and puts forward the attributes that performance monitoring and visualisation tools should possess. Portable parallel programming environments such as PICL [7] and p4 [5] attempt to provide a generic portable operating system for distributed memory multicomputers. PICL has its tracing logic embedded in the system, requiring minimal e ort by the programmer. Tools such as p4 provide a separate tracing package which must be used to instrument the program explicitly. We have found that the trace formats for these systems in particular are suciently structurally similar to make translation from our format easy, and hence connection to the related tools engineered readily.

7 Discussion and Conclusion We have described LERP as it is: a debugging and monitoring system still under development. We have implemented the framework for a toolkit, based around the common event trace and the ideas of selective focusing and automatic analysis of behavioural patterns. The tools are no more than serviceable at present, and we expect further re nement and experience with their use in the kilo-processor environment to inform further developments. In the paragraphs that follow we discuss some aspects of debugging for kilo-processors and some consequences of using event traces, followed by an outline of our anticipated future directions.

7.1 Approach to debugging

Tools for debugging and performance monitoring and analysis exist to aid programmers in the task of forming and testing hypotheses about the program's behaviour. Kilo-processor machines present the speci c problem of compressing the information that is available from the machine to humanly usable quantity. By attaching existing conventional debugging and monitoring tools to the ap1000 we are able to support part of the debugging activity, but at the same time we demonstrate these tools' inadequacy for large numbers of processes. Debugging on such machines requires extending the abstractive power of the tools further along the spectrum from detailed internal state examination (dbx-like), and beyond internal and external state transition and message ow (like ParaGraph and Upshot). The LERP compound event recognition tool, and the associated ability to cluster process behaviours, are an advance along this spectrum that provides the user with a more abstract gestalt view of the program's behaviour. The problem is not only in the large number of processes but in the shift in viewpoint that is needed, from considering internal program states alone in sequential debugging, to seeing both internal and external states in distributed parallel programs, up to broader pictures of state changes and behaviour. Conventional debugging systems are unable to handle the larger sequential programs of today, such as a complete X-windows application, or layered OSI application, through all their layers, multi-parameter interfaces, and semi-hidden internal states. It is not surprising that it is also unable to cope with the kilo-processor. Our answer is to provide two kinds of selective focus to the user: focus in time and space taking subsets of the execution timeline, and of the set of tasks; defocusing on detail abstracting whole program behaviour to reduce the information load. Programmers use several layers of abstraction in constructing their programs, both explicitly represented in higher-level language constructs and implicitly contained in their coding. The compound event recogniser is the rst stage in attempts to capture some abstract patterns in program behaviour and present them in ways that the users can relate to their designs. The alternative of requiring programmers to specify formal event descriptions ahead of time is unproductive: programmers rarely formulate their designs precisely enough to be able to give exact speci cations of the expected behaviour. By viewing behavioural abstractions extracted from actual behaviour programmers can be alerted to discrepancies between processors, and can gain new understanding of their programs to help them re ne their algorithms and implementations.

7.2 Event traces

We base our debugging on event tracing because traces provide reproducible behaviour during the user's exploration. With the ap1000's large local cell memory and high bandwidth communications the probe e ect during trace collection is acceptably small, and it is practical to

collect traces for reasonably long program runs of reasonable complexity. Modern workstations have enough computing and graphics power to make o -line debugging and analysis attractive. Trace les can also be used to support regression testing, ensuring that events occur in the same partial order after small changes are made to algorithms. A current issue in capturing event traces is the problem of handling large traces. Two mechanisms to reduce log le sizes are being considered, namely event trace compression, and selective tracing of chosen task/cell combinations. Data compression has been shown to have very good results on event traces, using runlength encoding or modi ed Lempel-Ziv algorithms [13]. The main consideration with trace compression is deciding when to perform it. The computational cost can be partly balanced by the smaller amount of storage and transmission time for the data.  Compressing when events are generated will increase the time overhead and thus increase the probe e ect. However, all processes are a ected and the net e ect must be determined.  Compression during bu er download to the host will have an e ect depending on the mode of bu er download. End of processing and synchronised modes will be relatively transparent, where regular downloads will be heavily a ected. It should be recognised that this mode would be the mode of choice for examining deadlocking problems which would be most sensitive to probe e ect.  Post-processing compression is already an alternative, provided that sucient temporary le storage is available for the intermediate full log le. The gain is only in reducing long-term disk storage. Selective tracing of individual or groups of processes is a possible source of instrumentation error, because it introduces uneven overheads on processors. By comparison, when all tasks are traced the probe e ect is minimal because tracing costs are almost uniform across all cells. Note that the trace created by this option will not be useful for animation purposes where instrumentation error will be visible, but remains useful in conjunction with the rerun server for debugging those tasks that have been traced.

7.3 Directions of further work

Future directions for this work will extend the approach that we have taken in providing a multi-layer, multi-view toolset, with automatic behavioural analysis, guiding manual selective focusing. We identify the need to include the whole program view in debugging and analysis: reductionism will not be successful in kilo-processing. Extensions planned in the near future include:  extending the automatic recognition of compound events to include events from more than one task  recognising geometric patterns of decomposition onto the processors  recognising and highlighting clustering in patterns of decomposition and event behaviour  raising the recognition of behavioural patterns to the level of the user program, not only the implementation of the programming abstraction, by using speci cations of higher-level behavioural patterns from implementors of higher level programming models such as data parallel paradigms, Linda, etc. We await further experimental evaluation of the debugging and analysis toolset by a variety of users with a variety of programming problems as the implementation develops.

References 1. M. S. Andersland and T. L. Casavant, Recovering Uncorrupted Event Traces from Corrupted Event Traces in Parallel/Distributed Computing Systems, in Proceedings 1991 International Conference on Parallel Processing, H. D. Schwetman, ed., CRC Press Inc, Boca Raton, August 1991, II{108{112. 2. P. Bates, Debugging Heterogeneous Distributed Systems using Event-Based Models of Behavior, Proceedings of the ACM SIGPLAN/SIGOPS Workshop on Parallel and Distributed Debugging, published in ACM SIGPLAN Notices 24 (January 1989), 11{22. 3. P. Bates and J. C. Wileden, An Approach to High-Level Debugging of Distributed Systems, Proceedings of the ACM SIGSOFT/SIGPLAN Software Engineering Symposium on High-Level Debugging, published in ACM SIGPLAN Notices 18 (August 1983), 107{111. 4. P. C. Bates and J. C. Wileden, EDL: A Basis for Distributed System Debugging Tools, in Proceedings of the Fifteenth Hawaii International Conference on System Sciences, January 1982, 86{93. 5. J. Boyle, R. Butler, T. Disz, B. Glick eld, E. Lusk, R. Overbeek, J. Patterson and R. Stevens, Portable programs for Parallel Programs, Holt, Rinehart and Winston, 1987. 6. Fujitsu Laboratories Ltd, CAP-II Program Development Guide C-language Interface, 2nd edition, March 1990. 7. G. A. Geist, M. T. Heath, B. W. Peyton and P. H. Worley, PICL: a portable instrumented communication library, C reference manual, Oak Ridge National Laboratory, Oak Ridge, Tennessee, ORNL/TM-11130, July 1990, Technical report. 8. M. T. Heath and J. A. Etheridge, Visualizing performance of parallel programs, Oak Ridge National Laboratory, Oak Ridge, Tennessee, ORNL/TM-11813, May 1991. 9. V. Herrarte and E. Lusk, Studying program behavior with upshot, Mathematics and Computer Science Division, Argonne National Laboratory, Illinois, 1991. 10. A. A. Hough and J. E. Cuny, Belvedere: Prototype of a Pattern-Oriented Debugger for Highly Parallel Computation, in Proceedings of the 1987 International Conference on Parallel Processing, University Park, Pennsylvania, 1987, 735{738. 11. C. W. Johnson and P. Mackerras, Architecture of an Extensible Parallel Debugger, in Proceedings 1991 International Conference on Parallel Processing vol 2, H. D. Schwetman, ed., CRC Press Inc, Boca Raton, August 1991, II{262-263. 12. C. W. Johnson and P. Mackerras, Design of a Replay Debugger for a Large Cellular Array Processor, in Proceedings of Australian Software Engineering Conference, 1991. 13. C. Z. Loboz, An analysis of program execution: issues for computer architecture, Australian National University, PhD Thesis, Canberra, Australia, 1990. 14. C. E. McDowell and D. P. Helmbold, Debugging Concurrent Programs, ACM Computing Surveys 21 (December 1989), 593{622. 15. C. Pancake, D. Gannon and S. Utter, Supercomputing 90 BOF session on Standardising Parallel Trace Formats, New York, 15 November 1990. 16. D. A. Reed and The Picasso Group, Scalable, Open Performance Environments for Parallel Systems, Department of Computer Science, University of Illinois, Urbana, Illinois, April 1991. 17. Z. Segall and L. Rudolph, PIE: A Programming and Instrumentation Environment for Parallel Processing, IEEE Software 2 (November 1985), 22{37.

Primitive Event Sequence - Cell 2 Event Time Iinfo 0001 Cue 0004 Recvs 0006 Recvd 0012 Cue 0015 Sync 0030 Cue 0035 Fsend 0100 Ack 0110 Recvs 0134 Recvd 0178 Ack 0213 Fsend 0311 Ack 0320 Recvs 0349 Recvd 0401 Ack 0429 Fsend 0498 Ack 0523 Recvs 0538 Recvd 0556 Ack 0791 Pstat 0923 Ack 0932 Csend 0955 Ack 1020 Exit 1093 Primitive Event Sequence - Cell 6 Event Time Iinfo 0002 Cue 0004 Recvs 0006 Recvd 0012 Cue 0015 Sync 0030 Cue 0038 Fsend 0105 Ack 0110 Recvs 0134 Recvd 0178 Ack 0218 Fsend 0317 Ack 0320 Recvs 0349 Recvd 0401 Cue 0428 Fsend 0493 Ack 0523 Recvs 0538 Recvd 0556 Ack 0797 Pstat 0923 Ack 0932 Csend 0955 Ack 1020 Exit 1094

Compound Event Sequence Cell 2 Event:

CE2 = (Iinfo,Cue, Recvs,Recvd, Cue,Sync,Cue) Start: 0001 End: 0035 Count: 1

Event:

CE1 = (Fsend,Ack, Recvs,Recvd,Ack) Start: 0100 End: 0791 Count: 3

Signature: CE2, CE1(3), CE2

Event:

CE3 = (Pstat,Ack, Csend,Ack,Exit) Start: 0923 End: 1093 Count: 1 Compound Event Sequence Cell 6 Event: Start: End: Count: Event: Start: End: Count: Event: Start: End: Count: Event: Start: End: Count:

CE2 =(Iinfo,Cue, Recvs,Recvd, Cue,Sync,Cue) 0002 0038 1 CE1 =(Fsend,Ack, Recvs,Recvd,Ack) 0105 0218 1 CE4 =(Fsend,Ack, Recvs,Recvd,Cue) 0317 0428 1 CE1 0493 0797 1

Signature: CE2, CE1, CE4, CE1, CE3

Event:

CE3 =(Pstat,Ack, Csend,Ack,Exit) Start: 0923 End: 1094 Count: 1

Figure 1: Corresponding Primitive and Compound Events (idelaised timestamps)