Techniques for debugging parallel programs with ... - Description

3 downloads 0 Views 3MB Size Report
acts as a summarizing block for its descendant blocks and is used when its descendants constitute an e-block; an e-block is the unit of incremental tracing during.
Techniques for Debugging with Flowback Analysis JONG-DEOK

Parallel Programs

CHOI

IBM Thomas J. Watson Research Center and BARTON P. MILLER and ROBERT H. B. NETZER University of Wisconsin — Madison

Flowback analysis is a powerful technique for debugging programs. It allows the programmer to examine dynamic dependence in a program’s execution history without having to reexecute the program. The goal is to present to the programmer a graphical view of the dynamic program flowback analysis while dependence. We are building a system, called PPD, that performs keeping the execution time overhead low. We also extend the semantics of flowback analysis to parallel programs. This paper describes details of the graphs and algorithms needed to implement efficient flowback analysis for parallel programs. Execution-time overhead is kept low by recording only a small amount of trace during a program’s execution. We use semantic analysis and a technique called incremental tracing to keep the time and space overhead low. As part of the semantic analysis, PPD uses a static program dependence graph structure that reduces the amount of work done at compile time and takes advantage of the dynamic information produced during execution time. Parallel programs have been accommodated in two ways. First, the flowback dependence can span process boundaries; that is, the most recent modification to a variable might be traced to a different process than that one that contains the current reference. The static dynamic program dependence graphs of the individual processes are tied together with synchronization and data dependence information to form complete graphs that represent the entire program. Second, our algorithms will detect potential data-race conditions in the access to shared variables. The programmer can be directed to the cause of the race condition. PPD is currently being implemented for the C programming language on a Sequent Symmetry shared-memory multiprocessor.

[Software Engineering}: Testing and Categories and Subject Descriptors: D.2.5 azd.$ monitors, tracing; D.3.3 [Programming Languages]: Lmwwe Debugging—debugging Constructs and Features—concurrent programming structures, control structures, procedures, functions, and subroutines; D.3.4 [Programming Languages]: Processors–code generation, compilers

General

Terms: Design,

Languages,

Measurement

This research was supported in part by National Science Foundation grants CCR-8703373 and CCR-8815928, Office of Naval Research Contract NOO014-89-J-1222, and a Digital Equipment Corporation External Research grant. Authors’ addresses: J.-D. Choi, IBM Thomas J. Research Center, Yorktown Heights, NY 10598; B. P. Miller and R. H. B. Netzer, Computer Sciences Department, University of WisconsinMadison, Madison, WI 53706. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright nbtice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. @ 1991 ACM 0164-0925/91/1000-0491 $01.50 ACM Transactions on Programming Languages and Systems, Vol 13, No 4, October 1991, Pages 491-530

492

.

J.-D, Choi et al

Additional Key Words and Phrases: Debugging, program, program dependence graph, semantic

flowback analysls

analysis,

incremental

tracing,

parallel

“In solving a problem of this sort, the grand thing is to be able to reason backward This is a very useful accomplishment, and a very easy one, but people do not practise it much In the everyday affairs of life it is more useful to reason forward, and so the other comes to be neglected. There are fifty who can reason synthetically for one who can reason analytically, “Let me see if I can make It clearer. Most people, if you describe a train of events to them, will tell you what the result would be, They can put those events together in their mmds, and argue from them that something will come to pass. There are few people, however, who, if told them a result, would be able to evolve from their own inner consciousness what the steps were which led up to that result. This power is what I mean when I talk of reasoning backward, or analytically. ” Sherlock Holmes in “A Study rn Scarlet” Arthur

Conan Doyle

1. INTRODUCTION Debugging program

is a major initially

programmers veloped

have

low. trace

way

by detailed

since

sequential

that

Whereas

programs

debugging

Program running

it is rare

intends. parallel

a

most

and have programs

dehas

Debugger ( PPD) [31] is a debugon shared-memory multiproces-

“multiprocessors”). PPD efficiently implements a techanalysis [7], which provides information on the data

flow between events while keeping both a method

is generated

programmer

strategies,

difficult. The Parallel for parallel programs

By using

a program,

the

debugging

debugging

sors (hereafter, called nique called flowback and control information

in developing the

experience

satisfactory

proved more ging system

step

behaves

during

information

in a program’s execution. PPD provides this the execution-time and debug-time overhead

called

incremental

execution obtained

and

tracing,

only

is supplemented

by reexecuting

only

a small during

selected

amount

of

debugging parts

of the

program. PPD is also capable of performing flowback analysis on parallel programs and detecting data races in the interactions between processes. This paper describes the mechanisms used by PPD to implement efficient flowback analysis for parallel programs. These mechanisms include program dependence graphs and semantic analysis techniques such as interprocedural analysis

[2, 12] and data-flow

analysis

The goal of PPD is to aid dependence. These dependence tations of erroneous program erroneous

program

state

(the

[22].

debugging by displaying dynamic program should guide the programmer from manifesbehavior (the failure) to the corresponding error)

to the

cause

of the

problem

(the

bug).

Debugging is a difficult job because the programmer has little guidance in locating bugs. To locate a bug that caused an error, the programmer must reason about the causal relationships between events in the program’s execution. There is usually an interval between when a bug first affects the program behavior and when the programmer notices an error caused by the

to locate the bug precisely. bug. This interval makes it difficult method for locating a bug is to execute the program repeatedly, ACM

Transactions

on Programming

Languages

and Systems,

Vol

13 No

4, October

The usual each time 1991.

Debugging Parallel Programs with Flowback Analysis

.

493

of the bug. An easier way to locate placing breakpoints closer to the location a bug is to track the events backward from the error to the point at which the bug

caused

the

programmer

error.

sees,

Flowback

either

analysis

forward

tracks

or

events

backward,

in such

how

a way.

information

through the program to produce events of interest. the programmer can more easily locate the bugs

Using flo whack that led to the

errors. Parallel

sequential

that

programming

complicate

offers

the problem

occurring execution

in parallel is crucial

therefore, ministic.

the cause of errors). Such nondeterminism

program

for

challenges

beyond

of debugging.

First,

debugging

Second, parallel often makes

purposes.

Third,

programs it difficult

interactions

analysis, observed

programming

it is difficult

programs. The ordering of the events for seeing causal relationships between

The

flowed

to order

events

during program the events (and,

are often nondeter to reexecut e the between

cooperating

processes in a multiprocessor system are frequent, and these accesses to shared variables can occur without the proper synchronization. PPD not only performs efficient flowback analysis for sequential programs, but also helps address

the problems

In this

paper

synchronization dezvous)

and

addressing

of debugging

we address primitives

explicit

to such machine

parallel class

(such

(and

automatic

the

parallelism,

generalize

to other

programs

semaphores, process

many

creation.

imperative

languages.

might

Ada

ren-

we are

not

be extended

that the underlying memory system [28]

techniques language We

in this paper [23], but they

address

of the C language,

including

primitives

for synchronization.

a simple approach investigation.

to pointer

variables,

but

this

use explicit or

Although

assume consistent

Symmetry). The C programming

that

monitors,

of our techniques

systems. Our current algorithms architecture has a sequentially

(as is the case on the Sequent are described in terms of the should

as

dynamic)

programs.

of parallel

is a topic

a large

part

We also discuss that

needs further

This paper is organized as follows: Section 2 presents an overview of the design of PPD. Sections 3 and 4 describe the graph structures and tools used by PPD to perform flowback analysis. Section 3 describes the static program dependence between

graph,

events

built

at compile

in the program’s

program dependence dependence between

graph, events

time,

which

execution.

shows

Section

potential

4 describes

dependence the

dynamic

built at debug time, which shows the actual in the execution. Section 4 also describes how

dynamic graphs are built by augmenting the static graphs with traces generated during execution and debugging. Section 5 presents the details of incremental tracing. Section 6 describes how flowback analysis is extended to parallel programs and how data races are detected. Section 7 presents some initial performance overhead results. We conclude with Section 8.

2. STRUCTURAL Flowback during

AND FUNCTIONAL

analysis the

and space.

execution

would

OVERVIEW

be straightforward

of a program.

However,

if we were doing

to trace

every

so is expensive

event in time

The user needs traces for only those events that may lead to the ACM l’ransactlom on Programming Languages and Systems,Vol 13 No. 4, Octobl~r1991.

494

J.-D. Choi et al.

.

detected will

error.

The

be detected

generate

a trace

important

that

Tracing

every

most

often

the

with

analysis. ral

we

time.

that

needs

amortize

be

transfers time

divide

in

the

the

execution phase.

detected. and

distortions

between

is

that

processes. as is often

other

during

as the

of the

reduce

the

programmer

time

-

of information debug

time

interactive

asks

and

as int,erprocedu

At

the

do flowback

compile

amount

execution. over

to

such

log,

debug-

informa-

needed into

idea

the

compiler-generated costs

traces

main

called

portion

analyses,

program fine

The

traces,

traces

semantic

the

difficulties.

interactive

to help

generated into

debugging

phase, When

it

the

three

program

we

debugging

about

dependence

the

along

with

the

emulation

ging

phase

major

the

PPD

produces

halts,

due The

object

be

used

to either

PPD

an

code code

in

programmer’s

debugging the

and is

the

the

preparafiles

running

following

error

Controller

our

During

object

phase,

execution

in

Controller.

the to

phase,

components

the

While a log

begins.

preparatory the

the

debugging

or user

oversees

to be in

intervention, the

debugging

requests.

fill

during

phase,

object

package to

generated

during

code, the

that

will

generate

between

the

the

gap

the

execution

which

the

Compiler/Linker

following:

phase

fine

traces

information and

the

during

the

contained

in

information

debugthe

needed

log

to

do

analysis;

the

static

and

control

the

program

such

preparatory

two

Phase

1 shows

flowback

and

generates

to the

Preparatory

produces,

phases:

are

phase.

phase

responding

Figure

(3)

the and

use

Compiler/Linker

debugging

phase,

(2)

is

reproducibility,

above

fine-grained

analysis,

debugging

phase,

used

(1)

a modified

error

of the

to

anything

overhead

pattern

execution-time

we

generated

are

the

traces the

debugging phase. There Compiler /Linker system: the

2.1

lack

has

program.

we

the

an

coarse-grained

during

coarse

and

tory

not

errors

user

to reexecute

because lack

what

the

unacceptable

that

generate

of generating

traces

has

interaction

to reduce

to

data-flow

cost

The

the

the

compile

to

the

session. in

use

and

the

know

will

after

of

programs

Then,

method

At

analysis

is

incrementally

This

debug

tracing

execution.

produce

user

because in

to either

traces

traces

programs

way

programs.

tracing

session, to

or the

parallel for

a no program;

the

necessary

introduce

parallel

so that

expensive for

incremental

program

ging

the

is

of the

is detected,

is

would

incremental

tion

error

impractical

use

during

event

is impractical

case

of

an

there

execution

every

event

debugger

we

is that

the

generates

Reexecution the

of

when

program

problem

before

program

as the

2.2

Execution

The

object

the

execution

code

ACM TransactIons

dependence

dependence

database places

graph

among that

where

that

shows

components

contains

a variable

the

of the

information is defined

static

(possible)

program;

and

on

the

program

data

text,

or used.

Phase plays phase,

the

major

during

on Programming

role which

in the

the

execution object

phase.

code

Languages and Systems, Vol

generates

Figure the

13 No 4, October 1991

2 shows normal

Debugging Parallel Programs with Flowback Analysis

,----

.

495

- —- —- —________ 1

~

;

I

~_–.



&tJLATION

..................

J–l

I , I

PACKAGEI

I

;;;&–m-D;

i

-

-–/

–.–-–

1

-- —_—-—-_

FxLEs ! .................. I I ----------- : provided by user ‘----

:generated

l___________ I L_________________l

bycompiler Fig. 1.

Preparatory

phase.

~~ +3 ,.—. .

/’

..................

..................+’ \ \

‘\

‘~

0::;

J

\

OUTPOT’

/’

----”

----------” : provided by user ‘-–- : generated by compiler : generated by object code Fig. 2.

program

output

and

execution.

The

debugging

phase

entries

in the are

is

to

before

the

program

described

in

The

goal

of the

from

ated

by

the the

code

with

the

emulation

ging

emulation 3. The

STATIC static

package

program

the

to

static

Transactions

program

log

the

The

variables

and

This

the

graph,

fine

to

that

record

the

entries

and

build

log

might changes tracing

a graph

and

the

informa-

database log

information

~gener -

generated

is used

traces

Controller

of the

assembles

program

detailed

PPD from

static

is

phase

phase,

The

DEPENDENCE dependence

3)

graph

phase.

the

analysis.

which The

debugging

preparation

requests and

point.

Figure

The

to generate

log

of the

program during

5.

(see

the

to generate

between ACM

Section

execution

PROGRAM

dependence

postlogs,

logging

about

package,

flowback

and

the

responds

the

point,

dependence.

from

for

values

the

package

information

emulation

the

phase

phases:

during

It data

traces

last

in

the

record

during

dynamic

phase.

necessary

the

detail

previous

object

of the

logging

phase.

dynamic

with

in a program.

compiler

the graph

which

debugging

dependence

tion

fine

more

Debugging Phase

dynamic

generate

since

2.3

contains

along

next

state

that

used,

prelogs,

include

be read

a log

log

Execution

needed

to bluild

oversees

the

programmer,

and

then

by

together

the

a

debug-

locating

executing

parts

the of the

traces.

GRAPH graph

(static

components,

on Programming

Languages

graph) such

as

shows

data

and Systems,

Vol.

the

potential

dependence.s

[241

13 No. 4, October

1991.

496

.

J.-D. Choi et al.

,–––. –...................... :: ,.......––.–. :: ;: !: :

~--:

~~

~

:

. . . . . . . . . .. . .. . . . . . . . . . . . . .. . .. . . . . . . . . . . .

. . . . . . . . . . . . . . . . . ...1 ; SOURCEFILES ; .................... :: ,.. . PROGRAM ! DATASASE [ r --------------J-- -----,

DYNAMIC PROGRAM DEPENDENCE GRAPH

:

“--; SXATICPRCCRAM ~[ ;: DEPENDENCE ; :

......................... :

I

. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . .

—+ : control flow —

USER

: information

o:

flow

system provided

: generated during debugging phase

.......... .......... : generated during previous phases Fig. 3.

and

deperzclence.s

branch

graph

is

graph

(dynamic

The

by

that

the

can

graph

Kuck

the

(similar

to

control

block

a variation [251.

~ : compiler generated

[151).

dependence

of the

two

the

then,

dy~amic

The

program

static

dependence

there

any

been

as an

361.

The

to their

potential

intro-

variations

applications.

First,

program

and main

graph

numerous

intermediate

vectorizing,

24–26,

exists

dependence

have

according

is used

of optimizing, [15,

program

there

classes,

graph

program

whether

of

Since

into

purpose

of the

decide

is

et al.

dependence

for

mations

. .

Debugging phase.

building

be categorized

program

tation

basic

.>

graph).

static

duced

the

also

(

represen-

parallelizing

concern

dependence

in

transfor-

this

between

class two

is to

sets

of

statements. Second,

the

program.

program

A slice

is the

set of all the

slices

can

be

debugging

used

[15,

applications

34,

is that

program

execution.

dynamic

information

the slice

dynamic

graph.

of the

program

between

dependence

of a program

statements for

graph

with

respect

that

might

integrating

37, they

381.

However,

in

obtained The at

common

use

PPD,

execution

the

value of

augment

point

of

[211

program u

the

the and

PPD

can

based

on

program

graph

be viewed actual

in

on Programmmg

Languages and Systems,

Vol

13 No

4, October

of

during with

the

building

as a dynamic dependence

statements.

ACM Transactions

a

p

classes

obtained

debugging the

for

two

static

from point

at p [38]. Such

and

information

execution in

slices

extract v and

attribute

we

graph

to

variants

dynamic

during

dynamic an

the

used

affect

program

One

do not

is

to variable

1991

Debugging Parallel Programs with Flowback Analysis Accordingly,

the

previous

systems.

following

observations:

mation

to

build

generated time

at

tion

structure

the

vectors

execution

time.

at run

mate

direction

such

precise identify

Moreover,

time

information

subroutine.

this

ence,

we can

the

at compile

time.

dynamic

We,

trace

direction of depend-

Although automatic this

co input loop

-

]paral

information

at

actual paths of controd flow trace), we need not approxi-

therefore,

do not

construct

a

Finally, for each subroutine, we want to might be used or defined by the execution allows

a static

us to decide when

graph

branch dependence layer, called the data

whether

showing

consisting

the

to sh~ow or depend~ences

of two

graph, shows the dependence graph,

the

and the inner

the

of a subroutine

we describe

called

the

to

informa-

dependence

reconstruct

dynamic

trace

execution-

characterizations for

the

inforof

low

dynamic

dependence. essential

from by

be compromised

Since

data

ways

amount

not

497

enough

means with

times.

we require

from

small

should

computing

is

Such identification

section

layer,

a

of trace

compile-time

static control flow graph. the sets of variables that

of that

In

loops,

because

to skip the execution details requested by the user. outer

debugging

because

contain

determined

vectors

(obtained

should only

execution-time

[3], it is unnecessary

execution

taken

motivated

efficiency

all

to show

several

is

amount

easily

are approximate

dependence

lelization

and

in

graph

with

small be

unrolls

is unnecessary

data

can

differs

static graph

graph A

in PPD the

static

compile-time

effectively

[39], which

ence,

the

time. that

at

of

dynamic

Second,

obtained

structure

First,

dependence

information

ing

graph

The

execution

overhead.

identify

static

.

layers.

branch shows

The

dependthe data

dependence within the blocks of the branch dependence graph. We discuss the two layers in detail. Interprocedural analysis is used in building the data dependence graph. With separate compilation, interprocedural analysis also allows us to avoid rebuilding more modules of the program described dural

in detail

analysis

the entire static graph from are modified. The separate

in Section

in building

3.5, where

the static

3.1

Branch Dependence

Graph

The

outer

static

(static) syntactic this

layer

branch

graph

of the

dependence

program with

analysis the control

dence graph consists edges between these graph. Such a graph

how we use interproce-

graphs.

graph

graph,

we describe

scratch when one or compilation issue is

is the

which

(i. e., at parse dependence

branch

dependence

is always time).

graph

a tree,

graph.

This

is developed

from

In Section

[15].

The

of nodes called control blocks and nodes. Figure 4 shows an example is constructed for each subroutine

4.3 we compare

static

branch

branch branch in the

d~epen -

dependence dependence program. A

control block is identical to a basic block, except that labels (which are potential targets of branching statements such as goto) always delimit the start of a new control block. For example, to handle switch statements in C, we also treat

a case

statement

as a label,

since

an implicit

branch

occurs

when a case does not end with a break and is allowed to fall through to the following case. A leaf control block represents a block of statements in which the flow of control always enters at the beginning and exits at the end, and ACM

Transactions

on Programming

Languages

and Systems,

Vol.

13 No, 4, October

1991.

498

J.-D. Choi et al.

.

extern

int

gl,

extern

int

A[1OO],

wolf

(

int

)

i,

SI :

if

k,

(gl

} else F

a,

0

wolf

b:

{

/\

Add2(gl,

g2); 7

/’

C

\

\

,

= o;

SubY ( ) ;

S5:

[

a = gl; D

L1 :

r

s6: * S7:

/.

{

gl

S3:

1,

> O)

a -

r

S4:

g3; B[1O()];

(

j,

S2: I

c32,

G

s8:

a =

(A[i]

b=

= b);

A[j];

A[k]

-B[l]+a+b;

L2 :

S9:

a = gl;

Slo: Sll:

H

S12:

g2

= a;

gl

-

g2;

SubX(a)

S13:

g2

[

;

= g2

+ g3;

}:

S14:

while

I

I

S15: s16:

(a > 1) a=a-

( 1;

IMOD

USE

MOD

E

g2 = g2 * a;

B

IUSE

POINTERTODATA DEPENDENCEGRAPH

1;

Fig.4

Sample static graph,

that is devoid of conditional or loop control statements. For programs without labels that are potential targets of branching statements such as goto, the branch

dependence

program, with the branch dependence

graph

is identical

to the

abstract

syntax

tree

[1] of the

basic blocks being the leaf nodes of the tree. Thus, the graphs can be built at compile time without control-flow

analysis. Since we do not perform control-flow analysis of the program, we simply assume at compile time that every label will be a target of at least one. branching statement. This assumption sometimes results in overly finegrained basic blocks, such as blocks F, G, and H in Figure 4, However, the benefit from not performing control-flow analysis easily offsets the small, additional overhead incurred by such pessimistic assumptions. Branching statements, such as goto, can affect the structure of dynamic graphs. In Section 4.3 we describe how the simple structure of the branch dependence graphs, combined with run-time traces, can handle these branching statements. ACM

Transactions

on Programming

Languages

and Systems,

Vol

13 No. 4, October

1991.

Debugging Parallel Programs with Flowback Analysis There

are four

represents

nonleaf

conditional

absence

of gotos

switch

statement

conditional this type.

block

(including falls

node During

types

statements,

needed

such

implicit

through

gotos,

to the

for

as if

C programs.

or switch

which

following

in the static graph will execution, either block

.

The

499

first

type

statements.

occur

when

case),

only

1n the

one case

of a

one child

of a

execute. Block A in Figure 4 is of C or block D will be executecl. The

second nonleaf block type represents loop control statements such as while or for. Execution of the descendant blocks may be repeated zero or more times depending on the loop control statement. Block B in Figure 4 is of this second type. The third and fourth nonleaf block types ment. The third type acts as a summarizing and is used when of incremental descendants root

during

constitute

debugging

of a summarizing

block

construct

its descendants

tracing of a static

an e-block

block

graph

do not correspond to any stateblock for its descendant blocks

is

an e-block; (described

execute

an e-block

in Section

in left-to-right

a summarizing

block,

even

if

All

of the

“L2”; these

descendants

of a dummy

order.

Control

block

Leaf

blocks

G and

descendants.

of the

Also, we

the

do not

block. This block exista only together the blocks (if there

are more than one) dependent on the conditional. The dummy the condition that only one of the descendants of a conditional left-to-right

All

order.

out of the subroutine.

The fourth type of nonleaf block is a dummy as a descendant of a conditional block to group

executed.

is the unit

5.1).

H are

flow of control can potentially labels to show how labels affect

goto statement in the example Associated with each control

block

D in Figure

will

also be executed

4 is a dummy

defined

block satisfies block will be

because

block

with

of labels

in

three

“L1”

and

enter at these points. (We introduce the static graph, although there is no

program.) block (except

dummy

blocks)

are four

sets of

variables—the IUSE, IMOD, USE, and MOD sets—and a data dependence graph. The IUSE set is the set of variables that might be referenced before they

are defined

used variables values

might

variables block

by a statement [1] of this

be defined

that

might

of calls).

The

MOD

The

called

they

this

set is similarly

it is the set of upward-exposed set is the

in this

before

from

block;

IMOD

by statements be used

in a subroutine

in this

block.

block.

are

block

defined

(following

defined.

set of variables

The USE

Whereas

whose

set is the set of

in this

block

or any

the transitive the

IUSE

cllosure and

IMOD

sets are determined locally by inspecting the statements belonging to a block, the USE and MOD sets can only be determined by interprocedural anallysis. 1 The USE and MOD sets are described in more detail in Section 3,5, on interprocedural analysis. The branch dependence graph for a subroutine can have several summarizing blocks, one for each e-block in the subroutine. The four sets (the IIUSE, IMOD, USE, and MOD sets) for a summarizing block are the unions of the

lIn

previous

previously

papers referred

[31]

We now use terminology ACM

we

used

to as DEFINED, from

Transactions

different IUSE

Banning

terminology as USED,

MOD

for

these

sets

as GDEFINED,

as follows: and USE

IMC)D

was

as GUSED.

[8].

on Programming

Languages

and Systems,

Vol.

13 No. 4, October

1991.

500

J.-D. Choi et al



same sets of all of the descendants’ However, they do not contain variables

blocks that that cannot

constitute the e-block. be accessed outside the

corresponding e-block, except for upward-exposed static variables. ple, those four sets for a subroutine do not contain variables subroutine,

although

variables. variable, local

static

The program telling

whether

to a subroutine,

subroutine.

variable.

are

treated

[31] contains

a given

a static

It also tells

is a shared

variables

database

variable

variable

whether (Sequent

C has

same

as global

variable

additional

of each

variable,

C), or a formal

global two

way

scope information

is a global

(in

a given

the

the

For examlocal to the

a variable

parameter

of a parallel key

of a

program

words

to support

parallel programming [351: shared and private.) The variables in the IUSE and IMOD set of the summarizing block are the variables that will be written to the log (described in Section 5) at execution time. The structure of the branch dependence graph and the four sets of used and defined variables allows for easy identification might be used and defined during the execution for easy identification Section

5 discusses

incremental 3.2

of which

data

might

structures

use or modify

a given

work

with

together

control

block

Graph

(except

for

summarizing

and

dummy

blocks)

has a data

dependence graph that shows only the dependence between belonging to that block. Data dependence between different resolved at debug time and appear in the dynamic graph. Figure sample

control

block

and its data

dependence

ence graph has two node types: singular node represents an assignment statement, such as an if or a switch, constant

variable.

the log and

tracing.

Data Dependence

Each

e-blocks

how these

of the sets of variables that of an e-block. They also allow

used

on the

or a branch

right-hand

side

graph.

The (static)

statements blocks are 5 shows a data

depend-

and subgraph nodes. The singular a control predicate in a statement

statement

such as goto

of a statement,

or exit.

we create

For a

a constant

node, which is a subtype of the singular node. The subgraph node represents the call site of a subroutine and is a way of encapsulating the inside details of such subroutines. There is one static graph for each subroutine. Each node of the data dependence graph is labeled with the statement number and either an identifier or an expression. The data dependence graph has three edge types: data dependence, flow, and

linking

edges.

[4, 24]. (A statement uses output represented

The

data

dependence

Sz has a true

edge represents

dependence

of S1. ) A flow edge from by n~ immediately follows

on another

a true

dependence

statement

when n, to nJ is defined the event represented by

SI if S’z the event n, during

execution; it shows the control flow of the program. The linking edge helps resolve the dependence that can only be determined at execution time, for example, deciding which array element is actually accessed when the array index is a variable. Linking edges are described in more detail in Section 3.4. The top of the control block shows the variables in the IUSE set of the block, block. ACM

and the bottom of the block shows the variables in the IMOD set of the The IUSE set of a block is the set of upward-exposed used variables of Transactions

on Programming

Languages

and Systems,

Vol.

13 No. 4, October

1991

Debugging Parallel Programs with Flowback Analysis

.

501



S15:

a=

S16:

g2=g2

IU :

IU

a

gz

VU

a

g2



a-l; *a;

RJSE

lM: IMOD _

: data dependence edge : singular node



o

Fig. 5.

Basic block and its data dependence graph (control



block E in Figure

4).

the block. A data dependence edge from the IUSE entry for a variable into a node N shows a dangling data dependence in this block, meaning that the value of the variable has not been defined in this block before the statement represented by node N. A data dependence edge into the IMOD entry for a variable

shows the last

of the nodes corresponding control events

statement

in the block

modifies

the variable.

All to the in a

block are sequential. This total ordering shows the execution order of represented by the nodes and is represented by the flow edges connect-

ing the nodes,

so we can say that

control

Ordering

block.

in debugging

parallel

events

different

control

recorded

in

debugging

static

are

not

graph.

and are recorded

is described

another

processes in Section

resolved

Interlock

at

compile

time;

dependence

in the dynamic

graph

are

(described

nodle in a

is important 6. We will

not

belonging they

to

are

resolved

not

during

in Section

4).

to Subroutines

Parameters

To map

which

or before

to different

edges in the figures in this section. (dependence between two statements

blocks)

the

a node is after

belonging

programs,

explicitly show the flow Interlock dependence

3.3

that

in a data dependence graph are totally ordered according statements in the control block, because statements

between

formal

parameters

and

actual

parameters

of a subroutine

call during debugging, we create a parameter node (a variant of the singular node) for each actual parameter passed to a subroutine. Each parameter node followed by the parameter position (%0 represents is labeled with “%” a function in Figure

return 4 and

parameters 3.4 Array

value). shows

of a called

Figure 6 shows the static graph of control block C how actual parameters are mapped to the formal

subroutine.

Arrays and Linking Edges index

to identify

values

are usually

the array

elements

ACM

Transactions

unknown that

on Programming

will

at compile actually

Languages

time,

so it is not possible

be accessed. and Systems,

Vol.

Our

apprc)ach

13 No. 4, October

is

1991.

502

J.-D. Choi et al.

.

IU

S2:

a

=

Add2(gl,

gl

cj2

E

g2);

%1

%2

s2: Add2

RJSE

III:

lM: IMOD

%0

O

: singular node s2: a

8

I Fig.

6.

I

Data

to supply type, select nodes data

dependence

enough

dependence

E

: subgmph node

graphs

for parameter

information

can be quickly

in

the

mapping

static

determined

(control

graph

at debug

block

C in Figure

so that time.

array

4)

reference

We use a new

edge

the linking edge, and two variants of the singular node, the index and nodes. Index nodes show the indexes used in array accesses, and select potential represent read-accesses of an array. Linking edges represent dependence and are used during debugging to locate the actual

dependence quickly. To represent an

assignment

to

an

array

created. Nodes “ s6:A” and “ s8:A” in Figure with assignments to scalar variables, this edges from of the

the

nodes

assignment.

representing

However,

for

element,

the variables array

a singular

7 are examples node contains

node

used in the right-hand

assignments,

is

of such nodes. As data dependence

a linking

edge

side is then

added, from the most recent node in the control block that writes the same array, to the assignment node. If there are no previous writes to the same array in the control block, then a special IUSE set entry is made for the array, and a linking edge is added from this entry. Finally, an index node is created for each array index and is labeled with “70” followed by the index position (similar to a parameter node). A data dependence edge is added from each index node to the assignment node. For example, node “ s6:A” in Figure 7 contains three incoming value, one data dependence

edges: one data dependence edge for the edge for the variable used in the right-hand

index side

of the assignment, and a linking edge from the IUSE set entry for the array being modified (since there were no previous modifications of array “A” in the control block). Because a definition of an array element is a preserving definition [1], which fails to prevent any uses reached by the definition from being upwardexposed,

ACM

a use of an array

Transactions

on Programming

element

Languages

always

creates

and Systems,

Vol.

an entry

in the IUSE

13 No. 4, October

1991

set of

Debugging Parallel Programs with Flowback Analysis

ru

s6:

a -

(A[i]

S7:

b -

A[j];

s8:

A[k]

-

i

A

b

L

i

1

a

b

A

j

1

B

.

503

k

b);

-B[l]+a+b;

[u: IUSE IM: IMOD ~ ------------

: data dependence edge ●

: linking edge

IM Fig. 7.

Data dependence graph with array and linking

the control the control

edges (control

block Gin Figure

block. We also insert an entry if the first reference block is a definition of an element, in anticipation

use of the array. A read from an array element node is created to represent the node “ S7 :B”

in Figure

side of statement from the index

is handled read. For

7 represents

s7. This node and

the array

4).

to the array in of a subsequent

identically except that a select example, the select node above access “Aljl”

on the right-hand

select node has an incoming data dependence an incoming linking edge from node “s6:A”,

edge the

most recent modification of array “A” in the control block. The above mechanisms are similar to the ideas used for array-related dependence in 1341. The actual data dependence for each array read are determined during debugging and are reflected in the dynamic graph. Once the fine traces for the e-block containing an array read are generated, the index values of all array accesses in that e-block will be known. The linking edges are followed backward, from the select node, until an assignment to the same array Iocation is found. A data dependence edge can then be added in the dynamic graph from this assignment to the select node. If no such assignment is found

ACM

Transactions

on Programming

Languages

and Systems,

Vol.

13 No. 4, October

1991.

504

J.-D. Choi et al.



(the

IUSE

exists

set entry

for the

described

USE

read.

in Section

Interprocedural

3.5

for the

array

and

identify

MOD

more

array The

was reached),

dangling

then

a dangling

dependence

can then

dependence be resolved

as

5.

Analysis and Data Dependence sets,

precise

computed (potential)

by

Graph

interprocedural

dependence

analysis,

information

than

allow

us

to

the worst-case

assumption that every global variable in the program is possibly used and defined by each call to a subroutine. In this section we describe the use of the USE and MOD sets. Building the data dependence

graphs

with

interprocedural

analysis

is done

in two steps. The first step is done at compile time without interprocedural information, building the pregraph form of the data dependence graphs. The graphs in Figures 5-7 are all pregraphs. The second step is done at link time, producing the postgraphs by modifying (if necessary) the pregraphs with interprocedural

summary

information.

When

several

are recompiled with separate compilation, pregraphs of the recompiled modules. Only calls

to subroutines

again. Figure

whose

USE

or MOD

Figure 8 shows the pregraph 4. We outline how to build

section.

Detailed

Our

algorithms

approach

follows:

When

we meet

to

include

a subroutine

set has changed

and the postgraph the pregraph and

for building

(heuristics)

modules

of a program

we need to rebuild only the those postgraphs that contain

these

graphs

appear

interprocedural

call in building

need

to be built

for control block H in the postgraph in this in [101.

information

the pregraph

is

as

of a control

block, we assume that all of the global variables written so far in the control block might be written by the subroutine. Then, we create a linking edge for each such global variable out of the most recent node that wrote the variable and into the subgraph node representing the subroutine call. We use the linking modified more

edges to identify to produce the

pessimistic

computed modified

than

the parts of the pregraph that might corresponding postgraph. Our approach approaches

interprocedurally, by a subroutine

that

use

as the basis call [21. However,

the

MOD

of determining our approach

set

need to be is certainly

of a procedure,

what is simple

might be to imple-

ment and not overly pessimistic, in that we do not assume that all of the global variables, but only those that are used or defined in a control block, might be modified during the execution of a subroutine called in the control block. We need more experiments with the working prototype under construction before we can evaluate the effectiveness and efficiency of this approach. When building

the postgraph,

reflected in each subgraph dependence edge into the the out the for the

the interprocedural

summary

information

USE set of the subgraph node. Second, we create a data dependence of the subgraph node for each global variable that is in the MOD subgraph node. Finally, we create a linking edge into the subgraph each global variable that is in the MOD set but is not in the USE subgraph node.

ACM

Transactions

on Programming

is

node in the following ways: First, we create a data subgraph node for each global variable that is in

Languages

and Systems,

Vol.

13 No. 4, October

1991.

edge set of node set of

Debugging Parallel Programs with Flowback Analysis IIU: IUSE S9:

a = gl: g2 = a;

Sll: S12:

gl

S13:

g2

rf.f

u: USE M MOD

= g2;

SubX(a)

;

= g2

gl

505

fJ(SubX) = {g2) M(SubX) = (gl, g3)

IW IMOD

Slo:

.

—————$ : dats depmdence edge + g3;

--------- ... ●

g3

f.1

gl

edge

✚linking

g3



\

*

s12: SubX

— EM

a

gl

g2

M

a gl

PRE-GRAPH Fig.

8.

Figure

Data

dependence

g2 g3



POST-GRAPH graph

before

and

after

interprocedural

analysis

(control

block

H in

4).

The linking edge is needed because USE and MOD are sets of variables that might be accessed during the procedure call. For example, if during debugging we discover “gl”, we need to locate might

modify)

“gI”,

that “SubX>’ (see Figure 8) does not actually modify the most recent node before “SubX’> that modifies (or which

in this

example

is “.s11 :gl”.

The

linking

edge

from “s11 :gl” to “ s12 :SubX” serves this purpose. (Note that the linking was similarly used for arrays in the previous subsection.) Figure 8 shows how the postgraph is constructed from the pregraph information

from

interprocedural

analysis.

First,

the

linking

edge

edge and

of “g2”

into the subgraph node in the pregraph is changed into a data dependence edge, because “g2” is in USE(SubX). Second, the data dependence edge of “g2” out of the subgraph node into the node “s13:g2” is disconnected from ACM

Transactions

on Programming

Languages

and Systems,

Vol.

13 No. 4, October

1991

506

J.-D. Chol et al.

.

the subgraph node and reconnected into the node “s10 :g2”, because “g2” is not in MOD(SubX). The reconnection is done by following the ‘ ‘g2” dependence through the subgraph node. Third, there are two additional edges out of the subgraph node “s13:g2”.

node: These

one into the MOD entry for ‘ ‘g3” and edges are added because “g3” turned

the other into out to be in

MOD(SubX).

Last, the data dependence edge from the IUSE entry for “g3” is deleted. The linking edge out of ‘

tq

tz

t, Fig,

Sub4. given

+14+

+[2+

ts

t4

Log intervals

generation

and debugging

time

activities

are

E-Blocks

how to divide

the program

into

e-blocks.

The only

condition for several consecutive lines of code to form an e-block is that there is a single entry point. Whenever control is transferred from one e-block to another,

the

e-block,

where

where

control the

the control

constructing

must prelog

be transferred is made.

is transferred

an e-block

are well defined. dependence graph

entry

postlog since

point

is made

out of an e-block.

is the subroutine,

of the

at the

One natural

the entry

to the performance

execution and debugging phases. In e-blocks large in favor of the execution

point

candidate

and the exit

for

points

general, phase,

of the system

if we make the debugging

during

the

the size of the phase perform-

On the other hand, if we make the size of the e-blocks small debugging phase, execution phase performance will suffer.

Although

the

introduce

unacceptable

number

of logging

points

performance

should

be small

degradation

during

enough

has no direct

relationship

to the

time

so as not to

the execution

it should also be large enough so as not to introduce unacceptable in generating fine traces during the debugging phase. Consider, the case in which the size of a subroutine is very large. Though subroutine

second

exit

(Actually, an e-block could be any node of the control in PDG [15], since the entry and exit points of each node in

PDG are well defined.) The size of e-blocks is crucial

ance will suffer. in favor of the

to the

The

needed

phase,

time delay for example, the size of a

to execute

it,

we

can act conservatively to construct several e-blocks out of such a large subroutine. Loop constructs, even though small in size, may require long execution time and, thus, introduce unacceptable time delay in generating fine traces. Currently, the PPD compiler constructs one e-block from each loop. However, the

compiler

nested

loops.

constructs Defining

only e-blocks

one

e-block

for

loops

from allows

the the

outermost debugging

of multiply phase

to

proceed without excessive time spent in reexecuting the loops. Still, if the user is interested in the execution details inside such loops, we can reexecute the e-blocks corresponding to the loops. Three elements can affect the program behavior of an e-block: the initial state as recorded by the prelog, the code of the e-block, and input statements ACM TransactIons on Programming Languages and Systems,Vol 13 No. 4, October 1991

Debugging Parallel Programs with Flowback Analysis in the make

e-block.

We need

the behavior

execution. consists

to accommodate

of the

e-block

We can make

each

of the variables

during

input

affected

input

statements

debugging

statement

the

515

in an e-block

same

an e-block,

by the input

.

as that

whose

to

during

IMOD

set

statement.

5.3 Log Optimization Small

and

frequently

called

subroutines

can be a problem.

If we make

an

e-block out of each small subroutine, the amount of logging done during the execution phase may be large enough to introduce unacceptable performance degradation. To avoid this problem, it may be better not to make e-blocks out of subroutines correspond

that

to leaf

do not nodes

contain

in the

subroutine

call

graph).

calls

(i. e., subroutines

If an e-block

that

is not formed

from

such a subroutine, then the subroutine itself does not perform any logging. Instead, the e-blocks that call this routine (its parent e-blocks in the call graph)

perform

its logging.

However,

a subroutine

that

either

contains

a loop

or contains accesses to a static variable such optimization. This process can

(in the C language) is not eligible for be applied recursively to the parent

e-blocks

of levels

and can continue

by the user)

5.4

until

Locating

When

any number

an e-block

phase

starts,

the last log interval, ment executed. (The

that last

execution

to an error

initial

halted

dynamic

due graph

that

up the call graph

is ineligible

(as specified

for optimization.

for Incremental Tracing

Log Intervals

the debugging

is reached

we generate

fine

debugging-time

traces

is, the log interval that contained the log interval usually lacks the postlog or to user

to be constructed.

intervention.)

From

then

This

on, there

for

last statewhen the allows

the

are three

cases

when we need to generate fine traces for a new log interval: (1) when the user wants to know the details of the dependence of a parameter passed from a calling subroutine, (2) when the user wants to know the details of a hidden dependence edge, that is, a dependence edge that either terminates into or comes out of a subgraph node, or (3) when the user wants to know the details of a dangling

dependence

(a dependence

e-block before it is written). When the user wants to know can

easily

intervals

locate

the

are nested

log

for

a variable

the detailed

interval

as in Figure

needed

that

dependence to generate

14, and the

caller’s

is read

in

of a parameter, fine

traces;

log interval

an we

the

log

is the

one

enclosing the current log interval. When the user wants to know the detailed dependence of a function return value, we can also easily locate the needed log interval; the callee’s log interval is one of those intervals nested in the current

log interval,

and

log intervals

at the

same

nesting

level

are gener-

ated in the execution order of the called subroutines. When the user wants to know the details of a hidden or dangling dependence, we need to identify the log interval needed to generate fine traces to show the details. To facilitate identifying such log intervals, we obtain the IMOD set of each e-block at compile time and keep it as part of the program database

[311. We also keep ACM

Transactions

in the

on Programming

program Languages

database, and Systems,

for each variable Vol.

13 No. 4, October

that 1991.

516

J -D. Choi et al.

.

LOG

PRELOG(l)

extern

— e4

g2,

g3;

k 1

POSTLOG(3) — e2 PRELOG(4) — el

,.X

‘.. ._

k

POSTLOG(4) — el

,.

PRELOG(5)

— d

PRELOG(6)

— e2

PRELOG(7)

— e3

TABLE

H

PRELOG(3) — e2

,, ‘.. ..

gl

el, e2, e3

g2

el, e2, e4

g3

el, e3, e4

-1

,., -

..-* ,,, > ,’ ,, ,. * I ,< 4.. .‘, .. ~. --.. . . ... . . . -. -. -. ,,,

E-BLOCK

k 1

POSTLOG(2) — el

,,,

glr

PRELOG(2) — el

,.~ ‘.. .-

.,

int

POSI’LOG(7)

— e3

POSTLOG(6)

— e2

k -1 1 -+-1

k --–-–

~

~

-–-–

;

-t-+-;-7

E-POINTERS Fig.

14,

LOG

might be accessed by more the variable in their IMOD table

in Figure

and “g3”. Figure

14 shows

14 also

with

pointer

for each e-block

type

than one e-block, the list of e-blocks that contain sets. We call the list the e-block table. The e-block the

shows

a back

list

of e-blocks

an example

for three

log file.

Log

variables:

entries

“gI”,

“gZ”,

generated

by the

same e-block form a linked list; each postlog has two pointers: one pointing to its corresponding prelog, and the other pointing to the most recent postlog E-pointers are an array of pointers to the last log made by the same e-block. entry To

made

by

locate

the

variable, their

we IMOD either

case

of hidden

may if the e-block

first the

dependence. need last that

and

recent

retrieve The most

the

list

that

also

of

or the

the

if the of the

potentially

ACM TransactIons on Programmmg

during that

e-blocks

log e-block

variable

fine

entries did

in

the

e-block

any

of these

the

prelog

for not

in the

contain

traces that

in

e-block

variable

Languages and Systems, Vol

the

by

using

log

interval.

actually

execution.

a modification

by

recent

the

modifies

that

produced most

program

contains

is stored

postlog

and

to be repeated modification

list

generate

e-block

updated

interval

of e-blocks

recent

finally

are

log

dependence, We

for

e-block

most

sets.

locate

package

each

modify occurred

the table.

a in

We

e-blocks case

to

variable

then in the

of dangling

the

emulation This

the

process

variable

before

[10]. 13 No 4, October 1991.

a nested

or

Debugging Parallel Programs with Flowback Analysis When

we construct

debugging-time e-block local

that to

more

efficiency might

write

a subroutine

than

one

e-block

considerations, a local

has

variable.

an

out

we in

of subroutine

sometimes

Unlike

instance

need

global

each

. because

of

locate

an

to

variables,

execution

517

a variable

instance

of the

subroutine, and we should not use log entries generated by different tion instances of the subroutine for the detailed dependence of

execua local

variable.

5.5 Arrays and the Log For

an e-block

IMOD in the

with

array

accesses,

sets that contain only e-block. One approach

even if only

a few array

it is not

possible

to compute

elements

are accessed.

A second

to trace every array access. However, both approaches ate a large amount of traces during execution. Our solution array accesses:

entry

a systematic

for the

approach

is simply

can potentially

gener-

if the array is accessed in a loop and the array data dependence on the loop control variable.

access, we regard

the

entire

array

ate a log entry (as usual) for the entire array. types of accesses to arrays as random accesses the array

and

to this problem is as follows: We distinguish two types of systematic accesses and random accesses. We say there is a

systematic access to an array index has a possibly transitive With

IUSE

those array elements that are actually accessed is to generate a log entry for the entire array

array

element

index

and the

at the time

accessed

and gener-

value

(read

or updated

value)

of

the access is made.

6. PARALLEL PROGRAMS AND FLOWBACK The discussion

as accessed

We regard all of the other and generate a special log

so far has described

ANALYSIS

mechanisms

to implement

efficient

flow-

back analysis for sequential programs. In this section we discuss the mechanisms for extending flowback analysis to parallel programs. For parallel pro~ams, data dependence may exist across process boundaries. Locating such data dependence involves constructing an abstraction of the dynamic graph that the events

contains the events belonging in this graph. With additional

incremental establish

tracing

the program 6.1

Parallel

To apply

scheme

dependence execution Dynamic

flowback

of the dynamic events belonging

described

between

to all processes and then ordering logging of shared variables, the in

processes.

Section

In addition,

5 can

then

potential

be data

used

to

races in

can be detected. Graph and Ordering

analysis

to parallel

Concurrent programs,

Events we construct

graph, called the parallel dynamic graph, to all processes in the program execution.

an abstraction

that contains the To this graph we

add edges that allow us to determine the order in which these events executed. From this ordering, data dependence can be established across process boundaries, and data races can be detected. We now describe how to construct the parallel dynamic graph and what run-time information must be recorded ACM

to

do

so.

Transactions

We

then

on Programming

show

how

Languages

this

graph

and Systems,

orders Vol.

the

events

13 No. 4, October

1991.

518

J.-D. Choi et al.

.

belonging to different processes, and how this ordering data dependence, and data races, to be detected.

6.1.1 Parallel dence graph (or graph tailed the

Dynamic parallel

that shows dependence

node,

edge (Figure

interprocess

dynamic

program

is an abstraction

of the

between processes while This graph contains only

and two

15 shows

parallel

graph)

the interactions of local events.

synchronization

internal

The

Graph. dynamic

allows

edge types,

an example

the

dependynamic

hiding the deone node type,

synchronization

of a parallel

edge and

dynamic

graph).

A

synchronization node is constructed for each synchronization operation in the program execution. A synchronization edge from one node to another indicates that the first synchronization operation executed before the second. An internal edge abstracts out all events (belonging to the same process) that executed example, event

between the synchronization operations connected by the edge. For in Figure 15 all of the events of process PI that executed before

nl, ~ also executed

after

event

viewed

nz, ~. The

before

as a generalized

We now describe

all

those

synchronization flow

edge that

how to construct

events edge

spans

Pz that

nl, ~ and

executed

nz, ~ can

be

the two processes.

synchronization

use semaphores. Other synchronization dezvous, etc.) can also be handled [10]. nization between

of process

between

primitives In general,

edges for programs

that

(such as messages, renwe construct a synchro-

edge between two nodes if we can identify the temporal ordering them. We say that the source node of an edge is the node connected

to the tail of the edge, and the sink node of an edge is the node connected to the head of the edge. Semaphore operations, such as P and V, are used in controlling accesses to shared resources by either acquiring resources (through a P operation) or releasing edge

resources

from

senting

the

(through

node

a V operation).

representing

some P operation

on the

each same

We construct V

operation

semaphore.

Each

a synchronization to

the

node

V operation,

reprewhich

releases resources, is paired with the P operation that acquires those released resources. There are two cases to be considered. The first case is where the second process tries to acquire the resources before the first process releases them; the second process thus blocks on the P operation until the V operation of the first process. The second case is where the first process releases the resources before the second process tries to acquire them; the second process does not block on the P operation in this case. In both cases, we define a source node for the V operation and a sink node for the corresponding P operation. The operations on a semaphore variable are serialized by the system that actually implements semaphore operations, and identifying a pair of related semaphore P operations is done by matching the nth V operation to the (n + i)th value operation on the same semaphore variable, where i( z O) is the initial of the semaphore variable. Additional logging is necessary to record the information required to determine this semaphore pairing. Each semaphore operation generates a log entry ACM

(for the process Transactions

it belongs

on Programming

to) containing

Languages

and Systems,

a counter Vol.

13 No

that 4, October

indicates 1991

how

Debugging Parallel Programs with Flowback Analysis P2

PI

H

nl,l

.

519

P3

internal edge

/

n2,1

n2,2

synchronization

edge

n2,3

h

n2,4

nl,2

Fig.

many

operations

on the

15.

Example

given

semaphore

semaphore

operations

can

constructed

from

log entries.

6.1.2 represents

these

parallel

easily

dynamic

have

be paired

graph.

previously

and

the

been

issued.

synchronization

Ordering Events. In the parallel dynamic graph, the set of events bounded by the surrounding

The edges

each internal edge synchronization

operations. The order in which two events executed can be determined if there is a path between the two internal edges that represent those events (if no such path exists, then the actual execution order cannot always be determined). We partially order the nodes and edges of the parallel dynamic graph

by defining

the

happened-before

relation

[27], + , as follows:

(1) For any two nodes nl and nz of the parallel dynamic graph, nl true if nz is reachable from nl by following any sequence of internal synchronization

nz is and

edges.

(2) For two edges el and ez, el + ez is true if nl + nz is true where nl is the sink node of the edge el, and nz is the source node of the edge ez. There

are

several

approaches

to

ordering

events

in

a parallel

program

execution [10, 14, 16, 17, 27, 32]. Although the ordering between two events can be determined by searching for a path in the graph, a more efficient representation of the happened-before relation can be constructed that allows the order between any two events to be determined in constant time. Such a representation is constructed by scanning the graph and computing, for each node, vectors that show the latest (or earliest) nodes in all processes that happened before (or after) that node [10]. ACM Transactions on Programming Languages and Systems,Vol. 13 No, 4, October 1991

520

6.1.3 have

J.-D. Choi et al.

. Data

been

Races.

ordered,

Once the events flowback

in the execution

analysis

can

of a parallel

be performed.

program

Dependence

that

span process boundaries can be successfully located when the execution is data-race free. Once these dependence are located, the incremental tracing scheme described in Section 5 can be extended to reexecute e-blocks belonging to different processes, allowing the dynamic graph to be constructed. We now

show

how

subsequent processes When

to determine

subsections,

when we

and how to extend the

user

the

show

execution

how

incremental

requests

to

see the

to

contains

locate

tracing

data

races.

dependence

to parallel

dependence

for

that

In the span

programs.

a read

of a shared

variable, “SV”, we must locate the event that assigned the value to “SV’> that was read. Locating this event involves finding all events that wrote “SV” and determining their order relative to the read event. However, when two events both access “SV” and are unordered by the happened-before relation, not enough information is available to determine which access occurred first. If at least one of the accesses is a write, A data race is usually a program race is said to exist.

then bug,

two events

both

at least

and either

did execute

Definition A ~ (e2 +

access a common

6.1.

shared

concurrently

Two

edges

variable

(that

or had the potential

el and

ez are

a potential and exists

one modifies)

of doing

simultaneous

edges

data

when

so. if ~ ( el -

e2)

cl).

Definition 6.2. READ-SET (e,) is the set of the shared variables read in e,. WRITE_SET( e,) is the set of the shared variables written in edge e,.

edge

Definition

6.3.

the following

Two simultaneous

conditions

(1) WRITE_

SET(el)

fl WRITE_

(2) WRITE.

SET(el)

(l READ.

(3) READ_

SET(el)

Definition

6.4.

of simultaneous To

(1 WRITE A program

when

el and

SET(e,)

ez are

data-race

free

if all

the

= @.

SET(e,)

= @.

_SET(e,)

= @.

execution

edges in the execution

determine

edges

are true:

program

is said to be data-race are data-race execution

free

if all pairs

free.

is data-race

free,

additional

tracing must be performed to record the shared variables that are read and written by the events represented by each internal edge. For this purpose we maintain bit-vectors (representing basic blocks) during execution and set a bit every time execution enters a basic block [5]. The size of these bit-vectors is computed at described in the tors, the sets of determined. To trace each array arrays (described flowback analysis,

compile time (by inspecting the simplified static graph, next subsection). From the run-time trace of these bit-vecscalar shared variables that were read and written can be determine which shared array elements are accessed, we access. Our mechanism for logging randomly accessed in Section 5.5), which must be employed anyway for will provide the necessary information. However, for

ACM Transactions on Programming Languages and Systems,Vol. 13 No 4, October 1991

Debugging Parallel Programs with Flowback Analysis systematic

accesses,

we must

Since only the access type value,

optimizations

overhead.

For

read

instead

trace

or write)

can be performed

example,

regions of the array that single record, resulting fraction

additionally

(either

to reduce

of writing

each

must

shared

. array

be recorded,

the associated

a trace

521 access.

and not the

execution-time

record

for

each

access,

were accessed can sometimes be summarized [61 by a in a trace whose length is proportional to a small

of the number

of array

elements

The execution can be analyzed ways. Either the entire execution

accessed.

for the presence can be checked

of data races in one of two for data races at one time,

or data races can be detected only when the user follows back dependence. In either case, we can only detect when a program execution is data-race free. When an execution is not data-race free, a set of potential data races (between edges that are not data-race free) is reported. Only potential data races

are

reported

necessarily

mean

because that

when

two

edges

all of the events

are

comprising

simultaneous

it

does not

the edges executed

concur-

rently or had the potential of doing so. Rather, it means that the program’s explicit synchronization did not prevent the events from executing concurrently;

accidental

still

prevent

always

synchronization

them

detects

from

data

(through

executing

races when

they

the

use of shared

concurrently.

However,

exist

reports

and only

variables) this

a data

at least one occurs [32, 331. This data-race detection scheme is similar to other methods 33], with the exception of pairing the P and V operations.

6.1.4

Data

Dependence

for Parallel

Programs.

When

see a dependence for a read of a shared variable, most recent modification to that variable. This finding

the event

is the one that the

that

wrote

happened-before

assigned “SV”

the value

that

relation.

is most

To locate

this

race when

[5, 13, 18, 32,

the user requests

“SV”, we must locate dependence is located

to “SV” recently

can

approach

that

was read.

ordered

write

event,

This

before the

to the by

event

the read

latest

by

edge

in

each process that happened before the edge containing the read is located. These edges give a boundary beyond which all events executed concurrently with or after the read event. Each process in the parallel graph is then scanned backward from this boundary to find an edge that modified “SV”. The

ordering

executed

last.

the

event.

read

of all A data

such

write

events

dependence

A unique

write

is examined

can then event

to determine

be drawn

is guaranteed

from

this

which last

to be found

is read

by

process

p~ and

is modified

by

processes

PI

to

if no data

races involving “SV” exist (unless, of course, “SV” was uninitialized). Figure 16 shows an example of a parallel graph in which a shared “SV”

one

event

and

variable Pz.

establish the data dependence edge for “SV”, the most recent modification “SV” that occurred before the read must be located. If events belonging edges ez ~ (the edge emanating from node n2, 1) and el, o (the topmost edge then a data dependence process ‘pl) are the only modifications of “SV”,

TO of to of is

established between the event ez, ~ that modified “SV” and the event in e3, 1 that read “SV>’. If, for example, there exists another event that modified “SV” in any of the edges el, ~, el, z, ez,2, e2, ~, or ez,A (i.e., edges simultaneous ACM Transactions on Programming Languages and Systems,Vol. 13 No. 4, October 1991.

522

J.-D. Choi et al.

.

PI

P3

P2

el,o e 2,o e3,0

ez,l

Fig.

16.

Dependence

that

span

‘?

el,l

pro-

SV=b ;

cesses.

e2,2

n2,3 e2,3

e3,1 > n 2,4

nl,z I

e 2,4 el.2

to ea, ~), then data 6.2

we cannot

race is reported Incremental

tell

which

Tracing

for Parallel

Our implementation of incremental the reproducibility of the debugged mental tracing to shared-memory ity.

Our

subset

solution

of the

nization

operations

additional 6.2.1

fied

static

logging Simplified

static

graph,

event

actually

modified

“SV”

last,

and a

to the user.

uses a graph graph

that

between

Static

tracing described in Section 5 relied on program. We now discuss applying increparallel programs that lack reproducibil-

called abstracts

processes.

is required

consider

Programs

Graph.

to support

the

simplified out

From

this

graph, except

graph

incremental

To motivate

the example

static

everything

we

is a

synchro-

determine

what

tracing.

the construction

shovvn in Figure

which the

of the simpli-

17, which

contains

a

subroutine that accesses a global variable named “SV”. The subroutine also constitutes an e-block. The statement indicated by the arrow is the first statement that accesses the variable “SV’> in this subroutine. In the case of a sequential program, we construct a prelog that saves the value of “SV” at the beginning of the subroutine. The value of “SV’> will not be changed until it is first accessed in the statement indicated by the arrow. Hence, one prelog and one postlog are sufficient to obtain reproducible behavior when reexecuting parts of sequential programs during debugging. However, now consider the case of a parallel program. If “SV)’ is a shared variable, we cannot guarantee that the value of “SV” saved in the prelog at the beginning of the subroutine will be the same as when “SV” is first read; other processes may have changed the value of “SV>’ between these two moments. Reexecution of this e-block may therefore perform a different ACM Transactions on Programming Languages and Systems,VO1 13 No. 4, October 1991

Debugging Parallel Programs with Flowback Analysis

SubB

.

523

ENTRY

()

{ int

a,

if

(p==. if

b,

c,

p,

..)

{

?

q;

..)

(q==. . .

}

llse . . .

{

Subc }/*

q

*/

() ;

L?6

q*/

else +

/’

{

/*

P

*/

SV=a+b+SV; SubA()

;

. . . )

}/* /*

p*/ SubB

*/

EXIT



: branching node

O

:non-branching

7

Thisnode

o

This nodecomesponds Fig. 17.

node

corresponds tothesubroutine

Subroutine

computation than was originally more run-time information must parallel

programs.

Such

totiesubroutine

callof

SubA

cdlof

SubC

and its simplified

static graph

performed during execution. In general, be recorded to ensure reproducibility of

additional

information

is used

to restore

the

pro-

gram state for read-accessed shared variables. The simplified static graph allows us to determine which shared variables must be recorded and where in the program they should be logged. In our examples we only consider semaphore

operations;

synchronization The simplified

however,

primitives. static graph

this

approach

is a subset

only flow edges and nodes that (such as if or case statements)

can be generalized

of the

static

graph

represent either possible or semaphore operations

that

to other contains

control transfers (Figure 17 also

shows the simplified static graph for subroutine SubB). Any subgraph node representing a subroutine that may perform a semaphore operation during its execution (or during the execution of any subroutine that may be transitively called by it) is treated as a semaphore operation. The simplified static graph therefore contains only branching nodes, which represent possible and non branching nodes, which represent possible control’ transfers, semaphore operations. 6.2.2 Synchronization the additional logging

for

Units and Additional shared variables, the

Logging. simplified

To static

generate graph is

ACM Transactions on Programming Languages and Systems,Vol. 13 No, 4, October 1991

524

J.-D. Choi et al.

.

partitioned to record

into synchronization units, which identify which and where in the program they should be logged.

Definition

6.5.

reachable

A

from

without

passing

synchronization

unit

a given

nonbranching

through

another

consists

node

nonbranching

The sets {el, e2, e3, e5, efj, ‘a, eg } , {e., each constitute a synchronization unit. The object code generates an additional synchronization accessed

unit

inside

the

for

those

generated

for the write-accessed as the

unit,

shared

synchronization

tion

regular

of all

the

e,},

and

e,,

at the that

There

static

e,}

are

graph

in Figure

beginning

are

17

of each

potentially

read-

is no corresponding

variables

generated

edges that

simplified

{e,,

prelog

unit.

the

variables

node.

variables

shared

logs

in

shared

postlog

at the end of a synchroniza-

at the

beginning

and

end

of the

e-block contain the values of both shared and nonshared variables. The additional prelog of the read-accessed shared variables is used to ensure repeatable reexecution of the events in the synchronization unit. As long as there were no data races during for ensuring repeatable execution 7. PERFORMANCE This

section

presents

measurements

time

the

code generated

Sequent time

of application

Symmetry

trace

size.

the additional prelog during debugging.

will

suffice

MEASUREMENTS

execution object

execution, behavior

by the

C Compiler. There

of the

programs. PPD

overhead

compiler

with

We also present

is a trade-off

caused

We compare

between

the that

by PPD

execution generated

measurements

the

amount

on the time

of

by the

of execution-

of trace

generated

during execution time and the amount generated during debug time. The trade-off is based on selecting the size and location of e-blocks. Our current heuristics for making this selection are quite simple, so the performance numbers give only an initial indication of the cost of using PPD. We present SH.PATH

measurement

- 1, SH-PATH-2,

results

of five

and CLASS.

test

SORT

programs:

SORT,

sorts a vector

MATRIX,

of 100 integers

using an Insertion Sort algorithm, whose time complexity is 0( nz ). MATRIX multiplies two square matrices of integers into a third matrix. The size of each matrix, for our tests, is 100 by 100. MATRIX uses a subroutine in multiplying two scalar elements of the two matrices. The subroutine does not contain a loop or accesses to a static variable, making that subroutine a target of log optimization (see Section 5.3). SH.PATH. 1 computes the shortest paths from a city to 99 other cities using an algorithm described by Horowitz and Sahni [19]. SH–PATH-2 is the same as SH_PATH - 1 except that it computes the shortest paths from all of the 100 cities to all of the other cities. CLASS is a program that emulates course such as registering for courses and dropping from run as an interactive program. 7.1

Execution

The

goal

unduly

registration for courses. CLASS

students, also can

Time

of the

burdening

PPD the

design other

is to minimize phases

execution-time

of program

execution.

overhead Table

without

I shows

ACM Transactions on Programming Languages and Systems,Vol. 13 No 4, October 1991.

the

Debugging Parallel Programs with Flowback Analysis Table I.

Test Program

Execution

Time Measurements PPD

Sequent

without

compiler SORT CPU

Elapsed

MATRIX CPU

Elapsed

525

(time in seconds)

compiler

PPD compiler

log optimization

(overhead

.

with

in%)

log optimization

(overhead

in%)

5.5

5.7

(3.6%)

5.7

(3.6%)

5.6

6.1

(8.9%)

6.1

(8.9%)

12.7

52.5

(313.4%)

13.4

(5.5%)

12.7

54.7

(330.7%)

13.7

(7.9%)

SH.PATH.1

1.1

1.8

(63.6%)

1.8

(63.6%)

CPU

1.3

2.2

(69.2%)

2.2

(69.2%)

Elapsed

sH_PATH_2

107.0

105.5

CPU

107.0

107.3

(2.8%)

0.3

0.4

(33.3%)

0.4

(33.3%)

0.4

0.7

(75.0%)

0.7

(75.0%)

Elapsed

CLASS CPU

Elapsed

105.5

(-2.4%)

(–2,4%)

107.3

(2.8%)

execution-time overhead of the tested programs. Execution-time overhead ranges from O to 330 percent for object code that is not log-optimized and from O to 75 percent for object code that is log-optimized. MATRIX has the largest performance improvement from log head of MATRIX is reduced from subroutine

that

is called

subroutine. Without prelog-postlog pair, one million loop

becomes

one million

(100

The execution-time percent. MATRIX

by 100 by 100) times

log optimization, each call to this resulting in a large execution-time

prelog–postlog

or accesses

optimization. 330.7 to 7.9

pairs).

to static

a non-eblock

However,

variables;

subroutine,

with

this log

subroutine

Log optimization

might

non-eblock

in the program; additional invoked.

parent

produce

is never

becomes

the parent

e-blocks

a higher

invoked

of these

does not have this

and the caller

actually

subroutine

by another generates a (due to the

optimization,

The non-eblock subroutine does not generate log smaller execution time, Accordingly, log optimization have a large reduction in the size of execution-time if the

subroutine overhead

a

subroutine e-block.

entries, yielding a much also causes MATRIX to traces. execution-time

due to conditional

non-eblock

overhas a

subroutines

overhead statements

may generate

log information for the non-eblock subroutines that are never However, we expect that such cases of losing by log optimization

should be rare. We also see that the beginning

copying

the contents

of an entire

or at the end of a loop is inexpensive

overhead if most is the case with

of the array elements program SH–PATH–2.

are actually However,

array in terms

(for a log entry)

at

of execution-time

accessed in the loop. Such if only a fraction of the

array elements are accessed in a loop, dumping out an entire array can be expensive, as seen in test program SH–PATH– 1. One possible way to reduce this overhead is to generate a smaller log entry containing only the particular row (or other part) of the matrix that is actually accessed by employing techniques for succinctly summarizing data accesses in arrays [6]. ACM

Transactions

on Programming

Languages

and Systems,

Vol.

13 No

4, October

1991.

526

J.-D. Choi et al.

.

Array

logging

also

can

cause

interesting

some

Notice that test program SH.PATH-2 time (the sum of user and system time) compiler. loop that This

anomalies.

The PPD compiler generates logging code immediately before the accesses a large array; the logging code accesses the entire array.

extra

tecture

performance

shows a slight improvement in CPU with the code generated by the PPD

access seems to affect

level)

of the

the paging

program,

currently investigating Program CLASS can

behavior

resulting

less

in

this anomaly. also run as an interactive

@ossibly

at the

execution

time.

program,

archi-

We

Whereas

are

there

is

a 33 percent

increase in CPU time and a 75 percent increase in elapsed time file, there was no noticeable difference in CLASS ran using an input

when

the response 7.2

times

Execution-Time

when

CLASS

Trace

Size

ran

interactively.

Table II shows the sizes of execution-time traces (log) generated by the test programs. As described before, pro~am MATRIX has a substantial decrease in trace size from log optimization. Program CLASS has a slight increase in trace

from

size

7.3

Trade-off

As

described

execution

log optimization Between

of the reason

Run Time and Debug

Section

in

because

3, there

and response

is

during

time

previously.

Time

trade-off

a

described

between

debugging.

efficiency

If we construct

during e-block

an

in

favor of the execution phase, debugging phase performance will suffer. On an e-block in favor of the debugging phase, the other hand, if we construct execution phase performance will suffer. Table III shows the reexecution times and debug-time trace sizes of various e-blocks of the tested programs. nested loop that sorts the list

The e-block from SORT consists of a singly of numbers once. The e-block of MATRIX is

made of a triply nested loop. By constructing a single e-block out of the triply phase overnested loop of MATRIX, we were able to reduce the execution head, but with a large debug-time overhead: 166 seconds in reexecution time and

about

58

Mbytes

time

of MATRIX

0.12

Mbytes,

is out

of the

execute nated These

than

generate tion

time

decision tion

of a singly

one (such

as to

of an e-block

an

that

out

one

of

full.

it

made another

be made

shortest

might

that

sometimes loop.

out

of

One for

dynamically

from

about the

100 be

100

of

with

a nested

loop).

In

pair

during

SH_

of trace.

to

construct

might

e-block

to

termi-

e-block

better

cities

5 seconds

Mbytes

alternative

at execution

paths

PATH–2

an

prelog-postlog

is III

is constructed

of SH–

than

size

Table

shortest

paths

time,

more

pair

the

- 1 took

e-block

At

with

a nested

e-block

on Programming

the the

prelog-postlog

generate

could

was

computes

1 in

of SH-PATH-2

of SH-PATH

7 minutes

suggest

than

whether

ACM Transactions

system than

that

execution

trace

of SH-PATH.

e-block

while

the

execution-time

e-block

computes

e-block

a comparison,

and

loop the

of trace,

e-block

more

that

The

file

more

results

while

loop

Mbytes

the

nested

cities,

cities.

1.3

lasted two

more

The

nested

because

PATH-2

optimization.

other

with

For

log

to 99 other

of a doubly

to all

trace. 13 seconds,

out

a city

debug-time

is about

with

constructed

from

of

itself

long this

be

case, the

time.

Languages and Systems, Vol. 13 No. 4, October 1991.

to

executhe

execu-

Debugging Parallel Programs with Flowback Analysis Table

II.

Execution-Time

Trace

Size Measurements

With log optimization

log optimization SORT

18,209

18,209

40,120,221

SH.PATH.

1

120,217

825,517

825,517

sH_PATH.2

417,129

417,129

CLASS

104,508

104,892

Table

III.

Reexecution

Times

527

(sizes in bytes)

Without

MATRIX

.

and Trace

Sizes (time

in seconds)

Debug-time Reexecution

Original

e-block 1 (SORT) e-block 2 (MATRIX) e-block 3 (SH_PATH_ 1) e-block 4 (SH_PATH_2)

7.4

Summary

In this

Elapsed

0.1 160.5 3.8 >364.8

1.7 165.5 4.8 >422.8

section of PPD.

vary

significantly

increases

in

techniques more

we

execution

for

and

the

The

test

these have

used

in

programs

size.

will

the

approach

for

the

debugging

of the the

in

time

However,

larger

access

overhead

complex

various

execution

that

this

accesses

in

that

balance during

the

up

small

show

is

only to

part

employ

arrays

[6].

With

a

objects,

we

expect

a

we

think

proportionally measurements

feasibility parallel

(less

than

1 Mbyte

we

need

more

between

the

trace

in

all

experiments size

during

debugging.

performance

However,

scale

programs

such

generally

better time

performance

demonstrated

data

in

overhead. are

a

response

programs small

general,

1.24 >117,79

programs.

reduce for

measurements

achieve

the

test

test

to

analysis

sizes

the

from

way

execution-time

to

and

come

increases

the

summarizing

trace

research

general,

0.37 57.76

measurements

that

among

possible

dependence

However,

execution

show

time

One

succinctly

Execution-time cases).

performance

(O to 86 percent)

loops.

in

provided

measurements

sophisticated

reduction

In

have

The

in the

arrays

0.1 10.5

size

(Mbytes)

of Measurements

parts

of