[Joy79] or calls to monitoring routines [Unix]. The. ¢ounter increment overhead is low, and is suitable for profiling statements. A call of the monitoring routine has ...
£prof: a Call G r a p h E x e c u t i o n P r o f i l e r i by S u s a n L. Graham P e t e r B. K e s s l e r Marshall K McKusiclc C o m p u t e r S c i e n c e Division Electrical Engineering and Computer Science Department U n i v e r s i t y of California, B e r k e l e y B e r k e l e y , C a l i f o r n i a 94720
Abstract
Large c o m p l e x p r o g r a m s are c o m p o s e d of m a n y s m a l l r o u t i n e s t h a t i m p l e m e n t a b s t r a c t i o n s for the
r o u t i n e s t h a t call t h e m . To be useful, a n e x e c u t i o n profiler m u s t a t t r i b u t e e x e c u t i o n t i m e in a way t h a t is significant for the logical s t r u c t u r e of a p r o g r a m as well as for its t e x t u a l d e c o m p o s i t i o n . This d a t a m u s t t h e n be displayed to the u s e r in a c o n v e n i e n t a n d i n f o r m a t i v e way. The g p r o f profiler a c c o u n t s •for the r u n n i n g t i m e of called r o u t i n e s in the r u n ning t i m e of t h e r o u t i n e s t h a t call t h e m . The d e s i g n a n d use of this profiler is d e s c r i b e d . 1. P r o g r a m s
to be Profiled
Software research environments normally i n c l u d e m a n y large p r o g r a m s b o t h for p r o d u c t i o n use a n d for e x p e r i m e n t a l i n v e s t i g a t i o n . These prog r a m s are typically m o d u l a r , in a c c o r d a n c e with g e n e r a l l y a c c e p t e d p r i n c i p l e s of good p r o g r a m design. Often t h e y c o n s i s t of n u m e r o u s s m a l l rout i n e s t h a t i m p l e m e n t v a r i o u s a b s t r a c t i o n s . Somet i m e s s u c h large p r o g r a m s are w r i t t e n by one prog r a m m e r who has u n d e r s t o o d the r e q u i r e m e n t s for t h e s e a b s t r a c t i o n s , a n d has p r o g r a m m e d t h e m a p p r o p r i a t e l y . More f r e q u e n t l y t h e p r o g r a m has had m u l t i p l e a u t h o r s a n d has evolved over t i m e , c h a n g i n g the d e m a n d s p l a c e d on t h e i m p l e m e n t a t i o n of t h e a b s t r a c t i o n s w i t h o u t c h a n g i n g t h e i m p l e m e n t a t i o n itself. Finally, t h e p r o g r a m m a y be a s s e m b l e d f r o m a l i b r a r y of a b s t r a c t i o n i m p l e m e n t a t i o n s u n e x a m i n e d by t h e p r o g r a m m e r . Once a large p r o g r a m is e x e c u t a b l e , it is o f t e n d e s i r a b l e to i n c r e a s e its speed, especially if s m a l l p o r t i o n s of t h e p r o g r a m are found to d o m i n a t e its
e x e c u t i o n t i m e . The p u r p o s e of t h e gprof profiling tool is to help the u s e r e v a l u a t e a l t e r n a t i v e i m p l e m e n t a t i o n s of a b s t r a c t i o n s . We d e v e l o p e d this tool in r e s p o n s e to our efforts to i m p r o v e a code g e n e r a t o r we were writing [Graham82]. The gprof d e s i g n t a k e s a d v a n t a g e of the f a c t t h a t the p r o g r a m s to be m e a s u r e d are large, s t r u c t u r e d a n d h i e r a r c h i c a l . We provide a profile in which t h e e x e c u t i o n t i m e for a set of r o u t i n e s t h a t i m p l e m e n t an a b s t r a c t i o n is c o l l e c t e d a n d c h a r g e d to t h a t a b s t r a c t i o n . The profile c a n be used to comp a r e and assess the costs of v a r i o u s i m p l e m e n t a tions. The profiler c a n be l i n k e d into a p r o g r a m w i t h o u t s p e c i a l p l a n n i n g by the p r o g r a m m e r . The o v e r h e a d for using gprof is low; b o t h in t e r m s of a d d e d e x e c u t i o n t i m e and in the v o l u m e of profiling information recorded. 2. Types of P r o f i l i n g There are s e v e r a l d i f f e r e n t uses for p r o g r a m profiles, a n d e a c h m a y r e q u i r e different i n f o r m a t i o n f r o m the profiles, or d i f f e r e n t p r e s e n t a t i o n of the i n f o r m a t i o n . We d i s t i n g u i s h two b r o a d c a t e g o r i e s of profiles: those t h a t p r e s e n t c o u n t s of s t a t e m e n t or r o u t i n e i n v o c a t i o n s , a n d those t h a t display t i m i n g i n f o r m a t i o n a b o u t s t a t e m e n t s or r o u t i n e s . Counts are typically p r e s e n t e d in t a b u l a r form, o f t e n i n p a r a l l e l with a listing of t h e s o u r c e code. Timing i n f o r m a t i o n could be s i m i l a r l y p r e s e n t e d ; b u t m o r e t h a n one m e a s u r e of t i m e m i g h t be a s s o c i a t e d with e a c h s t a t e m e n t or r o u t i n e . For e x a m p l e , in the f r a m e w o r k u s e d by gprof e a c h profiled s e g m e n t would display two t i m e s : one for the t i m e u s e d by t h e s e g m e n t itself, a n d a n o t h e r for the t i m e i n h e r ited f r o m code s e g m e n t s it invokes. E x e c u t i o n c o u n t s are u s e d in m a n y d i f f e r e n t c o n t e x t s . The e x a c t n u m b e r of t i m e s a r o u t i n e or s t a t e m e n t is a c t i v a t e d c a n be u s e d to d e t e r m i n e if an a l g o r i t h m is p e r f o r m i n g as e x p e c t e d . Cursory i n s p e c t i o n of s u c h c o u n t e r s m a y show a l g o r i t h m s whose c o m p l e x i t y is u n s u i t e d to the t a s k a t h a n d . Careful i n t e r p r e t a t i o n of c o u n t e r s c a n often s u g g e s t i m p r o v e m e n t s to a c c e p t a b l e a l g o r i t h m s . P r e c i s e e x a m i n a t i o n c a n u n c o v e r s u b t l e e r r o r s in a n
ZThil work was supported by grant MCS80-05144 from the National S c i e n ~ Foundation.
Permissionto copy without fee all or part of this material is granted provided that the copies are not made or distributedfor direct commercial advantage, the ACM copyright notice and the title of the publicationand its date appear, and notice is giventhat copying is by permissionof the Associationfor ComputingMachinery.To copy otherwise, or to republish,rcquircs a fcc and/or specific permission.
©
1982 A C M 0 - 8 9 7 9 1 - 0 7 4 - 5 / 8 2 / 0 0 6 / 0 1 2 0 $ 0 0 . 7 5
120
algorithm. At this level, profiling counters are similar to debugging statements whose purpose is to show the n u m b e r of times a piece of code is executed. Another view of such counters is as boolean values. One m a y be interested that a portion of code has executed at all, for exhaustive testing, or to check that one implementation of an abstraction completely replaces a previous one.
the monitorin~ routine solution has certain advantages. Whatever counters are needed by the monitoring routine can be m a n a g e d by the monitoring routine itself, rather than being distributed around the code. In particular, a monitoring routine can easily be called from separately compiled programs. In addition, different monitoring routines can be linked into the program being measured to assemble different profiling data without having to change the compiler or recompile the program. We have exploited this approach; our compilers for C, Fortran77, and Pascal can insert calls to a monitoring routine in the prologue for each routine. Use of the monitoring routine requires no planning on part of a p r o g r a m m e r other than to request that augmented routine prologues be produced during compilation.
Execution counts are not necessarily proportional to the a m o u n t of time required to execute the routine or statement. Further, the execution time of a routine will not be the same for all calls on the routine. The criteria for establishing execution time must be decided. If a routine implements an abstraction by invoking other abstractions, the time spent in the routine will not accurately reflect the time required by the abstraction it implements. Similarly, if an abstraction is implemented by several routines the time required by the abstraction will be distributed across those routines.
We are interested in gathering three pieces of information during program execution: call counts and execution times for each profiled routine, and the arcs of the dynamic call graph traversed by this execution of the program. By post-processing of this data we can build the dynamic call graph for this execution of the program and propagate times along the edges of this graph to attribute times for routines to the routines that invoke them.
Given the execution time of individual routines,
gprof accounts to each routine the time spent for it by the routines it invokes. This accounting is done by a s s e m b l i n g a call graph with n o d e s t h a t are the r o u t i n e s of the p r o g r a m a n d d i r e c t e d arcs t h a t r e p r e s e n t calls f r o m call sites to r o u t i n e s . We dist i n g u i s h a m o n g t h r e e different call g r a p h s f o r a prog r a m . The complete call graph i n c o r p o r a t e s all rout i n e s a n d all p o t e n t i a l arcs, i n c l u d i n g arcs t h a t r e p r e s e n t calls to f u n c t i o n a l p a r a m e t e r s or f u n c t i o n a l variables. This g r a p h c o n t a i n s the o t h e r two g r a p h s as s u b g r a p h s . The static call graph i n c l u d e s all r o u t i n e s a n d all possible a r c s t h a t are n o t calls to f u n c t i o n a l p a r a m e t e r s or variables. The dynamic call graph i n c l u d e s only t h o s e r o u t i n e s a n d arcs t r a v e r s e d by the profiled e x e c u t i o n of the p r o g r a m . This g r a p h n e e d n o t i n c l u d e all r o u t i n e s , n o r n e e d it i n c l u d e all p o t e n t i a l arcs b e t w e e n the r o u t i n e s it covers. It may, however, i n c l u d e arcs to f u n c t i o n a l p a r a m e t e r s or v a r i a b l e s t h a t the s t a t i c call g r a p h m a y omit. The static call g r a p h c a n be d e t e r m i n e d f r o m the (static) p r o g r a m text. The d y n a m i c call g r a p h is d e t e r m i n e d only by profiling an e x e c u t i o n of the p r o g r a m . The c o m p l e t e call g r a p h for a m o n o l i t h i c p r o g r a m could be d e t e r m i n e d by d a t a flow analysis t e c h n i q u e s . The c o m p l e t e call g r a p h for p r o g r a m s t h a t c h a n g e d u r i n g e x e c u t i o n , by modifying t h e m s e l v e s or d y n a m i c a l l y loading or overlaying code, m a y n e v e r be d e t e r m i n a b l e . Both t h e s t a t i c call g r a p h and t h e d y n a m i c call g r a p h are used by gprof, but it does not search for the complete call graph.
Gathering of the profiling information .should not greatly interfere with the running of the program. Thus, the monitoring routine must not produce trace output each time it is invoked. The volume of data thus produced would be unmanageably large, and the time required to record it would overwhelm the running time of most programs. Similarly, the monitoring routine can not do the analysis of the profiling data (e.g. assembling the call graph, propagating times around it, discovering cycles, etc.) during program execution. Our solution is to gather profiling data in m e m o r y during program execution and to condense it to a file as the profiled program exits. This file is then processed by a separate program to produce the listing of the profile data. An advantage of this approach is that the profile data for several executions of a prog r a m can be combined by the post-processing to provide a profile of m a n y executions. The execution time monitoring consists of three parts. The first part allocates and initializes the runtime monitoring data structures before the prog r a m begins execution. The second part is the monitorlng routine invoked from the prologue of each profiled routine. The third part condenses the data structures and writes t h e m to a file as the program terminates. The monitoring routine is discussed in detail in the following sections.
3. Gatherin/~ Profile Data
3. I. K m e c u U o n Counts The gprof monitoring routine counts the n u m b e r of times each profiled routine is called. The monitoring routine also records the arc in the call graph that activated the profiled routine. The count ~ ~ssooi~ted with t h e arc i n t h e call g r a p h r a t h e r t h a n with t h e r o u t i n e . Call c o u n t s for r o u t i n e s c a n t h e n be d e t e r m i n e d by s u m m i n g the c o u n t s on arcs d i r e c t e d i n t o t h a t r o u t i n e . In a m a c h i n e - d e p e n d e n t
Routine calls or statement executions can be measured bY having a compiler augment the code at strategic points. The additions can be inline increments to counters [KnuthTl] [Satterthwaite72] [Joy79] or calls to monitoring routines [Unix]. The ¢ounter increment overhead is low, and is suitable for profiling statements. A call of the monitoring routine has an overhead comparable With a call of a regular routine, and is therefore only suited to profiling on a routine by routine basis. However,
121
the e x e c u t i o n t i m e of a r o u t i n e by m e a s u r i n g the e l a p s e d t i m e f r o m r o u t i n e e n t r y to r o u t i n e exit. U n f o r t u n a t e l y , t i m e m e a s u r e m e n t is c o m p l i c a t e d on t i m e - s h a r i n g s y s t e m s by the t i m e - s l i c i n g of the p r o g r a m . A s e c o n d m e t h o d s a m p l e s t h e value of t h e p r o g r a m c o u n t e r at some i n t e r v a l , and i n f e r s e x e c u t i o n t i m e f r o m t h e d i s t r i b u t i o n of the s a m p l e s within the p r o g r a m . This t e c h n i q u e is p a r t i c u l a r l y s u i t e d to t i m e - s h a r i n g s y s t e m s , where the t i m e slicing c a n serve as t h e basis for s a m p l i n g the program counter. Notice t h a t , w h e r e a s the first m e t h o d could provide e x a c t t i m i n g s , the s e c o n d is inherently a statistical approximation. The s a m p l i n g m e t h o d n e e d n o t r e q u i r e s u p p o r t f r o m t h e o p e r a t i n g s y s t e m : all t h a t is n e e d e d is the ability to set a n d r e s p o n d to " a l a r m c l o c k " i n t e r r u p t s t h a t r u n r e l a t i v e to p r o g r a m t i m e . It is i m p e r a t i v e t h a t the i n t e r v a l s be u n i f o r m since the sahapling of t h e p r o g r a m c o u n t e r r a t h e r t h a n the d u r a t i o n of the i n t e r v a l is the b a s i s of the d i s t r i b u tion. If s a m p l i n g is done too often, t h e i n t e r r u p tions to s a m p l e the p r o g r a m c o u n t e r will overwhelm t h e r u n n i n g of the profiled p r o g r a m . On the o t h e r h a n d , t h e p r o g r a m m u s t r u n for e n o u g h s a m p l e d i n t e r v a l s t h a t the d i s t r i b u t i o n of t h e s a m p l e s a c c u r a t e l y r e p r e s e n t s the d i s t r i b u t i o n of t i m e for the e x e c u t i o n of the p r o g r a m . As with r o u t i n e call t r a c ing, the m o n i t o r i n g r o u t i n e c a n n o t afford to o u t p u t i n f o r m a t i o n for e a c h p r o g r a m c o u n t e r s a m p l e . In our c o m p u t i n g e n v i r o n m e n t , the o p e r a t i n g s y s t e m c a n provide a h i s t o g r a m of t h e l o c a t i o n of the prog r a m c o u n t e r at the e n d of e a c h clock tick ( 1 / 6 0 t h of a s e c o n d ) in which a p r o g r a m r u n s . The histog r a m is a s s e m b l e d in m e m o r y as t h e p r o g r a m r u n s . This facility is e n a b l e d by our m o n i t o r i n g r o u t i n e . We have a d j u s t e d the g r a n u l a r i t y of the h i s t o g r a m so t h a t p r o g r a m c o u n t e r values m a p o n e - t o - o n e onto the h i s t o g r a m . We m a k e t h e simplifying a s s u m p t i o n t h a t all calls to a specific r o u t i n e r e q u i r e the s a m e a m o u n t of t i m e to e x e c u t e . This a s s u m p t i o n m a y disguise t h a t some calls (or worse, some call sites) always invoke a r o u t i n e s u c h t h a t its e x e c u t i o n is f a s t e r (or slower) t h a n t h e a v e r a g e t i m e for t h a t r o u t i n e .
fashion, the monitoring routine notes its o w n return address. This address is in the prologue of s o m e profiled routine that is the destination of an arc in the dynamic call graph. The monitoring routine also discovers the return address for that routine, thus identifying the call site, or source of the arc. The source of the arc is in the caller, and the destination is in the ca/lee. For example, if a routine A calls a routine B, A is the caller, and B is the callee. The prologue of B will include a call to the monitoring routine that will note the arc from A to B and either initialize or increment a counter for that arc. One can not afford to have the monitoring routine output tracing information as each arc is identified. Therefore, the monitoring routine maintains a table of all the arcs discovered, with counts of the n u m b e r s of times each is traversed during execution. This table is accessed once per routine call. Access to it m u s t be as fast as possible so as not to overwhelm the time required to execute the program. Our solution is to access the table through a hash table. W e use the call site as the primary key with the callee address being the secondary key. Since each call site typically calls only one callee, we can reduce (usually to one) the n u m b e r of minor lookups based on the callee. Another alternative would use the callee as the primary key and the call site as the secondary key. Such an organization has the advantage of associating callers with callees, at the expense of longer lookups in the monitoring routine. We are fortunate to be running in a virtual m e m o r y environment, and (for the sake of speed) were able to allocate enough space for the primary hash table to allow a one-to-one m a p p i n g from call site addresses to the primary hash table. Thus our hash function is trivial to calculate and collisions occur only for call sites that call multiple destinations (e.g. functional parameters and functional variables). A one level hash function using both call site and callee would result in an unreasonably large hash table. Further, the n u m b e r of dynamic call sites and callees is not k n o w n during execution of the profiled program. Not all callers and callees can be identified by the monitoring routine. Routines that were c o m piled without the profiling augmentations will not call the monitoring routine as part of their prologue, and thus no arcs will be recorded whose destinations are in these routines. One need not profile all the routines in a program. Routines that are not profiled run at full speed. Certain routines, notably exception handlers, are invoked by non-standard calling sequences. Thus the monitoring routine m a y k n o w the destination of an arc (the callee), but find it difficult or impossible to determine the source of the arc (the caller). Often in these cases the apparent source of the arc is not a call site at all. Such anomalous invocations are declared "spontaneous".
When the profiled p r o g r a m t e r m i n a t e s , the arc t a b l e and t h e h i s t o g r a m of p r o g r a m c o u n t e r s a m ples are w r i t t e n to a file. The arc t a b l e is c o n d e n s e d to c o n s i s t of t h e s o u r c e a n d d e s t i n a t i o n a d d r e s s e s of the arc a n d the c o u n t of the n u m b e r of t i m e s the arc was t r a v e r s e d by this e x e c u t i o n of the p r o g r a m . The r e c o r d e d h i s t o g r a m c o n s i s t s of c o u n t e r s of the n u m b e r of t i m e s the p r o g r a m c o u n t e r was f o u n d to be in e a c h of the r a n g e s c o v e r e d b y t h e h i s t o g r a m . The r a n g e s t h e m s e l v e s are s u m m a r i z e d as a lower a n d u p p e r b o u n d and a step size. 4. Post Processing
Having gathered the arcs of the call graph and timing information for an execution of the program, we are interested in attributing the time for each routine to the routines that call it. W e build a dynamic call graph with arcs from caller to callee, and propagate time from descendants to ancestors by topologically sorting the call graph. Time
3.2. E x e c u t i o n T i m e s The e x e c u t i o n t i m e s for r o u t i n e s can be gathe r e d in at least two ways. One m e t h o d measures
122
equation for time propagation does not work within strongly connected components. Time is not propagated from one m e m b e r of a cycle to another, since, by definition, this involves propagating time from a routine to itself. In addition, children of one m e m b e r of a cycle m u s t be considered children of all m e m b e r s of the cycle. Similarly, parents of one m e m b e r of the cycle m u s t inherit all m e m b e r s of the cycle as descendants. It is for these reasons that we collapse connected components. Our solution collects all m e m b e r s of a cycle together, summing the time and call counts for all m e m b e r s . All calls into the cycle are m a d e to share the total time of the cycle, and all descendants of the cycle propagate time into the cycle as a whole. Calls a m o n g the m e m b e r s of the cycle do not propagate any time, though they are listed in the call graph profile.
p r o p a g a t i o n is p e r f o r m e d f r o m the leaves of the call g r a p h toward the roots, a c c o r d i n g to the o r d e r assigned by a topological n u m b e r i n g algorithm. The topological n u m b e r i n g e n s u r e s t h a t all edges in the g r a p h go f r o m higher n u m b e r e d n o d e s to lower n u m b e r e d nodes. An e x a m p l e is given in Figure 1. If we p r o p a g a t e t i m e f r o m nodes in the o r d e r assigned by the algorithm, e x e c u t i o n t i m e c a n be p r o p a g a t e d f r o m d e s c e n d a n t s to a n c e s t o r s after a single t r a v e r s a l of each arc in the call graph. Each p a r e n t r e c e i v e s s o m e f r a c t i o n of a child's t i m e . Thus t i m e is c h a r g e d to the caller in a d d i t i o n to being c h a r g e d to the callee. Let C, be the n u m b e r of calls to some r o u t i n e , e, a n d C~, be the n u m b e r of calls from a caller r to a callee e. Since we are a s s u m i n g e a c h call to a routine t a k e s the average a m o u n t of t i m e for all calls to t h a t r o u t i n e , the caller is a c c o u n t a b l e for C~e/C, of the t i m e s p e n t by the callee. Let the S e be the self time of a r o u t i n e , e. The selftime of a r o u t i n e c a n be d e t e r m i n e d from the t i m i n g i n f o r m a t i o n g a t h e r e d d u r i n g profiled p r o g r a m e x e c u t i o n . The t o t a l time, Tr, we wish to a c c o u n t to a r o u t i n e r , is t h e n given by the r e c u r r e n c e equation:
TT=S,+
Z
Figure 2 shows a modified version of the call graph of Figure i, in which the nodes labelled 3 and 7 in Figure 1 are mutually recursive. The topologically sorted graph after the cycle is collapsed is given in Figure 3. Since the technique described above only collects the dynamic call graph, and the p r o g r a m typically does not call every routine on each execution, different executions can introduce different cycles in the dynamic call graph. Since cycles often have a significant effect on time propagation, it is desirable to incorporate the static call graph so that cycles will have the s a m e m e m b e r s regardless of h o w the p r o g r a m runs.
r,x
r CALLS e Ci where r CALLS e is a relation showing all routines e called by a routine r. This relation is easily available from the call graph. However, if the execution contains recursive calls, the call graph has cycles that cannot be topologically sorted. In these cases, we discover strongly-connected components in the call graph, treat each such c o m p o n e n t as a single node, and then sort the resulting graph. W e use a variation of Tarjan's strongly-connected components M g o r i t h m that discovers strongly-connected components as it is assigning topological order n u m b e r s [Tarjan72]. Time propagation within strongly connected components is a problem. For example, a selfrecursive routine (a trivial cycle in the call graph) is accountable for all the time it uses in all its recursive instantiations. In our scheme, this time should be shared a m o n g it: call graph parents. The arcs from a routine tv itself are of interest, but do not participate iz~ time propagation. Thus the simple
Cycle to be collapsed. F i g u r e 2.
Topological numbering after cycle collapsing. Figure 3.
Topological ordering Figure I.
123
searching through the output.
The static call graph can be constructed from the source text of the program. However, discover-
The m a j o r e n t r i e s of t h e call g r a p h profile a r e t h e e n t r i e s f r o m t h e fiat profile, a u g m e n t e d b y the t i m e p r o p a g a t e d to e a c h r o u t i n e f r o m its d e s c e n d a n t s . This profile is s o r t e d b y t h e s u m of t h e t i m e for t h e r o u t i n e itself p l u s t h e t i m e i n h e r i t e d f r o m its d e s c e n d a n t s . The profile shows which of t h e h i g h e r level r o u t i n e s s p e n d l a r g e p o r t i o n s of t h e t o t a l e x e c u t i o n t i m e in t h e r o u t i n e s t h a t t h e y call. F o r e a c h r o u t i n e , we show t h e a m o u n t of t i m e p a s s e d by e a c h child t o t h e r o u t i n e , which i n c l u d e s t i m e for t h e child itself a n d for t h e d e s c e n d a n t s of t h e child ( a n d t h u s t h e d e s c e n d a n t s of t h e r o u t i n e ) . We also show t h e p e r c e n t a g e t h e s e t i m e s r e p r e s e n t of t h e t o t a l t i m e a c c o u n t e d to t h e child. S i m i l a r l y , t h e p a r e n t s of e a c h r o u t i n e a r e l i s t e d , along with t i m e , a n d p e r c e n t a g e of t o t a l r o u t i n e t i m e , p r o p a g a t e d t o e a c h one.
ing the static call graph from the source text would require two moderately difficult steps: finding the source text for the program (which m a y not be available), and scanning and parsing that text, which m a y be in any one of several languages. In our programming system, the static calling information is also contained in the executable version of the program, which we already have available, and which is in language-independent form. One can examine the instructions in the object program, looking for calls to routines, and note which routines can be called. This technique allows us to add arcs to those already in the dynamic call graph. If a statically discovered arc already exists in the dynamic call graph, no action is required. Statically discovered arcs that do not exist in the dynamic call graph are added to the graph with a traversal count of zero. Thus they are never responsible for any time propagation. However, they m a y affect the structure of the graph. Since they m a y complete strongly connected components, the static call graph construction is done before topological ordering.
Cycles a r e h a n d l e d as single e n t i t i e s . The c y c l e as a whole is shown as t h o u g h it w e r e a single r o u tine, e x c e p t t h a t m e m b e r s of t h e c y c l e a r e l i s t e d in p l a c e of t h e c h i l d r e n . A l t h o u g h t h e n u m b e r of calls of e a c h m e m b e r f r o m within t h e c y c l e a r e shown, t h e y do n o t a f f e c t t i m e p r o p a g a t i o n . When a child is a m e m b e r of a c y c l e , t h e t i m e shown is t h e a p p r o p r i a t e f r a c t i o n of t h e t i m e for t h e whole cycle. S e l f - r e c u r s i v e r o u t i n e s h a v e t h e i r calls b r o k e n down into calls f r o m t h e o u t s i d e a n d s e l f - r e c u r s i v e calls. Only t h e o u t s i d e c a l l s a f f e c t t h e p r o p a g a t i o n of time. The following e x a m p l e is a t y p i c a l f r a g m e n t of a call g r a p h .
5. Data Presentation The data is presented to the user in two different formats. The first presentation simply lists the routines without regard to the a m o u n t of time their descendants use. The second presentation incorporates the call graph of the program.
5.1. The Flat Profile The fiat profile c o n s i s t s of a list of all t h e r o u t i n e s t h a t a r e c a l l e d d u r i n g e x e c u t i o n of t h e p r o g r a m , with t h e c o u n t of t h e n u m b e r of t i m e s t h e y a r e c a l l e d a n d t h e n u m b e r of s e c o n d s of e x e c u t i o n t i m e for w h i c h t h e y a r e t h e m s e l v e s a c c o u n t a b l e . The r o u t i n e s a r e l i s t e d in d e c r e a s i n g o r d e r of e x e c u t i o n t i m e . A list of t h e r o u t i n e s t h a t a r e n e v e r c a l l e d d u r i n g e x e c u t i o n of t h e p r o g r a m is also a v a i l a b l e t o verify t h a t n o t h i n g i m p o r t a n t is o m i t t e d b y t h i s e x e c u t i o n . The fiat profile gives a quick overview of t h e r o u t i n e s t h a t a r e u s e d , and shows t h e r o u t i n e s t h a t a r e t h e m s e l v e s r e s p o n s i b l e for l a r g e f r a c t i o n s of t h e e x e c u t i o n t i m e . In p r a c t i c e , t h i s profile u s u a l l y shows t h a t no single f u n c t i o n is o v e r w h e l m i n g l y r e s p o n s i b l e for t h e t o t a l t i m e 'of t h e p r o g r a m . Notice t h a t for t h i s profile, t h e i n d i v i d u a l t i m e s s u m to t h e t o t a l e x e c u t i o n t i m e .
The en'try in t h e call g r a p h profile listing for t h i s e x a m p l e is shown in F i g u r e 4. The e n t r y is for r o u t i n e EXAMPLE, which has t h e Caller r o u t i n e s as i t s p a r e n t s , a n d t h e Sub r o u t i n e s as its c h i l d r e n . The r e a d e r s h o u l d k e e p in m i n d t h a t all i n f o r m a t i o n is g i v e n w i t h r e s p e c t to EXAMPLE. The i n d e x in t h e first c o l u m n shows t h a t EXAMPLE is t h e s e c o n d e n t r y in t h e profile listing. The EXAMPLE r o u t i n e is Called t e n t i m e s , f o u r t i m e s b y CALLER1, a n d six t i m e s b y CALLER2. C o n s e q u e n t l y 4 0 ~ of EXAmPLE's t i m e is p r o p a g a t e d t o CALLER1, a n d 60~ of EXAMPLE'S t i m e is p r d p a g a t e d %o CALLER2. The self 'and d e s c e n d a n t fields o'f t h e p a r e n t s show the a m o u n t o'f self a n d d e s c e n d a n t t i m e EXAMPLE p r o p a g a t e s to ' t h e m '(but n o t t h e 'time u s e d by t h e p a r e n t s d i r e c t l y ) . Note t h a t EXAMPLE calls i~tself r e c u i ' s i v e l y four t i m e s . The r o u t i n e EXAMPLE calls r o u t i n e SUB1 t w e n t y t i m e s , SUB2 once, a n d n e v e r calls SUB3. S i n c e sUB2 ~s c a l l e d a 'total of five t i m e s , 20~ of its self a n d d e s c e n d a n t 'time is p r o p a g a t e d to EXAMPLE's d e s c e n d a n t t i m e field. B e c a u s e SUB1 is a
5.'b-. The Call Graph Profile Ideally, we would like to p r i n t t h e call g r a p h o f b u t we a r e l i m i t e d b y t h e twod i m e n s i o n a l n a t u r e of o u r o u t p u t d e v i c e s . We c a n n o t a s s u m e t h a t a call g r a p h is p l a n a r , a n d even if i t is, t h a t we c a n p r i n t a p l a n a r v e r s i o n - o f it. I n s t e a d , we c h o o s e to l i s t e a c h r o u t i n e , t o g e t h e r With infor'mation about the routines that are its direct p a r e n t s a n d c h i l d r e n . This listing p r e s e n t s a window into t h e c a l l g r a p h . B a s e d o n Our e x p e r i e n c e , b o t h p a r e n t i n f o r m a t i o n and c h i l d i n i o r m a t i 0 n is important, and should be a v a i l a b l e w i t h o u t
the p r o g r a m ,
124
index
[2]
~time
41.5
self
descendants
0.20 0.30
1.20 I.B0
0.50
3.00
1.50 o.oo 0.00
1.00 o.so 0.00
called/total parents called+self name eaUed/toted children 4/10 CALLER1 0/I0 CALLER2 104-4 E'XAMPLE 20/40 SUB1 1/5 SUB2 O/S SUB3
index
[7] [1] [2] L4]
[9] [11]
Profile entry for EXAMPLE. Figure 4. m e m b e r of cycle 1, the self and descendant times and call count fraction are those for the cycle as a whole. Since cycle i is called a total of forty times (not counting calls a m o n g m e m b e r s of the cycle), it propagates 5DZ of the cycle's self and descendant time to EXAMPLE's descendant time field. Finally each n a m e is followed by an index that shows where on the listing to find the entry for that routine.
6. Using the Profiles The profiler is a useful tool for improving a set of routines that implement an abstraction. It can be helpful in identifying poorly coded routines, and in evaluating the new algorithms and code that replace them. Taking full advantage of the profiler requires a careful examination of the call graph profile, and a thorough knowledge of the abstractions underlying the program. The easiest optimization that can be performed is a small change to a control construct or data structure that improves the running time of the program. An obvious starting point is a routine that is called m a n y times. For example, suppose an output routine is the only parent of a routine that formats the data. If this format routine is expanded inline in the output routine, the overhead of a function ca]] and return can be saved for each d a t u m that needs to be formatted. The drawback to inline expansion is that the data abstractions in the program m a y b e c o m e less parameterized, hence less clearly defined. The profiling will also b e c o m e less useful since the loss of routines will m a k e its output m o r e granular. For example, if the symbol table functions "lookup", "insert", and "delete" are all m e r g e d into a single parameterized routine, it will be impossible to determine the costs of any one of these individual functions from the profile. Further potential for optimization lies in routines that implement data abstractions whose total execution time is long. For example, a lookup routine might be called only a few times, but use an inefficient linear search algorithm, that might be replaced with a binary search. Alternately, the discovery that a rehashing function is being called excessively, can lead to a different hash function or a larger hash table. If the data abstraction function cannot easily be speeded up, it m a y be advantageous to cache its results, and eliminate the need to rerun it for identical inputs. These and other ideas for program improvement are discussed in [Bentley81 ].
This tool is best used in an iterative approach: profiling the program, eliminating one bottleneck, then finding some other part of the program that begins to dominate execution time. For instance, we have used gprof on itself; eliminating, rewriting, and inline expanding routines, until reading data files (hardly a target for optimization!) represents the dominating factor in its execution time. Certain types of programs are not easily analyzed by gprof. They are typified by programs that exhibit a large degree of recursion, such as recursive descent compilers. The problem is that most of the major routines are grouped into a single monolithic cycle. As in the symbol table abstraction that is placed in one routine, it is impossible to distinguish which m e m b e r s of the cycle are respon.sible for the execution time. Unfortunately there are no easy modifications to these programs that m a k e t h e m amenable to analysis. A completely different use of the profiler is to analyze the control flow of an unfamiliar program. If you reeeive a program from another user that you need to modify in some small way, it is often unclear where the changes need to be made. By running the program on an example and then using gprof, you can get a view of the structure of the program. Consider an example in which you need to change the output format of the program. For purposes of this example suppose that the call graph of the output portion of the program has the following structure:
Initially you look through the gprof output for the system call "WRITE". The format routine you will need to change is probably a m o n g the parents o( the "WRITE" procedure. The next step is tc look at the profile entry for each of parents of "WRITE", in this example either "FORMATI" or "FORMAT2", to determine which one to change. Each format routine will have one or m o r e parents, in this example "CALCI", "CALC2", and "CALC3". By inspecting the source code ~or each of these routines you can
125
d e t e r m i n e which f o r m a t r o u t i n e g e n e r a t e s t h e outp u t t h a t you wish to modify. Since t h e g p r o f e n t r y shows all t h e p o t e n t i a l calls to t h e f o r m a t r o u t i n e you i n t e n d to c h a n g e , you c a n d e t e r m i n e if y o u r m o d i f i c a t i o n s will affect o u t p u t t h a t s h o u l d be l e f t alone. If you d e s i r e to c h a n g e t h e o u t p u t of
8. R e f e r e n c e s [Bentley8 I]
Bentley, J. L., "Writing Efficient Code", D e p a r t m e n t of C o m p u t e r S c i e n c e , Carnegie-Mellon U n i v e r s i t y , P i t t s b u r g h , P e n n s y l v a n i a , CMU-CS81-116, 1981. [GrahamB2] G r a h a m , S. L., Henry, R. R., S e h u l m a n , R. A., "An E x p e r i m e n t in Table Driven Code G e n e r a t i o n " , SIGPLAN '82 S y m p o s i u m on C o m p i l e r C o n s t r u c t i o n , June, 1982. [Joy79] Joy, W. N., G r a h a m , S. L., Haley, C. B. " B e r k e l e y P a s c a l U s e r ' s Manual", V e r s i o n 1.1, C o m p u t e r S c i e n c e Division U n i v e r s i t y of California, B e r k e ley, CA. April 1979. [Knuth71] Knuth, D. E. "An e m p i r i c a l s t u d y of FORTRAN programs", Software - Practice and Experience, 1, 105-133. 1971 [Satterthwaite72] S a t t e r t h w a i t e , E. " D e b u g g i n g Tools for High Level L a n g u a g e s " , S o f t w a r e - P r a c t i c e a n d E x p e r i e n c e , 2, 197-217, 1972 [Tar jan72] Tarjan, R. E., " D e p t h first s e a r c h a n d l i n e a r g r a p h a l g o r i t h m , " SIAM J. Comp'a~ing 1:2, 146160, 1972.
"CALC2", but not "CALC3", then formatting routine "FORMAT2" needs to be split into two separate routines, one of which implements the new format. Y o u can then retarget just the call by "CALC2" that needs the new format. It should be noted that the static call information is particularly useful here since the test case you run probably will not exercise t h e e n t i r e p r o g r a m . 7. C o n c l u s i o n s
We have c r e a t e d a p r o f i l e r t h a t aids in t h e e v a l u a t i o n of m o d u l a r p r o g r a m s . F o r e a c h r o u t i n e in t h e p r o g r a m , t h e profile shows t h e e x t e n t to which t h a t r o u t i n e h e l p s s u p p o r t v a r i o u s a b s t r a c tions, a n d how t h a t r o u t i n e u s e s o t h e r a b s t r a c t i o n s . The profile a c c u r a t e l y a s s e s s e s t h e c o s t of r o u t i n e s a t all levels of t h e p r o g r a m d e c o m p o s i t i o n . The p r o f i l e r is easily u s e d , and c a n b e c o m p i l e d into t h e p r o g r a m w i t h o u t any p r i o r p l a n n i n g by t h e p r o g r a m m e r . It a d d s only five to t h i r t y p e r c e n t e x e c u t i o n o v e r h e a d to t h e p r o g r a m b e i n g profiled, p r o d u c e s no a d d i t i o n a l o u t p u t u n t i l a f t e r t h e p r o g r a m finishes, a n d allows t h e p r o g r a m to b e m e a s u r e d in its a c t u a l e n v i r o n m e n t . Finally, t h e p r o f i l e r r u n s on a t i m e - s h a r i n g s y s t e m using only t h e n o r m a l s e r vices provided by the operating system and compilers.
[Umx] Unix P r o g r a m m e r ' s Manual, " p r o f c o m m a n d " , s e c t i o n 1, Bell L a b o r a t o r i e s , M u r r a y Hill, NJ. J a n u a r y 1979.
126