An Approach for Recovering Distributed System ... - Springer Link

11 downloads 0 Views 997KB Size Report
three techniques to help recover a static approximation of the runtime ... chitecture recovery approach for distributed systems.1 X-ray comprises three domain- ...
Automated Software Engineering, 8, 311–354, 2001 c 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. 

An Approach for Recovering Distributed System Architectures NABOR C. MENDONC¸A [email protected] Departamento de Computac¸a˜ o, Universidade Federal do Cear´a, Campus do Pici, Bloco 910, 60455-760 Fortaleza, CE, Brazil JEFF KRAMER [email protected] Department of Computing, Imperial College of Science, Technology and Medicine, 180 Queen’s Gate, London SW7 2BZ, UK

Abstract. Reasoning about software systems at the architectural level is key to effective software development, management, evolution and reuse. All too often, though, the lack of appropriate documentation leads to a situation where architectural design information has to be recovered directly from implemented software artifacts. This is a very demanding process, particularly when involving recovery of runtime abstractions (clients, servers, interaction protocols, etc.) that are typical to the design of distributed software systems. This paper presents an exploratory reverse engineering approach, called X-ray, to aid programmers in recovering architectural runtime information from a distributed system’s existing software artifacts. X-ray comprises three domain-based static analysis techniques, namely component module classification, syntactic pattern matching, and structural reachability analysis. These complementary techniques can facilitate the task of identifying a distributed system’s implemented executable components and their potential runtime interconnections. The component module classification technique automatically distinguishes source code modules according to the executables components they implement. The syntactic pattern matching technique in turn helps to recognise specific code fragments that may implement typical component interaction features. Finally, the structural reachability analysis technique aids in the association of those features to the code specific for each executable component. The paper describes and illustrates the main concepts underlying each technique, reports on their implementation as a suit of new and off-the-shelf tools, and, to give evidence of the utility of the approach, provides a detailed account of a successful application of the three techniques to help recover a static approximation of the runtime architecture for Field, a publicly-available distributed programming environment. Keywords: architecture recovery, static analysis, distributed software

1.

Introduction

The principled study of software architectures has been an important contribution to software engineering research in the last decade. It is now widely recognised that reasoning about software at the architectural level has the potential to improve many aspects of the software development process, from early detection of inconsistencies and undesired properties, prior to implementation, to better management, evolution, and reuse (Garlan and Perry, 1995). In practice, however, this potential has thus far gone largely unfulfilled (Bass et al., 1998). One reason is simply that current programming technologies are yet too limited in supporting architecture-centric software development (Shaw and Garlan, 1996). The main problem, though, lies in the fact that software development is ever more constrained by

312

MENDONC¸ A AND KRAMER

existing systems, whose architectures are seldom documented properly. All too often, upto-date architectural information about an existing system has to be recovered directly from the system’s implemented artifacts—a very demanding process commonly referred to as architecture recovery (Harris et al., 1995; Mendon¸ca and Kramer 1996; Gall et al., 1996). Effective architecture recovery approaches are thus expected to play a key part in bringing the promises of software architecture research into practice. Technically, the process of recovering architectural abstractions from existing code artifacts can be seen as an attempt to reverse the process that might have been used to derive those artifacts from a suitable architectural specification. However, any attempt to bridge the gap between the code level and the architectural specification level is hampered by the limitations of traditional programming languages in capturing high-level design concepts (Shaw and Garlan, 1996). In particular, those languages provide no explicit construct to implement a distributed system’s runtime abstractions (e.g., servers, clients, interaction protocols), which then have to be encoded using the primitive constructs available. As a consequence, recognising those abstractions in a distributed system implementation can be exceedingly difficult if based on plain source code information alone (Holtzblatt et al., 1997). Another difficulty stems from the fact that a single architectural description may be insufficient to represent the design of a large distributed system. Many researchers and practitioners advocate that the design of a complex software system should be described from multiple architectural views, with each view emphasising a specific set of concerns and including its own types of design elements and rationale (Kruchten, 1995; Soni et al., 1995). For example, Kruchten describes an industry-based architectural model where the main views are a logical view (focusing on problem domain elements and their functionality), a process view (focusing on runtime elements and their concurrency and synchronisation properties), a development view (focusing on modules and their aggregation into subsystems, layers and libraries), and a physical view (focusing on the allocation of components onto hardware devices) (Kruchten, 1995). The challenge that multiple architectural views pose to architecture recovery is that one must not only be able to recognise abstractions from different views, but also to understand how the abstractions recognised for one view may relate to the abstractions recognised for the other views (Waters and Abowd, 1999). Most reverse engineering tools are limited in supporting architecture recovery because they tend to neglect aspects of system distribution and runtime organisation when extracting design information from a system’s implemented artifacts. The exception are dynamic analysis tools (Kunz and Black, 1995; Sefika et al., 1996a; Jerding and Rugaber, 1998) and domain-based pattern matching tools (Fiutem et al., 1996; Holtzblatt et al., 1997). Dynamic analysis tools, though precise in capturing a system’s runtime organisation, are generally limited in revealing how runtime abstractions may relate to, and are realised in terms of, the various elements of the source code. In addition, those tools tend to require costly eventtracing techniques, such as code instrumentation and symbolic execution, which may be impractical for time-critical systems or systems whose original development environment is no longer (or only partially) available. Domain-based pattern matching tools in turn trade off precision against a better mapping of potential runtime events to their executing code

RECOVERING DISTRIBUTED SYSTEM ARCHITECTURES

313

fragments, but are still limited in revealing how those fragments may be reused across the code for different executable components. This paper claims that combining traditional and more sophisticated static analysis techniques can be a cost-effective way of recovering a distributed system’s “as-built” architectural design. To support this claim, the paper presents X-ray, a static multi-technique architecture recovery approach for distributed systems.1 X-ray comprises three domain-based static analysis techniques, each developed to facilitate the identification of implemented executable components and their potential runtime interconnections. The first technique— component module classification (Mendon¸ca, and Kramer, 1999)—is a new variation on existing subsystem classification techniques. It is used to identify which compilation modules of the source code constitute the code for each executable component. Moreover, the technique explicitly distinguishes modules that are exclusive to a single component from modules that are shared by multiple components. The second technique—syntactic pattern matching—is a flexible pattern matching based technique that enables specification and execution of queries over a syntactic source code representation. It is used both to capture and search for the stereotyped implementation of typical component interaction features. The third and last technique—structural reachability analysis—computes transitive closures of the activation relation over nodes of an activation graph system model. It is used to support the task of assigning component interaction features recognised in shared modules, to individual components. In so doing, the technique also helps to reveal the different paths of computation across both shared and component-exclusive code that may lead to the execution of a potential runtime event. These three techniques are supported by a proof-of-concept architecture recovery prototype that integrates several new and off-the-shelf tools. By considering different types of domain information at multiple levels of abstraction, X-ray also helps to relate implemented design elements across multiple architectural views. For example, following Kruchten’s model (Kruchten, 1995), the module classification technique maps elements of the development view (source code modules and subsystems) to elements of the process view (executable components). The pattern matching and reachability analysis techniques further enrich this mapping by associating a module’s specific code fragments to potential runtime interaction events. These two techniques are also useful in revealing aspects of the logical and physical views. For instance, recognising that two executable components may interact at runtime through a pipe indicates that these components are likely to play the architectural role of filters (“logical view”), and that they can only be executed concurrently under the same communications domain (“physical view”). Similarly, recognising that two components may interact through a TCP/IP based socket indicates that these components may play the role of server and client, respectively, and that, in principle, they can be executed on any compatible machine connected to the Internet. Like most domain-oriented architecture recovery approaches (Harris et al., 1995; Gall et al., 1996; Fiutem et al., 1996), X-ray is based on a hybrid reverse engineering process, in which bottom-up recovery strategies provide the basis for top-down exploration of architectural hypothesis. For example, the module classification technique automatically reveals how executable components may be implemented in a distributed system source code. This information, combined with architectural expectations that the programmer or software maintainer may have about the system, is then used to guide the application of

314

MENDONC¸ A AND KRAMER

the pattern matching and reachability analysis techniques. Typical sources of architectural expectations are existing documentation, knowledge of the application domain (such as standards and domain-specific architectures), catalogs of architectural styles and patterns, and the programmer or maintainer own development experience. A preliminary investigation on the effectiveness of X-ray has been carried out through a number of case studies involving distributed software of varying sizes and application domains (Mendon¸ca, 1999). In particular, X-ray has been successfully used to help extract architectural runtime information from the source code for two moderate-size, publiclyavailable distributed systems, namely Samba (Tridgell, 1994), a distributed software suite that provides file and print services across Unix and non-Unix based platforms, and Field (Reiss, 1990), a distributed programming environment. These experiments build confidence that combining module classification, syntactic pattern matching, and structural reachability analysis can be a cost-effective way of recovering up-to-date architectural information from existing distributed system artifacts. The rest of the paper is organised as follows. Sections 2–4 introduce X-ray by describing and illustrating the main concepts that underly each of its constituent techniques. Section 5 presents the approach’s support prototype and describes some aspects of its implementation. Section 6 provides an in-depth account of the method and results of the Field case study. Section 7 covers related work. Finally, Section 8 concludes the paper by summarising the research and giving directions for future work. 2.

Component module classification

The ability to understand the architecture of an existing distributed system relies to a great extent on the ability to recognise the system’s implemented executable components. Traditionally, due to the complexity of the source code, information on component implementation is obtained via manual examination of non-code artifacts, such as documentation, directory hierarchies, and, more commonly, configuration files (e.g., those used by Make (Feldman, 1979)). This approach has a number of clear disadvantages. First, because the code is the primary—and often the only—locus of maintenance, documentation artifacts are seldom in full sync with the implemented system. Second, many distributed systems have a flat directory structure, or a directory structure that does not reflect the implementation of executable components. Third, configuration files, despite being used primarily to automate the process of building executables, are neither the ideal nor the most accurate source of information on the implementation of executable components. With this respect, configuration files have at least three shortcomings: 1. Large systems tend to require large configuration and installation procedures (with the number of configuration commands at least proportional to the number of compilation modules in the source code). In addition, configuration information can be spread throughout the entire source code instead of localised in a single configuration file. As a consequence, extracting information pertaining the implementation of executable components from configuration files may not be a trivial task. 2. Poorly maintained systems can also have poorly maintained configuration files. In the worst scenario, configuration files may contain information that is out-of-date with

RECOVERING DISTRIBUTED SYSTEM ARCHITECTURES

315

respect to the source code evidence. For example, a configuration file may describe a module as being part of the build dependencies for an executable even when the executable implementation no longer needs resources provided by that module. 3. Most remarkably, traditional configuration file organisations fail to distinguish which modules are used exclusively to build a particular executable and which are shared by multiple executables. This distinction is important in that it helps to understand the unique functionalities associated with each executable and how different executables may depend upon common implementation resources. The component module classification technique is aimed at avoiding these limitations. The technique classifies compilation modules as either exclusive to a single executable or shared by multiple executables. It does so based on analysis of module dependency information extracted automatically from source code. The main concepts underlying the technique are described below. 2.1.

Entry and library modules

In most programming languages, the code for an executable component comprises an entry compilation module and, optionally, one or more library modules. An entry module is a module containing a language-designated execution entry point. In Pascal, for example, an entry point corresponds to a procedure defined by the reserved word PROGRAM; in C and C++, a main function; in Java, either a main class method (in the case of Java applications) or a class constructor derived from one of the constructors for the Applet class (in the case of Java applets); and so on. Library modules in turn are compilation modules that do not contain an execution entry point. A library module thus is not confined to a single executable program and hence may be reused to build different executables. Identifying entry modules and the library modules they depend upon is key to recognising what, how many, and where executable components are implemented in the source code for an unfamiliar distributed system. To identify entry modules and their dependencies, the component module classification technique relies on the module dependency graph system model. 2.2.

Module dependency graph

A module dependency graph (“MDG”) is an abstract graph model of the compilation modules of a system source code and their “depends-on” relationships. In an MDG, each node corresponds to a distinct compilation module. A directed edge between two nodes, say p and q, represents the fact that module p depends on (i.e., uses resources defined and exported by) module q. Formally, an MDG can be defined as a particular kind of flow graph (Hecht, 1977) as follows: Definition 1 (MDG).

A module dependency graph is a triple G = (M, D, r ), where

– M is a finite set of modules;

316

MENDONC¸ A AND KRAMER

– D ⊂ (M × M) is the module-dependency relation, i.e., a tuple of the form  p, q ∈ D exists if and only if p depends on q; and – there is a path in G connecting the root module, r ∈ M, to every module. The above definition captures two typical properties of flow graphs: there is a specific node at which to begin, and every node is accessible from this initial node. These properties can be guaranteed during an MDG construction process as follows. If the initial MDG, constructed based on extracted source code information, contains multiple root modules (multiple root “entry” modules, in the case of an MDG for a distributed software), then a single “super-root” module can be added with an edge to each root module of the initial graph. Any subgraph inaccessible from the root can either be removed, if the subgraph is found to contain only “dead” modules (i.e., modules no longer used), or be reintegrated to the graph, if the dependency relationships required to reconnect the subgraph are found to be missing due to limitations of the particular source model extraction tool used. Note that the definition does not constrain an MDG to be acyclic. This means that care must be taken to prevent cycles from being formed during an MDG traversal operation, so as to guarantee that the operation terminates properly. Finally, it is important to emphasize that an MDG only captures implementation dependencies between compilation modules that are part of the system source. Information on the inclusion of header files and the use of external library modules is completely omitted from the graph. The advantage of filtering out this kind of information is that it reduces the library saturation problem during model inspection and visualisation. Two important relations over MDG nodes, that are useful to reasoning about a number of interesting MDG properties, are the relations of reachability and dominance (Hecht, 1977). They are defined formally as follows: Definition 2 (Reachability). If x and y are two (not necessarily distinct) modules in an MDG G, then x reaches y if and only if there is a path in G connecting x to y. For convenience, let REACH(x) = {y | x reaches y} define the reachability-set of every module x in G. The reachability relation captures the notion of transitive dependency among modules: if p reaches q, then p depends directly or indirectly on q. Definition 3 (Dominance). If x and y are two (not necessarily distinct) modules in an MDG G, then x dominates y if and only if every path in G connecting its root module to y contains x. For convenience, let DOM(x) = {y | x dominates y} define the dominance-set of every module x in G. By definition, REACH(x) ⊆ DOM(x) for every module x in G. The dominance relation captures the notion of exclusive transitive dependency among modules: if p dominates q, then every module not dominated by p that depends on q does so only indirectly via p. The reachability and dominance relations are well-known and have been used to reason about a variety of flow-graph based program models, including call graphs (Cimitile and Visaggio, 1995; Burd and Munro, 1998) and object aliasing graphs (Clarke et al., 1998).

RECOVERING DISTRIBUTED SYSTEM ARCHITECTURES

317

It is easy to show that both relations are reflexive, antisymmetric and transitive, and that the graph of the reflexive and transitive reduction of the dominance relation is a tree—there is a path from p to q in the tree if and only if p dominates q (Hecht, 1977). 2.3.

Module categories

The notion of entry modules, coupled with the reachability and dominance MDG relations, provides a basis to the formal definition of three module categories: relevant modules, exclusive modules, and shared modules. The former two categories are defined relative to the implementation of a particular executable component while the latter is defined relative to the source code as a whole. 2.3.1. Relevant modules. The modules relevant to a component are all the modules that implement the component’s respective executable program, that is, the program’s entry module plus all the library modules upon which the entry module depends. More specifically, a module p is relevant to a certain component if and only if the entry module of that component “reaches” p in the MDG. The set of relevant modules of a component is defined in terms of the reachability relation as follows: Definition 4 (Relevant modules). Let (M, D, r ) be the MDG for a distributed system; let C be a component of that system; and let Centry ∈ M be the entry module of C. The modules relevant to C are all the modules in REACH(Centry ). 2.3.2. Exclusive modules. The modules exclusive to a component are the modules that are unique to the component’s respective executable program. This set includes the program’s entry module, of course, and, depending on the way that the program is implemented, may include none, part, or even all of its library modules. More specifically, a module p is exclusive to a certain component if and only if the entry module of that component “dominates” p in the MDG. The set of exclusive modules of a component is defined in terms of the dominance relation as follows: Definition 5 (Exclusive modules). Let (M, D, r ) be the MDG for a distributed system, let C be a component of that system, and let Centry ∈ M be the entry module of C. The modules exclusive to C are all the modules in DOM(Centry ). 2.3.3. Shared modules. The modules shared by multiple components are library modules that are used across multiple executable programs. More specifically, a module p is shared if and only if p is not dominated by any component entry module in the MDG. Shared modules can be defined in terms of either the reachability or dominance relation: Definition 6 (Shared modules). Let G = (M, D, r ) be the MDG for a distributed system, and let P ⊆ M be the set of all component entry modules in G. The shared modules in G are all the modules in the set S ⊂ M such that ∀x ∈ S, ∃ y, z ∈ P, y = z | (x ∈ REACH(y) ∧ x ∈ REACH(z)) or, alternatively, ∀x ∈ S, ¬∃ y in P | x ∈ DOM(y). Note that, by definition, S ∩ P = ∅ and ∀x ∈ P, (REACH(x) − DOM(x)) ⊆ S.

318 2.4.

MENDONC¸ A AND KRAMER

A sample application

Applying the component module classification technique to a distributed system source code requires three basic steps. Here these steps are illustrated through a sample application of the technique to the source code modules of ANIMADO (Rodrigues, 1993), a socketbased client/server computer animation prototype.2 First, the system’s MDG, constructed based on extracted module dependencies, is built and extended so that it complies with the definition of a flow-graph (figure 1). This extension consists of adding a single superroot module to the MDG with an edge to each root module of the original graph (those identified as entry modules). Second, module dominance relationships are made explicit by constructing the MDG’s dominance tree (figure 2). The dominance tree makes it easy to visualise which modules belong exclusively to a component and which are shared: modules exclusive to a component form a subtree rooted by an entry module, whereas shared modules

Figure 1. ANIMADO’s module dependency graph. Each module of the graph except the super-root module corresponds to a distinct .c file of the source. Entry modules are highlighted in gray.

Figure 2.

ANIMADO’s module dominance tree.

RECOVERING DISTRIBUTED SYSTEM ARCHITECTURES

319

form subtrees rooted by library modules. For instance, it is clear from figure 2 that parser, table, and stack are all exclusive to the same component (since they are all dominated by the entry module inter), and that message is a shared module (since it is a library module which is not dominated by any entry module). Finally, the exclusive and shared module categories are made explicit by clustering together MDG modules that are dominated by the same component entry module (figure 3). The results in figure 3 are enough to provide a number of interesting insights into the way that ANIMADO was designed and implemented that otherwise would be difficult (or at least more demanding) to grasp: – The system comprises five different executable components, of entry modules inter, control, collision, fred, and egesp, respectively. – The component of entry module inter has three other exclusive modules; the components of entry modules collision and egesp have one; and the components of entry modules fred and control have none. – The module message is the single shared module amongst the 11 modules of the source; it provides resources to at least one exclusive module of each of the five components identified. – Since message is documented as the implementation module for the system’s message exchange mechanism, the finding that it provides resources to all recognised components suggests that these components may all engage in some sort of message-based interaction at runtime.

Figure 3. ANIMADO’s module dependency graph after classification by the component module classification technique. Component exclusive modules are clustered together according to their dominant entry modules. The added super-root module was suppressed for clarity.

320

MENDONC¸ A AND KRAMER

– Since fred, egesp, and matrix are documented as closely related modules that implement the system’s required set of mathematical methods, the finding that they are part of two different components suggests that these two components are likely to exchange some sort of mathematical information at runtime. Of course, the module classification technique reveals very little as to the architectural roles that the identified components may play at runtime (“which components are clients and which are servers?”), their patterns of interaction (“which components interact with which other components?”), and their underlying communication mechanism (“do components interact via sockets, RPCs, or otherwise?”). Recovering this kind of information, which is obviously unavailable at the level of module dependencies, is the aim of the other two techniques provided by X-ray, namely syntactic pattern matching and structural reachability analysis. The syntactic pattern matching technique is described below. 3.

Syntactic pattern matching

Identifying implemented executable components provides only part of the information necessary to understand a distributed system’s potential runtime organisation. Also important is the recognition of the mechanisms or features through which these components are expected to interact at runtime. This is not a simple task, though, as it depends on external domain knowledge pertaining both the architectural style and the development platform of the system (Holtzblatt et al., 1997). More specifically, recognising implemented interaction features requires information on the kinds of interaction features to expect (style knowledge), and on the language constructs through which the expected interaction features are most likely to be implemented in the source code (platform knowledge). The syntactic pattern matching technique is aimed at facilitating the recognition of code constructs that may implement runtime interaction features. The technique comprises a notation for the definition of syntactic program patterns, and an associated pattern-matching mechanism for automatic search of program patterns in a syntactic source code representation. The underlying idea is to use program patterns as the means to represent the stereotyped implementation of typical interaction features under a certain platform of interest. In this way, the interaction features of a distributed system, developed under that platform, could be automatically recognised by matching the appropriate program patterns against an abstract representation of the system’s source code. 3.1.

Program patterns

A program pattern is an abstract representation embodying program knowledge at the lexical, syntactic or algorithmic levels (Rich and Wills, 1990; Harandi and Ning, 1990; Paul and Prakash, 1994; Griswold et al., 1996). A program pattern can be used to represent a wide range of program concepts, from simple programming constructs and techniques (e.g., subroutine calls, variable swaps, iteration loops, and recursion), to more general algorithms (e.g., sort and search). Typically, a program pattern contains both fixed and varying parts; it may also include constraints that restrict the varying parts (Rich and Wills, 1990).

RECOVERING DISTRIBUTED SYSTEM ARCHITECTURES

321

For instance, a pattern for a typical imperative matrix multiplication algorithm inevitably contains three nested iteration loops; however, the names, types and ranges of the iteration variables may vary. Such a pattern may also include constraints to enforce that iteration variables are used as an index to the main matrix data structure. 3.2.

Defining program patterns

X-ray allows the definition of syntactic program patterns through a library of special Prolog predicates. These predicates can be used to both specify and execute a variety of matching operations on an abstract syntax tree (“AST”) representation of a program. An X-ray program pattern is then any valid Prolog clause defined in terms of one or more of the special predicates provided. Predicates used specifically to match AST nodes and node attributes constitute the fixed parts of a pattern, while their arguments constitute the pattern’s varying parts. The varying parts can be further constrained by expressing equivalence and structural relations among them. Equivalence relations are enforced by specifying arguments with the same Prolog name. Structural relations in turn are enforced by means of structural predicates that match AST nodes according to their relative position within the overall AST structure. This approach for program pattern specification is similar to the one used in the ART system (Fiutem et al., 1996), which in turn is based on the formalism described in Kozaczynski et al. (1992). A full description of all AST-matching predicates provided by X-ray is given in Appendix A. Using a Prolog environment for definition and matching of program patterns offers a number of benefits. First, a Prolog-based program pattern is, at the same time, the formal description for a program concept and the “executable” engine to match potential implementations of that concept against any AST structure represented in the form of a Prolog knowledge base. Second, the ability to specify predicate arguments with either instantiated or uninstantiated values leads to a highly flexible matching mechanism. Finally, the set of all possible matches for given program pattern can be easily manipulated iteratively, via successive backtracking, or altogether, via any standard “all-solution” Prolog predicate, for example, findall/3, bagof/3 and setof/3 (Sterling and Shapiro, 1994).3 To illustrate how a program pattern can be defined using this notation, consider the special Prolog predicate below: ast assign(AssignId,Lhs,Rhs,(File,Line)) This predicate succeeds if it matches a node of type assignment expression in a depthfirst visit of the AST. Its four arguments match, respectively, the identifier of the main node (AssignId), the node identifiers for the assignment’s left-hand-side (“LHS”) and right-handside (“RHS”) expressions (Lhs and Rhs), and the source code location of the assignment in the form of a (File,Line) pair. Here is another example: ast call(CallId,CallName,ParList,Loc)

322

MENDONC¸ A AND KRAMER

This predicate succeeds it it matches an AST node of type subroutine-call, with its four arguments matching the identifier of the matched node (CallId), a string with the callee’s name (CallName), a list with the node identifiers of each parameter expression of the call (ParList), and the source code location pair of the call (Loc). The following composite clause illustrates how more elaborate patterns can be defined taking advantage of Prolog’s standard library predicates: member(File,["init.c", "lib.c"]), ast assign(AssignId,LHS,RHS,(File,AssignLine)), ast call(CallId,CallName,ParList,(File,CallLine)), ast parent(RHS,CallId) This clause (or pattern) succeeds if it matches an assignment and a subroutine-call node pair, with the equivalence constraint (expressed through a common predicate argument instantiated via the standard list library predicate of Prolog, member/2) that both nodes must occur in AST branches generated from the same file. Note that the list ["init.c", "lib.c"], passed as an argument to the member/2 predicate, restricts the name of the file to be either init.c or lib.c. In addition, the pattern also includes the structural constraint (expressed by means of the notation’s structural predicate ast parent/2) that the node corresponding to the assignment’ RHS expression must be the parent, in the AST, of the node corresponding to the subroutine call. In other words, the above pattern will match any assignment construct, in either init.c or lib.c, whose RHS expression (perhaps an arithmetical or type casting expression) contains a subroutine call. Figure 4 shows further examples of program patterns, but now specific to the C programming language. The pattern in figure 4(a) takes advantage of Prolog’s unification and matching capabilities, which allow alternative definitions for the same predicate, to represent the concept of a “generic-assignment” statement in C. This pattern succeeds if it matches an assignment expression implemented either via C’s primitive assignment construct, “=”, or via any of the three “variable-copy” C library routines strcpy( ), bcopy ( ) and memcpy( ).

Figure 4. Examples of program patterns specific to the C language. (a) represents a “generic assignment” statement, and (b) represents the “variable swap” concept.

RECOVERING DISTRIBUTED SYSTEM ARCHITECTURES

323

The pattern in figure 4(b) in turn relies on the generic assignment pattern shown in figure 4(a) so as to represent the concept of a variable-swap.4 This pattern succeeds if it matches three generic-assignment statements constrained by the use of common Prolog names and the notation’s structural predicates ast before/2 and ast sameident/2. The structural predicate ast before/2 is used to match a pair of AST nodes where the first node always appears before the second node in a depth-first visit of the AST. In the variableswap pattern, ast before/2 is used to enforce the required precedence order over the three assignment statements. The predicate ast sameident/2 in turn is used to match a pair of AST nodes whose descendants refer to a common program identifier node. In the variable-swap pattern, ast sameident/2 is used to enforce syntactic equivalence among program identifiers that may appear embedded in type-casting expressions. The following section illustrates how the Prolog-based pattern notation of X-ray can be used to represent the stereotyped implementation of a typical client/server interaction mechanism, namely creation and connection of sockets. 3.3.

Socket creation patterns

Sockets are bidirectional communication channels used as the primary component interaction mechanism in many Unix-based client/server systems (Stevens, 1990). Socket-based communication between two components requires one component to open a server-side socket and the other to open a client-side socket. In the C language, a descriptor to an open socket can be obtained by issuing a socket( ) library call, whose arguments define the domain, type, and protocol parameters of the new socket. Connecting a client-side socket to a server-side socket involves the use of a different set of library calls at each side. At the server side, an open socket must be explicitly associated with a server address in the communications domain—basically a host name and a port service number—by issuing a bind( ) call. The server then can wait for client connections by issuing a listen( ) and an accept( ) call (in the case of stream sockets, which are based on the TCP/IP protocol), or by issuing a recvfrom( ) call (in the case of datagram sockets, which are based on the UDP/IP protocol) (Stevens, 1990). At the client side, a connection to a server-side socket can be established explicitly by passing a local socket descriptor and a valid server address as arguments to a connect( ) call (in the case of stream sockets), or directly to a sendto( ) call (in the case of datagram sockets). Figure 5 shows two program patterns representing creation of a server-side socket and a client-side socket, respectively. Both patterns are defined using the same set of arguments: an expression corresponding to the matched socket descriptor; an expression list with the socket’s domain, type, protocol and address parameters; and the likely location of the match in the source code. Note how the client-side socket creation pattern, shown in figure 5(b), uses the auxiliary predicate connection call/2 to allow matches containing either connect( ) or sendto( ) as the main socket connection call. Using these patterns, three conditions must be satisfied for one to be able to recognise a potential implementation for a socket-based interaction between two distributed system components. First, each pattern has to be matched in a section of code associated with a distinct executable component. Second, sockets created by the two matching code fragments

324

MENDONC¸ A AND KRAMER

Figure 5. Two program patterns representing socket creation in C. (a) represents creation of a server-side socket, and (b) represents creation of a client-side socket.

must be of compatible type, domain, and protocol. Finally, the values of the server address parameters (particularly its port number field) used at the client side must refer to the corresponding parameter values defined at the server side. For systems developed following a traditional single-server/multiple-client style, it is possible to recognise potential socketbased interactions even when the server address parameters used to establish the connections are not statically defined in the code. This is because all clients in a single-server/multipleclient system are expected to communicate with the same server component at runtime. For systems of the multiple-server/multiple-client style this would not be possible, though, as the only way to determine which server a client is expected to communicate with would be via analysis of the server address parameters defined at both sides. Samples of program patterns for four other types of C/Unix interaction mechanisms (namely spawning of a child component, shell-based and process overlay invocation of an external component, and creation and connection of a pipe) are shown in figure 6. Although many other types of component interaction mechanisms are available under the C/Unix domain, these four types of interaction mechanisms, together with client/server sockets, constitute a representative subset of the types of architectural connectors that have been catalogued by the software architecture research community (Shaw and Garlan, 1996). 3.4.

Limitations of the pattern matching technique

The set of pattern-matching predicates provided by X-ray, together with the unification and matching mechanisms of Prolog, suffice to represent the implementation of a wide diversity of program concepts, in general, and component interaction features, in particular. However, the high syntactic variability of some interaction feature implementations can lead to patterns with a large amount of alternative descriptions, as it is the case of the auxiliary predicate to handle testing of the result of the fork( ) call. In addition, equivalence constraints can be enforced among nodes corresponding to variable and parameter expressions

RECOVERING DISTRIBUTED SYSTEM ARCHITECTURES

325

Figure 6. Program patterns representing four different types of C/Unix component interaction mechanisms. (a) represents spawning of a child component via a fork( ) call; (b) represents shell-based and process overlay invocation of an external component via, respectively, a system( ) call and any variant of the exec( ) family of calls; and (c) represents creation and connection of a pipe. The definition of auxiliary predicates has been suppressed for clarity.

but not among the actual values of these expressions. As a consequence, interaction features that may be implemented in somewhat non-conventional ways can be difficult to represent and (sometimes impossible to) match. This is especially true for interaction features whose implementation may be delocalised (Letowsky and Soloway, 1986), interleaved (Rugaber et al., 1996), or may involve non-trivial programming concepts such as multi-valued constant propagation (Merlo et al., 1995) and pointer aliasing (Zhang et al., 1996). An effective way of improving the expressiveness of the pattern notation—and, consequently, the effectiveness of the matching mechanism—is to augment pattern description with flow information (Tonella et al., 1996). By allowing constraints to be expressed in terms of data and control flow relations, it is possible to reduce the amount of alternative descriptions needed to represent a complex concept implementation; to enforce equivalence among values of variable and parameter expressions; and to match delocalised concepts.

326

MENDONC¸ A AND KRAMER

Flow information may also be helpful in tracking down the values of parameters that are defined far apart from the locations where they are used. Currently, X-ray provides no patternmatching Prolog predicate aimed at expressing flow-dependence constraints. Ultimately, accommodating predicates of this kind will require implementing an inter-procedural flow analysis tool—or perhaps reusing an existing flow analysis tool—so as to annotate the AST representation with the necessary control and data flow information. In the particular case of delocalised concepts, though, one way of circumventing the lack of flow-based constraint predicates is to define patterns with an alternative description for each potential delocalised part. This would allow matching of each part independently of the others. However, such an artifice is not without its drawbacks. The main one is the impossibility of expressing constraints involving predicate arguments used in different alternative descriptions for the same pattern. As a consequence, if multiple parts of a pattern are matched, it is not possible to distinguish, automatically, which parts belong together and in which order. Further research is also required into the scalability of the pattern matching mechanism. Although the general “comparison” algorithm between program patterns (represented by syntactic schemas and constraints) and code (represented by annotated ASTs) is known to be NP-Hard (Woods and Yang, 1996), empirical studies indicate that some heuristic strategies, such as constraint reordering, may drastically improve matching efficiency (Woods and Quilici, 1996). Investigating how some of these strategies may affect the performance of the Prolog-based matching mechanism of X-ray is an interesting topic for future work.

3.5.

A sample application

To investigate how effective the patterns in figure 5 are in representing real socket implementations, they were both executed against a Prolog-based AST representation derived from the source code for ANIMADO. The results were one successful match for each pattern. As expected, both matches occurred in the AST derived from the module message. The matching fragments are shown in figure 7. Even though these results confirm the utility of the pattern notation, at least for representing and matching socket creation features, syntactically-based pattern matching alone may

Figure 7. ANIMADO’s code fragments matching socket creation patterns. (a) matches creation of a client-side socket, and (b) matches creation of a server-side socket.

RECOVERING DISTRIBUTED SYSTEM ARCHITECTURES

327

still be limited in revealing which executable components actually execute the matching code fragments. For example, because message is a shared module that provides resources to all five components identified within the ANIMADO source, it is not possible to promptly associate the two socket creation code fragments matched therein to any executable component in particular. Hence, at this point, it is unclear which ANIMADO component may play the role of a server or a client at runtime. To facilitate the association of interaction features recognised in shared modules to individual components, X-ray provides the structural reachability analysis technique. 4.

Structural reachability analysis

The encapsulation of runtime features in shared modules is a common practice in the development of many distributed systems. Amongst other benefits, it reduces code redundancy, helps to abstract from “low-level” feature implementation details, and has the potential to ease system maintenance and evolution. On the other hand, as not all components use shared modules in exactly the same way, it may be difficult to determine, based solely on a syntactic approach, which components are involved with a runtime feature whose implementation is part of a shared module. The structural reachability analysis technique is aimed at supporting exactly this. The technique computes transitive closures over the activation relations between program units so as to determine which components may “reach” the implementation of a particular architectural feature. To this end, it relies on the activation graph structural system model and the notion of feature reachability. These two concepts are defined below. 4.1.

Activation graph

An activation graph (AG) is an abstract graph model of the well-defined activation units (procedures, functions, methods, etc.) of a software system and the “activates” (or calls) relationships between them. In an AG, each node corresponds to a distinct activation unit. A directed edge between two nodes, say p and q, represents the fact that p directly activates q. Formally, an AG can be defined as another kind of flow graph: Definition 7 (Activation graph). where

An activation graph for a system is a triple G = (U, A, e),

– U is a finite set of well-defined activation units; – A ⊂ (U × U ) is the activation relation, i.e., a tuple of the form  p, q ∈ A exists if and only if p activates q in some execution of the system; and – there is a path in G connecting the entry activation unit, e ∈ U , to every module. In practice, an AG constructed from extracted source code information can have multiple entry units and can also be disconnected. As it is the case with other types of flow-graph based source models, a flow-graph compliant AG can be obtained by adding a single “superentry” unit with an edge to each entry unit of the largest strongly-connected subgraph of

328

MENDONC¸ A AND KRAMER

the original graph. The concepts of reachability and reachability sets over AG nodes can be defined exactly as those for an MDG (see Section 2.2). 4.2.

Feature reachability

An executable component is said to “reach” a certain interaction feature if its execution entry point directly or indirectly activates the minimal set of program units that are commonly associated with a potential implementation of that feature. In C, for example, an executable component can be said to reach the feature representing creation of a server-side socket if its main( ) function directly or indirectly calls the library routines socket( ) and bind( ). Similarly, the component is said to reach a client-side socket creation feature if its main( ) function directly or indirectly calls socket( ) and connect( ). Formally, the concept of feature reachability is defined as follows: Definition 8 (Feature reachability). Let G = (U, A, e) be an AG for a distributed system; let C be a component of that system; let Centry ∈ U be the entry activation unit of C; and let IMP(F ) ⊂ U be the minimal set of activation units necessary to implement an interaction feature, F, under the system’s underlying platform. C reaches F in G if and only if IMP(F ) ⊆ REACH(Centry ), where REACH(Centry ) denotes the reachability-set of node Centry . In other words, if an executable component reaches a potential implementation of a certain interaction feature, then the code for that component is expected to execute the specific program units that implement that feature at runtime. Note that this definition is language and platform dependent, as the resources needed to implement a particular architectural feature may vary across different development environments. 4.3.

A sample application

To show how the structural reachability analysis technique may contribute to the overall architecture recovery process, the server-side and client-side socket creation library routines, socket( ) and bind( ), and socket( ) and connect( ), respectively, were used as the target for reachability analysis in an AG derived from the ANIMADO source. The results are shown in figure 8. It is clear from figure 8 that all five components identified in the ANIMADO source somehow are involved with socket creation at the implementation level. In particular, figure 8(a) reveals that the only component that reaches a server-side socket creation feature is the one of entry module inter. Figure 8(b) in turn reveals that a client-side socket creation feature is reached from all the other four components’ entry modules, namely control, egesp, fred and collision. Therefore, the component of entry module inter is bound to be the system’s single server component, while the others all are expected to play the role of clients. In short, these results show that ANIMADO was developed following a traditional single-server/multiple-clients architectural style.

RECOVERING DISTRIBUTED SYSTEM ARCHITECTURES

329

Figure 8. Reachability analysis of server-side (a) and client-side (b) socket creation routines in the ANIMADO’s activation graph. Activation units correspond to C functions (depicted as the ellipses) and appear clustered inside their defining compilation modules (depicted as rectangles). Reaching entry modules are highlighted in gray.

5.

Tool support

X-ray’s support prototype is currently targeted towards distributed systems developed under C/Unix platforms. The choice of C/Unix is justified due to its wide-spread use in the development of distributed applications, and the fact that it offers a rich set of architectural interaction mechanisms. The prototype integrates several new and “off-the-shelf” tools, which provide automated support for a number of common architecture recovery activities: 1. 2. 3. 4.

Extraction of source code information. Manipulation (e.g., filtering and querying) of the extracted information. Abstraction from the extracted information. Graphical visualisation.

For extraction of source code information, the prototype provides a preprocessor and a customised code analyser. The preprocessor extracts only file-inclusion information, while the code analyser extracts both structural (module dependencies, name references, etc.) and syntactic (AST) program representations. For storage and manipulation of the extracted information, the prototype relies on a Prolog database (or source base). To improve query efficiency, the source base is implemented as two separate databases: the structural source model database (SSMdb), for coarse-grained structural information, and the AST database (ASTdb), for AST nodes and their relationships. For abstraction from the source base, the prototype provides a set of source analysis tools. These include a reference resolver, used to resolve references to externally declared names, and three tools (namely a module

330

MENDONC¸ A AND KRAMER

classifier, a reachability analyser, and a pattern matcher) that implement each of the three X-ray techniques. Finally, for graphical visualisation, the prototype uses a layout filter, an automatic layout tool, and standard display tools. The layout filter formats the results produced by the source analysis tools according to the language accepted by the layout tool, which in turn automatically generates layouts (in image formats such as GIF and Postscript) that can be visualised using appropriate display tools. The prototype’s organisation follows a layered style, as shown in figure 9. Tools of the extraction layer and most of the tools of the visualisation layer were reused off-the-shelf. In particular, the code analyser was generated using Prem Devanbu’s GEN++, a code analysis framework for C++ (Devanbu and Eaves, 1994). The layout tool in turn is a wrapper for ATT’s graph drawing tool, DOT (Gansner et al., 1993). All tools of the source analysis layer and the layout filter of the visualisation layer were implemented in Prolog from scratch. Using Prolog proved well-suited to this task, as its facilities for declarative programming, rapid prototyping, and execution of recursive queries over a relational database model

Figure 9.

The X-ray architecture recovery prototype.

RECOVERING DISTRIBUTED SYSTEM ARCHITECTURES

331

have long been recognised as a valuable aid to the production of reverse engineering tools (Canfora et al., 1992). 6.

A case study

This section reports on the method and results of a case study in which X-ray was used to help recover architectural runtime information from the source code for the Field distributed programming environment (Reiss, 1990). 6.1.

Field overview

Field was developed at Brown University in the early nineties to support programming teaching and learning over a Unix-based local area network. It integrates a number of tools, including text and annotation editors, data structure visualisers, a customised graphical interface for existing program debuggers, and a cross reference facility. These tools communicate at runtime via a selective broadcast mechanism provided by Field’s message server component. Each Field tool registers with the message server a set of string patterns that describe the message it is interested in. Any tool can send a message to the message server, which rebroadcasts it to all the tools that have registered a pattern matching the message (Reiss, 1990). In addition to the code for the above tools, the Field source also includes a number of software packages that provide library support for services such as string matching, window management, and network communication. This decomposition in terms of tools and packages is also reflected in the physical organisation of the source code. The code for each individual tool or package is placed in a separate subdirectory. Moreover, all source code elements defined or contained in a particular subdirectory share the same name prefix. For example, all files in the annotation editor subdirectory share the name prefix annot, all files in the debugger graphical interface subdirectory share the name prefix dbg, and so on. In total, the source includes 533 files—out of which 416 are compilation modules (.c) and 117 are include files (.h)—, comprising over 160,000 non-blank lines of code. To give an idea of the relative scale and complexity of Field, a complete installation of the system requires more than 1,500 compilation and file manipulation commands. These commands are originated from the execution of more than 40 general and tool-specific configuration (i.e., Make) files, which are spread throughout the entire source directory hierarchy. 6.2.

Method

The case study was carried out in five subsequent stages. First, the MDG, AST, and AG system models were built using the appropriate extraction and source analysis tools. Second, the module classifier was applied to identify component entry modules in the MDG, and to distinguish exclusive modules from shared modules. Third, the pattern matcher was used to search for a potential implementation for each component interaction mechanism

332 Table 1.

MENDONC¸ A AND KRAMER C/Unix component interaction mechanisms considered in the case study.

Interaction mechanism

Description

C/Unix routines

Process spawning

Creates a new child component with a process image virtually identical to that of its parent

fork( )/vfork( )

Process overlay invocation

Overlays the process image of the calling component with the process image for the component invoked

any variant of exec( )

Shell-based invocation

Invokes a new component via a call to the standard Unix command interpreter

system( )

Pipe

Creates a new child component attached to its parent via a pipe

pipe( ), fork( ) /vfork( )

Server-side socket

Creates and binds a socket at the server-side

socket( ), bind( )

Client-side socket

Creates and connects a socket at the client-side

socket( ), connect( ) sendto( )

represented as a program pattern in Section 3. A description of these mechanisms along with their main C/Unix library routines are given in Table 1. Fourth, the reachability analyser was used to help associate code fragments matched in shared modules to each individual component. Finally, to facilitate visualisation and dissemination of the case study’s results, the recovered abstractions were represented graphically using a notation based on existing architecture description languages (or ADLs) (Medvidovic and Taylor, 2000). To avoid cluttering the exposition, and at the same time providing an in-depth account of some of the main case study results, this paper focuses on a reduced yet representative subset of the Field source. The source code components selected for this subset are listed in Table 2. Also listed in the table are the components’ common name prefix, and their size in both number of modules and number of non-blank lines of code.5 To build confidence on the validity of the experiment, the case study results were also analysed with respect to other sources of evidence, such as code comments, the accompanying documentation, related results published elsewhere, and consultation with one of the system’s original developers.

Table 2.

Field source code components selected for presentation. Component Control panel

Name prefix

#Modules

#LoC

field

5

2,890

Debugger interface

dbg

3

2,031

Cross reference facility

xr

12

7,703

Message facility

msg

8

4,152

Communication package

comm

1

1,172

29

17,948

Total

RECOVERING DISTRIBUTED SYSTEM ARCHITECTURES

6.3.

333

Results

6.3.1. Component module classification. The MDG derived from the selected Field subsystem is shown in figure 10. The results of applying the module classification tool to this MDG are shown in figure 11. By correlating the results shown in figure 11 with the logical module classification imposed by the source code organisation, it is possible to gain a number of interesting insights into how the concepts of “executable components” and “Field tools” are related in the implementation of this particular Field subsystem: 1. At the code level, the relationship between the concept of a tool (as defined in the documentation) and the set of implemented executable components is more subtle than the source code organisation may indicate. For example, the system debugger graphical interface was initially thought as an independent executable component that would act as a “wrapper” for an existing debugger. However, the results in figure 11 show that no component entry module is found among the debugger interface modules (those with

Figure 10.

Module dependency graph derived from the selected Field subsystem.

334

MENDONC¸ A AND KRAMER

Figure 11. The Field subsystem’s modules after classification by the component module classification tool. Modules exclusive to the same component are clustered together with their dominant entry module (highlighted in gray). Modules not included in any cluster are shared by at least two components.

name prefix dbg). In fact, these modules were classified as being exclusive to the component of entry module fieldmain, which corresponds to the control panel. Some of the modules of the cross reference facility (those with name prefix xref) were also classified this way. After a manual examination of the code, it turned out that these two sets of modules implement two of the several windows that constitute the control panel’s graphical interface: files with name prefix dbg implement a window for interacting with the local debugger, while files with name prefix xref implement a window for interacting with the cross reference facility. Interestingly, the Field documentation describes these two window interfaces, for the debugger and for the cross reference facility, as two separate “tools”, each with its own executable program. However, a more careful examination of the executable programs generated for these tools revealed that they are in fact symbolic links to the control panel executable. A further examination of the control panel’s entry module revealed that the control panel executable may create different startup windows, depending on the shell command string used to invoke it. Hence, even though a Field user sees the control panel and the debugger and cross reference window interfaces as three separate tools, these are in fact implemented by the same executable program.

RECOVERING DISTRIBUTED SYSTEM ARCHITECTURES

335

2. Some tool-specific module subsets turned out to contain entry and library modules for more than one executable program. For example, two program entry modules, xrdbmain and xrfsmain, were recognised among the modules for the cross reference facility. Similarly, three other program entry modules, msgsend, msglisten and msgserver, were recognised among the modules for the message facility. These results show that, contrary to what one might intuitively expect, the set of “tools” in Field does not have a one-to-one correspondence with the set of executable components implemented in its source code. 3. Based solely on the number of exclusive modules identified for each executable program, it seems that the Field components present a significant variation in terms of functional complexity. In one end of the spectrum, the three executable program recognised as part of the message facility only need a single exclusive module (i.e., their own entry module). In the other end, the control panel executable requires exactly 13 exclusive modules. The components of entry modules msgsend and msglisten, in particular, appear to implement no functionality but to send invocation arguments to the message server, and to receive any message that the message server rebroadcasts, respectively. They seem to be available in Field exclusively for monitoring purposes. 4. Among all executable components identified in the source for this particular Field subsystem, only the cross reference database (of entry module xrdbmain) does not depend on a message facility or socket communication module. This may indicate that the cross reference database does not directly rely on any message-passing mechanism to communicate with other Field tools. The finding that the Field source is not organised so as to reflect executable component implementation is of particular interest. It implies that one should not reason with much confidence about Field’s runtime components if component modules are identified strictly based on physical file attributes such as name prefixes and directory location. To investigate this claim, the module classification results shown in figure 11 were correlated with another component-based model extracted from the Field source that has been reported elsewhere (Murphy and Notkin, 1996). In that work, Murphy and Notkin considered as the code for each Field tool all source files contained in a directory which had a /bin sub-directory available.6 They then used their lexical source model extraction tool to extract tool event names from the relevant set of event-registration and eventannouncement calls contained in those files. By matching the event names that each tool registers and announces, Murphy and Notkin were able to form a static approximation of the implicit-invocations that take place between Field tools at runtime. A curious aspect of Murphy and Notkin’s Field implicit-invocation model is that it contains several self-relationships. In other words, it appears from their model that some Field tools may implicitly-invoke themselves at runtime (a fact that would be counterintuitive given the communication overhead of contacting the message server in order to perform an implicit invocation). This is the case, for example, of the control panel and the cross reference facility. Of course, as their model is only a static approximation of the actual implicit-invocations that occur at runtime, it may be that the self-relationships are simply false positives, i.e., spurious relationships incorrectly included in the model due to the

336

MENDONC¸ A AND KRAMER

limitations of the static source model extraction approach used. However, a correlation with the module classification results shown in figure 11 offers evidence for another hypothesis. Even though a careful examination of the source reveals that there really are false positives in Murphy and Notkin’s model, several of the self-relationships they extracted are in fact potential implicit-invocations between different executable programs, whose code happen to be located under the same “tool” sub-directory. For example, the self-relationship involving the cross reference facility is in fact a potential implicit-invocation between the control panel’s cross reference graphical window interface (whose component exclusive modules share the name prefix xref) and the cross reference server (whose component exclusive modules share the name prefix xrfs). Since these two tools have their set of exclusive modules located under the same sub-directory, in Murphy and Notkin’s analysis their modules were treated as the exclusive modules for a single Field tool, i.e., the cross reference facility. This finding is important in that it shows the perils of identifying executable component modules based solely on source code attributes, such as directory hierarchies and file names. In this regard, the benefits of identifying and classifying executable component modules based on extracted module dependencies, as advocated by X-ray, are evident. 6.3.2. Syntactic pattern matching. Program patterns representing the implementation of each component interaction mechanism described in Table 1 were searched for against every module of the selected Field subsystem. The results are presented in Table 3. Samples of the code fragments matched are shown in figures 12 and 13. The process spawning match in msgserver, shown in figure 12(a), means that the message server may spawn a subprocess at runtime in order to detach from its controlling terminal. This characterises the message server as a typical Unix daemon process. The four process invocation matches in turn indicate that the proper execution of some of the tools in Field are likely to depend on the execution of other external components. In particular, the match in fieldmain (figure 12(c)) suggests that the control panel may invoke sh, a Unix shell command interpreter, and the match in xrdbdata (figure 12(c)) suggests that the cross reference database may invoke compress, a Unix file compression tool. The match in msgclient, shown in figure 12(b), could not be promptly associated with any particular Field tool, since msgclient is classified as a shared module. This match suggests that more Table 3.

Interaction feature patterns matched in the selected Field subsystem. Pattern Spawning Overlay invocation

Shell invocation

Containing module

Containing component

msgserver msgclient

message server (shared)

fieldmain

control panel

msgutil

(shared)

xrdbdata

cross reference database

Pipe

xrfsrun

cross reference server

Server-side socket

commcode

(shared)

Client-side socket

commcode

(shared)

RECOVERING DISTRIBUTED SYSTEM ARCHITECTURES

337

Figure 12. Code fragments matching several types of component interaction feature patterns in the selected Field subsystem. (a) shows a process spawning match in module msgserver; (b), (c) and (d) show process overlay invocation matches in modules msgclient, fieldmain and msgutil, respectively; (e) shows a shell invocation match in module xrdbdata; finally, (f) shows a pipe creation match in module xrfsrun.

than one Field tool may need to invoke the message server executable at runtime (note the use of the environment variable MSG SERVER to build one of the arguments for the execv( ) call). The pipe creation match in xrfsrun (figure 12(f)) is indicative that the cross reference server may first invoke and then establish a pipe-based connection to the cross reference database (note the use of the environment variable XRFSDB to build one of the arguments for the execl( ) call). In fact, a more careful examination of the matched code fragment revealed that the cross reference server may create not only one but two pipes, and that it may use this pair of pipes to establish a stream-based bidirectional communication channel with the cross reference database. The first pipe would be used to send data from the cross reference server to the cross reference database, and the second pipe to send data from the cross reference database back to the cross reference server. Finally, because commcode is also a shared module, the two socket creation matches found therein (figure 13) also could not be promptly associated with any Field tool in particular. A

338

MENDONC¸ A AND KRAMER

Figure 13. Two Field code fragments matching socket creation patterns in the shared module commcode. (a) matches socket creation at the server side, and (b) matches socket creation at the client side.

detailed examination of those two matches revealed that sockets of compatible type, domain and protocol may be created. In addition, the port number and host name parameters defined for a newly created socket at the server side may be subsequently written out to a special system lock-file. Values representing port number and host name pairs stored in this file may then be used to establish socket connections at the client side. 6.3.3. Structural reachability analysis. The structural reachability analyser was used to help determine which Field tool may indirectly execute (or “reach”) those code fragments matched in shared modules. This was done by analysing the C library routines included in the reachability set computed for each program entry point (i.e., C main( ) functions) identified in the selected Field subsystem. The results were as follows. The library routines responsible for socket creation at the server side, socket( ) and bind( ), are reachable only from the message server entry module, msgserver, as shown in figure 14. This finding was somewhat of a surprise since, according to the available documentation, Field includes several “server” components—the message server and the cross reference server being just two of them. The client-side socket creation library routines, socket( ) and connect( ), in turn are reachable from the entry modules for the control panel (fieldmain), the cross reference server (xrfsmain), and the programs msglisten and msgsend, as shown in figure 15. There are two main conclusions to be drawn from the above results: 1. The selected Field subsystem follows a traditional single-server/multiple-client architectural style, with the message server being the only component to play the role of a “real” server; the control panel, the cross reference server, and the msglisten and msgsend programs all play the role of clients. 2. Because the cross reference database is involved with creation of neither a client-side socket nor a server-side socket, it is neither a server nor a client within the subsystem’s client/server architecture. In fact, the cross reference database’s only communication

RECOVERING DISTRIBUTED SYSTEM ARCHITECTURES

339

Figure 14. Reachability analysis results for the server-side socket creation library routines, socket( ) and bind( ), in the activation graph derived from the selected Field subsystem. The only reaching entry module is msgserver (highlighted in gray).

Figure 15. Reachability analysis results for the client-side socket creation library routines, socket( ) and connect( ). The reaching entry modules are fieldmain, xrfsmain, msglisten and msgsend.

340

MENDONC¸ A AND KRAMER

peer is the cross reference server, through the double-pipe mechanism described in the previous section. This may suggest that the actual role of the cross reference server is to serve as a kind of “network-aware” interface to the cross reference database, thus enabling the other Field tools to use the message server’s selective broadcast mechanism to request and receive cross reference information of interest. Concerning the two process overlay invocation library routines matched in the shared module msgclient, vfork( ) and execv( ), which were verified to implement an invocation of the message server, the analysis showed that they are reachable from the same set of components identified as playing the role of clients. Therefore, all clients components in the selected Field subsystem may invoke the message server at runtime. This suggests that it may not be necessary to start up the message server explicitly in the beginning of a Field session, as this may be part of the initialisation procedure for whichever client tool the user invokes first. Finally, with regards to the two process overlay invocation library routines matched in the shared module msgutil, vfork( ) and execvp( ), the analysis showed that they are reachable only from the entry module for the control panel, that is, fieldmain. A subsequent examination of fieldmain revealed that the command name argument passed to the execvp( ) call is built, at initialisation time, from a list of component names stored in an auxiliary data file. Moreover, among the components included as part of the selected Field subsystem, only the cross reference server has its executable name listed in that data file. Hence, the control panel is likely to invoke the cross reference server as part of its initialisation procedure. 6.3.4. ADL-based graphical representation. A better way to convey most of the architectural information recovered with the help of X-ray is through the use of a graphical notation similar to those provided by architecture description languages (ADLs). Figure 16 shows an ADL-based graphical representation for all runtime abstractions recognised in the selected Field subsystem. This representation uses the graphical notation provided by Darwin (Magee et al., 1995; Magee and Kramer, 1996), an ADL aimed specifically at describing distributed system architectures. In Darwin, a rectangular box represents a potential runtime component, with the borders of the box representing the component’s externally visible interface. A circle in the interface of a component in turn represents an interaction feature (or service) that the component either requires or provides (requirement and provision of services are represented as empty and filled circles, respectively). An edge connecting a service requirement of a component to a service provision of another component (or vice versa) represents a communication channel (or binding) that may be established between the two components at runtime. Finally, an arrow connecting a service requirement of a component directly to the interface of another component represents the fact that the former may dynamically create or invoke the latter. Because existing distributed applications may present a richer set of architectural abstractions than that currently supported by Darwin, new graphical elements had to be added to the notation so as to better represent all types of components and component interaction mechanisms that can be recovered with X-ray. The banner in the bottom of figure 16 explains the meaning of all graphical elements used in the description of the Field subsystem’s recovered architecture.

RECOVERING DISTRIBUTED SYSTEM ARCHITECTURES

341

Figure 16. ADL-based graphical representation of the runtime abstractions recovered from the selected Field subsystem.

The description in figure 16 alone reveals a number of interesting aspects of the Field subsystem’s implemented architecture. For example, it shows clearly that the control panel, the cross reference server, and the msgsend and msglisten programs may all invoke, and establish a socket-based connection to, the message server. Also evident is the relative isolation of the cross reference database with respect to client/server connections, as it is the restriction that the cross reference server and the cross reference database must both execute concurrently under the same communications domain (due to their need to interact via a pair of pipe-based connections).

6.4.

Discussion

The results of the case study provide evidence of the utility and applicability of X-ray. However, the approach’s intended scope and static nature limit its ability to support the recovery of more precise, semantically richer architectural information. Some of these limitations are discussed below. Component functionality. Understanding the functional behaviour of a distributed component requires a detailed knowledge of the component’s implementation as well as the environment in which the component will be executed. Clearly, the kinds of architectural information extracted with the help of X-ray are insufficient to explain the full set of functionalities that might be implemented by a particular distributed system component. For example, recognising that the Field’s message server component provides a selective broadcast mechanism would require a deeper understanding of its implementation as well as the ways through which it interacts with the other tools. On the other hand, by explicitly identifying component-exclusive modules and interaction features, X-ray can help maintainers

342

MENDONC¸ A AND KRAMER

in focusing their natural program understanding efforts to the source code locations most relevant to the maintenance task at hand. Dynamic reconfiguration. Most distributed systems are designed to support dynamic reconfiguration. This means that their number of executing components, as well the connections between those components, may vary at runtime. For example, Field can be said to be dynamically reconfigurable since it supports execution and connection of a varying number tools. X-ray is limited with respect to recognising dynamic reconfiguration information in that it only reveals a static approximation of the potential set of configurations that a distributed system may undergo at runtime. Nonetheless, given the undecidability of identifying all possible execution scenarios for a program, and the fact that dynamic analysis is always restricted to reveal only those possibilities exercised by the execution scenarios chosen, static approximations are many times the most cost-effective way of reasoning about a system’s runtime properties. Interaction semantics. Syntactically identifying potential interaction channels between runtime components is far from revealing why and how those components may interact. For example, identifying that most client tools in Field may establish a socket-based connection to the message server is insufficient to reveal how those tools use the message server to implicitly invoke each other. Ultimately, determining the precise semantics of an interaction between two components requires a deep understanding of the functionalities and runtime behaviour implemented by each component. Hence, recovering this kind of architectural information is beyond the purely syntactic capabilities of X-ray. 6.5.

Developer feedback

With the intent of better evaluating the strengths and limitations of X-ray, the results of the Field case study, along with a detailed description of the tools and method used to generate them, were made available to the Field’s chief developer. The developer then reported back with a number of insightful observations on various aspects of the results. The following is an “annotated” transcription of some of the developer’s main comments. Initially, the developer provides some positive feedback on the overall results of the case study: Overall it is a pretty accurate assessment of a small part of Field.7 The developer then elaborates on several aspects that he thought needed clarification. On the functionality of the control panel, he adds: The control panel is more than that, it also holds a small set of common routines that are used by all [Field] modules that have a user interface. Indeed, identifying the exclusive compilation modules and potential interaction code fragments of a runtime component, in this case the control panel, is far from revealing the component’s full functionality. As discussed in the previous subsection, the idea is to

RECOVERING DISTRIBUTED SYSTEM ARCHITECTURES

343

use this information as a means of complementing a programmer’s natural understanding capabilities. On the fact that modules of multiple Field tools were classified as part of a single executable program, the developer explains: Field was originally structured as individual binaries. This proved very costly on early Sun systems in terms of memory usage since there was no sharing of memory across the binaries (this was before shared libraries were available at the user level). Hence, to make things practical, we combined all the executables into one binary and the startup (what you call the control panel) actually determines the appropriate entry point for the module based on argv[0] from the command line. This observation provides further evidence of the benefits of using the module classification technique to identify a distributed system’s set of implemented executable components. Concerning the finding that there is only one server component implemented in the subsystem, the developer observes: You might note thatxrefserver (through xrfsmain.c) runs as a server as well when it is spawned. It makes itself into a daemon process through its call to COMMserver process. This an important clarification, as a “server” for the developer means a daemon process running detached from any controlling terminal. From the point of view of client/server interactions, the cross reference server can still be seen as one of the clients of the message server. As regards the semantics behind the interactions between the cross reference server and the cross reference database, the developer reveals: The xrefserver utility does more than forward requests to the database. It is responsible for keeping track of what databases are currently active, what systems they are active for, and what is currently being done with each. When a request comes in, it determines what database to use for that request, starts up the database if necessary, and then queues the request for that database. Requests are queued since the message-based requests can come in at any time (and can overlap) while the underlying database can only process one request at a time. This is another example of the types of architectural information that requires a detailed knowledge of the application domain to be recovered, and whose extraction therefore is yet not properly supported by X-ray. Finally, the developer clarifies that Field was implemented in a way to support multiple instances of the messages server: The code (both the message server and all the clients) actually supports multiple message servers, both logically and physically. Logical message servers come from the notion of server groups, while physical servers depend on different host/port files. The information

344

MENDONC¸ A AND KRAMER

needed to start and connect to a message server is actually stored in environment variables by Field so that it can be passed to the appropriate child processes. As previously discussed, X-ray only supports recovery of a static approximation of a distributed system’s potential set of runtime organisations. In the particular case of Field, however, the possibility of having multiple message servers is already hinted by the group name prefix in a number of source code elements used by the message facility, such as in the module msggroup. Curiously, this prefix was overlooked during the case study because it was thought to refer to entities supporting group communication—instead of communication using groups of message servers, as it turned out to be the case. 7.

Related work

This section reviews several tools (from research prototypes to fully-fledged commercial products) that are representative of the wide spectrum of technologies currently available to support software comprehension and design recovery (Chikofsky and J.H.C. II, 1990; Gannod and Cheng, 1999; Bellay and Gall, 1997). For convenience, those tools are grouped into five broad categories: conformance checking tools, query and visualisation tools, clustering tools, domain-based architecture recovery tools, and dynamic architectural analysis tools. Tools in each category are analysed with respect to their capabilities of revealing implemented distributed system architectures and, when pertinent, how they compare or contrast with X-ray. 7.1.

Conformance checking tools

Documented architectural models can be a useful resource when reasoning about the design of an existing software system. However, these models also carry the risk that they may not accurately represent the actual system implementation. Conformance checking tools help to reduce these risks by showing whether and how a given architectural model differs from the source code evidence. Tools in this category include RMTool (Murphy et al., 1995), Pattern-Lint (Sefika et al., 1996b), and the tool described in (Fiutem and Antoniol, 1998). Basically, the user provides a conformance checking tool with a set of components and relationships that are part of an idealised architectural model, along with an explicit mapping of source code entities to the components of that model. The tool then extracts the actual relationships of the architectural model by reflecting the relationships between source code entities onto the idealised components. The similarities and differences between the idealised relationships and extracted relationships highlight the discrepancies in the architectural model (Murphy et al., 1995). Conformance checking tools contrast with X-ray and other design recovery tools in that they depend on an externally-provided architectural model whose elements must be explicitly mapped to elements of the source code. This makes it difficult or even impossible to apply a conformance checking tool in situations where little design information is available, as it is the case with many legacy code systems. Another limitation is that conformance checking is generally restricted to architectural relationships that are directly defined in the

RECOVERING DISTRIBUTED SYSTEM ARCHITECTURES

345

source code (such as definition/use and subtype relationships between program elements) or that can be easily derived from it (such as include and compilation dependencies between source files). Consequently, it may be difficult to reason about architectural runtime components and interactions using current conformance checking technology. 7.2.

Query and visualisation tools

Another way to support software architectural comprehension is by extracting static source models of interest into a software repository, and then allowing users to interactively abstract from those models using the repository’s query and visualisation facilities. Tools that implement this scheme include CIA (Chen et al., 1990) and its more sophisticated descendants Ciao (Chen et al., 1995) and Acacia (Chen et al., 1998), Graphlog (Consens et al., 1992) and PQL (Jarzabek, 1995), which are also based on a Prolog database and environment, and the Software Bookshelf (Finnigan et al., 1997). Tools that provide facilities for three-dimentional visualisation and navigation of program information, such as FileVis (Young and Munro, 1998) and Imagix4D (Imagix Corporation, 1999), can also be included in this category. Query and visualisation tools may be helpful in revealing several aspects of a distributed system’s implemented architecture. Those that support recursive queries over the transitive closure of definition/use relations (such as Acacia, Graphlog and PQL) may be particularly useful to support the task of associating individual source code elements to each of the system’s implemented executable components. However, the static source models used by those tools are still limited to convey design information from a single architectural perspective, namely the source code development organisation. Without considering external domain knowledge, the process of querying and visualising those models is not in itself sufficient to reveal aspects of runtime distribution and physical organisation. 7.3.

Clustering tools

Automatic design recovery has been the aim of many reverse engineering tools. In general, those tools attempt to reconstruct a system’s implemented design by automatically or semi-automatically clustering source code elements into a hierarchy of logical subsystems (Lakhotia, 1997). Clustering tools have been proposed to support a variety of design recovery tasks, for example, recovery of object-oriented concepts from procedural code (Yeh et al., 1995; Canfora et al., 1996; Girard et al., 1997), recovery of design patterns from object-oriented code (Kr¨amer and Prechelt, 1996; Antoniol et al., 1998; Keller et al., 1999), and classification of several types of source code elements into higher-level abstractions such as subsystems and layers (M¨uller et al., 1993; Mancoridis et al., 1998; Kazman and Carri`ere, 1999). Even though clustering tools have been successfully used to support many real-world design recovery activities, their recovered design descriptions are still limited to represent architectural relationships as explicit definition/use relationships between well defined source code entities. In X-ray, clustering is used not as the final product of the architecture recovery process, but rather as a means through which to identify and classify executable

346

MENDONC¸ A AND KRAMER

component modules. The resulting module classification is then complemented with, and provides a basis for, the application of more sophisticate static analysis techniques. 7.4.

Domain-based architecture recovery tools

The architecture recovery tools more closely related to X-ray are ManSART (Harris et al., 1996) and ART (Fiutem et al., 1996). Both rely on the definition of a generic architectural model which captures important domain and language knowledge pertaining the implementation of several architectural styles (Shaw, 1995) (client/server, pipe-and-filter, processspawning, and so on). Each style in their architectural model is associated with a specific set of recognisers, which are programs written in an AST-based source query language provided by their underlying code management system. In general, a recogniser contains detailed syntactic knowledge on the stereotyped implementation of a specific design element. The recogniser then uses this knowledge to search for instances of the design element in the code. Recognisers thus provide bottom-up reverse engineering strategies to support top-down exploration of architectural hypothesis. In practice, this means that the recovery process is entirely driven by the user of the tool, who selects and applies the appropriate recognisers according to the expectations that she may have on the style implemented by the system in question. Some limited support for the recovery of style-based architectural views is also available in more recent industrial reverse engineering tools, such as NORTEL’s inSight (Rajala et al., 1999). The main similarity between X-ray and ManSART and ART is in the use of predefined syntactic queries to recognise style-specific architectural features. Although X-ray does not define an architectural model explicitly, styles elements are indirectly represented in the types of interaction features captured via program patterns (e.g., elements of the client/server and pipe-and-filter styles are indirectly represented in the socket and pipe program patterns, respectively). The decision against the explicit classification of program patterns into separate styles reflects the fact that, as demonstrated in the Field case study, styles seldom occur in their pure forms. ManSART addresses this potential “hybridisation” of styles by providing mechanisms for manipulation and combination of multiple style-specific architectural views (Yeh et al., 1997). In X-ray, hybrid styles are natively supported, as elements of different styles are represented using different graphical attributes in a single architectural view. On the other hand, ManSART supports view combination based on containment analysis among elements of different views. This allows representation of style elements across both the system execution and source code levels, while in X-ray these two levels are represented as separate views. Another salient characteristic that differs X-ray from ManSART and ART is that the latter two identify executable component modules based solely on information available in configuration files. As discussed in Section 2, configuration files may be inaccurate with respect to the actual system implementation. In particular, if a module, say M, that is no longer used by a component, say C, is incorrectly described by the configuration file as a build dependency of C, then any architectural feature implemented in M will be incorrectly recognised by ManSART and ART as a potential feature of C. Finally, X-ray also differs from ManSART and ART in that it explicitly distinguishes between shared and component-exclusive compilation modules. Besides helping to better

RECOVERING DISTRIBUTED SYSTEM ARCHITECTURES

347

understand which implemented functionalities are unique to each component and which are reused across components, this distinction further contributes to a more detailed mapping of style (runtime) elements to development (source code) elements. For instance, ManSART and ART both perform entry function reachability analysis indirectly when searching for the implemented architectural features of a component. However, when a shared feature is recognised and associated with the entry function of multiple components, it is difficult, using those tools, to determine which set of functions involved in the implementation of that feature is actually unique to a component and which is shared. Having a detailed knowledge on the parts of a feature implementation that are shared by multiple components is important in that it helps to assess and perform changes to that feature more effectively. 7.5.

Dynamic architectural analysis tools

An alternate way of recognising runtime features in a distributed system source code is by analysing dynamic data gathered during the actual system execution. Basically, this involves running an instrumented version of the system in order to generate an event trace, i.e., a log of all the system’s dynamic activities. Analysis of event trace data provides the basis for a number of dynamic architectural analysis tools, such as those described in (Kunz and Black, 1995; Sefika et al., 1996a; Jerding and Rugaber, 1998). Those tools enable an engineer to accurately identify (and, in some cases, visualise) a system’s complete set of runtime components and their evolving patterns of interaction. Despite their many benefits, dynamic analysis tools may pose some serious challenges to the process of understanding an existing distributed system’s architecture. First, there is the need to determine which parts of the system to instrument, a task that requires a reasonable knowledge on the system’s implementation model as well as on its underlying development platform. Second, for systems developed using traditional compiled languages, with no shared library facilities, instrumentation implies that some parts of the system will have to be rebuilt; this may be a problem if the system’s original development environment and libraries are no longer fully available. Third, event traces are inevitably sensitive to the system’s selected input data and execution scenarios; as a consequence, important runtime events may be missed during dynamic analysis if the parts of the system relevant to the task at hand are not properly exercised. Fourth, dynamic detection of a runtime event in a component does not in itself reveal whether this event is unique to that component. Knowing which events are unique to a component and which may be shared by multiple components is important in that it helps to better understand how the implementation for some events may be reused across the code for different components. Finally, strict performance constraints may preclude dynamic analysis altogether. For example, a recent architecture recovery experiment, involving a family of industrial distributed embedded systems, revealed that events could not be traced without impeding proper system execution due to performance violations (Bratthall and Runeson, 1999). In contrast to dynamic architectural analysis tools, X-ray does not require a system’s source code to be in a “buildable” state (although it does require the code to be in a state amenable to parsing). Therefore, X-ray can be particularly attractive to situations in which dynamic analysis techniques prove unfeasible or are too costly. Alternatively, it can be

348

MENDONC¸ A AND KRAMER

used as a source of approximate information on the system’s potential physical and runtime organisation, before one goes through the burden of instrumenting the code, generating input data, and then selecting the execution scenarios typically required by dynamic analysis tools. 8.

Conclusion

The ability to recover architectural information from existing software artifacts may play an important role in bringing the promises of software architecture research into practice. Although it much benefits from current reverse engineering techniques and tools, many issues remain to be properly addressed, especially regarding the recovery of runtime abstractions (client, servers, communication protocols, etc.) that are typical to the design of distributed software systems. This paper presented a static reverse engineering approach, called X-ray, that can be helpful in recovering architectural runtime information from existing distributed software artifacts. The driving force behind the development of X-ray has been the assumption that much information on the potential runtime architecture for a distributed system is available across different aspects of its implementation. Hence, to be able to recover at least part of this information, it is necessary to analyse the system source code from different perspectives and across multiple levels of abstraction. In X-ray, this is achieved through the use of three complementary, domain-based static analysis techniques, namely module classification, syntactic pattern matching, and structural reachability analysis. The module classification technique helps to identify how the system’s executable components share, and are implemented in terms of, compilation modules. The syntactic pattern matching technique in turn helps to recognise the stereotyped implementation of typical component interaction features inside each compilation module. The structural reachability analysis technique then helps to assign interaction features recognised in shared modules to individual components. Practical evidence of the utility of X-ray was provided both illustratively, through sample applications to the source code for a small yet non-trivial distributed computer animation tool, and in the form of a larger case study involving a representative subsystem of a publiclyavailable distributed programming environment. Moreover, positive feedback received from the programming environment’s chief developer offered additional evidence of the validity of the approach, and helped to better evaluate its strengths and limitations. As regards future work, some interesting topics include improving the usability and scalability of the X-ray tool set, especially its Prolog-based tools; conducting a more systematic evaluation of the approach based on experiments involving industrial-scale distributed systems of varying application domains; and investigating how to extend the current capabilities of X-ray so as to facilitate future integration with architectural design environments (Garlan et al., 1994; Shaw et al., 1995; Ng et al., 1996; Medvidovic et al., 1999) and also architecture transformation tools (Krikhaar et al., 1999; Carri`ere et al., 1999). The use of an ADL-based notation to represent recovered architectural abstractions may be a first step in this direction. Appendix: A. Pattern definition predicates This appendix presents the set of Prolog predicates available for definition and matching of syntactic program patterns in X-ray. For convenience, the predicates are described grouped

RECOVERING DISTRIBUTED SYSTEM ARCHITECTURES

349

into two general categories: syntactic predicates, which match AST nodes of specific types and attributes; and structural predicates, which match nodes according to their relative position within the AST.

A.1.

Syntactic predicates

A.1.1. Declarations and definitions ast objdec(NId,Name,Type,Scope,Loc) Succeeds if NId is an object-declaration node; Name, Type and Scope are respectively its name, type and scope slots; and Loc is the file-line location of the declaration in the source code. ast objdef(NId,Name,Type,Scope,Init,Loc) Succeeds if NId is an object-definition node; Name, Type, Scope and Init are respectively its name, type, scope and initialisation expression slots; and Loc is the file-line location of the definition in the source code. ast funcdec(NId,Name,Args,Type,RtnType,Loc) Succeeds if NId is a function-declaration node; Name, Args, Type and RtnType are respectively its name, argument list, type and return type slots; and Loc is the file-line location of the declaration in the source code. ast funcdef(NId,Name,Args,Type,RtnType,Block,Loc) Succeeds if NId is a functiondefinition node; Name, Args, Type, RtnType and Block are respectively its name, argument list, type, return type and definition block slots; and Loc is the file-line location of the definition in the source code. A.1.2. Statements ast ifstm(NId,Cond,Tbranch,Fbranch,Loc) Succeeds if NId is an if-statement node; Cond, Tbranch and Fbranch are respectively its conditonal expression, true-branch and false-branch slots; and Loc is the file-line location of the statement in the source code. ast for(NId,Init,Inc,Cond,Body,Loc) Succeeds if NId is a for-statement node; Init, Inc, Cond and Body are respectively its initialisation, increment, conditional expression and statement body slots; and Loc is the file-line location of the statement in the source code. ast while(NId,Cond,Body,Loc) Succeeds if NId is a while-statement node; Cond and Body are respectively its conditional expression and statement body slots; and Loc is the file-line location of the statement in the source code. ast do(NId,Cond,Body,End,Loc) Succeeds if NId is a do-statement node; Cond, Body and End are respectively its conditional expression, statement body and end of statement slots; and Loc is the file-line location of the statement in the source code. ast switch(NId,Cond,Body,Cases,Default,Loc) Succeeds if NId is a switchstatement node; Cond, Body, Cases and Default are respectively its conditional expression, statement body, case list and default case slots; and Loc is the file-line location of the statement in the source code.

350

MENDONC¸ A AND KRAMER

A.1.3. Expressions ast assign(NId,LHS,RHS,Loc) Succeeds if NId is an assignment-expression node; LHS and RHS are respectively its lhs and rhs slots; and Loc is the file-line location of the expression in the source code. ast call(NId,CallName,ParList,Loc) Succeeds if NId is a function-call expression node; CallName and ParList are respectively its call name and parameter expression list slots; and Loc is the file-line location of the expression in the source code. ast notexp(NId,Exp,Loc) Succeeds if NId is a !-type unary expression node; Exp is its expression argument slot; and Loc is the file-line location of the expression in the source code. ast eqexp(NId,LHS,RHS,Loc) Succeeds if NId is a = =-type binary expression node; LHS and RHS are respectively its be lhs and be rhs slots; and Loc is the file-line location of the expression in the source code. ast noteqexp(NId,LHS,RHS,Loc) Succeeds if NId is a !=-type binary expression node; LHS and RHS are respectively its be lhs and be rhs slots; and Loc is the file-line location of the expression in the source code. ast lessexp(NId,LHS,RHS,Loc)Succeeds if NId is a =-type binary expression node; LHS and RHS are respectively its be lhs and be rhs slots; and Loc is the file-line location of the expression in the source code. A.2.

Structural predicates

ast parent(A,B) Succeeds if node A is directly connected to node B via a child-type edge in the AST. ast ancestor(A,B) Succeeds if node A is directly or indirectly connected to node B via one or more child-type edges in the AST. ast same or ancestor(A,B) Succeeds if node A unifies with node B, or A is an ancestor of B in the AST. ast before(A,B) Succeeds if node A is encountered before node B in a depth-first visit of the AST. ast next(A,B) Succeeds if both node A and node B are children of the same parent, and A is encountered before B in a breadth-first visit of the AST. ast sameident(A,B) Succeeds if both node A and node B are ancestors of a node that is connected to the same object-identifier node via a link-type edge in the AST.

RECOVERING DISTRIBUTED SYSTEM ARCHITECTURES

351

Acknowledgments The authors are grateful to Gail C. Murphy, for providing detailed data on the implicitinvocation model of Field, and to Steven P. Reiss, for his help and collaboration on the Field case study. This work was partially supported by the National Council for Scientific and Technological Development of Brazil (CNPq) under grant No 200603/94-9. Notes 1. The acronym X-ray is a convenient reference to its medical counterpart, as architecture recovery can be likened to the process of exposing the software “skeleton” (Kramer, 1994). A description of an earlier version of the approach appeared in (Mendon¸ca and Kramer, 1998). A more in-depth discussion of its underlying principles and tools can be found in (Mendon¸ca, 1999). 2. ANIMADO will be used throughout the paper to illustrate the main concepts underlying each of the three X-ray techniques. It was selected for illustrative purposes due to its small yet non-trivial size (roughly 3,000 non-blank lines of C/Unix code), and the fact that one of its original developers was readily available. 3. In the Prolog terminology, a term of the form Name/N denotes a predicate of name Name with N arguments. 4. Variable-swap is a programming strategy of exchanging (or “swapping”) the values of two program variables by using a third variable for temporary storage. 5. The number of modules listed for the cross reference facility does not include source files that depend on external parser-generator tools (such as lex and yacc) to be compiled. 6. Gail C. Murphy, personal e-mail communication. 7. Steven P. Reiss, personal e-mail communication.

References Antoniol, G., Fiutem, R., and Cristoforetti, L. 1998. Design pattern recovery in object-oriented software. In Proceedings of the 6th Workshop on Program Comprehension, Ischia, Italy, IEEE CS Press, pp. 153–160. Bass, L., Clements, P., and Kazman, R. 1998. Software Architecture in Practice. Addison-Wesley, Reading, Massachusetts. Bellay, B. and Gall, H. 1997. A comparison of four reverse engineering tools. In Proceedings of the 4th Working Conference on Reverse Engineering, Amsterdam, Holland, IEEE CS Press. Bratthall, L. and Runeson, P. 1999. Architecture design recovery of a family of embedded software systems. In Proceedings of the 1st IFIP Working Conference on Software Architecture, San Antonio, Texas, USA, pp. 3–14. Burd, E.L. and Munro, M. 1999. Evaluating the use of dominance trees for C and COBOL. In Proceedings of the International Conference on Software Maintenance, Oxford, England, IEEE CS Press, pp. 401–410. Canfora, G., Cimitile, A., and de Carlini, U. 1992. A logic-based approach to reverse engineering tools production. IEEE Transactions on Software Engineering, 18(12):1053–1064. Canfora, G., Cimitile, A., and Munro, M. 1996. An improved algorithm for identifying objects in code. Software: Practice and Experience, 26(1):25–48. Carri`ere, S.J., Woods, S., and Kazman, R. 1999. Software architectural transformation. In Proceedings of the 6th Working Conference on Reverse Engineering. IEEE CS Press. Chen, Y., Nishimoto, M.Y., and Ramamoorthy, C.V. 1990. The C information abstraction system. IEEE Transactions on Software Engineering, SE-16(3):325–334. Chen, Y.-F., Gansner, E.R., and Koutsofios, E. 1998. A C++ data model supporting reachability analysis and dead code detection. IEEE Transactions on Software Engineering, 24(9). Chen, Y.-F.R., et al., 1995. Ciao: A graphical navigator for software and document repositories. In Proceedings of the International Conference on Software Maintenance, IEEE CS Press, pp. 66–75.

352

MENDONC¸ A AND KRAMER

Chikofsky, E.J. and James H. Cross II. 1990. Reverse engineering and design recovery: A taxonomy. IEEE Software, 7(1):13–17. Cimitile, A. and Visaggio, G. 1995. Software salvaging and the call dominance tree. Journal of Systems and Software, 28:117–127. Clarke, D.G., Potter, J.M., and Noble, J. 1998. Ownership types for flexible alias protection. In Proceedings of the ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications’98, Vancouver, B.C., Canada, ACM Press, pp. 48–64. Consens, M., Mendelzon, A., and Ryman, A. 1992. Visualizing and querying software structures. In Proceedings of the 14th International Conference on Software Engineering. IEEE CS Press. Devanbu, P. and Eaves, L. 1994. Gen++—An analyzer generator for C++ Programs. Technical report, AT&T Bell Laboratories, Murray Hill, New Jersey. Feldman, S.I. 1979. Make—A Program for Maintaining Computer Programs. Software: Practice and Experience, 9:255–265. Finnigan, P., Holt, R.C., Kalas, I., Kerr, S., Kontogiannis, K., M¨uller, H., Mylopoulos, J., Perelgut, S., Stanley, M., and Wong, K., 1997. The software bookshelf. IBM Systems Journal, 36:564–593. Fiutem, R. and Antoniol, G. 1998. Identifying design-code inconsistencies in object-oriented software, In Proceedings of the International Conference on Software Maintenance, IEEE CS Press, pp. 94–102. Fiutem, R., Tonella, P., Antoniol, G., and Merlo, E., 1996. A clich´e-based environment to support architectural reverse engineering. In Proceedings of the International Conference on Software Maintenance, IEEE CS Press, pp. 319–328. Gall, H., Jazayeri, M., Kl¨osch, R., Lugmayr, W., and Trausmuth, G. 1996. Architecture recovery in ARES. In Proceedings of the 2nd ACM SIGSOFT International Software Architecture Workshop, San Francisco, USA, ACM Press, pp. 111–115. Gannod, G.C. and Cheng, B.H.C. 1999. A framework for classifying and comparing software reverse engineering and design recovery tools. In Proceedings of the 6th Working Conference on Reverse Engineering. IEEE CS Press. Gansner, E.R., Koutsofios, E., North, S.C., and Vo, K.-P. 1993. A technique for drawing directed graphs. IEEE Transactions on Software Engineering, 19(13):214–230. Garlan, D., Allen, R., and Ockerbloom, J. 1994. Exploiting style in architectural design environments. In Proceedings of the 2nd ACM SIGSOFT Symposium on the Foundations of Software Engineering, pp. 175–188. Garlan, D. and Perry, D.E. 1995. Introduction to the special issue on software architecture. IEEE Transactions on Software Engineering, 21(4):269–274. Girard, J.-F., Koschke, R., and Schied, G. 1997. Comparison of abstract data type and abstract state encapsulation detection techniques for architectural understanding. In Proceedings of the 4th Working Conference on Reverse Engineering, Amsterdam, Holland, IEEE CS Press. Griswold, W.G., Atkinson, D.C., and McCurdy, C. 1996. Fast, flexible syntactic pattern matching and processing. In Proceedings of the 4th Workshop on Program Comprehension, IEEE CS Press, pp. 144–153. Harandi, M.T. and Ning, J.Q. 1990. Knowledge-based program analysis. IEEE Software, 7(1):74–81. Harris, D.R., Reubenstein, H.B., and Yeh, A.S. 1995. Reverse engineering to the architectural level. In Proceedings of the 17th International Conference on Software Engineering, IEEE CS Press, pp. 186–195. Harris, D.R., Yeh, A.S., and Reubenstein, H.B. 1996. Extracting architectural features from source code. Automated Software Engineering Journal, 3(1/2):109–138. Hecht, M.S. 1977. Flow Analysis of Computer Programs. Elsevier North-Holland, Inc. Holtzblatt, L.J., Piazza, R.L., Reubenstein, H.B., Roberts, S.N., and Harris, D.R. 1997. Design recovery for distributed systems. IEEE Transactions on Software Engineering, 23(7):461–472. Imagix Corporation, 1999. The imagix 4D C/C++ reverse-engineering and documentation system. Online system description. Available at http://www.Imagix.com. Jarzabek, S. 1995. PQL: A language for specifying abstract program views. In Proceedings of the 5th European Software Engineering Conference, pp. 324–342. Jerding, D. and Rugaber, S. 1998. Extraction of architectural connections from event traces. In Proceedings of the ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering. ACM Press. Kazman, R. and Carri`ere, S.J. 1999. Playing detective: Reconstructing software architecture from available evidence. Automated Software Engineering Journal, 6(2):107–138.

RECOVERING DISTRIBUTED SYSTEM ARCHITECTURES

353

Keller, R.K., Schauer, R., Robitaille, S., and Pag´e, P. 1999. Pattern-based reverse-engineering of design components. In Proceedings of the 21st International Conference on Software Engineering, Los Angeles, CA, USA, pp. 226–235, ACM Press. Kozaczynski, W., Ning, J.Q., and Engberts, A. 1992. Program concept recognition and transformation. IEEE Transactions on Software Engineering, 18(12):1065–1075. Kr¨amer, C. and Prechelt, L. 1996. Design recovery by automated search for structural design patterns in objectoriented software. In Proceedings of the 3rd Working Conference on Reverse Engineering, IEEE CS Press, pp. 208–215. Kramer, J. 1994. Exoskeletal software. In Proceedings of the 16th International Conference on Software Engineering. IEEE CS Press, (Invited Panel Presentation). Krikhaar, R., Postma, A., Sellink, A., Stroucken, M., and Verhoef, C. 1999. A two-phase process for software architecture improvement. In Proceedings of the International Conference on Software Maintenance, Oxford, England, IEEE CS Press, pp. 371–380. Kruchten, P.B. 1995. The 4 + 1 view model of architecture. IEEE Software, 12(6):42–50. Kunz, T. and Black, J.P. 1995. Using automatic process clustering for design recovery and distributed debugging. IEEE Transactions on Software Engineering, 21(6):515–527. Lakhotia, A. 1997. A unified framework for expressing software subsystem classification techniques. Journal of Systems and Software, 36(3):211–231. Letowsky, S. and Soloway, E. 1986. Delocalized plans and program comprehension. IEEE Software, 3(3):41–49. Magee, J., Dulay, N., Eisenbach, S., and Kramer, J. 1995. Specifying distributed software architectures. In Proceedings of the 5th European Software Engineering Conference, pp. 137–153. Vol. 989 of Lecture Notes in Computer Science, Springer-Verlag. Magee, J. and Kramer, J. 1996. Dynamic structure in software architectures. In Proceedings of the 4th ACM SIGSOFT Symposium on the Foundations of Software Engineering, ACM Press, pp. 3–14. Mancoridis, S., Mitchell, B.S., Rorres, C., Chen, Y., and Gansner, E.R. 1998. Using automatic clustering to produce high-level system organizations of source code. In Proceedings of the 6th Workshop on Program Comprehension, Ischia, Italy, IEEE CS Press, pp. 45–52. Medvidovic, N., Rosenblum, D.S., and Taylor, R.N. 1999. A language and environment for architecture-based software development and evolution. In Proceedings of the 21st International Conference on Software Engineering, Los Angeles, CA, USA, ACM Press, pp. 44–53. Medvidovic, N. and Taylor, R.N. 2000. A classification and comparison framework for software architecture description languages. IEEE Transactions on Software Engineering, 26(1). Mendon¸ca, N.C. 1999. Software architecture recovery for distributed systems. Ph.D. thesis, University of London, Imperial College of Science, Technology and Medicine. Mendon¸ca, N.C. and Kramer, J. 1996. Requirements for an effective architecture recovery framework. In Proceedings of the 2nd ACM SIGSOFT International Software Architecture Workshop, San Francisco, USA, ACM Press, pp. 101–105. Mendon¸ca, N.C. and Kramer, J. 1998. Developing an approach for the recovery of distributed software architectures. In Proceedings of the 6th Workshop on Program Comprehension, Ischia, Italy, IEEE CS Press, pp. 28–36. Mendon¸ca, N.C. and Kramer, J. 1999. Component module classification for distributed software understanding. In Proceedings of the International Conference on Software Maintenance, Oxford, England, IEEE CS Press, pp. 119–127. Merlo, E., Girard, J.F., Hendren, L., and Mori, R.D. 1995. Multi-valued constant propagation analysis for user interface reengineering. International Journal of Software Engineering and Knowledge Engineering, 5(1):2– 23. M¨uller, H.A., Orgun, M.A., Tilley, S.R., and Uhl, J.S. 1993. A reverse-engineering approach to subsystem structure identification. Journal of Software Maintenance: Research and Practice, 5(4):181–204. Murphy, G.C. and Notkin, D. 1996. Lightweight lexical source model extraction. ACM Transactions on Software Engineering and Methodology, 5(3):262–292. Murphy, G.C., Notkin, D., and Sullivan, K. 1995. Software reflexion models: Bridging the gap between source and higher-level models. In Proceedings of the 3rd ACM SIGSOFT Symposium on the Foundations of Software Engineering, pp. 18–28.

354

MENDONC¸ A AND KRAMER

Ng, K., Kramer, J., and Magee, J. 1996. A CASE tool for software architecture design. Automated Software Engineering Journal, 3(3/4):261–284. Paul, S. and Prakash, A. 1994. A framework for source code search using program patterns. IEEE Transactions on Software Engineering, 20(6):463–475. Rajala, N., Campara, D., and Mansurov, N. 1999. inSight—reverse engineering case tool. In Proceedings of the 21st International Conference on Software Engineering, Los Angeles, CA, USA, ACM Press, pp. 630–633. Reiss, S.P. 1990. Connecting tools using message passing in the field environment. IEEE Software, 7(4):57–66. Rich, C. and Wills, L.M. 1990. Recognizing a program’s design: A graph-parsing approach. IEEE Software, 7(1):82–89. Rodrigues, M.A.F. 1993. ANIMADO: An animation system prototype using dynamics. Master’s thesis, State University of Campinas, Brazil, Department of Computing and Automation, Faculty of Electrical Engineering, In Portuguese. Rugaber, S., Stirewalt, K., and Wills, L.M. 1996. Understanding interleaved code. Automated Software Engineering Journal, 3(1/2):47–76. Sefika, M., Sane, A., and Campbell, R.H. 1996a. Architecture-oriented visualisation. In Proceedings of the ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, ACM Press, pp. 389–405. Sefika, M., Sane, A., and Campbell, R.H., 1996b. Monitoring compliance of a software system with its high-level design models. In Proceedings of the 18th International Conference on Software Engineering, IEEE CS Press, pp. 387–396. Shaw, M. 1995. Comparing architectural design styles. IEEE Software, 12(6):27–41. Shaw, M., DeLine, R., Klein, D.V., Ross, T.L., Young, D.M., and Zelesnik, G. 1995. Abstractions for software architecture and tools to support them. IEEE Transactions on Software Engineering, 21(4):314–335. Shaw, M. and Garlan, D. 1996. Software Architecture—Perspectives on an Emerging Discipline. Upper Saddle River, New Jersey: Prentice-Hall. Soni, D., Nord, R.L., and Hofmeister, C. 1995. Software architecture in industrial applications. In Proceedings of the 17th International Conference on Software Engineering, ACM Press, pp. 196–207. Sterling, L. and Shapiro, E. 1994. The Art of Prolog: Advanced Programming Techniques. 2nd edition, Cambridge, USA: The MIT Press. Stevens, W.R., 1990. Unix Network Programming. Upper Saddle River, New Jersey: Prentice Hall. Tonella, P., Fiutem, R., Antoniol, G., and Merlo, E. 1996. Augmenting pattern-based architectural recovery with flow analysis: Mosaic—a case study. In Proceedings of the 3rd Working Conference on Reverse Engineering. IEEE CS Press. Tridgell, A. 1994. SAMBA: unix talking with PC’s. In Linux Journal, Vol. 4, Available online at http://linuxjournal.com/cgi-bin/frames.pl/lj-issues/issue7/samba.html. Waters, B. and Abowd, G. 1999. Architectural synthesis: Integrating multiple architectural perspectives. In Proceedings of the 6th Working Conference on Reverse Engineering. IEEE CS Press. Woods, S. and Quilici, A. 1996.Some experiments toward understanding how program plan recognition algorithms scale. In Proceedings of the 3rd Working Conference on Reverse Engineering, IEEE CS Press. pp. 21–30. Woods, S. and Yang, Q. 1996. Approaching the program understanding problem: Analysis and a heuristic solution. In Proceedings of the 18th International Conference on Software Engineering. IEEE CS Press. Yeh, A.S., Harris, D.R., and Chase, M.P. 1997. Manipulating recovered software architecture views. In Proceedings of the 19th International Conference on Software Engineering, ACM Press, pp. 184–194. Yeh, A.S., Harris, D.R., and Reubenstein, H.B. 1995. Recovering abstract data types and objects instances from a conventional procedural language. In Proceedings of the 2nd Working Conference on Reverse Engineering, IEEE CS Press, pp. 227–236. Young, P. and Munro, M. 1998. Visualising software in virtual reality. In Proceedings of the 6th Workshop on Program Comprehension, Ischia, Italy, IEEE CS Press, pp. 19–26. Zhang, S., Ryder, B.G., and Landi, W. 1996. Program decomposition for pointer aliasing: A step toward practical analyses. In Proceedings of the 4th ACM SIGSOFT Symposium on the Foundations of Software Engineering, ACM Press, pp. 81–92.

Suggest Documents