Searching Program Source Code with a Structured Text ... - CiteSeerX

2 downloads 297 Views 108KB Size Report
Electrical and Computer Engineering, University of Toronto, M5S 3G4 Canada ... source code searching may expedite software development. A search for a ...
Searching Program Source Code with a Structured Text Retrieval System Charles Clarke Electrical and Computer Engineering, University of Toronto, M5S 3G4 Canada [email protected]

Anthony Cox Computer Science, University of Waterloo, N2L 3G1 Canada [email protected]

Susan Sim Computer Science, University of Toronto, M5S 3G4 Canada [email protected]

Abstract Software repositories are often based on object-oriented or relational databases, usually with extensions to accommodate the special requirements of software. Here, we discuss a software repository based on a structured text retrieval system, which avoids some of the limitations of previous approaches, including language dependence and poor scalability. 1 Introduction A tool for source code searching can assist a variety of software engineering tasks. As an aid to program comprehension, a search tool might help a program maintainer discover sections of the code that may need to be examined and modi ed. Given an error message, a search can nd the statement where the message was generated. Given the name of a variable, a search can determine where the variable is de ned and used. Given two modules, a search can identify the functions that are de ned in one and used in the other. If a large repository of existing software is available, source code searching may expedite software development. A search for a function containing calls to both fork() and exec() can provide a provide a developer with an example of how these functions are used in conjunction. A search may also locate an existing class or package using the terms appearing in the comments or documentation, facilitating software reuse. We have applied the MultiText structured text retrieval system to the problem of storing and searching program source code. The MultiText system is designed to handle heterogeneous collections of structured text in a variety of formats, including HTML, SGML, and standard email and word processor formats. Distinguishing features of Multi-

Text include its ability to model document structure independent of a xed schema and to query on both the structure and content of documents [3]. In addition, MultiText supports a unique relevance ranking function [4] that can be used to rank arbitrary document components, such as paragraphs or pages, where the components to be ranked are speci ed by an arbitrary query. We view program source code as just another form of structured text. By taking this approach, we have avoided many of the problems encountered by members of the software engineering community in their attempts to develop software repositories using relational and object oriented databases. In addition, the approach allows user manuals and design documentation to be stored in the system along with the source code. Links between related portions of the code and documentation can be represented and queried.

2 Searching Source Code Software repositories are typically used for tasks in which the ability to query on the structure (syntax) of software is both necessary and and useful. For example, a typical query might be: Locate all de nitions for the variable z. When using our approach it is necessary to identify and describe the structural content of the source code. To solve the query above, we must be able to distinguish variable names and de nitions from other parts of the code. To illustrate our approach, we will provide several examples based on the simple C source code fragment below. int z = 0; int zero () { return (z); }

In MultiText, data is viewed as a sequence of tokens, with solutions to queries being sets of contiguous token sequences. Markup can be invisibly placed around portions of the data to indicate structural elements. For example, \" is placed just before the de nition of the zero function and \" is placed just after the end. Markup can be referenced in queries, but is not returned as part of any solution. In program source code, we use markup to label the

int1 z2 = 03; int4 zero5 ()f return6(z7); g Figure 1: C Source Code Fragment with Markup syntactic elements, which are identi ed using an auxiliary parsing tool. Figure 1 presents a marked-up version of our example code fragment, with subscripts indicating positions in the token sequence. GCL, the MultiText query language, is used to pose queries. GCL queries are formed by combining descriptions of the textual contents desired in the solution sequence with descriptions of the context in which the text occurs. For example, the following query returns program statements that contain both of the strings \return" and \z": (""..."") > ("return" and "z")

Basic queries in GCL are quoted strings, such as "z", which match sequences of tokens and markup. More complicated queries are formed by combining queries with GCL's binary operators. The ordering operator, ... is used to express ordering relationships between elements identi ed by its subqueries. Thus, the query ""..."" returns all source code statements. Containment relationships can be described using one of four containment operators: > (containing), /> (not containing), < (contained in), and /< (not contained in). Finally, the Boolean operators and and or are supported. MultiText also provides an indirection operator to allow queries to follow named references between two parts of the database. A typical example of a query using indirection would be: defined@(""..."")

This query, which locates the de nition of all variables referenced in the database, returns the substrings associated with occurrences of the markup item defined in a variable reference. Since the code fragment contains the item contained in the desired context, the sequence of tokens 1 through 3 in the database (i.e. int z = 0) is returned as a solution.

3 Discussion and Related Work Current approaches to source code repositories, such as CIA [2], decompose the source code into component entities speci ed by a prede ned model, resulting in a loss of context. By using MultiText, and treating the source as a contiguous sequence of tokens, we have avoided any loss of context. Existing approaches have other limitations which we believe that our system avoids: scalability, language dependence, and restrictions imposed by the use of a xed schema. MultiText has been used e ectively with 100GB text collections [4], and so we do not anticipate any problems with scalability. When considering a document as a series of tokens, interleaved with markup, the language from which the tokens were taken loses some of its importance. Source les from many languages can be combined in the one database and intermixed with other documents such as Unix man pages, texinfo les, etc. resulting in a repository what is language independent. Each document can have its own schema, since the markup is added by an auxiliary tool as

part of the process of entering the data into the system. It does not matter that the markup schema may be di erent for each document, provided that there exists some method of identifying the particular schema used for each document. The idea of using information retrieval tools as the basis of a software repository is not new. For example, SMART [1] has been used to identify repository components for reuse. However, the system was used to query a set of reusabilityrelated attributes describing the source, and not the actual source itself. The use of markup to describe characteristics of source code was recently investigated in CHIME [5]. In CHIME, HTML hyperlinks are used to describe relationships between code elements permitting existing browsing tools to be used to explore the code. The indirection supported by MultiText provides a slightly more general facility, allowing a variety of di erent relationships to be represented.

References [1] Chang, Y. F., and Eastman, C. An information retrieval system for reusable software. Information Processing and Management 29, 5 (1993), 601{614. [2] Chen, Y.-F., Nishimoto, M., and Ramamoorthy, C. V. The C information abstraction system. IEEE Transactions on Software Engineering 16, 3 (March 1990), 325{334. [3] Clarke, C., Cormack, G., and Burkowski, F. An algebra for structured text search and a framework for its implementation. The Computer Journal 38, 1 (1995), 43{56. [4] Cormack, G. V., Palmer, C. R., Van Biesbrouck, M., and Clarke, C. L. A. Deriving very short queries for high precision and recall. In Seventh Text REtrieval Conference (TREC-7) (Gaithersburg, Maryland, November 1998). [5] Devanbu, P., Chen, Y.-F., Ganser, E., Muller, H., and Martin, J. CHIME: Customizable hyperlink insertion and maintenance engine for software engineering environments. In 21stInternational Conference on Software Engineering (Los Angeles, May 1999). To appear.

Suggest Documents