Programs are Knowledge Bases - CiteSeerX

Programs are Knowledge Bases Daniel Ratiu and Florian Deissenboeck Institut für Informatik, Technische Universität München Boltzmannstr. 3, D-85748 Garching b. München, Germany {ratiu|deissenb}@in.tum.de

Abstract Gaining an overview of the concepts represented in large programs is very demanding as multiple dimensions of knowledge appear at different abstraction levels throughout the source code. To reduce the overall comprehension effort it is therefore desirable to make the knowledge once gained explicit and shareable. We tackle this problem by establishing a mapping between source code and conceptualizations shared as ontologies. To achieve this we regard programs themselves as knowledge bases built on the programs’ identifiers and their relations implied by the programming language. Making these mappings explicit allows sharing knowledge about the concepts represented in programs. We present our approach together with a methodology and exemplify it on the Java programming language and the WordNet ontology and we report on our experience with two open source case studies.

1. Understanding Large Programs Program comprehension accounts for about 50% of the expenses spent during software maintenance [12] which itself consumes the bulk of the lifecyle costs of successful, long-lived software-systems [21]. Most programmers would agree that understanding the high-level coherences of a program (as opposed to understanding localized code fragments like a single method) is one of the most challenging tasks in program comprehension. This is can be best exemplified by a (fictional) experiment: compare understanding an unknown program without any prior information to understanding the program after one of the developers gave you a 10 minute introduction on the program. Certainly the latter task is considerably easier than the former. This raises the question about the nature of the knowledge conveyed to you in the short introduction. As previous research on program comprehension suggests, this is mainly knowledge about the basic concepts

represented in the program and their interrelations. In this context concepts don’t necessarily have to be concepts of the application domain, e. g. an account number. They can as well be a technical concepts like a stack or sorting algorithm or a part thereof. Unfortunately the conceptual information contained in the source code is often of implicit nature as these concepts are scattered over various locations and different abstraction levels within the source code. Thus the task of locating known concepts in the program as well as extracting concepts from the program are very demanding. Up to now neither specialized comprehension tools nor prose documentation (with the well-known problem of code and documentation drifting apart) do really satisfy developers demands. To circumvent this problem most development organizations (consciously or unconsciously) use an informal knowledge management method which is usually built on a small group of gurus who know everything about the system someone might want to know. Although highly efficient in small organization this method becomes suboptimal in larger organizations (with large software systems) and is doomed in situations with high personnel turnover. This informal knowledge management method has the additional drawback of not being repeatable: comprehension tasks performed by one developer (with guru support) do not directly aid other developers’ comprehension tasks. The guru’s key to mastering his job is his broad knowledge about the concepts of the program and their representation in code together with his capability of reasoning as an intelligent human being. Even if it is impossible to completely replace the guru by a machine, we propose a method to support non-guru developers (and gurus) by making explicit part of the knowledge a guru has. We achieve this by mapping the source code to ontologies that represent shared conceptualizations of specific domains. These mappings enable a conceptual decomposition of the program along the ontology, and can be used by developers to locate concepts in the source code.

Contribution

cepts to their implementation in the source code [2]. This task is usually referred to as concept location or concept extraction (in the other direction). Due to the paramount importance of this task a lot of work has been devoted to this topic. While there are investigation of complex concept location strategies like the dynamic search [25] method or control and data flow analysis [4], most work agrees that identifier-based concept location strategies are the most intuitive [22] if the identifiers are chosen properly.

We define an abstract view over programs which is based on identifiers and relations between their corresponding program elements (Section 4). Using this representation we describe a new approach for extracting concepts from code based on mapping the hierarchies between program identifiers to ontologies (Section 5). As a result, we explicitly link the sources with the semantics contained in ontologies. We present an incremental methodology for eliminating the noise and improving the results (Section 6). We instantiate the approach using on the one hand the relations between the names of program elements induced by the type and module-system of Java and on the other hand relations from the WordNet lexical hierarchy as a reference ontology (Section 7). We report on our experiences gained on two open-source case-studies (Section 8).

Latent Semantic Indexing Latent Semantic Indexing (LSI), a standard technique in information retrieval from large bases of text documents, has been used to locate concepts in source code [16, 17, 18]. Recently [15] advanced on this by using LSI to analyze object-oriented systems on different structural levels. However, due to the very general nature of LSI, using it for concept location is similar to using web-search engines to find information on the web. When using LSI for concept location, in order to be effective, the software engineer still needs to make an educated guess. When LSI is used for concept extraction, it requires manual interpretation of the generated representation. Thus, the observations made by one developer do not aid other developers’ comprehension activities.

2. Related Work Knowledge Required for Understanding Programs There’s highly valuable work on the program comprehension processes [24] and the knowledge required for understanding programs [5, 8, 23]. Although the details of program comprehension and the knowledge involved aren’t fully understood it is generally accepted that program comprehension requires multidimensional knowledge. Example knowledge dimensions are knowledge about programming languages, algorithms, libraries, design patterns (computer science knowledge), knowledge about business functionality, business processes (domain knowledge) and general knowledge. Typically a program’s structure doesn’t follow a conceptual modularization. Hence, the knowledge required for understanding programs is scattered over different abstraction levels and various locations within the program.

Knowledge-based Approaches The LASSIE system [7] uses a knowledge base for intelligently indexing reusable components. The approach makes a distinction between the domain model and the code model. Although the code model is populated automatically, the domain model and its relation to the code model must be maintained manually. Such a system proved to support comprehension tasks but the overhead of manually synchronizing the models reduced the overall benefit.

Identifiers To a large extent this knowledge is encoded in programs only in the names used for its entities. Evidence for the strong influence of these names (identifiers) on the comprehension processes is provided by a strategy commonly applied by code obfuscators: In order to protect the code they substitute the identifiers by meaningless character sequences (Identifier Scrambling). This transformation is sufficient to make comprehension a cumbersome to impossible task. [3, 1, 14] highlight the importance of proper identifier naming and define naming rules. [6] advances on this by providing a formal naming model and treating program identifiers as controlled vocabularies.

3. Knowledge Sharing through Ontologies To support sharing and reuse of knowledge of a particular domain one needs to explicitly represent it in a formal manner. The first step in formally representing a body of knowledge is to decide on a conceptualization of the domain. A conceptualization is an abstract, simplified view of an area which is to be described for a specified purpose. It contains the set of objects, and concepts together with their properties and interrelationships [9]. An ontology is defined to be an explicit specification of a conceptualization [10] and is used for sharing the knowledge about a domain by making explicit the concepts and relations within it. The term “specification” implies that this conceptualization is defined in a rigorous manner. Figure 1 shows an example of

Concept Assignment One of the most important comprehension activities is the assignment of human-oriented con2

an ontology that describes some of the members of a family and their relations1 . The classes have a set of properties and constraints defined over them. Through arrows we represent strict subclass relations between classes. Person

Gender

hasSex : Gender hasParent : Parent hasMother : Woman hasChild : Person

{Male, Female}

Woman

Child

hasChild >= 1

hasSex = Female

hasParent >= 1

Daughter

4. Knowledge Representation in Programs In order to act like a guru within a software project, one needs to have both the knowledge about programming language(s), paradigm(s), algorithms, libraries, program design, domain and how is it reflected in the code. We imagine a mental experiment of obfuscating the names of program entities and presenting the resulted code to the guru. Even if the guru’s knowledge about what is represented in the code remains, when faced with an obfuscated code he can not identify where is this knowledge represented in the code. Of course that he could still identify all the structural characteristics of the program. With some effort he could even find that a certain arithmetic expression might cause a division by zero error or that some part of the program performs a sorting. However, this represents only an extensional, mathematical view of the program, without revealing the purpose of these parts of code. Questions like “what is this method doing?” (extensional) might be answered by the guru using mathematical terms (e.g. operational semantics) but questions like “why is it doing this / what is the purpose of this code in the larger context of the program / what is the link with the modeled domain?” (intentional) that for the guru would initially represent no challenge, are now almost impossible to answer. We thus start from the assumption that an important part of the knowledge that a guru needs is manifested in the names of identifiers.

Male

Parent

Mother

there is an edge between the nodes that denote the entities annotated with the name of the relation in the ontology.

Female Class Instance attribute : Type Constraint Legend

Figure 1. Part of the Family Ontology There is a wide spectrum through which the ontologies can be seen from the point of view of the specification detail [19]. At the lowest level of detail are controlled vocabularies which are nothing else than lists of terms. The next level of specification are glossaries which are expressed as lists of terms with associated meanings presented as glossary entries in natural language. Thesauri provide additional relations between their terms (e.g., synonymy) without assuming any explicit hierarchy between them. Many scientists prefer to have some hierarchy included before a specification can be considered an ontology. The most important hierarchical relation in ontologies is the “is-a” relation. At the next stage are strict subclass hierarchies which allow the exploitation of inheritance (i.e., the transitive application of the “is-a” relation). More expressive specifications include classes attributed with properties which, when specified at a general level, can be inherited by the subclasses. A more detailed specification level implies value restrictions for properties (the ontology presented in Figure 1 is at this level of specification). The most expressive ontologies allow the specification of arbitrary logical constraints between their members.

From a formal point of view names are simple labels, attached to program elements. Among these program elements is a rich set of well-defined and automatically checkable relations which are transmitted to their names. While there is no enforceable consistency between the names and the external world (e.g., there are no means to check if a particular class that is named “Child” really models a child), there are many ways in which an internal consistency check is enforced (e.g., if “Child” is a subclass of “Person” then this relation is true in the whole program). From an informal point of view the names are in the center of program understanding by humans as they introduce a higher-level, not formally checked, conceptual layer in a program. Let’s take a simple code example in which all variables are integers: position = initialPosition + distance. This example would be considered acceptable both by the type-checker and by a human, from a conceptual point of view. If we modify the example a bit and write: position = initialPosition + temperature we assume that everybody would consider this piece of code to be at least

In the present work we use an informal meaning of the term “ontology” - which we regard to comprise only concepts and relations between them. We do not require any consistency checking, restrictions on properties or logical inference to be defined or that the terms obey that inference. In order to represent an ontology we use a graph language similar to the RDF graphs [13]. Entities within the ontology are the nodes of the graph. If there is a relation in the ontology between two entities then in the graph 1 from http://protege.stanford.edu/plugins/owl/ owl-library/

3

RsPO

RsMP

awkward if not wrong, even if it conforms to the programming language semantics. The additional semantics in this case is given by the names that are at a higher conceptual level than the level of checkable relations (e.g., types).

RsPN

Source Code

Program hierarchy RsMN RsNO between names represented Identifying concepts = Graph Matching as graphs l_match(M, A) ∧ l_match(N, C) r_match(RsMN, RoAC)

At the basis of building the program vocabulary stays the programming language. We start from the Steele’s [11] metaphor that a programming language can be seen as a core set of words (e.g. keywords and the core library) and rules (e.g., module system, type system) that allow other words to be defined. Through every declaration, programmers define new words by making use of the already existing ones. For example in the code snippet in Figure 5 we defined the word Parent in terms of the word Person. Once a word is defined, it enters the vocabulary of the program and can be subsequently used (e.g. for defining the word mother). The relation between a defined word and the ones used in its definition is (many times) similar to the relations between the concepts represented by these words in the real world. We regard programs as knowledge bases where the knowledge representation language contains (a subset of) the programming language used. The knowledge itself is expressed in this language through the names of the identifiers. The identifier names are used to lexicalize concepts which can contain information belonging to different dimensions and attach them to program entities. Our view on programs focuses on concepts that are lexicalized within the code. The concepts that are not named, even if they exist in the code due to the operational manner in which the programs are described, are not considered by our approach.

Domain, Design, Libraries Knowledge shared as Ontologies

∧

RoAC Ontologies represented as Graphs

RoCB

RoCE RoBE

RoDE

Figure 2. Proposed Approach labeled multigraphs (see Section 3). We restrict our approach for identifying the concepts within the code to simply mapping parts of the multigraphs that represent the program to parts of multigraphs that represent ontologies. This is in fact the classical problem of finding a sub-graph homomorphism between multigraphs. Our mapping function has two components: • linguistic matching function (l match) - takes two graphs and returns the set of pairs which contain nodes whose names are compatible. The compatibility relation between two names is checked with respect to a specified criterion. In Section 7.3 we define an instance of the lexical compatibility function (l compatible). • relations matching (r match) - intuitively, the r match, takes as input two pairs of names, each consisting of an identifier name and an entity name from an ontology, and checks if there is a compatible path between them in the program and the ontology graph. In Section 7.3 we define an instance of the paths compatibility function (r compatible).

5. From Sources to Ontological Entities Understanding a program implies recovering the concepts that are present in the sources. As we presented in Section 3, ontologies are used for sharing conceptualizations. We start from the assumption that the concepts that we need to identify are represented by entities within an ontology. Thus, we identify concepts within a program by creating mappings between parts of it and parts of an ontology (i.e. by identifying ontological entities within a program). As is illustrated in Figure 2, we use ontology mapping techniques for recovering information from programs. The input is twofold: on one hand the names of program elements and their relations and on the other hand the reference ontologies. The output is given by ontological entities which can be related to program elements. Ontologies are represented in different languages and formats. In order to overcome the syntactic heterogeneity between the representation of ontologies and programs, we choose to represent both the sources and the ontologies as

The l match used alone can identify only the matching between different strings and not between the concepts that they stand for. Due to the polysemy, a certain word can represent several concepts. Furthermore, the context in which a word is used can substantially change its meaning. Our approach proposes to solve these problems by taking into consideration not only words in isolation but also how these words are interrelated. By making explicit the mapping between the code and the ontology, we obtain a new representation of the program centered on the identified ontological entities. This representation contains a higher level of knowledge than the code alone, namely the relations of the identified ontological entities with other concepts from the ontology. 4

6. Noise and Incomplete Knowledge Bases

other hand the noise that needs to be eliminated (given in our figure through noise managers). Each extractor should produce not only new concepts but should also allow a human to check the results by providing the reasons that generated them. Within an iteration the extractors chain extracts the concepts from the code by considering the available ontologies and noise information.

Our approach can be compared to a new programmer who comes into the team and wants to understand the software system by browsing through the sources in a speculative manner. Taking a trial and error approach, he looks at relations between names and compares this information with his previous knowledge. If the comparison is successful he identifies some concepts in the code until they are not invalidated by accessing other code details. If new details are found, some of the already made concepts can be re-validated or invalidated. Only by browsing the code in an extensive manner, and based on the identifier names, the context in which they appear and the relations which are defined between them but without following the control or data flows, he tries to deduce what is the program about. Even if at first he uses only his general knowledge (e.g., programming, computer science) to speculatively grasp some meaning from the sources he will certainly understand, from the first time, part of the concepts which are dealt with in the sources. However, the part of the sources that is specific to the implementation or modeled domain, about which our subject has no previous knowledge, will remain unknown to him until he makes use of additional sources of information (e.g. asks the guru) or he starts to investigate the operational details of the program. A fully automatic concept extraction would require the knowledge bases to contain enough information to cover all the aspects defined in the code in the similar manner and nothing else. An additional problem is that programs are written using many coding styles and the knowledge used within them varies greatly. Thus, our approach would be not realistic without the human intervention in an iterative manner. Following the above analogy, the initial knowledge that a programmer possesses is represented by the initial knowledge bases that are available before the first iteration of our approach. The false hypothesis that the programmer makes based only by speculatively looking at names are similar to the false positives that are generated by the mapping functions. The additional knowledge that the programmer needs (from the guru) in order to increase the accuracy of his understanding is given by the additional knowledge added to our knowledge bases in the subsequent iterations. In Figure 3 we present a single iteration of our approach. The available dimensions of knowledge are shared as several ontologies encapsulated in a set of knowledge bases. The chain of extractors is responsible for matching the identifier names and their relations to particular ontologies. An extractor, which is composed from l match and r match functions, contains the knowledge about specific mappings between the code and an ontology. The inputs of an extractor are on the one hand the graphs to be matched and on the

Source Code Model

Noise Manager 1

Noise Manager N

Extractor 1

Extractor N

Reduce Noise Feedback Loop

Add new Knowledge Knowledge Knowledge Knowledge Base 1 Base 2 Base M Dimensions of Knowledge expressed as Ontologies

Figure 3. An iterative approach The end of an iteration is similar to a short discussion of the programmer with the guru in which the programmer validates or invalidates his results and obtains new knowledge which will help him to perform the next steps. At the end of each iteration, the user should validate the partial results automatically obtained by the tool. This validation is done in two directions: • elimination of false positives - the false positives are entities that belong to an ontology and were mapped to the code but they do not exist in the code. This noise originates from the fact that we use a partial mapping between the ontology and the source code graphs. In order to eliminate this problem, the user should explicitly instruct the system that certain entities were wrongly identified so that in the subsequent iteration the system ignores them. • elimination of false negatives - the false negatives are concepts that appear in the code but could not be identified by the tool due to the lack of knowledge in the knowledge bases. No matter how many knowledge bases we will have, there will always be project specific knowledge that is not contained in them. In this case, the user should add new knowledge to the available knowledge bases. The most important variation point here between a fully automatic to a fully manual approach is the number of iteration that one needs to perform in order to obtain satisfying results. Our experiments show that even after a small number of iterations one can eliminate the false positives and obtain a satisfactory coverage of the code (see Section 8) even if only one knowledge base is used. 5

S: (n) parent (a father or mother; one who begets or one who gives…) subclass direct hyponym S: (n) adoptive parent, adopter (a person who adopts a child of…) S: (n) empty nester (a parent whose children have grown up and…) S: (n) father, male parent, begetter (a male parent (also used as a…)) S: (n) filicide (a parent who murders his own son or daughter) S: (n) mother, female parent (a woman who has given birth to a…) S: (n) stepparent (the spouse of your parent by a subsequent marriage) contained by direct member holonym S: (n) family, family unit (primary social group; parents and children) direct hypernym superclass S: (n) genitor (a natural father or mother)

7. Instantiating Our Approach We present an instance of our generic approach for extracting ontological entities from programs, by choosing the WordNet [20] dictionary as knowledge base and a particular set of program relations from Java. To exemplify our approach, we use a small example containing basic concepts of family members and their relations. We present how these concepts are expressed in WordNet, how can they be expressed in a Java code and how can they be recovered by mapping the code hierarchy to WordNet.

Figure 4. Example of WordNet Entry

7.1. WordNet: A Knowledge-base Example

a mean to model structural relations from the modeled domain. Thus, relations between modules and their constituents are good candidates to be similar to meronymy relations from WordNet. We consider that the most important modules in Java are packages and classes and they determine the following relations:

WordNet is an online dictionary of English inspired by psycholinguistic theories of human lexical memory. Instead of organizing the words in terms of their form, like the majority of other dictionaries do, WordNet organizes the words in function of their meanings, in sets of synonyms (synsets). The WordNet 2.0 contains over 150000 words grouped in over 115000 sets of synonyms, out of which more than 70% are nouns. Due to the words polysemy, every word can express more lexicalized concepts and due to the synonymy every lexicalized concept can be represented through more words. The synsets are organized hierarchically along the hyponymy / hypernymy (i.e. “is a kind of”) relation. Every word definition consists of its immediate hypernym followed by distinguishing features. In the case of nouns the distinguishing features that are explicitly encoded in WordNet are the meronyms (i.e. “part of”). Meronyms are features that can be inherited by hyponyms. There is a big difference between the number of hypernym and meronym relations in WordNet - there is a hypernymy relation for every word but much fewer words have associated also a meronym. Figure 4 contains as an example an entry of the noun “parent” in the WordNet dictionary and how are the relations of this synset with other entries expressed. A graphbased representation of this entry is given on the left part of Figure 5.

• memberOfPackage - this relation holds between a package and all its containing classes or interfaces. • memberOfClass - this relation holds between a class and all its containing attributes. Type system induced relations The type system represents a static approximation of the behavior of programs at run-time. Due to the fact that in Java the type system is names based, a well-typed program enforces a certain level of consistency between names (i.e. a certain name in a certain naming context can be used only for certain actions). With the help of typing rules, a wide variety of relations between program elements is enforced. We enumerate below the most important relations: • subTypeOf - is the relation between two types, used as the main mechanism in object-oriented languages to model the “is a kind of” relations between real-world entities. This relation holds between a class and all its superclasses and interfaces.

7.2. Relations between Java Program Elements

• hasType - holds between a variable and its declared type. Every variable in a program is explicitly declared to have a certain type and the variable can be used only in the contexts where its type can be used.

In order to map Java program entities to the WordNet ontology, we need first to identify program relations that are similar to the relations within WordNet. We classify the language defined relations between program elements according to two criteria: relations generated by the module system and generated by the type system.

• assignedTo - holds between a named member of an expression in the right side of an assignment and the assigned variable. • boundWith - holds between a member of an expression in the place of an actual parameter and the formal parameter. Whenever an actual parameter is bound to

Module system induced relations The module system enables a structural decomposition of programs. It is also 6

a formal parameter, we have a dependency similar to an assignment in which the left side is the formal parameter and the right side is the expression for the actual parameter.

dotted lines represent the links that were identified between program identifiers and concepts within the ontology. For clarity’s sake not all links are shown. Through these links we make explicit not only the concepts but also the location in the code where they appeared. This is similar to our guru who has both the knowledge about the concepts and how they are represented in the code.

Based on these relations we represent Java programs as multigraphs. The nodes of the graph are the names of the identifiers and the edges are the above defined relations between their corresponding program elements. In the right part of Figure 5 is illustrated the graph representation of a small code example that represents relations within a family.

8. Case Studies We present the experimental results of our approach obtained from mapping the concepts present in the WordNet ontology to the sources of two open source Java programs. As tool support we used the Insider2 and Bridge3 tools. We present the steps that we used to deal with the false positives. Within these experiments we used only one knowledge base (WordNet) and thus did not tackle the problem of false negatives (by adding a new knowledge base).

7.3. Mapping WordNet to Java Programs In the previous subsections we discussed the relations within the WordNet knowledge base and similar relations in Java programs generated by the module and the type system. Based on the current configuration (WordNet and Java) we define instances of the functions l compatible and r compatible that ensure lexical and relational compatibility for the matching functions.

• Step 1: extract ontological entities from sources by mapping the code to available ontologies. This step assumes no human intervention. In the tables which present the results we refer to the fully automatic extracted entities as: “Raw Results”

An instance of l compatible: The function l compatible takes two strings, one the name of a WordNet word, and the other a program identifier. The identifier is subsequently split in words based on the Java compound identifiers naming conventions and l compatible returns true if the WordNet word is morphologically equivalent to one of the words obtained by splitting the identifier. For example, l compatible(“paper”, “currentPapers”) = true because “paper” is morphologically equivalent to “papers” and “papers” ∈ {“current”, “papers”}. l compatible(“house”, “child”) is false because these words are not morphologically equivalent.

• Step 2: bring to a normal morphological form the program words or abbreviations which could not be identified in WordNet. For example instruct the program that the word “cloneable” is a morphological derivative of “clone” and “millis” is an abbreviation of “millisecond”. This step is semi-automatic while even if stemming algorithms can be used for resolving some of the morphological variations of words, we still need manual inspection for resolving the others and the abbreviations. • Step 3: inspect the concepts identified from WordNet and their relations and instruct the noise managers about the false positives. This step is completely manual and consists in the classification of the WordNet entities that were identified in the program in three categories: the ones that we checked that are referring to the analyzed program, the ones that are obviously not referring to the program (e.g. specific knowledge about baseball) and the ones about which we still do not have enough information to decide (due to the semantic heterogeneity between WordNet and arbitrary programs they might be similar but not exactly identical).

An instance of r compatible: The function r compatible takes two paths, one from the WordNet ontology graph and the other from the graph representation of the program. To assess compatibility of these two paths, we consider the hypernymy relations to be compatible with the type-checker enforced relations and the meronymy relations to be compatible with the module system enforced relations. Thus, a path in the WordNet graph is a sequence of “hypernym” or ”meronym” labels and in the program graph a sequence of relations defined in the previous section. We consider two paths to be compatible if within them is an alternation of compatible relations. In Figure 5 we present a hands-on example of our approach. The left side shows the words representing members of a family and their relations as they are given in WordNet. Accordingly the right side shows the identifiers which appear in our code fragment and their relations. The

2 Insider is a reverse engineering platform developed at LOOSE Research Group, Technical University of Timis¸oara, Romania (www.loose.utt.ro) 3 Bridge is a tool for extraction of concepts from code developed at Technische Universität München, Germany

7

WordNet World

Program World

currentChild

meronym

hyponym

Mama

Family

pe

aMother o dT

mother

assignedTo

e gn ssi

a

offspring

WordNet Entity

Local Variable

Class

Parameter

Field hypernimy-like relation

Package meronimy-like relation

memberOfPackage

Child(Parent aMother) { mother = aMother; }

hasType

boundWith

Child

Mother

Legend

hasType

hasTy

hyponym

Parent

er mb me Class of

hyponym

Child

Offspring

class Parent extends Person { … } class Child extends Person { Parent mother;

subclassOf

subclassOf

hyponym

Parent

package edu.tum.cs.family;

Person

Person hyponym

Code

mama

edu.tum.cs.family

} … Parent mama = new Parent(); Child currentChild = new Child(mama); … offspring = currentChild;

memberOfPackage

Figure 5. An Example of Concepts Identification • Step 4: extract ontological entities considering the resolved noise. We will refer to these entities in the following as Refined(i) where “i” represents how many times we applied Step 3. Are there any more false positives? If not then stop, otherwise go to Step 3.

Classes Identifiers Words kLOC

JFreeChart Compiere 738 1178 7220 16627 1570 3125 211 315

Overview (a) memberOfPackage memberOfClass subclassOf hasType assignedTo boundWith

Case-studies Overview JFreeChart4 is an open-source Java library which supports drawing of charts and Compiere5 is a fully integrated business solution for small-tomedium companies. Table (a) in Figure 6 shows a quantitative overview of these systems. Due to the lack of space, we will exemplify in detail mostly the results obtained on JFreeChart and for Compiere we restrict to the quantitative results.

JFreeChart Compiere 0 5 1 37 70 60 282 68 436 319 36 25

Relations in Code (c)

Ontological Entities Affected identifiers Relations in Code Low level Concepts

JFreeChart Compiere 172 190 1503 5343 825 514 65 48

Knowledge and its Impact (b) JFreeChart Compiere

Raw Refined(1) 344 229 406 311

Refined(2) 172 190

Iteratively Refined Results (d)

Figure 6. Case Study Results Discussion of the Results In the introduction, we claimed that our approach can recover part of the knowledge present in programs and make it explicit. We discuss the results obtained with our tool, presented in Tables (b,c,d) in Figure 7.

Results Examples In Figure 7 we illustrate some examples of ontological entities extracted from JFreeChart. This figure presents the concepts’ names and relations between them. The “is a” relation is illustrated through a line between two concepts, the “part of” relation through a line with a label attached. There are also Java code fragments to illustrate how do the identified entities and their relations appear in the code. As we can observe in this figure, there are several words that denote different concepts in the code. Such an example is the word “line” which is used to denote both a shape and a piece of text; another one is the word “day” which is used to refer both a period of time and a point in time.

The number of ontological entities that were identified in the code represents the knowledge that the tool extracted from the system. In case of JFreeChart this is 172 concepts, for Compiere it is 190 concepts. As we can’t know how many concepts the example systems actually contain we use a number of simple metrics. To evaluate coverage we analyzed the number of words that were identified as WordNet entities and put in relation with total number of words. In both systems, if we ignore the words that appear only once, about 10% of the non-compound words could be identified. Interestingly these 10% of non-compound words appear in over 30% of the programs’ identifiers, which denotes that these words are central to the analyzed systems.

4 http://www.jfree.org/index.php 5 http://www.compiere.org

8

9.1. Variation points of our approach

The Relations in Code row in Table (b) shows the exact number of places in the code where these concepts were identified. In Table (c) we present a detailed view over the individual types of relations that appear in the code. It is worth to emphasize the high number of relations at the level of assignments and parameter binding as these capture conceptual coherences in operational contexts. An example is: position = initialPosition + distance The last row of Table (b) shows the number of concepts that were identified at low levels of abstraction, i.e. whose names are not part of any class or package names. This category of concepts are the most hard to identify manually because they are hidden deep in the code and thus require exhaustive search. These concepts are usually not at the core of the modeled domain, they rather represent marginal properties (e.g., colors green, orange). As we discussed in Section 6, the difference between an automatic approach and a manual one is determined by the number of iterations that one needs to perform in order to eliminate false positives. In Table (d) we present the results obtained after three iteration, starting with the number of ontological entities that were extracted fully automatic (Raw). After the first run, we only eliminated concepts that obviously didn’t belong to our case-studies (e.g., chemical elements) but didn’t inspect the code itself. This simple measure reduced the number of false positives by about 50% (Refined(1)). In the last step (Refined(2)) we manually inspected the relations identified in the code and thereby eliminated the remaining false positives.

The number and size of the available knowledge bases The results of our approach strongly depend on the quantity and relevance of the knowledge contained in the available knowledge-bases. Our view on the source code, that it contains multiple dimensions of knowledge, requests for every dimension a relevant knowledge base. The relevance of a knowledge base can be characterized according to the following directions: • coverage - a knowledge base might cover other portions of the world and not those which are expressed within the analyzed program. In our instantiation WordNet covers only a part of the general human knowledge and is not concerned with any domain specific vocabulary. For example the concept of “zoom”6 , although it is used in JFreeChart, is not contained in WordNet. Thus there is no way for our tool to identify it. • granularity - a knowledge base might provide more or less detailed description of the same entities that are represented in the program. In WordNet the common concepts are represented at a lower granularity than the highly specific ones. Furthermore, the lexicalized concepts are represented with the help of a differential theory of lexical semantics whose emphasis is on modalities to distinguish between the meanings of words rather than construct the concept represented by them (the constructive theory of lexical semantics). Thus the granularity of WordNet is not ideal for our purposes.

9. Conclusions and Future Work

• perspective - the knowledge base can offer another perspective of the modeled domain (e.g., inverting right and left in characterizing directions) than the one implemented in the program

We presented a novel approach for extracting the concepts from source code by mapping the sources on conceptualizations shared as ontologies. The core of our approach is the fact that programs can be regarded themselves as knowledge bases that contain the information in the names of program entities (identifiers) which are related through programming language specific relations. By establishing a mapping from a part of the program to an ontology we identify the part of the conceptualization that is expressed in the particular program fragment. The concepts that we extract are made explicit (part of a common accepted conceptualization) and shareable (through the conceptualization itself) and thus help to reduce further comprehension effort. We have defined a methodology that provides incremental means for understanding the code. We instantiated our general approach on Java programs and WordNet and we reported on our experience gained from analyzing two case-studies.

The quality of the identifiers names: Programmers use a wide variety of names and naming conventions in their programs. The naming scheme at the semantical level is not automatically enforced. It is usual to find morphological variations of a name denoting the same concept, acronyms or even words in different languages. Furthermore, by names composition programmers lexicalize new concepts which can not be present in the available knowledge bases. Such variations raise no problem for a human that tries to understand the program but pose severe problems for an automatic tool. 6 zoom - “to focus a camera or microscope on an object using a zoom lens so that the object’s apparent distance from the observer changes – often used with in or out” (Merriam-Webster Online)

9

Minute.java 73: public class Minute extends RegularTimePeriod…

Period Time

White Green Magenta Grey Black Orange Blue Yellow

Year

Week Month Quarter Day

Minute Instant Day Jan Feb Nov Dec TimeSeriesCollectionTest.java 165: Day today;

Event Progress Action Change Motion

Point

Color

Cyan Pink

Center Source End Hotspot Position Left

Location

Bottom Axis Left Right Center Today Tomorrow Yesterday Text Title

Line

Subtitle Legend

MTable.java (Compiere) 85: private int m_rowCount; 130: private int m_indexKeyColumn; hasPart StandardXYItemRenderer.java 530: this.legendLine=line;

Row

Shape Arc

Column

Line

Circle Polygon Rectangle

Table hasPart

Right

Name Filename

Box ServletUtilities.java 346: String filename = name;

Figure 7. Examples of Identified Concepts

9.2. Future Work

[10] T. R. Gruber. Toward principles for the design of ontologies used for knowledge sharing. Int. J. Hum.-Comput. Stud., 43(5-6):907–928, 1995. [11] J. Guy L. Steele. Growing a language. In OOPSLA ’98 Addendum, New York, NY, USA, 1998. ACM Press. [12] C. S. Hartzman and C. F. Austin. Maintenance productivity: Observations based on an experience in a large system environment. In CASCON ’93. IBM Press, 1993. [13] P. E. Hayes. Rdf semantics. Technical report, W3C Recommendation, 2004. [14] D. Keller. A guide to natural naming. SIGPLAN Not., 25(5):95–102, 1990. [15] A. Kuhn, S. Ducasse, and T. Gˆırba. Enriching reverse engineering with semantic clustering. In WCRE ’05, pages 113– 122, Los Alamitos CA, 2005. IEEE Computer Society Press. [16] J. I. Maletic and A. Marcus. Supporting program comprehension using semantic and structural information. In ICSE ’01, pages 103–112, Washington, DC, USA, 2001. IEEE Computer Society. [17] A. Marcus and J. I. Maletic. Recovering documentation-tosource-code traceability links using latent semantic indexing. In ICSE ’03, pages 125–135, Washington, DC, USA, 2003. IEEE Computer Society. [18] A. Marcus, A. Sergeyev, V. Rajlich, and J. I. Maletic. An information retrieval approach to concept location in source code. In WCRE ’04, pages 214–223, 2004. [19] D. L. McGuinness. Ontologies come of age. In Spinning the Semantic Web, pages 171–194, 2003. [20] G. A. Miller. Wordnet: a lexical database for english. Commun. ACM, 38(11):39–41, 1995. [21] T. M. Pigoski. Practical Software Maintenance. Wiley Computer Publishing, 1996. [22] V. Rajlich and N. Wilde. The role of concepts in program comprehension. In IWPC ’02, page 271. IEEE CS, 2002. [23] M. F. N. Ramal, R. de Moura Meneses, and N. Anquetil. A disturbing result on the knowledge used during software maintenance. In WCRE, pages 277–, 2002. [24] A. von Mayrhauser and A. M. Vans. Program comprehension during software maintenance and evolution. Computer, 28(8):44–55, 1995. [25] N. Wilde and M. C. Scully. Software reconnaissance: mapping program features to code. Journal of Software Maintenance, 7(1):49–62, 1995.

We believe that what we presented here is only the first step in the extraction of a conceptual decomposition from programs. To advance on this, we envision two main directions for our future work. The first direction focuses on enhancing the current approach by using more knowledge bases and more sophisticated mapping techniques. The second direction aims at using the conceptual decomposition to support maintenance activities. Two items that we are currently working on are documentation generation and an application of our approach to detect polysemy and synonymy in identifier naming of large programs.

References [1] N. Anquetil and T. Lethbridge. Assessing the relevance of identifier names in a legacy software system. In CASCON ’98, page 4. IBM Press, 1998. [2] T. J. Biggerstaff, B. G. Mitbander, and D. Webster. The concept assignment problem in program understanding. In ICSE ’93. IEEE CS Press, 1993. [3] B. Carter. On choosing identifiers. SIGPLAN Not., 17(5):54– 59, 1982. [4] K. Chen and V. Rajlich. Case study of feature location using dependence graph. In IWPC ’00, page 241. IEEE CS, 2000. [5] R. Clayton, S. Rugaber, and L. Wills. On the knowledge required to understand a program. In WCRE ’98, page 69, Washington, DC, USA, 1998. IEEE Computer Society. [6] F. Deissenboeck and M. Pizka. Concise and consistent naming. In IWPC ’05, pages 97–106, Washington, DC, USA, 2005. IEEE Computer Society. [7] P. T. Devanbu, R. J. Brachman, P. G. Selfridge, and B. W. Ballard. Lassie: a knowledge-based software information system. In ICSE ’90, pages 249–261, Los Alamitos, CA, USA, 1990. IEEE Computer Society Press. [8] M. B. Dias, N. Anquetil, and K. de Oliveira. Organizing the knowledge used in software maintenance. Journal of Universal Computer Science, 9(7):641–658, 2003. [9] M. R. Genesereth and N. J. Nilsson. Logical foundations of artificial intelligence. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1987.

10

Programs are Knowledge Bases - CiteSeerX

Programs are Knowledge Bases - CiteSeerX

Suggest Documents

Agent Knowledge Bases - CiteSeerX

An Innovative Application from the DARPA Knowledge Bases Programs

An Operational Semantics for Knowledge Bases - CiteSeerX

Building Knowledge Bases through Multistrategy Learning - CiteSeerX

Evolution of DL-Lite Knowledge Bases - CiteSeerX

Collaborative Development of Knowledge Bases in ... - CiteSeerX

Semantics in Data and Knowledge Bases - CiteSeerX

Knowledge Discovery in Literature Data Bases - CiteSeerX

Highly interpretable linguistic knowledge bases optimization - CiteSeerX

Mixed-Initiative Development of Knowledge Bases - CiteSeerX

Knowledge Bases in mArachna

Large heterogeneous knowledge bases

Combining multiple knowledge bases - Knowledge and Data ...

Reasoning from Conditional Knowledge Bases

Exploring Knowledge Bases for Similarity

What Software Programs are Available? - CiteSeerX

Machine Code Programs are Predicates Too - CiteSeerX

What Software Programs are Available? - CiteSeerX

Interactive Exploration of Fuzzy RDF Knowledge Bases - CiteSeerX

Object-oriented Knowledge Bases for the Analysis of ... - CiteSeerX

Queries to Hybrid MKNF Knowledge Bases through ... - CiteSeerX

1 Process Quality Knowledge Bases Kevin Dooley â¢ John ... - CiteSeerX

Knowledge Strategy: Aligning Knowledge Programs to Business ...

Schiff Bases - CiteSeerX

Programs are Knowledge Bases - CiteSeerX