A System for the Automatic Detection of Variable Roles

2 downloads 1243 Views 822KB Size Report
Dec 20, 2006 - Automatic program comprehension (APC) systems share the basic ...... and provides the data illustrator and a source code illustrator with ...
A System for the Automatic Detection of Variable Roles

Petri Gerdt

December 20th, 2006 University of Joensuu, Finland Department of Computer Science and Statistics Licenciate Thesis in Computer Science

Abstract The roles of variables are a new concept developed by Sajaniemi in 2002 when he was looking for a way to convey programming knowledge to novice programmers. The roles has been successfully utilized in teaching programming to novices, and they can provide experts with an approach to analysing and processing large-scale programs. In this thesis we present the ADVR system, a machine learning application that automatically detects a subset of the roles. An essential feature of the ADVR system is the use of a set of flow characteristics (FCs), which are algorithmically defined descriptions of variable behavior. The ADVR system is based on a dual machine learning process: In the learning mode the ADVR system is given a set of programs along with role information for the variables in the programs. The ADVR system detects what FCs the variables have, replace the variable with the role of the variable, and thus creates role-FC profiles. The ADVR system stores the role-FC profiles in a role-FC database. In the matching mode the ADVR system examines what FCs apply to each of the variables of the programs that it gets as input. The FC detection process produces variable-FC profiles, which are sets of FCs that apply for each of the variables. The ADVR system does not get variable role information as input in the matching mode, instead it consults its role-FC database: the system finds roles for the variables by comparing role-FC profiles and variable-FC profiles. We have evaluated the ADVR system by making the system detect roles for a set of programs from three programming textbooks. The correct roles were provided by researchers. The ADVR system detected 93.3 % of the roles correctly. From the promising evaluation results we conclude that the FC-based automatic detection of variable roles is feasible, and that the set of FCs seems to be adequate for the task.

Preface I would like to thank... My supervisors Professor Jorma Sajaniemi and Senior Lecturer Marja Kuittinen for their valuable guidance through these years without which the writing of this thesis would not have been possible. My colleagues Ph.Lic. Pauli Byckling and Ph.Lic. Seppo Nevalainen for the collaboration in different research projects, and innumerable discussions on various topics. People at the department office and at the department library for making all practical matters much easier. The East Finland Graduate School in Computer Science and Engineering (ECSE) for funding my studies 2002-2006. My parents for a thorough upbringing. Mia for her love and support.

Abbreviations

Abbreviation

Description

Defined in

ABB

Abbreviation

3.1

ADVR

Automatic Detection of Variable Roles

3.1

APC

Automatic Program Comprehension

2

ASE

Arbitrary Sequence

3.1

AST

Abstract Syntax Tree

3.1

CEP

Conditional Expression Participation

3.1

CFG

Control Flow Graph

3.1

COA

Conditional Assignment

3.1

DFA

Data Flow Analysis

3.1

DSE

Defined Sequence

3.1

DU

Definition-Use

5.1.3

DUA

Definition-Use in Assignments

5.1.4

FC

Flow Characteristic

3.1

FIX

Fixed Value

1.1

FOL

Follower

3.1, 1.1

FRD

Forward Reaching Definitions

5.1.2

GAT

Gatherer

1.1

IVA

Initial Value

3.1

IVS

Incremental Value Sequence

3.1

LEP

Loop Expression Participation

3.1

LOA

Loop Assignment

3.1

MRH

Most-Recent Holder

1.1

MRH

Most-Wanted Holder

1.1

ONE

One-Way Flag

1.1

PFO

Partial Following

3.1

PR

Program Representation

4.1

RD

Reaching Definitions

5.1.1

SAS

Single Assignment

3.1

SPA

Singlepass

3.1

STP

Stepper

1.1

TMP

Temporary

1.1

Contents 1 Introduction

1

1.1

Roles of Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

Contribution of the Thesis . . . . . . . . . . . . . . . . . . . . . . .

6

1.3

Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . .

7

2 Related Research 2.1

9

Human Program Comprehension . . . . . . . . . . . . . . . . . . . .

10

2.1.1

Shneiderman’s Model . . . . . . . . . . . . . . . . . . . . .

13

2.1.2

Brooks’ Model . . . . . . . . . . . . . . . . . . . . . . . . .

13

2.1.3

The Model of Soloway and Ehrlich . . . . . . . . . . . . . .

14

2.1.4

Letovsky’s Model . . . . . . . . . . . . . . . . . . . . . . . .

15

2.1.5

Pennington’s Model . . . . . . . . . . . . . . . . . . . . . .

16

2.1.6

Overview of Models . . . . . . . . . . . . . . . . . . . . . .

17

Automated Program Comprehension . . . . . . . . . . . . . . . . . .

21

2.2.1

The UNPROG Program Understander . . . . . . . . . . . . .

24

2.2.2

The RECOGNIZE Method . . . . . . . . . . . . . . . . . . .

25

2.2.3

The Method of Quilici . . . . . . . . . . . . . . . . . . . . .

26

2.2.4

The GRASPR System . . . . . . . . . . . . . . . . . . . . .

27

2.2.5

PARE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

2.2.6

Overview of Automatic Program Comprehension . . . . . . .

29

2.3

Assessing Roles of Variables with the RoleChecker . . . . . . . . . .

32

2.4

Automatic Detection of Design Patterns . . . . . . . . . . . . . . . .

33

2.4.1

The Method of Heuzeroth, Holl, Hogstrom, and Lowe . . . .

34

2.4.2

IDEA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35

2.4.3

SPQR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

Reverse Engineering and Refactoring . . . . . . . . . . . . . . . . .

37

2.5.1

Function Clone Detection with Metrics . . . . . . . . . . . .

38

2.5.2

RefactoringCrawler . . . . . . . . . . . . . . . . . . . . . . .

39

Static Program Analysis Based Software Visualization . . . . . . . .

40

2.6.1

The University of Washington Illustrating Compiler (UWIC) .

41

2.6.2

Algorithm Animation with Shape Analysis . . . . . . . . . .

43

Software Metrics Based Automatic Assessment . . . . . . . . . . . .

44

2.2

2.5

2.6

2.7

3 Automatic Detection of Variable Roles

i

46

3.1

Overview of the Detection Process . . . . . . . . . . . . . . . . . . .

46

3.2

Comparison with Literature . . . . . . . . . . . . . . . . . . . . . . .

53

4 Program Representation

56

4.1

Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

56

4.2

Interface to the Program Representation . . . . . . . . . . . . . . . .

66

4.2.1

Class ProgramRepresentation . . . . . . . . . . . . . . . . .

68

4.2.2

Class Statement . . . . . . . . . . . . . . . . . . . . . . . . .

68

4.2.3

Class LoopStatement . . . . . . . . . . . . . . . . . . . . . .

69

4.2.4

Class AssignmentStatement . . . . . . . . . . . . . . . . . .

69

4.2.5

Class Variable . . . . . . . . . . . . . . . . . . . . . . . . . .

69

5 Flow Characteristics Analysis 5.1

71

Data Flow Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . .

72

5.1.1

Reaching Definitions Analysis . . . . . . . . . . . . . . . . .

75

5.1.2

Forward Reaching Definitions Analysis . . . . . . . . . . . .

85

5.1.3

Definition-Use Chains . . . . . . . . . . . . . . . . . . . . .

86

5.1.4

Definition-Use in Assignments . . . . . . . . . . . . . . . . .

87

5.1.5

Interface to Data Flow Analysis Results . . . . . . . . . . . .

88

5.2

Initial Value (IVA) . . . . . . . . . . . . . . . . . . . . . . . . . . .

91

5.3

Conditional Assignment (COA) . . . . . . . . . . . . . . . . . . . .

94

5.4

Loop Assignment (LOA) . . . . . . . . . . . . . . . . . . . . . . . .

95

5.5

Conditional Expression Participation (CEP) . . . . . . . . . . . . . .

97

5.6

Loop Expression Participation (LEP) . . . . . . . . . . . . . . . . . .

98

5.7

Interrelated Value Sequence (IVS) . . . . . . . . . . . . . . . . . . .

99

5.8

Single Assignment (SAS) . . . . . . . . . . . . . . . . . . . . . . . . 100

5.9

Arbitrary Sequence (ASE) and Defined Sequence (DSE) . . . . . . . 102

5.10 Singlepass (SPA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.11 Abbreviation (ABB) . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.12 Following (FOL) and Partial Following (PFO) . . . . . . . . . . . . . 112 5.13 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 6 Machine Learning

120

6.1

Machine Learning and the ID3 Classifier Algorithm . . . . . . . . . . 120

6.2

Machine Learning in the ADVR System . . . . . . . . . . . . . . . . 127

7 Empirical Evaluation of the ADVR System and the FC Set ii

132

7.1

Source Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

7.2

FC Distributions for the Roles . . . . . . . . . . . . . . . . . . . . . 135

7.3

The ID3 Classification Tree . . . . . . . . . . . . . . . . . . . . . . . 140

7.4

Leave-One-Out Cross-Validation . . . . . . . . . . . . . . . . . . . . 144

7.5

Comparison With Computer Science Educators . . . . . . . . . . . . 146

7.6

Summary of Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 147

8 Conclusion and Future Work

149

References

152

iii

1 Introduction The roles of variables are names for stereotypical behaviors of variables [Sajaniemi, 2002a, 2006]. The concept was developed by Sajaniemi when he was looking for a way to convey programming knowledge to novice programmers. The roles have been successfully utilized in teaching programming to novices, and they can provide experts with an approach to analysing and processing large-scale programs. Concrete examples of the respective uses of the role concept are role-based program animation, and role information in UML diagrams. The goal of this thesis is to find a way to automatically detect the roles of variables. Our hypothesis is that the roles, stereotypic variable behavior, can be defined as sets of flow characteristics that in turn describe different behavioral characteristics of variables. In this thesis we present a set of flow characteristics, and use them as the basis of a machine learning application, that can be used to assign roles to variables in arbitrary programs. We then evaluate the application empirically to seek evidence that support our hypothesis. The rest of this Chapter is structured as follows: in Section 1.1 we present the concept in detail and give an overview of roles of variables related research. The topic of this thesis is the automatic detection of the roles of variables, which is a part of the ongoing research dealing with the roles of variables concept. We outline the key contribution of this thesis in Section 1.2. Finally, we present the structure of this thesis in Section 1.3.

1.1 Roles of Variables The roles of variables is a new concept that describes the stereotypic behaviors of variables [Sajaniemi, 2002a, 2006]. The roles of variables is a simple and compact way to represent higher level programming knowledge, in fact, 11 roles cover 99% of all variables in elementary programming books [Sajaniemi, 2002a]. The role concept is based on the notion of variable plans, which represent stereotypic uses of variables [Ehrlich and Soloway, 1984, Rist, 1991]. Consider the variable month in the program distance of Figure 1. The variable is initially assigned the value of 3 on line 9, and then its value is repeatedly increased in the loop of lines 10–17. The variable month goes through a defined sequence of 1

values: 3, 4, 5, ..., 13. Thus, the role of month is stepper, it steps through a predictable sequence of values. The variable current on the other hand gets its values from user input on lines 7 and 13. In fact, current stores the most recent value, that the user of the program has entered. Because of this the role of current is most-recent holder. The variable largestDiff is used to store the largest difference between two consecutive values, that the user enters. We say that the role of largestDiff is most-wanted holder, because it stores the “best” value encountered according to the criteria of the biggest difference.

( 1)

program distance;

( 2)

var month, current, previous, largestDiff: integer;

( 3)

begin

( 4)

write(’Enter 1. value: ’);

( 5)

readln(previous);

( 6)

write(’Enter 2. value: ’);

( 7)

readln(current);

( 8)

largestDiff := current - previous;

( 9)

month := 3;

(10)

while month largestDiff then

(15)

largestDiff := current - previous

(16)

month := month + 1

(17) (18) (19)

end; writeln(’Largest difference was ’, largestDiff) end.

Figure 1: The Pascal program distance. Table 1 presents the names of roles, abbreviations for the role names, and short informal descriptions of the roles. The first eight roles apply to all variables, and the last three marked with asterisks (*) are applied to data structures. The role of a variable may change during its lifetime [Sajaniemi, 2002a]. Role changes are of two basic types: In a proper role change the final value of the variable in the first role is used as the initial value for the next role. In a sporadic role change the variable 2

Table 1: Roles for object-oriented, procedural, and functional programming [Sajaniemi et al., 2006a]. Role Fixed value

Abbreviation FIX

Informal Description A data item that does not get a new proper value after its initialization.

Follower

FOL

A data item that gets its new value always from the old value of some other data item.

Gatherer

GAT

A data item accumulating the effect of individual values.

Most-recent

MRH

holder

A data item holding the latest value encountered in going through a succession of unpredictable values, or simply the latest value obtained as input.

Most-wanted

MWH

holder One-way flag

A data item holding the best or otherwise most appropriate value encountered so far.

ONE

A two-valued data item that cannot get its initial value once the value has been changed.

Stepper

STP

A data item stepping through a systematic, predictable succession of values.

Temporary

TMP

A data item holding some value for a very short time only.

Organizer *

ORG

A data structure storing elements that can be rearranged.

Container *

CNT

A data structure storing elements that can be added and removed.

Walker *

WLK

A data item traversing in a data structure.

is re-initialized with a totally new value at the beginning of the new role phase. The roles of variables represent tacit expert knowledge by capturing the behavior of variables. Sajaniemi and Navarro Prieto [2005] found out that the roles are indeed a part of expert programmers’ tacit knowledge, in other words, roles are names for concepts that are tacit knowledge. This research result is supported by the results reported in Sajaniemi et al. [2006a]: the role concept is easy to understand, in experiments pro-

3

grammers have learned to use the concept in less than one hour. One explanation for the quick learning is that the programmers did not learn the roles of variables concept, instead roles made tacit knowledge explicit and indexed it with the role names. The roles of variables have been successfully applied to procedural, object-oriented, and functional programming paradigms [Sajaniemi et al., 2006a]. Thus, it seems that the role concept is higher programming knowledge that transcends even paradigms: the roles appear to be present when problems are solved through programming. The fact that the role concept is a concise presentation of higher order programming knowledge that programmers learn easily makes the roles of variables an interesting concept within educational settings. In a classroom experiment the teaching of roles of variables in an elementary programming course had several beneficial effects. The use of role-based program animation resulted in considerably better program writing skills [Byckling and Sajaniemi, 2006]. The use of roles in teaching improved program comprehension and resulted in better mental models of programs—the roles provide a new conceptual framework that enables students to mentally process program information in a way that demonstrates good programming skills [Sajaniemi and Kuittinen, 2005]. These results have encouraged several educational institutions to include the role concept in novice level programming courses. Another observation of the pedagogical use of the role concept is that the roles of variables provide a vocabulary that can be used in a class to explain examples and, for example, frequently appearing errors that novices make. As mentioned above, role information has been used in program animation. PlanAni is a role-based program animator that features role-based visualizations of variables and role-based animations of operations that affect the variables’ values [Sajaniemi, 2002b, Sajaniemi and Kuittinen, 2003, 2004, 2005]. The current version of the PlanAni animator animates short C, Java, Python, and Pascal programs. Figure 2 is a screenshot of the PlanAni user interface. The left pane shows the animated program with a color enhancement showing the current action. The upper part of the right pane is reserved for variables depicted by metaphoric role images, and below it there is the input/output area consisting of a paper for output and a plate for input. The currently active action in the program pane on the left is connected with

4

an arrow to the corresponding variables on the right. Frequent pop-ups explain what is going on in the program, e.g., “creating a gatherer called sum”. Users can change animation speed and the font used in the panes. Animation can proceed continuously (with pauses because the frequent pop-ups require clicking “Ok” button, or pressing “Enter”) or stepwise. Animation can be restarted at any time but backward animation is not possible.

Figure 2: A screenshot of the PlanAni user interface. The architecture of PlanAni consists of four levels [Sajaniemi and Kuittinen, 2003]. The first level takes care of primitive graphics and animation, and implements the user interface, the second level knows how to animate smallest actions that are meaningful to viewers of the animation. This level is language independent, it can be used to animate programs written in various languages. The third level takes as input a program to be animated, annotated with the roles of variables and possible role changes, and animates the program automatically. Finally, the fourth level does not need role information because it finds roles automatically. The fourth level can be used in any application, which needs to automatically detect and analyze role information.

5

The metaphors used for the roles in the PlanAni system have also been used as in the development of metaphor-based animation of object-oriented programs [Sajaniemi et al., 2006b], which is intended to be used to teach novices object-oriented programming concepts. Role information has furthermore been included in UML-diagrams to increase their comprehensibility [Byckling et al., 2006].

Role information may provide

implementation-related information through quite small graphical additions to the diagrams: for example, a role name associated with an attribute conveys information about how the attribute behaves during run-time. The alternative to present this information would be the insertion of a code snippet or a free-form text comment into the diagram, which could cause the diagram to be too complicated to be used. Role information does not take that much space in the diagrams, and the roles are explicitly well defined. In summary, there are thus many possible uses for the role concept. Roles have been found to be good pedagogical tools for teaching programming to novices. The roles provide a vocabulary for explaining errors in novice programs, and they may be useful in the documentation of larger scale programs as well. The UML-diagrams with role information is an example of role-enhanced larger-scale program documentation. Program visualization aims at making programs more comprehensible, and the role concept can be used to achieve this goal, especially if the target audience is familiar with the role concept. The possible uses of the roles, such as automatic role-based program animation, automatically appending role information into UML diagrams or the analysis of large scale programs, benefit from or even require automatic detection of variable roles, whose purpose is to automatically assign role information to variables of arbitrary programs. This thesis presents a system that automatically detects variable roles, excluding the three roles for data structures and role changes. The main approach and basic ideas have been previously presented in Gerdt and Sajaniemi [2004, 2006].

1.2 Contribution of the Thesis The automatic detection of variable roles require a way to describe variable behavior. Thus, the first contribution of this thesis is a set of thirteen characteristics of variables,

6

that can be automatically detected. We call these characteristics “flow characteristics” as they describe different properties of the data flow that either affect or go through a variable. We present the flow characteristics through static program analysis algorithms that detect them. The roles of variables is a cognitive concept, it cannot be explicitly technically defined. To meet this challenge we have created a dual machine learning process, the second contribution of this thesis. In the learning mode the automatic role detector learns what kind of variable behavior each of the roles indicate. The automatic role detector stores this information as sets of flow characteristics that apply to each of the roles. In the matching mode the automatic role detector is given an arbitrary program and it detects the flow characteristics of the variables. It then compares the variables’ flow characteristics to the role information that it stored in the learning mode. If a variable and a role have similar flow characteristics that apply, then the variable has the corresponding role. The third contribution of this thesis is the evaluation of the automatic role detector application described in this thesis. The goal of the evaluation is to assess the expressive power of the flow characteristics: we are interested in finding out if it is possible to describe roles with flow characteristics. We also compare the performance of the automatic role detector against human role assigners.

1.3 Structure of the Thesis Chapter 2 covers related research from several fields, including human program comprehension, different approaches to automatic program comprehension, and applications of automatic program comprehension. Chapter 3 gives an overview of the automatic detection of variable roles, which is the topic of the thesis. The purpose of the Chapter is to give the reader a general view of the topic, later sections present details of the different facets involved. The Chapter includes a comparison of our approach to the research discussed in the literature survey of Chapter 2. In Chapter 4 we present the program representation that the automatic role detection application creates when it analyzes programs. Chapter 4 also includes a description

7

of the interface to program information that is used in the more detailed descriptions of the application in later Chapters. Chapter 5 begins with a brief overview of data flow analysis, and proceeds then to present our detection algorithms for flow characteristics (FCs). The algorithms use the interface to the program representation that is described in Chapter 4. Chapter 6 discusses the topic of machine learning in general, and ID3 classification trees in particular. In Chapter 7 we evaluate our implementation of an automatic detector of variable roles, that follows the process described in Chapter 3, uses the FCs of Chapter 5, and relies on the machine learning approach discussed in Chapter 6. In Chapter 8 we give a summary of our results. We then discuss the lessons learned during the process of doing the research for this thesis and outline the possible avenues of future work.

8

2 Related Research In this Chapter we will make a literature survey of research, that relates to automatically analyzing programs. This Chapter does not include discussion about data flow analysis and machine learning techniques, which are covered in Chapters 5 and 6. Human program comprehension models describe the scholars’ view to what humans do when they try to understand what a program might be about. The models seek to explain why certain behavior happens when people ponder what code does. We begin our literature survey by examining different theories of human program comprehension in Section 2.1. The models are mirrored in automatized program comprehension and its derivates: the automatic program comprehension systems are after all programs written by humans, whose purpose is to understand programs like humans do. The rest of the Sections in this Chapter have one thing in common: all of them describe computer software, whose purpose is to generate information about programs. An obvious audience for this information are those who work with code: programmers, software engineers, etc. Another target audience are those who would like to work with code but can not program: students. Automatic program comprehension (APC) systems share the basic components of a program representation, a knowledge repository and an algorithm, which compares the two. These components resemble the constructs found in human program comprehension models: the mental model, programming knowledge of the long term memory and the comprehension strategies. We will present some approaches to automating program comprehension in Section 2.2. Next we will examine an APC approach that is based on a theory about programming knowledge. Section 2.3 presents a tool for assessing the roles of variables that students have assigned for variables. This is an educational application of the role concept discussed in Section 1.1 that is based on static program analysis. The research presented in Section 2.3 is independent from our work despite the similarities with this thesis. Automatic design pattern detectors discussed in Section 2.4 are quite similar to the APC approaches and they share much of their characteristics. The feature that sets them apart is that they have a common view of how programming knowledge is expressed on an abstract level: as design patterns.

9

The last three Sections of this Chapter are more distantly related to the topic of this thesis. They have nevertheless been included as they include interesting analysis methods and present use scenarios where the automatic detection of roles might be of use. In reverse engineering and refactoring, programs must be comprehended to some extent before they can be improved, thus, automatic program comprehension is needed to automatize the process. Section 2.5 touches this topic briefly. The purpose of program visualization is to present programs in a way that makes them easier to understand. Visualizations can be used in diverse situations, they may, for example, aid students to learn programming or developers to debug large systems. The creation of useful visualizations typically require a lot of time and effort if they are done manually. This obviously makes program visualization less useful. A solution to this problem is to automatically create visualizations through the combination of automatic program comprehension and program visualization, which is discussed in Section 2.6. Section 2.7 ends the literature survey by examining research on software metrics based on automatic assessment. Metrics deal with programs in a black-box manner, they provide numbers that can then be interpreted as indicators of some characteristic. This approach is very different from automatic program comprehension, where the syntactical and semantical properties of programs are examined. Nevertheless, the metrics-based approach provides an alternative way of automatically “understanding” the properties of a program. The reader is advised to compare the automatic assessment approach of Section 2.7 with the more software engineering oriented metrics approach described in Subsection 2.5.1.

2.1 Human Program Comprehension Researchers of psychology of programming try to model programmers’ cognitive models and their creation in order to understand how programmers create and use knowledge in different programming tasks. Psychology of programming is interested, for example, in programming strategies, mental processes and mental representations used in the programming context. The theoretical framework behind this kind of research is based on ideas that can be traced back to Aristotelian views of reasoning.

10

Categorical syllogism is the basis of Aristotelian logic based on quantified deduction, where premises are expressed with the help of the quantifiers some, all, no, and some not [Anderson, 2000, pp. 325–327]. Categorical syllogism provides sound theory for reasoning with quantifiers, which is used as the basis of the mental model theory, which seeks to explain how subjects’ cognitive processes work. The mental model theory states that a subject builds a mental model of the world that satisfies the premises of syllogism and then inspects the model to see if some conclusion is satisfied [Anderson, 2000, p. 330]. Within this context understanding is the activity of building a mental model and verifying it. Von Mayrhauser and Vans [1994] have done a survey of mental model theories used in program comprehension research literature. They call mental model theories for program cognition models. The purpose of a cognition model is to explain the cognitive processes of a programmer engaged in a task requiring code cognition. Different cognition models differ in details, but they include some common elements. Program comprehension is a process, where existing knowledge is combined with new knowledge. Knowledge can be divided into external and internal knowledge. External knowledge refers to any materials available to the comprehender that aid in the comprehension process, such as the program code and documentation. Internal knowledge is knowledge that resides within the long-term memory of the comprehender and includes, for example, information about programming in general, programming languages, and program domain knowledge. Internal knowledge is organized as hierarchies of mental constructs, that support the development and validation of expectations, interpretations, and inferencing. These constructs are sometimes represented by schemas, which represent categorical knowledge with typed slots, much like data structures in computer science [Anderson, 2000, pp. 154–155]. Hereafter we will use the term knowledge in the meaning of internal knowledge. During the comprehension process the comprehender creates a mental model that describes the program to be understood. The mental model of the program resides within the working memory of the comprehender. The comprehension process is directed by a comprehension strategy, which directs the acquisition of knowledge and the verification of the mental model. We use the term acquisition process as a synonym of comprehension strategy, which stresses the process nature of the comprehension process. Most experiments within psychology of programming compare novice programmers and expert programmers, and most program cognition models describe what expertise

11

or expert characteristics are in the context of the model. Typically the experts have different knowledge acquisition strategies as well as different internal knowledge representations when compared to novices. Different models of program comprehension describe the process of building the mental model differently, but most involve chunking and cross-referencing in some form. Chunking means the creation of higher-level-abstraction structures from chunks of lower-level structures [Miller, 1956]. When groups of structures are recognized, labels replace the lower-level chunks. Lower-level structures are thus chunked into larger structures at higher level of abstraction [von Mayrhauser and Vans, 1994]. Crossreferencing means the relating of different levels of abstraction, building a mental model that spans all levels of abstraction in internal knowledge. Cross-referencing is an active process, which interacts with the mental model and knowledge stored in the long-term memory of the program comprehender. There are two analogous comprehension strategies that appear in the program comprehension models. In the top-down approach the structures of mental model are first formulated as general overviews of the target. The structures are then refined recursively until the mental model is detailed enough to be validated. The top-down approach starts with general knowledge, which is refined into more specific detailed knowledge. The bottom-up approach works vice versa, i.e., the construction of the mental model is started by formulating specific structures representing details in the target being examined. The specific structures are then linked together to form larger structures that can be validated. The bottom-up approach starts with specific knowledge, from which more general knowledge is constructed. The two approaches are not mutually exclusive, some models of program comprehension expect both top-down and bottom-up processing to happen. In the rest of this Section we will review five models of human program comprehension. The models are presented in publication order; the first model that we cover is Shneiderman’s model (Subsection 2.1.1) which was first published in 1977, and the last is Pennington’s model (Subsection 2.1.5) from 1987. The models appear to build upon each other, the more recent models are more detailed than the older ones. In addition to explanations about human coginition, this development gives an interesting view into how the scientific community has processed the topic. The last Subsection of this Section, 2.1.6, gives an overview of the models.

12

2.1.1

Shneiderman’s Model

In Shneiderman’s model [1977, 1986] the program comprehender creates an internal semantic representation of the program code. The representation resides in the working memory and consists of multiple abstraction levels, such as high-level program goals and low-level implementational details. When the program is comprehended the representation is a language independent model and the comprehender can express it in other programming languages or as other external representations. The comprehender uses syntactic and semantic knowledge stored in the long-term memory as he or she seeks to understand a program. Syntactic knowledge includes information about the syntax of different programming languages and the use of programming environments (programming language knowledge). Semantic knowledge is information about what effects different syntactic constructs have, and how to implement different algorithms and data structures on a general language independent level (programming knowledge). In Shneiderman’s model the goal of the comprehension process is to build the internal semantic representation. The program comprehender uses the semantic and syntactic knowledge to build the different levels of the internal representation by recognizing familiar structures and ideas. It is possible for the comprehender to understand the program better at some level of abstraction: the comprehender may understand what the statements do, but he or she might not have a clue about the goals of the program.

2.1.2

Brooks’ Model

According to Brooks [1983, p. 544] programming involves the creation of mappings from a problem domain into the programming domain. There may be intermediate domains between the two. Program comprehension involves the reconstruction of a part or all of these mappings. Hypotheses play an important part in program comprehension: the reconstruction process is driven by expectations that originate in the creation, confirmation and refinement of hypotheses. Brooks does not explicitly specify how knowledge is organized. In the acquisition or understanding process hypotheses are successively refined [Brooks, 1983, p. 545]. The hypotheses relate to a specific knowledge domain or

13

to the connection between the domains. The hypotheses are created and refined in a top-down fashion, starting from the goals of the program, and proceeding to more specific hypotheses that refine the previous hypotheses. Because of this feature von Mayrhauser and Vans [1994] call Brooks’ model a hypothesis driven comprehension model. When the refinement reaches a certain level, the comprehender starts to verify the hypotheses against program code and other external knowledge representations. Sets of features in program code that indicate the occurrence of a certain program construct are beacons for that structure [Brooks, 1983, p. 548]. Beacons help the comprehender to associate a hypothesis with the source code, more generally beacons are cues that index into knowledge [von Mayrhauser and Vans, 1994]. In the terminology of Soloway and Ehrlich [1984] beacons would be indexes or indicators to programming plans. The mental model of the comprehender consists of hypotheses, which form a treelike hierarchy, where the most general hypothesis form the top-most element. The verification of hypotheses may lead to the creation of more hypotheses as beacons may trigger the creation of a hypothesis. The verification is hypothesis driven: when a hypothesis is created the external representations may be searched for beacons. Experts differ from novices in that they are better in organizing information about a program. This leads to more effective hypothesis verification ability, as an expert can utilize his or her programming knowledge more effectively when verifying hypotheses against code and programming structures [Brooks, 1983, p. 553].

2.1.3

The Model of Soloway and Ehrlich

Soloway and Ehrlich [1984] divide the knowledge that an expert program comprehender has into two categories. Programming plans are stereotypic programming elements or action sequences that can be used to solve different programming problems. The plans are related to each other, and some plans may contain or imply other plans. There are three different kinds of plans [Soloway et al., 1982, pp. 34–35]. Strategic plans represent global strategies to solving programming problems, such as algorithms. Tactical plans are local strategies, which can be used to specialize a more global solution into a certain context. Implementational plans are language dependent techniques for implementing language independent strategic and tactical plans. Soloway

14

and Ehrlich base their program comprehension model on text comprehension research, as program code can be viewed as readable text. The notion of plans correspond to the notion of schemas. The second knowledge category is rules of programming discourse, which specify programming conventions. Similarly to speech discourse rules the programming discourse rules specify what are the proper ways to express plans as code. The rules of discourse direct the decomposition of general programming goals into stategic, tactical, and finally, implementational plans. Soloway and Ehrlich claim that the two knowledge categories define an expert, i.e., experts have plan knowledge and understand rules of programming discourse, whereas novices do not. Soloway and Ehrlich [1984, pp. 596–597] use the terms plan-like and unplan-like to indicate the nature of a program’s plan structure. A program that has been composed according to the rules of programming discourse are called plan-like. A plan-like program can be dissected into code fragments that match expert programming plans. An unplan-like program violates the rules of discourse and it is hard or impossible to understand as it does not match expert programming plans. The mental model of the program comprehender consists of a linked hierarchy of goals and plans, where high-level goals are divided or decomposed into lower abstraction level plans. The knowledge acquisition process utilizes a top-down strategy that is driven by the rules of programming discourse. The comprehension process starts with the creation of high-level program goals that are then partitioned into sub-goals, which in turn are replaced with strategic, tactical and implementational plans according to the rules of discourse. The process utilizes chunking: a goal is implemented by different plans, which in turn include goals that are resolved through other plans.

2.1.4

Letovsky’s Model

Letovsky [1986, p. 68] introduces the concept of a knowledge based program understander, which consists of three parts or components: a knowledge base, a mental model, and an assimilation process. The knowledge base represents the knowledge that the program comprehender uses in the program comprehension task. The knowledge base borrows the elements of knowledge from Ehrlich and Soloway’s model [Ehrlich and Soloway, 1984]: it includes both plan knowledge and rules of programming discourse. In addition, Letovsky mentions

15

goals, which are knowledge about recurring computational goals. Efficacy knowledge is information about how to implement efficient programming solutions and how to detect inefficient ones. Domain knowledge provides the program comprehender with a body of knowledge about the possible goals and purposes that the examined program is supposed to cover. The mental model consists of three layers, which represents different levels of abstraction [Letovsky, 1986, p. 69]. The most abstract layer is the specification layer, which contains a description of program goals. The least abstract layer is represented by the implementation layer that contains entities like data structures and algorithms. The third layer, the annotation layer, ties the other two layers together by connecting goals and their implementation(s). The mental model represents the internal knowledge. In Letovsky’s model the comprehension or assimilation process is driven by conjectures, that resemble the Brooks’s idea of hypothesis driven program comprehension. The comprehender creates three kinds of conjectures: why, how, and what [Letovsky, 1986, p. 73]. Why conjectures are hypotheses about the purpose of some part of the program. What conjectures are hypotheses about what constructs the program contains. How conjectures relate to the methods that are used to reach a goal. In the process the external program representations such as documentation and source code are compared with the knowledge base, and any matches are stored in some of the layers of the mental model. The assimilation process can be either top-down or bottom-up or a mix of the two, whichever suits the understander better [Letovsky, 1986, p. 69]. Humans are opportunistic in nature and will use whatever strategy that seems to yield the best results. Letovsky [1986, p. 71–72] suggests that the what and why questions are associated to the bottom-up strategy, as they are asked as unfamiliar program constructs are encountered in code.

2.1.5

Pennington’s Model

Pennington’s [1987] model of program comprehension is based on text comprehension theories just like the model of Soloway and Ehrlich [1984]. The basis of the model is a mental model, which constists of two distinct but interlinked parts: the program model and the situation model. The program model is a mental representation of the

16

procedural program relations, a control flow abstraction of the program. The program model is created bottom-up by examining the program code for control structures, or control primes. The control primes work as beacons that refer to the text structure knowledge and the programming knowledge that the comprehender has. The program model creation process involves the chunking of control structures into a complete view of the control flow in the examined program. The situation model is based on the program model. It describes the program in terms of real-world objects that are stored as domain plan knowledge, e.g., how an certain statement is interpreted from the point of view of domain knowledge. The construction of the situation model include information about how data flows in the program, as well as other semantic information. The situation model is constructed in the same manner as the program model, by chunking bottom-up. Beacons serve as indexes into domain plan knowledge. The two parts of the mental representation are crossreferenced, program parts are connected to domain functions [Pennington, 1987, p. 102]. Pennington [1987] conducted an empirical study of programmers and found out that expert comprehenders differ from poor comprehenders by their cross-referencing strategy. Experts alternated between studying the program, translating it into domain terms, and verifying their mental model in program terms. Poor comprehenders tend to focus on either domain or program issues without making serious efforts to cross-reference them. The non-cross-referring comprehenders fall into two categories. Comprehenders who focus on program issues understand how the program works, but do not know why it does what it does. A singular interest in domain concepts lead to an understanding of the purpose of the program, but no real idea of how the program works.

2.1.6

Overview of Models

The models of program comprehension presented in this Section have four common components: a representation of internal knowledge, a mental model, an acquisition process that builds up the mental model, and a definition of expert characteristics. We will next summarize the models presented in this Section with respect to these four components. The program comprehender has an internal knowledge representation, which is a long

17

time repository of knowledge. Table 2 summarizes how the different models explain how knowledge is stored. Most of the models describe how knowledge is partitioned into different kinds of knowledge, which are used in program comprehension. Most notably Soloway and Ehrlich [1984], and Letovsky [1986] give a quite detailed description. All models seems to share at least two kinds of knowledge that a program comprehender uses: program language knowledge and programming knowledge. For example, Shneiderman [1986] calls these syntactic and semantic knowledge. Table 2: Internal knowledge representation in selected models of human program comprehension. Source

Internal knowledge representation

Shneiderman

Knowledge is divided into syntactic and semantic knowledge.

Brooks

Knowledge is arranged and accessed through index-like beacons.

Soloway Ehrlich

and Knowledge is divided into programming plans and rules of programming discourse. Three kinds of plans: strategic, tactical and implementational.

Letovsky

Like Soloway and Ehrlich, in addition: knowledge about recurring programming goals, efficacy knowledge, and domain knowledge.

Pennington

Knowledge is divided into text structure knowledge and programming knowledge.

Comprehension involves the creation of a mental model, that represents the program under examination. Table 3 summarizes key characteristics of the mental models of the comprehension models presented in this Section. Two common features that all mental models of this Section have in common is that they have multiple levels of abstraction and at least one of the levels is language independent, i.e., the program comprehender creates a mental model of a program that is not based on any programming language syntax. In addition, the mental models are assumed to be hierarchical: the most abstract mental representation is at the top of the hierarchy and elements lower in the hierarchy are less abstract. The acquisition process explains how the program code and internal knowledge rep-

18

Table 3: Mental model characteristics. Source

Mental model characteristics

Shneiderman

Language independent semantic representation of the program code with multiple levels of abstraction.

Brooks

A tree of hypotheses, where the most general hypothesis is the root.

Soloway

and A linked hierarchy of goals and plans.

Ehrlich Letovsky

Three layers, the most abstract first: specification layer, annotation layer, and implementational layer.

Pennington

Two distinct but interlinked parts: program model and domain model.

resentation are used when the comprehender builds the mental model. The process descriptions summarized in Table 4 may be divided into three categories according to the direction of the process, the comprehension strategy. Some models build the mental model from the abstract to the concrete, starting with the overall goal of the program or the purpose of the program and proceeding to implementational details (see for example Soloway and Ehrlich [1984]). This is called top-down processing. Bottom-up processing works vice versa: the comprehension process starts by examining the concrete details of the program, such as individual statements in code, and then proceeds to more abstract issues, for example, the purpose of algorithms. Pennington [1987] uses a bottom-up process. Finally, Letovsky [1986] provides a third option, the combination of both bottom-up and top-down, where the comprehender uses whatever comprehension strategy that yields the best results. All models have the common feature of cross-referencing external and internal knowledge with the mental model, and the creation of links between the different knowledge sources. Most models describe how experts comprehend programs, which is reflected by expert characteristics. A definition of expert characteristics makes the model more useful for research: to verify a program comprehension model a researcher can compare how novice and expert programmers differ. Thus, an idea about expert characteristics is crucial for empirical evaluation of the models: expert characteristics serve as a hypothesis to be verified. Table 5 lists a short summary of expert characteristics of the compre-

19

Table 4: Acquisition process characteristics. Source

Acquisition process characteristics

Shneiderman

Building a mental model that covers multiple levels of abstraction.

Brooks

Hypothesis driven: reconstruction of mappings between the problem and programming domains through refinement of hypotheses.

Soloway Ehrlich

and Top-down processing of goals and plans driven by the rules of discourse: goals are chunked into sub-goals, which in turn are replaced by plans.

Letovsky

Conjecture driven, three kinds of conjectures: why, what, and how. May be top-down or bottom-up or both.

Pennington

Bottom-up processing of program code: program model is constructed by examining control structures and the domain model is constructed by examining data flow and other semantic information.

hension models. Typically an expert’s comprehension process works in the way that the comprehension model describes, novices differ in making errors in the process or being unable to create some mental constructs needed for program comprehension. Table 5: Expert characteristics. Source

Expert characteristics

Shneiderman

-

Brooks

Effective hypothesis verification ability.

Soloway

and Possesses programming plans and is able to use them.

Ehrlich Letovsky

A good ability to build and verify conjectures.

Pennington

Experts are able to cross-reference between domain and program knowledge.

20

2.2 Automated Program Comprehension Quilici [1994, p. 84] defines automated program comprehension (APC) as the process of automatically extracting programming knowledge from source code. We will use the term automatic program comprehension as a synonym for the alternative notions used in literature: program recognition [Wills, 1996] and program understanding [Hartman, 1991]. The basis of automated program comprehension lays on the belief, that the programming knowledge used by the programmer of a certain program used can be recognized and recovered by examining the program [Hartman, 1991, p. 62]. Even if publications about APC rarely cite research on human program comprehension it can be assumed that this assumption is based on the success of explaining programmers’ cognition through different models of human program comprehension (see Section 2.1). Cognitive science and artificial intelligence research have a long history of interaction, where cognitive theories have been implemented as computer models of reasoning (see Bechtel and Graham [1998] for a thorough introduction into the topic). These implementations in turn have given feedback in form of new ideas and feasibility studies to the cognitive science community. Automated program comprehension in its many forms continues the tradition of implementing systems based on notions of human reasoning: most implementations use data structures and algorithms that are almost directly comparable to various program comprehension models. More generally, in Section 2.1 we outlined common elements of human program comprehension models, which included four parts: knowledge, mental model, acquisition process, and expert characteristics. Next we will examine how these four constituents are mirrored in APC. Human program comprehenders have programming and domain knowledge stored in their long-term memory. This knowledge is used in the comprehension process. Similarly an APC application needs a permanently stored programming knowledge repository of suitably encoded knowledge. Most APC systems do not deal with domain knowledge as this kind of knowledge is very specialized, and the purpose of most APCs is to analyze arbitrary programs. Programming knowledge, such as high-level language independent programming plans, is suitable for computerization. The encoding of programming knowledge differs in the level of abstraction and granularity, but all representations can be seen as encoded programming plans. We will use the term plan as general concept when referring to knowledge items stored in a programming

21

knowledge repository. Other names for knowledge items include, for example, knowledge atoms [Clayton et al., 1998], standard implementation plans (STIMPs) [Hartman, 1991], abstract program concepts [Kozaczynski et al., 1992], common stereotypical computational structures or clichés [Wills, 1996], and templates [Woods and Yang, 1996]. Plans of the repository are encoded as various data structures, such as flow graphs, source code templates or logical constraints, which are typically organized as an interrelated or even recursive hierarchy. The hierarchies are typically organized as forests of tree structures that have goals as their root nodes and instructions as leafs. Knowledge repositories are typically constructed manually by humans by codifying plans found in example program samples. The mental model that humans build and process when they are examining a program is represented by a program representation that is constructed by the APC during the automatic analysis. The program representation is typically a complex data structure that can be queried and otherwise manipulated. Abstract syntax trees (ASTs) constructed during the scanning and parsing of the code of a program are a popular choice for the foundation of program representations [Hartman, 1991, Kozaczynski et al., 1992, Quilici, 1994]. Data flow graphs are another program representation type that is used. Some program representations combine both data structures along with other types of information. An important feature of the program representation is that it must be somehow comparable with the programming knowledge repository of an APC. In some cases the data structure used in the program representation is used to implement the program knowledge repository. For example, a plan might be represented by a partial abstract syntax tree that could be matched against the abstract syntax tree internal representation of a source program with common tree operations. The acquisition process of a human comprehender is represented by the algorithm, which searches the program representation for any instances of the plans contained in the program repository. There are two recurring general approaches to implement the matching algorithms, which are similar to those of human comprehension strategies. In the top-down approach the system starts from determining the goals that a program might achieve and then seeks the plans that would implement these goals [Quilici, 1994, p. 86]. Then the instructions of the plans are matched to program instructions. The bottom-up approach starts from the opposite direction: it starts by examining instructions of the input program, and then tries to find plans that contain these instructions. Then the algorithm tries to infer goals that the found plans might define. Quilici

22

lists some problems with the two algorithmic approaches. The top-down method needs information about the goals that the program achieves. This information is not readily accessible in real world applications. The top-down approach does not suit partial plan recognition as it only knows how to deal with plans that are connected to goals. The bottom-up method suffers from the possibility of a combinatorial explosion: each instruction may be a part of several plans, which in turn may be parts of other plans. This feature limits the size of the analyzed programs and plan hierarchies that can be used. Quilici suggests that methods that limit search space may help to circumvent this problem. A weakness in both top-down and bottom-up methods is the lack of domain or program specific plans, the general plans stored in the plan library may not apply to real world application as they are bound to contain very domain specific plans. APC systems have many different application areas. The restructuring of programs involves understanding what a program does and then restructuring it into another representation which is more desirable in some respect, i.e., easier to maintain, or more effective. A maintenance related objective of APC is the generation of documentation: the plan structure of a program may, in fact, serve as a documentational aid to developers. The APC techniques may help to locate reusable code, thus providing opportunities for code reuse (see Section 2.4). The translation of a program from some language to another language can be done with an APC system, for example, Quilici [1994] presents an approach where procedural C programs are translated to objectoriented C++. Finally, the tutoring of novice programmers may be automatized to some extent with tools that use APC (see Sections 2.6 and 2.7 for some examples). In the rest of this Section we will do a survey of five APC approaches. The first four APC approaches are presented in order of publication: the UNPROG Program Understander from 1991 being the oldest approach (Subsection 2.2.1), and the GRASPR System (Subsection 2.2.4) published in 1996 being the most recent. Not surprising, the newer APCs are more developed and complex than the older ones, it seems that the newer approaches are at least partially based on their predecessors. Subsection 2.2.5 covers PARE, an APC approach that is a little different form the other APCs: it is a pure research system, which is intended for providing researchers with information about programmers mental models. Lastly, the Subsection 2.2.6 gives an overview of the topic of this Section.

23

2.2.1

The UNPROG Program Understander

Hartman [1991] describes an APC system, the UNPROG Program Understander, which is intended as a general purpose APC to be used in CASE tools, empirical studies in psychology of programming, and as a program comprehension aid in general. The UNPROG uses a program representation, which is called the hierarchical program model (HMODEL) [Hartman, 1991, pp. 3–4]. The HMODEL is an abstraction of control and data flow graphs, and it is constructed with a process labeled proper decomposition. In proper decomposition the control and data flow graph structures of the examined program are partitioned into multiple small sub-HMODELS, which are grouped together as a tree-like structure, which is the HMODEL. In the UNPROG system programming knowledge is expressed by standard implementation plans (STIMPs), which are stored in a library. The STIMPs are specified as labeled sets of sub-HMODELs. The use of STIMPs is based on the notion of planfullness: “programming is stereotyped, making frequent use of standard implementations” [Hartman, 1991, p. 65]. The notion of planfullness resembles the term plan-like, which Soloway and Ehrlich [1984] used to denote program code that match expert programming plans (see Section 2.1). The algorithm of UNPROG is called HMATCH. The HMATCH algorithm compares the sub-HMODEL representation of the examined program to the sub-HMODELs of the STIMPs, the comparison is based on matching control and data flow similarities and the evaluation of a set of true or false evaluated tests. The time complexity of HMATCH is linear respective to the number of sub-HMODELs in the STIMP and the examined program. The UNPROG system supports partial recognition, which is defined as the problem of recognizing plan instances despite non-plan material. This is made possible by the implementation of the sub-HMODELs, which hide implementational details and side computations. The authors report an experiment, in which UNPROG correctly recognized 34 of 35 instances of a bounded linear search algorithm in 20 input programs. The input programs were in various languages: Pascal, C, PL/I, Lisp and even pseudocode. The STIMP library used in the experiment contained 9 STIMPs.

24

2.2.2

The RECOGNIZE Method

Kozaczynski et al. [1992] describe a program concept recognition method called RECOGNIZE, which involves the formulation of abstract program concepts, that are then detected automatically. The RECOGNIZE method is used in a Cobol software renovation environment designed to improve large Cobol systems. The authors use the term abstract program concept instead of speaking of programming plans [Kozaczynski et al., 1992, pp. 216–217]. Abstract concepts represent language-independent ideas of problem solving and computation and may be divided into three categories. Programming concepts include data structures and algorithms, and their applications. Architectural concepts are related to the architectural structure of a program, such as interfaces, modules and databases. Domain concepts include the implementations of application or business logic. Abstract concepts are stored in a knowledge repository called the concept model that, in addition of the concepts, includes concept recognition rules, which consist of information about components of sub-concepts of a concept, as well as constraints on and among the sub-concepts. RECOGNIZE creates a program representation by parsing an AST of the source program and by performing various semantic analyses, such as definition-use chain and control-dependency relation analysis. The abstract concept recognition algorithm builds an abstract program concept representation on top of the AST of the analyzed program in a top-down manner [Kozaczynski et al., 1992, pp. 221-222]. Each AST node may trigger the evaluation of some abstract concept stored in the concept model. The abstract concept is then compared to the triggering part of the AST according to the concept recognition rules associated with it. The result of the recognition process is the abstract concept mapping of the source program. The user can browse the results of the analysis through a user interface that supports querying and browsing the concepts contained in the analyzed program. The implementation of RECOGNIZE described by Kozaczynski et al. [1992, pp. 222223] processes short programs of about 100-200 lines. The system needs an extensive collection of abstract concepts in order to be useful. The authors present some possibilities for improving performance.

25

2.2.3

The Method of Quilici

The approach of Quilici [1994] uses APC to find common operations and objects in C code in order to replace them with object-oriented C++ code from existing libraries. The programming knowledge repository described in Quilici [1994, pp. 89–91] consists of two parts, which describe programming plans. Plan definitions are lists of attributes that plan instances have, and they are represented by frames, which are conceptually close to the notion of a schema. A plan recognition rule lists the components of a plan and describes constraints of the components. The plans are interlinked to indicate different relationships: indexes, specialization constraints, and implication information. Plans are indexed by program instructions and other plans. This means that when detecting plans a certain program construct may trigger a plan, which is then checked against the examined part of the program. The detection of a plan may trigger another plan, that has been indexed as a related plan in the detected plan. The purpose of indexing is to limit the search space by focusing searching on the plan library. For example, partial plan detection may cause searching among indexed plans, which are related to the partially matched plan, thus making it easier to find the plan that matches best. The plans in the library may also be presented as specializations of other plans, rather like the notion of object-oriented specialization or inheritance. The plans include specialization information, which allows the analyzer to first try a more general plan and then try to match a more specialized version of the plan. Each plan has a list of implied plans. The recognition of a plan may imply the presence of other plans even if they have not been recognized during the analysis process so far. The program representation is an abstract syntax tree whose nodes are frames, the same structure that is used to represent plans in the knowledge repository [Quilici, 1994, p. 92]. A frame can represent any program object that a translator can recognize: a primitive operation like addition or a more complex structure like a loop or a procedure. Quilici has created an APC algorithm and supporting data structures by first observing and then mimicking how novice programmers comprehend short programs. The result is a hybrid top-down, bottom-up method, which is similar to the human program comprehension strategy formulated by Letovsky [1986] (see Section 2.1). The recognition algorithm of Quilici [1994, p. 92] is based on following the specialization links in a depth-first search manner, where program components or plans that await processing are stored in a worklist. Quilici has tested the APC system with a relatively small li26

brary including approximatedly 100 plans and student programs. The publication does not report performance of the testing.

2.2.4

The GRASPR System

The GRASPR (GRAph based System for Program Recognition) system described by Wills [1996] transforms program comprehension process into a graph computation problem. Another innovative and novel feature of the GRASPR is to use a graph grammar to represent recognizable program constructs. GRASPR represents the examined program as an annotated flow graph, which forms a directed acyclic graph. The system uses a knowledge repository that contains cliches, which are defined as being common stereotypical computational structures. Cliches are encoded with an attributed graph grammar, which describes different program properties, such as control dependency, and recursivity. Wills has constructed a cliche library manually by extracting from examples in CS textbooks, and on basis of discussions with developers. The cliches are recognized by parsing a flow graph of the source program in accordance with the graph grammar. The use of a flow graph creates an abstraction layer, that makes it easier to define and find abstract cliches. A novel flow graph parsing algorithm manages partial cliche recognition with a chart parsing algorithm. In the GRASPR system the program recognition task is transformed into a graph parsing problem, which is a well researched topic within computer science. The graph parsing problem is NP-complete in theory, but the constraints of the cliches make the task computable in practice. GRASPR can deal with recursion, aggregate data structures and some side effects of variable assignments. The output of the recognition process is a hierarchy of cliches, that have been identified in the source program along with their relationships to each other. Fahmy and Blostein [1992, p. 298] report some problems with the use of graph grammars. First, graph grammars are computationally expensive: subgraph isomorphism calculation, which is a central part of graph parsing, is NP-complete. Secondly, large graph grammars are very complicated and hard to read and manage. Thirdly, it is hard to model noisy and uncertain data with graph grammars. Woods and Yang [1996] prove that automatic program comprehension, where plans in a knowledge repository are compared with a program representation, is a NP-hard

27

problem. They formally define the simple program understanding problem (SPU) [Woods and Yang, 1996, pp. 8–9], where the knowledge repository is a graph of templates representing programming plans, and the program is represented by a graph. Program comprehension is done by comparing the subgraphs of the knowledge repository with the graph structures of the program representation. [Woods and Yang, 1996, p. 9] then prove that the SPU problem is NP-hard by reduction from the subgraph isomorphism problem, which is known to be NP-hard. In addition to SPU being unsolvable the authors prove that cliche, template or STIMP-based matching is NP-hard too, as it is a reduction of the subgraph isomorphism problem too. The approaches presented in this Section deal with the NP-hardness by using various heuristics, like constraints that reduce search space. Woods and Yang [1996] call these attempts as heuristic tricks and propose a formal definition for a constraint based model to solve the SPU problem. The constraint based model reformulates the SPU problem into a constraint satisfaction problem.

2.2.5

PARE

Rist [1989, 1994] introduces the PARE (Plan Analysis by Reverse Engineering) system that automatically detects the plan structure of an arbitrary Pascal program that it gets as input. PARE is a research system, it is intended for providing researchers with information about programmers mental models. Rist defines a plan as a series of actions that achieve a goal. A program may have many plans, which together build up the plan structure of the program. PARE represents plans as sequences of linked lines of code. PARE uses different possible link types that occur between lines of code that are presented in Table 6. It is obvious that the links use and make are symmetric, as are obey and control. The use and make links describe data flow within a program. The links control and obey model the control structures or control dependencies between lines. A program plan is created by progressing the lines of the program being analyzed in reverse order. The analysis of a plan starts from an output line, that is called the goal in Rist’s plan theory. The analysis proceeds by tracing back data and control links. The tracing proceeds until a terminal line is found. The terminal line is called an action in the plan theory, it is the plan focus. The resulting trace through the lines of a program

28

Table 6: Link types within a plan structure of a program [Rist, 1989]. Type

Explanation

Use

A line may use 0-n data values. For example, a prompt statement may be implemented without data usage, an assignment on the other hand uses 1-n data values.

Make

A line may make 0-n data values. A control structure can be implemented in a way that does not make any data values. A calculation makes a data value, and a procedure may make multiple data values.

Obey

A line may obey 0-n other lines. A line may, for example, be inside a control structure, thus obeying the line that defines the control structure.

Control

A line may control 0-n other lines. A control structure controls any lines that occur within its scope.

constitutes a plan. The plan structure of a program is built by examining each output line. The plan structure forms a directed acyclic graph with multiple roots. The plans that PARE detects corresponds to the notion of program slices, and the process of building them is an application of backward slicing. The plan structure representation as defined by Rist [1994] does not describe in what order the plans are executed in the program. The plan structure represents the deep structure of the program, whereas the code represents the surface structure of the program. The order of lines or actions in a plan can vary between programmers resulting in different representations of the same solution to a programming problem.

2.2.6

Overview of Automatic Program Comprehension

This subsection summarizes the features of the APC approaches examined in subsection 2.2.1 through 2.2.5. The APC approaches are implementations of human program comprehension, they seek to automatize the program comprehension to achieve some programming related goal. Table 7 is a collection of the purposes and goals for the APC systems and approaches discussed in this section. Three broad categories emerge. The first is general purpose APC systems, that can be “plugged in” to any application that

29

needs in depth information about a program, UNPROG [Hartman, 1991] is an example of this. The second category is research systems, which are either prototypes or proofof-concept implementations of APC methods. This category includes GRASPR [Wills, 1996], and PARE [Rist, 1994]. The third purpose category includes APC systems that are used to analyze and improve existing software, see RECOGNIZE [Kozaczynski et al., 1992], and Quilici’s method [Quilici, 1994]. Table 7: The purposes and goals of selected APC systems. System

Purpose

UNPROG

A general purpose APC that can be used in CASE tools, research, etc.

RECOGNIZE

Used in a Cobol software renovation environment to improve large Cobol systems.

Quilici’s method

Finding common operations and objects in C code in order to translate them into C++.

GRASPR

A research system for plan graph based APC.

PARE

A research system for finding the plan structure of a program.

The APC systems seem to mirror human program comprehenders: the long-term knowledge stored in the human brain is represented by a programming knowledge repository, which is our term for the long-term storage of programming knowledge in APC systems. Table 8 presents a summary of the programming knowledge repositories of the APC systems discussed in this Section. The APC systems represent knowledge as data structures, which form a tree-like hierarchy of interrelated programming plans. The programming plans are presented as constraints that describe which kind of programming constructs might be an instance of a plan. The program representation of the APC systems can be viewed as a similar concept to the mental model of human program comprehension models. The program representations of all APC systems share the feature of being related to the knowledge repository through some interface. The reason is obvious: there must be a way to compare the knowledge in the repository to the information that is being gathered about the examined program. The reader is instructed to study the summary of the program representations in Table 9 by cross-examining it with Table 8 that contains information

30

Table 8: Programming knowledge repositories of selected APC systems. System

Overview of Knowledge Repository

UNPROG

Standard implementation plans (STIMPs), represented by the hierarchical program model (HMODEL), an abstraction of control and data flow that is organized as a tree.

RECOGNIZE

Abstract concepts stored in a concept model. Includes concept recognition rules: information about components of sub-concepts of the concept, constraints on and among the sub-concepts.

Quilici’s method

Plan definitions encoded as frames: list plan attributes. Plan recognition rules: list of components and constraints.

GRASPR

Cliches, program properties encoded in graph grammar.

PARE

No explicit knowledge repository.

about knowledge repositories. Table 9: Program representations of selected APC systems. System

Overview of Program Representation

UNPROG

A tree-like hierarchical program model (HMODEL).

RECOGNIZE

An AST, on which programming concepts are added.

Quilici’s method

An AST, whose nodes are frames.

GRASPR

An annotated flow graph.

PARE

A tree of backwards slices.

The human knowledge acquisition process is represented in the APC systems by the program comprehension algorithms. The algorithms summarized in Table 10 can be categorized according to the way that they process the program representation: a topdown algorithm will process an AST from the root to the leaves and a bottom-up vice versa. The approach used in RECOGNIZE [Wills, 1996] is notably different: it is based on the parsing of graph grammars.

31

Table 10: Overviews of the algorithms of selected APC systems. System

Overview of Algorithm

UNPROG

HMATCH: compares STIMPs with the program representations top-down.

RECOGNIZE

Top-down traversal of the AST, the nodes may trigger an evaluation of a concept in the concept model.

Quilici’s method

A hybrid top-down, bottom-up algorithm.

GRASPR

Graph grammar parsing based algorithm.

PARE

A rule based bottom-up algorithm.

2.3 Assessing Roles of Variables with the RoleChecker In an independent study Bishop and Johnson [2005] present the RoleChecker, a system that automatically assesses the role assingments that students have done to variables in a program. The RoleChecker is implemented as a plugin for the BlueJ programming environment, which is intended for novice programmers who are learning to program [Kölling et al., 2003]. The RoleChecker uses a static program analysis technique called program slicing, which was pioneered by Weiser [1981]. In program slicing the purpose is to create a partial program, a program slice, which is a reduced program that includes a part of the functionality or behavior of the original program. The slice can be created, for example, by selecting all statements that affect the value of a certain variable. According to Weiser a program slice has two important characteristics: a slice is obtained through statement deletion and the slice must be an executable program whose behavior must match that of the original program. In other words, slicing may not alter the behavior of the code, it only removes code that is unessential. The RoleChecker produces a program slice for each variable. Then the system performs data flow analysis on the slices to find the data flow properties of the variables. Bishop and Johnson do not explain what kind of properties they are looking for, or what kind of data flow analyses they are using. Bishop and Johnson [2005] mention that they processed the analysis data with a tool that summarized how variables of example programs were assigned and used. The

32

summary was then used to create analyses that suggest variable roles. Bishop and Johnson have created 21 rules, boolean valued predicates, that are different assertions about variables, e.g., variable is assinged in a loop, variable is of type array, or variable is directly toggled within a loop. The 21 rules are used to define failure conditions for each of the roles. The failure condition of a role indicates when the role does not apply. A failure condition is built on the rules: the rules are connected by Boolean logic operators. In other words, the failure conditions are sets of rule conditions that must be met for the role to apply. In this context the Bishop and Johnson call the rules as subconditions. The failure conditions are used to produce feedback for students assigning roles to variables. Each of the subconditions is related with a role-specific error message. If a failure condition is met, then the messages associated with each of the failed subconditions are passed to the student. The messages are priorised, as some failure conditions are regarded as more important than the others. The messages associated with the more important subconditions are passed in favor of other messages. Bishop and Johnson do not give further details of the priorization of the messages or the hierarchy of the subconditions. The RoleChecker processes procedural parts of Java programs, and according to Bishop and Johnson [2005] it assigns roles correctly to 17 unspecified programs available from the Roles of Variables Homepage [Sajaniemi, 2006]. Bishop and Johnson do not specify what programs they have been using and do not present any analysis of the performance of their tool. The tool is announced to be available for download at the web [Bishop and Johnson, 2005] but at the time of writing this thesis (December, 2006) has not been available at the published address.

2.4 Automatic Detection of Design Patterns Design patterns are standard descriptions for commonly occurring software engineering problems [Gamma et al., 1995]. A design pattern is a template, that describes a design problem and a general abstracted solution that can be used to solve the problem. On a conceptual level design patterns resemble the notion of programming plans that represent abstract programming knowledge, like the strategic programming plans of Soloway and Ehrlich [1984] (see Subsection 2.1.3). Design patterns are well docu-

33

mented and thus they are easily transformed into a suitable knowledge repository, that an automatic program analyzer might use. Design patterns are a relatively new innovation in software engineering, their emergence can be related to the acceptance of the object-oriented programming within the industry during the 1990’s. From this background it is not surprising that at least some of the APC research efforts seems to have shifted towards using design patterns to create the knowledge libraries of automated tools that aid designers in their work. Bergenti and Poggi [2000] envision several ways in which an automatic assistant could help a software designer:

• Automatically finding pattern realization from source code. • Proposing pattern specific critiques to enforce design rules or to improve design. • Suggesting alternative realizations of patterns to improve design. • Proposing patterns for solving design problems. • Finding recurring design solutions that can perhaps be formulated as new patterns.

In this Section we will briefly examine three approaches that use the concept of design patterns in conjunction with automatic program analysis: analyzing the design and architecture of a system (Subsection 2.4.1), a critiquing system (Subsection 2.4.2), and a general purpose design pattern finder (Subsection 2.4.3).

2.4.1

The Method of Heuzeroth, Holl, Hogstrom, and Lowe

Heuzeroth et al. [2003] present an approach for the detection of design patterns with static and dynamic program analysis and outline the implementation of an unnamed analysis tool. The purpose of the authors is to create a method for finding the design and architecture of a program through the automatic detection of patterns [Heuzeroth et al., 2003, p. 95]. Because of this goal the method focuses on patterns that communicate information between components of a program. The static analysis of the patterns is based on algorithms, which examine the program representation, i.e., each pattern that can be detected is represented by an algorithm

34

that sets constraints and defines sets of applicable properties. In the terminology of Section 2.2 the algorithms represent the knowledge repository. The detection process starts with a static analysis that collects a set of candidate patterns [Heuzeroth et al., 2003, p. 96]. The candidate patterns are searched by iterating through all classes and their methods in the program. The analysis defines candidate patterns with predicates, that a collection of classes must satisfy to become a candidate pattern. Then the method uses dynamic analysis to narrow the candidate pattern set. In dynamic analysis the program is executed and the run-time behavior of the classes and methods of the candidate patterns are observed. The authors call this checking protocol conformance. Run-time behavior that supports the findings of the static analysis makes a candidate pattern a matched pattern [Heuzeroth et al., 2003, pp. 97–98]. The authors present static analysis algorithms and dynamic analysis outlines for five patterns: Observer, Composite, Mediator, Chain of Responsibility, and Visitor. The authors report empirical research that shows that the patterns that the method finds are true matches, but a lot of the candidate patterns are left unconfirmed by the dynamic analysis [Heuzeroth et al., 2003, pp. 99–102]. The authors propose that data flow analysis methods might help in narrowing the candidate pattern set.

2.4.2

IDEA

Bergenti and Poggi [2000] present IDEA (Interactive Design Assistant), which is a tool for automatically finding the patterns in a UML diagram, and producing critiques about the found patterns. The purpose of IDEA is to improve UML designs, to automate the refactoring of UML diagrams to some extent. The knowledge repository of IDEA consists of a two-part definition for each pattern: a structure template and a collaboration template. The templates are defined as Prolog rules. The templates are connected to the design rules, which are stored in a knowledge base. IDEA starts the pattern detection process by searching UML class diagrams for groups of classes that form potential patterns by comparing the design to the structure templates in its knowledge repository. Then it looks for pattern-specific object interaction within the potential patterns by examining UML collaboration diagrams with the help of the collaboration templates. IDEA proceeds by searching for pattern realizations, or patterns derived from those already detected. The final step in the process is to present

35

the results to the user. IDEA provides the designer with a list of patterns and a to-do list. The to-do list includes the critiques that IDEA finds appropriate for detected patterns. The to-do list is ordered so that the most important messages are displayed on top. Critiques are based on rule sets for patterns. The rules outline issues in good design and architecture. When IDEA detects a pattern it checks if it follows the rules, and suggests possibilities for improvements if some rule is violated. The critiques can, for example, be about naming, scopes of operations, missing operations, reusability issues, and design issues like pattern co-operation. Bergenti and Poggi [2000] do not present any evaluation or use experiences of the IDEA system, which they state to be on a prototype stage.

2.4.3

SPQR

Smith and Stotts [2003] have created a design pattern recognizing method that does not rely on strict pattern definitions, but uses a more abstract way of describing patterns. They define elemental design patterns (EDPs) [Smith and Stotts, 2003, p. 215], which embody basic concepts that patterns are built of. The EDPs are defined formally with rho calculus, which encapsulates the semantics of program constructs. The rho calculus is the calculating basis for a theorem prover, which forms the core of the System for Pattern Query and Recognition (SPQR). The EDPs are encoded through rho calculus and stored in a knowledge repository [Smith and Stotts, 2003, pp. 219–221]. The SPQR builds a rho calculus representation of the examined source code and then the theorem prover part of it compares the EDPs to the program representation. If a part of a program’s rho calculus representation can be reduced to an EDP, then the theorem prover has found a pattern instance. Different variations of code, whose rho calculus representations are reducible to EDPs are called EDP isotopes. The authors argue that rho calculus is a good way to represent patterns as it is concise and new patterns can easily be added into the knowledge repository. The SPQR is intended to be used as a part of different programming and refactoring tools. The output of the system is an XML file, that different tools can parse and use as they see fit. The architecture is also modular, with independent parts. The authors

36

report that their performance criteria for the SPQR is that it processes a program at the same speed than a compiler would compile the same code. Smith and Stotts [2003] do not report any use experience or evaluation of SPQR, but they do outline future work that would expand the approach to the application of SPQR, and the development of a refactoring framework that would be based on the SPQR approach.

2.5 Reverse Engineering and Refactoring Chikofsky and Cross [1990, p. 15] define reverse engineering as the process of analyzing a subject system to identify the system components and their interrelationships and to create representations of the system in another form or at a higher level of abstraction. Reverse engineering is a process of examination, the main purpose is to gain insight into how the system under examination works in some respect. This definition is similar to the definition of automatic program understanding, even if the point of view of reverse engineering is more technically oriented. This may be due to reverse engineering originating in software engineering, whereas the roots of automatic program understanding seems to be in psychology. Fowler [2000, p. 13] defines refactoring as “a change made to the internal structure of software to make it easier to understand and cheaper to modify without changing the observable behavior of the software”. Furthermore Fowler defines the verb to refactor as “to restructure software by applying a series of refactorings without changing the observable behavior of the software”. The purpose of refactoring may be, for example, the improvement of maintainability, readability, speed of execution, or memory consumption. Fowler presents 72 refactorings in a catalogue manner. It should be noted that the definition of restructuring [Chikofsky and Cross, 1990, p. 15] is very close to Fowler’s definition of refactoring: restructuring is the transformation of a system from one representation into another at the same relative abstraction level, while preserving the system’s external behavior, functionality, and semantics. Philipps and Rumpe [2001] point out that compiler optimizations are, in fact, a form of low level refactoring, which is in widespread use. Automatic refactoring of legacy software on the other hand is not used, even though the automatization of many refactorings might be possible to achieve. Some research has been done on the topic of automatic refactoring or providing semi-automatic tools that make refactoring easier.

37

In the rest of this Section we will examine two approaches. First, Subsection 2.5.1 describes an reverse engineering approach, which uses software metrics to find redundant code. Then, Subsection 2.5.2 introduces RefactoringCrawler, a system that helps a programmer to improve a program by looking for refactoring opportunities.

2.5.1

Function Clone Detection with Metrics

Mayrand et al. [1996] use a set of 21 software metrics to automatically find function clones in large software systems. Function clones are exact copies or mutations of other functions in the same system. A typical clone is created by copying a function and modifying it slightly, and perhaps renaming it. Clone detection can be used when developing large systems to find redundant code or possibilities for refactoring. Other uses include plagiarism detection in student programming projects. The general idea in clone detection with metrics is simple: calculate a metrics based signature for each function in the examined code and compare the signatures: similar signatures indicate various levels of cloning. The metrics are calculated with a source code analysis tool, which calculates metrics information language independently by transforming the source program into a language independent representation based on an abstract syntax tree and a separate intermediate representation language. The clone detector uses four points of comparison: the name of the function, layout metrics, expression metrics and control flow metrics [Mayrand et al., 1996, pp. 247– 248]. The name of a function usually reflects the implementation and operation of a function, thus similarly named functions may be clones of each other. Layout metrics produce data about the visual layout of the source code, such as indentation, variable names, and amount of comments. Expression metrics deal with the amount, complexity, and nature of expressions in a function. Control flow metrics describe the properties of the control flow in a function: number of loops, average nesting level, and number of decisions. There are three different levels of similarity between the metrics calculated from two different functions. Equal indicates that the metrics of the two functions match exactly. Each metric has a delta value, which indicates the absolute difference that is considered equal between two metrics values. Similar indicates that the absolute difference of the metrics are between the delta parameters assigned to the metrics. Two functions are

38

distinct if at least one measured metric difference is bigger than the delta value assigned to it. Clones are classified according to eight strategies, for which one or more of the metrics apply to [Mayrand et al., 1996, pp. 248–249]. The strategies are defined by specifying what points of comparison are used and how their levels of similarity are considered. For example, the DistinctLayout strategy does not compare the name of the functions, expects that at least one metric of the layout metrics is outside the delta parameter, and that the expression and control flow metrics have equal values. The authors report that their method detects clones well when all of the metrics are used to compare two functions. The detection process becomes more unreliable producing false accusations when not all of the points of comparison are included in the detection. The time consumption of the method is polynomial. The authors plan to optimize the method by experimenting with relative delta values.

2.5.2

RefactoringCrawler

In record-and-replay refactoring developers’ refactoring activities are recorded and the recording is replayed by an automated refactoring tool that migrates a version of deployed software into an updated, refactored, version. There is a shortcoming in this approach: all developers do not use refactoring tools that record their activities. Dig and Johnson [2006b] propose an automatic refactoring tool, that automatically gathers refactorings and applies them, much in the same manner as the record-and-replay method: the tool gathers a log of refactorings, and then applies them. The purpose of automatic refactoring is to make automatic upgrade of component-based tools possible. In modern component-based open-source software development different components evolve at their own pace. Dig and Johnson [2006b] outline an algorithm that finds refactorings between two versions of a component. The algorithm is implemented in RefactoringCrawler, an Eclipse plugin. The detection of refactorings is a two-phase process. The first step is a syntactical analysis, which finds possible refactorings by comparing the shingles of two parts of the examined programs. Shingles are a method of encoding strings to produce a unique identifier. The difference between two shingles representing two different sequences of characters can be compared: the magnitude of the difference indicates how

39

much the two strings differ. The comparison is a linear function, so if two shingles represent each other the value representing the difference is smaller than if they are distinct. The second step is a semantical analysis, which uses seven different strategies to find refactorings. The seven strategies are selected on the basis of a case study where Dig and Johnson [2006a] analyzed how programs are refactored by examining several industrial-size programs. The most frequently encountered refactorings were RenameMethod, RenameClass, RenamePackage, MoveMethod, ChangeMethodSignature, PullUpMethod, and PushDownMethod. In the semantic analysis different versions of the code are compared with the help of seven strategies that reflect the properties of the seven refactorings. The strategies are implemented as algorithms that examine the program representation. If the versions differ more than a user specified threshold, then the system marks the new version as a refactoring. The algorithm that performs the second step is a fixed point algorithm: it reruns all strategies as long as new matches occur as the strategies affect each other. The authors manually charted the refactorings of three legacy systems, and then tested RefactoringCrawler on the same material [Dig and Johnson, 2006b]. The automatic detector found over 85% of the refactorings that human detectors had found. The system reported less than 10% false negative refactorings.

2.6 Static Program Analysis Based Software Visualization Price et al. [1993] define software visualization as “the use of the crafts of typography, graphic design, animation, and cinematography with modern human-computer interaction technology to facilitate both the human understanding and effective use of computer software”. This definition covers various aspects of software visualization, such as algorithm animation. Software visualization systems work by looking for events in the source program that are then reflected to the user as some visual representation [Henry et al., 1990, p. 223]. The events are marked with hookpoints that the visualization system knows how to visualize. Intrusive visualization systems rely on human intervention to find events of source programs. Control intrusive hookpoints are procedure calls or other commands embedded within the source program. Control intrusive hookpoints are added by hand and they obviously disrupt the original source code. Data intrusive hookpoints are added by changing the libraries that the source code uses as it is run. This approach works especially well in object-oriented sys-

40

tems, where visualized operations may be overridden by methods that have embedded visualization functionality. The PlanAni program animator [Sajaniemi and Kuittinen, 2004] is an example of control intrusive hookpointing and the Jeliot program animator [Lahtinen et al., 1998] family is based on data intrusive hookpoints. Non-intrusive software visualization uses syntactical and even semantical analysis of the source program to find hookpoints. A common example of non-intrusive visualization is pretty-printing of source code: formatting the code in such a way that it is easier to read and understand. This formatting may include indentation and the positioning of braces, as well as syntax highlighting, for example, reserved words may be blue and comments red. Pretty-printing can be done by performing a syntax analysis of the program and providing the displaying software with coloring instructions. Software visualization that uses automatic program understanding (see Section 2.2) to find events and hookpoints is more complicated. In the software visualization taxonomy of Price, Baecker, and Small [1993] visualization systems that build some kind of understanding of the visualized program are considered to have a high level of AI capacity. The authors report that such systems are quite rare. The rest of this Section briefly introduces two non-intrusive software visualization approaches. Subsection 2.6.1 describes UWIC, a visualization system that relies on traditional static program analysis methods. Subsection 2.6.2 presents a visualization approach, that uses shape analysis, which is an emerging program analysis techique.

2.6.1

The University of Washington Illustrating Compiler (UWIC)

The University of Washington Illustrating Compiler (UWIC) is a non-intrusive software visualization system that is based on automatic program understanding [Henry et al., 1990]. UWIC visualizes data structures in short Pascal programs by automatically processing the declarations and operations in them. UWIC relies on static program analysis to recognize program patterns that it knows how to visualize. UWIC processes a subset of Pascal that does not include pointers, type declarations, records or Booleans [Henry et al., 1990, p. 224]. In addition, UWIC assumes that a procedure encapsulates the program region to be visualized, it does not process procedure calls or deal with parameters. The architecture of UWIC consists of several parts. The Inferencer performs a static analysis on the source program and inserts hookpoints.

41

The layout strategist decides what parts of the program are included in the visualization, which the data illustrator draws on the UI. The interpreter interprets the program and provides the data illustrator and a source code illustrator with drawing instructions. The Inferencer uses an attributed abstract syntax tree as an intermediate representation of the source program [Henry et al., 1990, p. 226]. The inferencer has a data flow analyzer component that calculates liveness, definition-use, and use-definition information (see Subsection 5.1 for more information about data flow analysis). The analysis results are used to gather information about variables, and separate events of a variable’s lifetime to find suitable episodes to visualize. The inferencer includes a statement pattern matcher that detects idioms, which are program constructs defined by extendable rules [Henry et al., 1990, pp. 226–227]. A typical idiom consists of 1-3 statements. As an example the inter statement swap idiom would detect tmp := a[i]; a[i] := a[j]; a[j] := tmp; and the intra statement incr idiom would detect the following statement: x:=x+1; The inferencer has a component called the subrange finder, that tries to find the range of values that a variable has. The CDT (concrete data structure) to ADT (abstract data structure) converter assigns an ordered set of ADTs to each CDT on the basis of the collected information [Henry et al., 1990, p. 228]. The most plausible ADT is the foremost candidate. The ADT definitions are included in the rule base of the UWIC system along with idiom definition rules. Mukherjea and Stasko [1993, p. 457] point out that building a visualization system that animates arbitrary programs “black-box” style seems to be impossible because of the limitations of automatic program understanding. The approach of UWIC—to have a set of abstract data structure patterns that are visualizable—is a workaround of the limitations of program understanding: the system knows how to visualize most data structures, but may encounter some it cannot visualize. In fact, Price et al. [1993] find the approach of UWIC a high-level AI solution, which according to them is quite uncommon among visualization software. 42

2.6.2

Algorithm Animation with Shape Analysis

Johannes et al. [2005] outline an algorithm visualization approach that is based on the abstract execution of an algorithm. The approach differ from mainstream program visualization in the respect that it does not visualize concrete executions of programs but abstract ones. The static program analysis technique that underlines the approach is shape analysis, which analyzes the possible heap states that may occur during program execution [Nielson et al., 1998, pp. 102–107]. The analysis produces shape graphs, labeled directed graphs, where nodes represent heap cells and edges references to them. A shape graph is a generalization of the heap state at a certain point of execution of a program. A shape analysis produces many shape graphs for a program, a set of graphs represent each point of execution, where each point of execution can be located, for example, after the processing of each statement. Johannes et al. [2005, pp. 21–23] explain a method which partitions and transforms the set of shape graphs into a set of graphs that is more suitable for visualization purposes. The process involves finding and merging similar shape graphs into classes of graphs, and the examination of control flow to tie the classes together. Johannes et al. [2005] use an analysis tool called TVLA (Three-Valued-Logic Analyzer), which produce the shape graphs. To produce a shape graph TVLA must be given a specification of the analysis, which states what kinds things are included and how events related to them are visualized. Johannes et al. proposes that the analysis should be specified to be as explanatory as possible, i.e., shape graphs illustrating the heap structure of tree data structures should look like trees. Alexa is a program animator, that displays code and a corresponding shape graph side by side [Bieber, 2001]. Alexa gets the shape graphs from the TVLA analysis tool and generates the animation from the graphs. Alexa is capable of animating singly linked lists, and it both visualizes the heap state as a graph and animates the transitions between program points [Johannes et al., 2005, p. 25]. Johannes et al. state that the work on a more advanced program animator capable of animating complex data structures is underway.

43

2.7 Software Metrics Based Automatic Assessment In large CS courses the teaching staff may be overwhelmed by the sheer number of programs that students submit for evaluation. However, instant feedback is important for pedagogical reasons. Thus, the automatization of the assessment of students’ programs is an appealing idea. In this Subsection we take a look at two approaches, where software metrics are used to analyze programs. Software metrics are program properties that can be measured, such as the number of statements or the number of functions and procedures. A more complex example of metrics is the McCabe cyclomatic complexity, which measures the number of linearly independent paths through a program’s source code [McCabe and Butler, 1989]. Our point of view is to see metrics as a way to automatically understand what properties the analyzed programs have. The metrics approach has much in common with automatic program understanding in Subsection 2.2 that is based on the recognition of programming plans: both are static program analysis methods that define program patterns and specify a way to search programs for these patterns. The presence or absence of the patterns are then interpreted to indicate something about the program. Mengel and Yerramilli [1999] report a pilot study, where a metric based static analysis of programs written by freshmen are compared to the opinion of expert programmers with respect to quality. The authors use a commercial analysis tool to produce five metrics values from students’ programs, the values represent different attributes of the programs. Then they calculate linear regression with quality as the dependent variable and each of the metrics as independent variables. The metrics that are used in the study are McCabe Cyclomatic complexity, number of functions, number of statements, number of nested levels, and number of comment characters per statement. The purpose of the analysis approach is to generate metrics based estimates of the quality of students’ programs using a relatively simple procedure and easily available commercial tools. In addition to helping the teacher to analyze single assignments the approach can be used to follow the progress of students’ programming abilities: if students write increasingly complex programs then future assingments can be designed to rectify this tendency. Another possibility is to use the metrics collected during a semester to provide a database of metrics that students and teachers can use to comparison purposes: a student may examine her program attributes to those of her peers, and teachers may use the collected metrics to get an overview of student progress. The

44

authors call the results of their analysis encouraging, and vision the standardization of metrics and their use in an automated grading system. Truong et al. [2004] present a system that analyzes students’ Java programs to provide semi-automatic assessment and tutoring. The purpose of the tool is to provide intelligent assistance both to students and teaching staff. The system examines short compilable programs that students write with respect to structure and quality. The structural similarity analysis compares the structure of students code to that of model solutions. Student code and the model solutions are transformed into abstract pseudo code that is comparable. The program representation is a XML-based abstract syntax tree, which represents the abstract algorithmic structure of the programs. The authors note that this approach only works for simple introductory programs. If the system does not find a match when it compares a student program to the model solutions then it sends the student code to a instructor, who then does the comparison manually. If the manually checked answer is correct, then it too can be added to the set of model solutions providing a larger base for similarity analysis. The structural analysis can be configured from exactly matching to relative matching, where the level of abstraction is higher. The quality of the students’ submitted code is examined through a set of software complexity metrics to find poor programming practices and logic errors. The quality metrics are calculated from the program representation. The authors seek to detect a set of poor programming practices that they have found through literature and experimentation. Some metrics that they use are cyclomatic complexity, unused parameters, unused variables, characters per line, and redundant logic expressions. To use the tool the teacher specifies fill-in-the-gap exercises, which in this case consist of small semi-independent code fragments. The teacher specifies the metrics and structural properties that the correct solution has by creating a model solution and by optionally specifying which metrics are used in the qualitative analysis of the answer. The teacher may specify answers that are submitted to students according to the analysis results, or alternatively the results of the analysis may be sent to the teacher, who can then write a response. The authors name four limitations that the system has: it only analyzes small code snippets, there are additional restrictions to the form of the code, there is no semantic analysis, and the tool only analyzes syntactically correct programs.

45

3 Automatic Detection of Variable Roles In this Chapter we first present an outline of our system that automatically detects the roles of variables in Section 3.1. Then in Section 3.2 we compare our approach to the ideas, concepts, and systems presented in the literature survey of Chapter 2.

3.1 Overview of the Detection Process The purpose of the automatic detection of variable roles is to find the roles of variables (see Section 1.1) in arbitrary programs that include no role information. We restrict ourselves to detecting only the eight roles that apply to all variables and do not try to detect the three roles that are applied to data structures. Furthermore, we do not detect role changes. Roles are cognitive constructs and represent human programming knowledge. Different people may think of roles differently, and the exact definition of what behavior constitutes a certain role is a matter of opinion and point of view. Consider a program that calculates both the Fibonacci sequence and the decimals of pi for some reason. The variable fib goes through the Fibonacci sequence, and the variable pi goes through the decimals of pi. A mathematician thinks that the role of both fib and pi is stepper: they are progressing along a predicatable, well defined and at least partially known sequence of values. A computer scientist thinks that fib is a stepper, he knows how the sequence is calculated, i.e., it is quite predictable from his point of view. The computer scientist is aware of the formula of pi, but remembers from lectures that the decimals of pi have not all been calculated, maybe just a zillion or so. Thus, he thinks that pi is a most-recent holder, for him the variable holds the most-recent calculated decimal of the pi sequence. The humanist wondering what the roles of fib and pi are has forgotten her high-school maths. For her the role of both is most-recent holder, or maybe pi is a temporary (a single number can’t be a sequence, can it). The ambigious nature of the role concept is one of the strengths of it as programmers can discuss variables through a vocabulary that does not restrict the discussion too much. The vague nature of cognitive constructs is reflected in the fact that the roles do not have technical definitions. The lack of definitions is the cause of a major challenge 46

in the automatic detection of variable roles: the analysis tries to find traces of cognitive structures of the programmer in program code. The fact that human cognition is not exact whereas program code is exact makes this task a challenging one. From our point of view, the APC systems discussed in Subsection 2.2 mirror human program comprehension, even if they are not explicitly designed to follow any human program comprehension model. The automatic detection of variable roles resembles a human trying to assign roles to variables in code. If we think of the roles of variables as programming plans that describe different stereotypic uses of variables, then the automatic detection of roles is plan recognition. How, then, does a human look for plans in code? One characteristic of human program comprehension models is the use of beacons, which are mentioned in most models of Section 2.1. To recap: according to Brooks [1983] beacons are program features that indicate the presence of certain program constructs. Beacons help the comprehender to associate a hypothesis about the program to program code [von Mayrhauser and Vans, 1994]. Soloway and Ehrlich [1984] think that beacons are indexes or indicators of programming plans. So, a human might look for code features that affects variables that would serve as beacons into the role plans. In our approach we have included 13 of such beacons, which are based on our personal experiences in manually assigning roles to variables. We call these beacons flow characteristics (FCs) as they are assertions about how a variable behaves in the data flow of a program. An obvious requirement for the FCs is that they must be automatically detectable: it must be possible to technically specify a way to find each of the FCs. The beacons that human comprehenders use are no doubt more complicated and based on intuition and complex heuristics that are at least partially implicit knowledge. Next, we will outline our technically defined framework for detecting the roles of variables with the help of the FCs. We start the definition by examining the nature of variables. A variable has a lifespan, the set of statements that either refer or define the variable. The term definition is used here in the customary way of data flow analysis literature: it refers to assignment of a new value to a variable and not to the declaration of the variable. The lifespan starts with the first assignment to the variable and ends to the last reference to the variable. Variables that are referred to but not assigned values are not of interest, they do not have roles. A variable goes through a sequence of values during its lifespan. The value sequence may include one or more distinct values that can have a relation such as in arithmetic or geometric progression, be ordered somehow, or be 47

totally unrelated. The lifespan of a variable may be divided into episodes, which are sub-sequences of the statements of the lifespan, which form independent value sequences. There are three kinds of episodes. An initial valued loop episode starts with an assignment of a value to a variable, which is first referred to and then overwritten in a loop structure. Thus, the value sequence of the episode consists of a pre-loop assigned value, a reference to the value, and the values assigned within the loop body. A loop episode is similar to the initial valued episode, but without the initial value: a variable is repeatedly assigned a value in a loop without referring to the value that the variable had before entering the loop. Figure 3 illustrates the two episodes. A trivial episode is simply a set of assignments which are not done within a loop structure. Variable roles and episodes are related. From the point of view of automatic detection, roles are names for episodes.

program episodeExample (input); var input: integer; /* loop episode */ end: boolean;

/* initial valued loop episode */

begin end := false; while not end do begin writeln(’Enter a number greater than zero: ’); read(input); if input > 0 then end := input > 0 end end.

Figure 3: A program where the variable input has a loop episode and the variable end has an initial valued loop episode. We can now restate the essence of the FCs: they describe the properties of a variable’s value sequence within an episode. Each FC is related to a statement in the lifespan of a variable. Table 11 presents the FCs with short descriptions. Each FC is detected with a detection algorithm. The algorithms use various static program analysis results, and some are more complicated than others in terms of computational complexity. The FCs are quite abstract in order to separate them from language dependent issues, the FCs describe such variable properties that commonly exist in procedural and object48

oriented programming. See Chapter 5 for thorough descriptions of the FCs and their detection algorithms.

Name

Table 11: Summary of flow characteristics Description

Conditional definition (COA)

The variable has an assignment that is done in a branch of a conditional structure.

Loop assignment (LOA)

The variable has an assignment that is done within a loop structure.

Conditional expression

The variable participates in a conditional expres-

participation (CEP)

sion that may affect the value sequence of it.

Loop expression participation

The variable participates in a loop expression that

(LEP)

affects the value sequence of the variable.

Interrelated value sequence The right hand side of the assignment includes the (IVS)

previous value of the variable being defined.

Single assignment (SAS)

The examined program includes only one assignment statement for the variable.

Defined sequence (DSE)

The variable has a sequence of values that are statically defined.

Arbitrary sequence (ASE)

The variable goes through values that are resolved dynamically at run time.

Singlepass (SPA)

The variable is defined and referred to during a single pass of a loop, and the definition is not referred to in subsequent passes of the loop.

Abbreviation (ABB)

The definitions used to define the variable reach the use of the definition of the variable.

Following (FOL)

The variable’s value sequence follows another variable’s or variables’ value sequence(s).

Partial following (PFO)

The variable’s value sequence may follow another variable’s or variables’ value sequence(s).

Initial value (IVA)

The definition of a variable is done before a loop, where the variable is redefined.

Our hypothesis is that each variable role has a unique set of FCs, i.e., the role of a variable can be specified as a set of FCs. This belief is rooted on the fact that roles label variable behavior, which in turn is described by the FCs. Our hypothesis is based 49

on the results of a validation, in which the set of FCs were found to be adequate, see Chapter 7 for details. We have designed, implemented, and validated a system that performs the automatic detection of variable roles. We will henceforth call it the ADVR system, where ADVR is an accronym of Automatic Detection of Variable Roles. The ADVR system analyzes the value sequences of variables through static program analysis—a collection of compile-time techniques that can be used to make estimations of the run-time behavior of a program [Nielson et al., 1998]. See Chapter 4 for a description of the program representation that the ADVR system uses, and Section 5.1 for information about the static program analysis methods and techniques used in the ADVR system. There are no exact technical definitions for the roles, indeed, different programmers might give the same variable different roles. On the other hand, there is no telling what kind of variable behavior a human may program; an all-embracing static set of role definitions is not possible. Thus, the ADVR system must recognize the possibility that the role definitions may differ. We have solved this problem by making the ADVR system a machine learning application, that can be run in two different modes. In the learning mode, it is provided with role-annotated programs, which it uses to learn what kind of run-time behavior of a variable accounts for each role. In the matching mode, it assigns roles to the variables of a program that has no role annotations. See Chapter 6 for a description of the machine learning features of the ADVR system. Figure 4 displays the steps (represented by gray rectangles) of the process in both modes. An arrow indicates the flow of information from one step to the next one; a dashed arrow denotes information flow that is present in the learning mode only. The first three steps are identical in both learning and matching; only the final step is different between the two modes. The first step in the detection process is the scanning and parsing of the program, which results in an abstract representation of the program. During the first step the program is first transformed into an abstract syntax tree (AST), which is a tree representation of a program that is a transformation of the parse tree [Aho et al., 1988, p. 49]. Then the ADVR system transforms the AST into a more suitable program representation that is based on a control flow graph (CFG), which is a directed graph showing the control flow within the program. The CFG represents abstract execution of the program: paths in the CFG correspond to possible execution paths. In addition to the CFG, the program

50

1: Scanning and parsing

Input program

Role information

2: Data flow analysis

Program representation

3: Flow characteristics detection

4b: Matching

Variable-role information

Variable-FC profiles

Data flow information

4a: Learning

Role-FC database

Figure 4: An overview of the automatic detection of roles. representation used in the ADVR system stores other program information as well, see Chapter 4 for details. The purpose of the program representation is to provide a language independent interface between the scanning and parsing step and the rest of the process. In fact, the first step is the only language dependent part of the ADVR process: the system can analyze any imperative language that can be represented as a CFG that is described in detail in Chapter 4. The second step in the process is data flow analysis (DFA), a static program analysis method using equations to examine different phenomena in programs [Aho et al., 1988, Kam and Ullman, 1976, Nielson et al., 1998]. The second step performs the data flow analysis on the program representation that the first step creates, the second step is thus not dependent on any specific programming language. The ADVR process relies on a classical data flow analysis of the examined program: reaching definitions analysis. The second step also includes the interpretation of the results by creating definitionuse chains. In addition to the two well-known analyses, the ADVR system uses two customized variants of them: forward propagated reaching definitions and definitionuse in assignments chains. See Section 5.1 for more information of data flow analysis and the custom analysis variants used by the ADVR system. The third step in the process of Figure 4, flow characteristics detection, assigns a set of FCs to each variable in the program by analyzing the program representation and DFA information. The third step requires both the program representation and data flow information, that were created in the first and second steps respectively. The outcome of the third step is a set of variable-FC profiles, each of which lists what FCs apply to

51

each variable in the examined program. The algorithms that the ADVR system uses are presented in Chapter 5. The fourth step is different in the learning and matching modes. In the learning mode (“4a: Learning” in Figure 4), the variable-FC profiles are transformed into role-FC descriptions by substituting each variable with its annotated, i.e., user-provided role. The role-FC descriptions are then inserted into the role-FC database, which accumulates descriptions of the roles as sets of FCs. In the matching mode the fourth step (“4b: Matching”) starts with the generation of a classification tree, which is a data structure that resembles a decision tree. The ADVR system creates the tree by using the role-FC descriptions stored in the role-FC database as input to the ID3 classification tree generation algorithm. The classification tree represents the role-FC database as a binary tree data structure, that can be used to determine the role of a variable-FC profile. A path from the root to a leaf of the tree represents a role FC profile, i.e., defines what FCs apply to a role. The tree may contain several profiles for a single role, which represents the fact that there may be several different valid definitions for a certain role. The outcome of the fourth step in the matching mode are variable-role pairs, which the ADVR system finds by comparing the variable-FC profiles to the role-FC profiles contained in the classification tree data structure. The role information of variables in the processed program can then be used by other applications. See Chapter 6 for more information about classification trees and their use in the ADVR system. To assess the quality of the automatic role detection process, role assignments suggested by the ADVR system were compared with human role assigners. The results are presented in detail in Chapter 7. The overall accuracy of the ADVR system was 93%. This can be compared with the accuracy of computer science educators who assigned roles to variables after an introduction to the role concept [Sajaniemi et al., 2006a]; their accuracy with short Pascal programs was 85%. The same programs that the educators analyzed were given also to the ADVR system (using this time both the learning material and the matching material of the above validation as a large learning material). The ADVR system achieved 95% correctness. Thus, the reliability of the ADVR system seems to be comparable to that of computer science educators. The ADVR is implemented in Tcl. It is built with the help of Yeti [Pilhofer, 2004], a Yacc-like compiler-compiler [Aho et al., 1988].

52

3.2 Comparison with Literature The ADVR system may be viewed as an APC system, its purpose is to automatically generate information about program code. From the APC point of view, the knowledge repository of the ADVR system is the role–FC database that the learning phase creates. The algorithm of ADVR does not directly compare the knowledge repository with the program representation. Instead it extracts information of variables from the program representation—the variable-FC profiles—and uses it in the comparison. The major difference between the APC approaches of Section 2.2 and the ADVR system is that it uses a well researched and specified definition of programming knowledge, roles of variables, which it is able to learn. This is a feature that the ADVR shares with the automatic design pattern matchers, which have a well-documented conception of what programming knowledge is. Another feature that sets the ADVR system apart from the APC approaches is that its knowledge repository and the examined program, i.e. role-FC profiles and variableFC profiles, can be compared with an existing and well-documented machine learning method instead of a complicated and tailor-made algorithm. In the terminology of Wills [1996], FCs are names for sets of constraints on control flow and data flow properties of the variables. The detecting of each FC with its own algorithm is a more fine grained approach than, for example, the method of Heuzeroth et al. [2003], which detects pattern candidates with pattern specific algorithms (See Subsection 2.4.1). The beacon-like approach has been used in automatically detecting design patterns: the SPQR system (see Subsection 2.4.3) uses elemental design patterns (EDPs) which define basic concepts that design patterns are built of. The FCs differ from EDPs in that they are more basic programming related concepts when compared with patterns. The RoleChecker [Bishop and Johnson, 2005, Johnson, 2006] is independent work which resembles our ADVR system.

The 21 rules or subconditions that the

RoleChecker uses to define failure conditions for role assignments are conceptually close to the FC concept used in the ADVR system. Exact details of the definition of roles differ in the two systems: Bishop and Johnson [2005] mention that they have used some kind of tool to perform an analysis about what rules apply to each role, whereas

53

the ADVR system uses machine learning as described in Section 3.1. Unfortunately the publications presenting the RoleChecker does not give further details about the implementation of the system. Furthermore, the RoleChecker implementation cannot be found as per instructed in [Bishop and Johnson, 2005], thus any comparative evaluation with the ADVR system has not been possible. The ADVR system could be used to similar ends that the automatic design pattern analyzers discussed in the introduction of Section 2.4, some of which the IDEA tool realizes (See Subsection 2.4.2). Reverse engineering (see Section 2.5) is another interesting and possible application area of ADVR system. A variable which has two different roles may, for example, be replaced with two variables to clarify their purpose and to make the program easier to maintain. The RefactoringCrawler refactoring tool of Subsection 2.5.2 looks for bad design. ADVR system could be used to detect unorthodox variable uses, which could be sources of error or cause confusion for the program comprehender. A concrete example of role information, that may suggest bad design, are variables for which the automatic role analysis does not find roles for. In these cases it may be assumed that the use of the variable is complex and unorthodox. The variables that could not be assigned roles could be pointed out for possible refactoring. The function clone detector of Mayrand et al. [1996] (see Subsection 2.5.1) uses metrics to describe a function and then proceeds to see if two functions share the same metrics signature. Similar metrics suggest similarities in function data flow, control flow or textual representation. From this point of view, the ADVR learning and matching process tries to find variables with similar FC signatures. The learning process defines what the signature of a variable with a certain role is like and the matching process tries to find “clones” with a similar FC signature. The roles of variables concept has been used in the PlanAni program animator with empirically proved positive effects on learning to program [Sajaniemi and Kuittinen, 2005, Byckling and Sajaniemi, 2006]. The ADVR system makes it possible to create a non-intrusive role based program animation possible: the ADVR system could annotate programs with their roles, which the PlanAni system could then animate. Indeed, the research that this thesis documents was originally begun to create a component to

54

the PlanAni system. When compared to the non-intrusive UWIC visualization system (Subsection 2.6.1) which recognizes certain program patterns that it knows how to visualize, the ADVR enhanced PlanAni would know how to animate recurring variable behavior. As the roles of variables seem to cover most variables that might be encountered in programs the applicability of the ADVR system based approach would be better than that of the UWIC. ADVR system based visualization shares the basic idea of the shape analysis based visualization described in Subsection 2.6.2. Both approaches perform an abstract execution of the program to be animated. The shape-based approach uses the TVLA analyzer to produce shape graphs to be used as the basis of animations, and the ADVR system could provide role information that PlanAni could animate. The difference between the two approaches is most evident on the interface between the program analysis system and the animator system. The roles of variables is a well specified publicly available concept that has been developed for educational purposes, whereas shape graphs are technical results of a static program analysis, which must be further processed for visualization purposes. Another source of difference between role- and shape-based animation is the amount of preparations required to create an animation. A shape-based animation requires a specification of what is to be analyzed. The role-based animation can be run on an arbitrary program without preprocessing assuming that the ADVR system has been trained to detect roles. From a program analysis point of view automatically produced role information is, in fact, estimated information about the run-time behavior of variables that is categorized into 11 categories, i.e., the roles of variables. As a side note, the future development of the ADVR System could benefit from the use of shape analysis as it is well suited for the analysis of pointer structures. Assessment of student code is based on some understanding of what the code does. The approach of Subsection 2.7 outlines one solution: to define the assessment through different metrics, that in turn can be automatically computed. The ADVR system could be used as a source for such metrics: a teacher may define what roles might be needed to solve a programming problem, and automatized role detection could provide information about whether the student has implemented code that realizes those roles.

55

4 Program Representation In order to make the analysis efficient, static program analysis studies abstractions of programs. A program representation is an abstract description of the program created from source code, which typically includes both syntactic and semantic information of the program. Finiteness is an important property of a program representation, it makes the analysis of the represented program computationally feasible. The first Section of this Chapter, 4.1, presents the two crucial data structures that form the basis of the program representation that the ADVR System creates of the programs it gets as input: the symbol table and the control flow graph. The purpose of the Section is to explain the general principles of the program representation, not to explain it in technical detail. Thus, the Section omits the descriptions of some properties of the program representation, and does not describe how the data structures are created. We believe that a lighter approach will provide the user with enough information to understand later Chapters. The Section 4.2 describes an interface to the program representation that the FC detection algorithms described in Chapter 5 use. We recommend that the reader glances this Section briefly and returns to it later when examining the FC detection algorithms if needed.

4.1 Data Structures The program representation (PR) that the ADVR system uses is based on an annotated CFG: the nodes of the CFG include information about statements, and the edges between the nodes represent possible flows of control in the program. We will use the abbreviation ADVR PR to denote the ADVR system program representation. There may be several separate CFGs that represent the analyzed program within the ADVR PR. In addition to the CFG, the ADVR PR includes a symbol table, which records data about variables and constants among other things. Variables and constants are represented by a reference to the symbol table in the CFG. In this Chapter we use variableN and constantN , where N is an integer, as references to the symbol table. The purpose of the program representation is to make the analysis language independent, which makes it easier to extend the capabilities of the ADVR system. To add a 56

new language to those that the ADVR system can analyze is done simply by creating a front-end that transforms the program into the ADVR system program representation. The ADVR PR has been designed to facilitate the examination of variable behavior. Thus, some things that usually are of interest are disregarded: for example, function return values are not represented in the ADVR PR. The PR seeks to faithfully preserve variable behavior. We will use object-oriented terminology when discussing the program representation that the ADVR system uses, as the program representation is implemented in an objectoriented fashion. In the ADVR program representation each node of the CFG represents a statement. Sometimes the statements of a programming language may be divided into several nodes, or several statements, in the CFG. We will use the term statement as a synonym to the term node when speaking of the ADVR program representation. All of the statements have a common abstract superclass: Statement. The superclass defines three properties that its subclasses may have: assigning a value to a variable, refering variables, and containing an expression. The superclass Statement has four concrete subclasses, that represents different kinds of statements that are recognized by the ADVR PR: AssignmentStatement, ConditionalStatement, LoopStatement, and UseStatement. The class AssignmentStatement represents the assignment of a value to a single variable. It has the attribute assign, which stores a reference to a variable in the symbol table. The assigned value is stored in the attribute expression, which is defined in the ADVR PR as a list that may contain references to variables and constants. In addition, an expression may contain two special markers that are not references to instances: input and f unction. The marker input represents the fact that the value is obtained as input to the program. A concrete example where the input marker is used would be the Pascal statement readln(x), which is an assignment of user input to the variable x. The f unction marker indicates that the expression contains a call to a function, for example, the right-hand-side of the statement x := ord(y) would be represented by a variable reference for y and the marker f unction. Figure 5 shows a short program, which includes three assignment statements on lines 5, 6, and 7. The symbol table for the program displayed in Table 12 contains five

57

entries, one on each row. The column labeled “Reference Id” shows the reference that is used to refer to the entry, for example, variable1 refers to the variable x. The “Type” column indicates whether the symbol table entry is a variable or a constant, the “Datatype” column stores the datatype of the constant or variable. The “Name” column stores the name of the entry that is used in the program code, if the entry does have a name. The first three rows of the symbol table have names, but the two last ones do not. The “Scoped Name” column stores the name with scope information. For example, the scoped name of variable1 (x) is assignmentexample-x. The scoped name is used when the results of the automatic role analysis are presented to a user. Constants do not have scoped names as they are not assigned roles. The last column of the symbol table, “Value” indicates the value of a constant or the initial value of a variable. The variables in Table 12 do not have inital values. Table 12: The partial symbol table for the program assignmentexample of Figure 5. Reference Id

Type

Datatype

Name

Scoped Name

Value

variable1

variable

integer

x

assignmentexample-x

-

variable2

variable

integer

y

assignmentexample-y

-

constant1

constant

integer

MIN

-

100

constant2

constant

integer

-

-

1

constant3

constant

integer

-

-

10

The ADVR PR CFG of the program assignmentexample of Figure 5 is shown in Figure 6. The rectangles in Figure 6 represent nodes in the CFG, and the arrows represent the edges between the nodes that indicate possible control flow. In the CFG of Figure 6 there is only one possible sequence for the statements to be executed. Each of the rectangles represents an instance of some of the subclasses of class Statement. The uppermost partition of the rectangles stores the instance name, shows the class of the instance, and indicates the line number of the statement that this node has been created from. For example, the notation asgn1 : AssignmentStatement (5) means that the name of the instance is asgn1 and it is an instance of the AssignmentStatement class. The number in parenthesis, (5) in this case, indicates that asgn1 58

has been created from statement 5 in Figure 5. The lower area of the instance rectangles include attribute information, in the case of Figure 6 the three instances on the all have two attributes, assign and expression. Only those attributes that have values and are relevant to the example are shown for simplicity.

(1) program assignmentexample (input); (2) var x, y: integer; (3) const MIN = 0; (4) begin; (5)

x := MIN + 1;

(6)

readln(x);

(7)

y := 10 * x

(8) end.

Figure 5: The program assignmentexample.

asgn1 : AssignmentStatement (5) assign = variable1 expression = { constant1, constant2 }

asgn2 : AssignmentStatement (6) assign = variable1 expression = { input }

asgn3 : AssignmentStatement (7) assign = variable2 expression = { variable1, constant3 }

Figure 6: The ADVR PR CFG of the program assignmentexample of Figure 5. The AssignmentStatement instances have the attribute assign, which indicates which variable the instance assigns a value to. In the CFG of Figure 6 the instances asgn1 and asgn2 assign a value to variable1 (x in the program code), and asgn3 defines the value of variable2 (y in the program code). In the AssignmentStatement instances the expression attribute contains the participants of the right-hand side expression, not the actual expression itself. The instance asgn3 in Figure 6 has an expression that contains references to a variable, and a constant: variable1 and constant3. The 59

readln-statement on line 6 of the program in Figure 6 is represented by asgn2, its expression attribute contains the value input, which indicates that the assigned value is provided at runtime, i.e., during the execution of the program. Conditional statements are represented with the ConditionalStatement class. This class is used to model both if-then-else type conditionals, and case statements, which are also known as switch-structures in some languages, such as Java. The ConditionalStatement class encapsulates the expression that is evaluated when the program is determining which path to take. In the CFG the ConditionalStatement statement has at least two successors, although there is no upper limit to the number of successors. Consider the short program conditionalexample of Figure 7 whose symbol table is in Table 13. The if-then structure of lines 4 and 5 are represented by the program representation construct labeled “(a)” in Figure 8. The PR of the if-then-else structure of lines 6–9 is slightly more complicated, it is labeled “(b)” in Figure 8.

60

Table 13: The partial symbol table for the program conditionalexample of Figure 7. Reference Id

Type

Datatype

Name

Scoped Name

Value

variable1

variable

integer

x

conditionalexample-x

-

variable2

variable

integer

y

conditionalexample-y

-

constant1

constant

integer

MIN

-

0

constant2

constant

integer

-

-

1

( 1) program conditionalexample (input); ( 2) var x, y: integer; ( 3) const MIN = 0; ( 4) begin; ( 5)

if x < MIN then

( 6)

x := 1;

( 7)

if x < y then

( 8)

x := y + 1

( 9)

else

(10)

x := x + 1

(11) end.

Figure 7: A short code passage with two Pascal conditional structures.

61

In Figure 8 the lines 4 and 6 of the program in Figure 7 are represented by ConditionalStatement instances cond1 and cond2. Both have the attribute expression, which lists all participants of the expressions that control the conditional structures. In the case of “(a)” in Figure 8 there are two alternative paths through the CFG: after cond1 the control flow can either go to asgn1 or bypass it. The CFG “(b)” has two alternative paths as well, after cond2 the control flow goes to either asgn2 or asgn3.

cond1 : ConditionalStatement (1) expression = { variable1, constant1 }

(a) asgn1 : AssignmentStatement (2) assign = variable1 expression = { constant2}

cond2 : ConditionalStatement (3) expression = { variable1, variable2 }

asgn2 : AssignmentStatement (4)

(b)

assign = variable1 expression = { variable2, constant2 }

asgn3 : AssignmentStatement (6) assign = variable1 expression = { variable1, constant2 }

Figure 8: A CFG for (a) if-then-structure and (b) it-then-else structure. The LoopStatement class is used to represent the header of a loop. The LoopStatement class is similar to the ConditionalStatement class: it encapsulates the expression that determines whether the loop body is entered. A LoopStatement instance always has one successor: the first statement of the sequence of statements that are repeated within the loop. This sequence of statements is called collectively as the loop body. If the program does not end with the last statement of the loop body, then the LoopStatement instance has an edge to the first statement that comes after the last statement of the loop 62

body. A LoopStatement instance has always at least two predecessors: the statement before the loop header is connected to the LoopStatement instance, and the last statement of the loop body is connected to it too. The edge between the last statement of the loop body and the LoopStatement instance is a backedge, because the edge transfers the flow of control to a statement that has already been evaluated at least once. In the ADVR PR the only structure that can be the target of a backedge is a LoopStatement instance. The ADVR PR represents all kinds of loops with the same structure. For example, the two loops on lines 4–5 and 6–8 of program loopexample in Figure 9 are both represented with an identical structure that is labeled “(a)” in Figure 10. The symbol table of loopexample is shown in Table 14. A for-loop is transformed into a while-loop before it transformed into an ADVR PR structure. For example, the forloop on line 9 in Figure 10 is transformed into the construct of lines 10–14 that include a while-loop. The program representation for both the loop of line 9 and the code of lines 10–14 of Figure 9 is shown in Figure 10 labeled with “(b)”. Table 14: The partial symbol table for the program loopexample of Figure 9. Reference Id

Type

Datatype

Name Scoped Name Value

variable1

variable

integer

i

loopexample-i

-

constant1

constant

integer

-

-

0

constant2

constant

integer

-

-

1

constant3

constant

integer

-

-

10

The ADVR PR of Figure 10 includes two backedges, which are shown as thick arrows. The backedge of case “(a)” goes from asgn1 to loop1 and in “(b)” the backedge transfers the control flow from asgn2 to loop2. Procedures and functions are treated in two alternative ways. If a function or a procedure has only call-by-value parameters and does not access global variables, then it is represented by a CFG of its own. The roles of variables are defined in the context that variables exist, and a subroutine with only call-by-value parameters does not affect variables outside itself. Thus, such a subroutine can be represented as a separate CFG. Consider procedure printBigger on lines 3–8 of the program biggestoftwo in Figure 11. The procedure has two call-by-value parameters, the two integers number1 and number2, and it prints the bigger of the numbers. The ADVR PR of the proce63

( 1)

program loopexample (input);

( 2)

var i:integer;

( 3)

begin

( 4)

while (i < 0) do

( 5)

readln(i);

( 6)

repeat

( 7)

readln(i)

( 8)

until i > 0;

( 9)

for i := 1 to 10 do writeln(i);

(10)

i := 1;

(11)

while(i number2 then

( 6)

write(number1)

( 7)

else write(number2)

( 8) end; ( 9) procedure readInput(var x); (10) begin (11)

x := minimum;

(12)

while(x 0) do

(5)

readln(input);

(6)

writeln(’Random number ’,i,’: ’,random(input));

(7)

i := i+1

(8) end

Figure 16: Example code where the variables i and input have the IVA FC. The variable input is used to read the values that the user enters. The first assignment to the variable is done at statement 3, and possible subsequent assignments within a loop on line 5. The two assignments on lines 3 and 5 create a value sequence of

91

arbitrary length that starts with the value assigned on statement 3. Thus, input has the IVS FC at statement 3. The variable input illustrates an important property of the IVA FC: it can be caused by an assignment statement within a loop. If this is the case, as with input on lines 3–5, then the episode, which the assignment with the IVS property starts is repeated. For example, we know that input goes through the same episode, three times as the loop of lines 2–8 executes. Algorithm. The algorithm of Figure 17 that detects the IVA FC instances gets a reference to the program representation as input. We use the term definition to refer to the definition of a value of a variable, i.e., the term definition means the assignment of a value. The algorithm uses three phases for finding the initial definitions which it stores in the set initial_defs. Before the steps are performed initial_defs is set to be an empty set on line 3. On line 4 the algorithm makes a call to the method getLoopStatements(), which returns all loop statements of the source program. These are then examined in turn on lines 4–10. The variable loopstmt holds a reference to the currently processed loop statement.

( 1) procedure detectIVA (pr: ProgramRepresentation) ( 2) begin ( 3)

initial_defs := ∅;

( 4)

for each loopstmt in pr.getLoopStatements() do begin /* 1. Collect definitions killed in loop identified by loopstmt */

( 5)

killed_defs := ∅;

( 6)

for each stmt in loopstmt.getStatements() do killed_defs := killed_defs ∪ stmt.getRDKill();

( 7)

/* 2. Collect all definitions that enter the loop */ ( 8)

entering_defs := loopstmt.getFRDEntry(); /* 3. Calculate initial definitions */ initial_defs:= initial_defs ∪ (entering_defs ∩ killed_defs);

( 9) (10)

end

(11)

for each asgnstmt in initial_defs do begin

(12)

variable := asgnstmt.getAssignedVariable();

(13)

add IVA FC to variable at asgnstmt

(14)

end

(15) end

Figure 17: An algorithm that detects the IVA FC.

92

The first phase collects the definitions that are killed, or overwritten, within loopstmt. The lines 5–7 collect all definitions that the statements within the body of loopstmt kill into the set killed_defs. This is done by processing each statement in turn and repeatedly calling the method stmt.getRDKill(), which returns the definitions killed in the statement identified by the reference stmt. In the second phase the algorithm collects all definitions that enter the loop loopstmt. This is done by calling the method loopstmt.getFRDEntry(), which returns all definitions that enter loopstmt via forward edges, in other words, via data flow from the predecessors of loopstmt. The third phase detects the initial definitions that enter the loop loopstmt by comparing the definitions that enter it and the definitions that are killed in it: the intersection of the sets entering_defs and killed_defs calculated on line 9 of the algorithm is stored in the set initial_defs along with previously calculated results. If a definition enters the loop and is killed within it, then the variable defined in the definition has the IVA FC at the statement that does the assignment. When all loop statements of the source program have been processed the algorithm has all initial values stored in the set initial_defs. The loop of lines 11–14 iterates through the definitions contained in the set and adds the IVA FC to each of the variables in the definitions at the location of the definition. Time complexity. The time complexity of the algorithm in Figure 17 is bound by the time complexity of the two nested loops: the first is located on lines 4–10 and contains the second, which spans the lines 6-7. The innermost loop iterates through the statements within a loop body, which obviously can happen n − 1 times, where n is the number of statements if all statements are within the loop body save for the loop header itself. The outermost loop starting on line 4 iterates through the loop headers. A program can include n loop headers, where n is again the number of statements in the program. This assumes that a loop can be expressed in a single statement and each statement of the program is a loop. From these worst case time complexities of the loops we conclude that the worst case time complexity of the algorithm is O(n2 ) where n is the number of statements in the source program. Consider the example code of Figure 18. The outermost loop starting on line 4 is executed 5 times as there are 5 loop statements

93

(1) for (i:=1; i

Suggest Documents