Simplicity: A Key Engineering Concept for Program Understanding Yang Li and Hongji Yang Computer Science Department De Montfort University England E-mail: yangli,
[email protected] Abstract One of the most significant problems for existing program comprehension methods is its scalability. In this paper, we introduce a new technique to make the scalability possible. In particular, we advocate the concept of “simplicity” for program understanding. We first propose a simplifed semantic network as domain knowledge representation; we then introduce a linear and domain-oriented program partitioning method which can partition a huge program into self-contained program modules so that the recovery of domain knowledge can be carried out within smaller program space; we also introduce a set of rules for recovering domain knowledge from C code followed by a theoretical analysis on these algorithms; A case study on programming style based program partitioning method is particularly given. Finally, comparisons with others’ work are made and conclusion is drawn. Keywords: program understanding, knowledge recovery, programming styles, program partitioning
1. Introduction Program understanding has established itself as a crutial pre-requisite for many software engineering tasks, especially the evolution of legacy software systems [12, 17]. Legacy software is referred to software systems which were developed some time ago, which have been undergone considerable modification and which are still indispensably used nowadays. Two main features usually charaterise a legacy software system: (1) huge volume of software code, and (2) deteriorated software documentation. Driven by urgent submission deadline, programmers hardly had time to write profit-less software document. This is again the case for software evolution at the later stage of software life circle, where programmers failed to keep the documentation up-to-date. All these facts suggest that soft-
ware documentation in legacy system is no longer an existent or reliable map for source code and a re-generation of such map directly from source code is needed. It is a daunting job to manually browse through millions of lines of legacy source code to re-establish linkage between source code and high level domain knowledge, and therefore an extensive tool support is needed. Among these tools, program plan recognition [13] and human concept assignment [5] play an important role. These two kind of methods are based on Knowledge Engineering methodology, where source code is matched with a pre-defined program plans or human concept knowledge base in order to recognise a programming plan or a high level concept in a program. Since the matching can be automated by a computer program, programmers are therefore relieved from the hard work of manually matching. From a knowledge engineering viewpoint, the matching mentioned above involves two parties: knowledge base and source code. The computational complexity of matching is determined by both the size of knowledge base and the size of source code. Even the fastest computer is very sensitive to such computational complexity. So far, few existing methods in this area have addressed the “size” issue, which is crucial to scale their methods up to real world application. The potential of reducing the size of both knowledge base and source code is huge. For example, a program may consist of several different components, each contains domain knowledge which is irrelevant to other components. The computational complexity of matching can be significantly reduced if the matching is carried out within individual component space rather than the whole program space. Another example for reducing the size of knowledge base is to cut off less useful knowledge. In this paper, we introduce a new technique to address these concerns. In particular, we introduce a simplified semantic network as domain knowledge representation; we also propose a linear program partitioning method which is based on human programming
styles. Furthermore, we introduce a set of rules for recovering domain knowledge from C language. The remaining sections are organised as follows. Section 2 introduces the simplified semantic network. Section 3 describes our linear program partitioning method. The process of generating linkage between source code and domain knowledge is given in Section 4. In Section 5, a theorectic analysis of the benefit of our approach in terms of computational complexity is given. In Section 6, a case study on programming style based program partitioning method is described. A comparison between our work and related methods can be found in Section 7. Finally, we reach our conclusion and propose future work.
2 Domain Knowledge Partitioning It is well established that software evolution often results from the adding, changing and deleting of service in application domain [20]. This is in contrast with other software maintenance tasks such as corrective maintenance, optimising maintenance, adaptive maintenance and preventive maintenance, where domain knowledge embedded in the software normally remains unchanged. Having evolved many years, domain knowledge in code has reached a certain degree of “Saturation”. This means that it is now possible, at least in some specified domain, to build up a stronger domain knowledge base which covers frequently-used domain knowledge [21]. The domain knowledge base can therefore be used, in conjunction with domain knowledge extraction rules, to recover domain knowledge from source code. The domain knowledge embedded in source code often has ambiguous appearance, which makes recovering complete and comprehensive domain knowledge rather difficult. Traditional domain knowledge analysis methods often use semantic network as a knowledge representation [5]. A semantic network is composed of concepts and the relationships among these concepts. As mentioned in the previous section, the recovery of domain knowledge involves the matching between domain knowledge base and source code. If parts of the semantic network is not successfully matched during a domain knowledge recovery process, the decision on acknowledging the whole semantic network is hard to make. To minimise the effect of unsuccessful matching between domain knowledge base and source code, we introduce a new concept called knowledge slice into domain knowledge representation. A domain knowledge slice is defined as a set of strongly related domain concepts linked by a set of relationships among these concepts. Multiple domain knowledge slices could exist for a single set of domain concepts depending on the number of different groups of relationships among these concepts. Domain knowledge
is therefore regarded as a collection of domain knowledge slices which are linked with each other through common concepts. In order to accommodate this idea, we change classic semantic network into two-layer network, namely, concrete semantic network and abstract semantic network. Concrete semantic network contains detailed information on concepts and relationships among them. Each concrete semantic subnet is associated with a single knowledge slice. Abstract semantic network contains only domain concepts and links to corresponding concrete semantic sub-nets. The concrete semantic networks are used as templates for the recovery of knowledge slices from source code whereas the abstract semantic network can facilitate the analysis of the impact of change at domain level. Figure 1 illustrates the simplification of monolithic knowledge (upper-part) into looselycoupled knowledge slices (lower-part), where concepts Ci connected by strong relationships SRi are grouped into a single knowledge slice and weak relationships W Ri are removed. The partitioning of monolithic knowledge into knowledge slices can either be subjective or be based on certain partitioning criteria, which is not the main topic of this paper. At the lower part of Figure 1, circled Ci connected by solid lines denote a concrete semantic network. UN-circled Ci, together with dotted lines denote abstract semantic network. Software is normally developed to fulfill a certain operational functionality and Verb-Noun is the basic elements for describing an operational domain knowledge. We therefore classify the concepts in the abstract/concrete semantic network into two categories, namely, objects and actions. An object represents a class, an instance, a features, etc., whereas an action represents an operation or an event which occurs among several objects. The relationships in concrete sematic network can therefore be classified into the relationships between objects and objects, objects and actions, and actions and actions. Table 1 describes relationship examples in each category.
3 Domain-Oriented Program Partitioning A large software program is generally co-written by a group of programmers with each programmer being responsible for part of the whole program. Each part of the program is usually a self-contained component with relatively independent functionality. Empirical studies [22] suggest that each programmer, having different training background and temperament, tends to use a particular code-writing style consistently. Therefore, if different programming styles in a program can be identified then this information can be used to partition the program into smaller selfcontained sub-modules. In our approach, three groups of
relationships examples instance-of, part-of, etc. receiver-of, sender-of. sub-plan-of, before, etc.
objects-objects objects-actions actions-actions
Table 1. Examples of Relationships among Nodes in Semantic Network features in source code are used to distinguish different programming styles. These are style of comments, style of
names and style of indentation. Some examples of each group of features are set out below:
WR1
C1
C5 SR4
SR1 WR2
C2
SR5
SR6
C3
SR2
WR3
C4
SR3
SRi
Strong relationship
WRj
C1 SR1
Weak relationship
C5 SR5
C2
SR6
C3’
SR4
C3"
Ck
Concept
C4 SR3
SR2
C1
linkage between local concept and global concept
C5
C2
C3
C4
Figure 1. Simplification of A Complex Domain Knowledge Base
============================================================================ Style of Comments patterns Code ---------------------------------------------------------------------------Style 1: /* x */ /* X */ SC1 Style 2: /* *x ... */
/*\n{ *X\n} */
SC2
Style 3: /* /*\n{ X\n}*/ SC3 x ... */ ---------------------------------------------------------------------------Note: 1. \n denotes the control character ’RETURN’. 2. X stands for an arbitrary sequent of characters without ’\n’, ’*’ being contained. 3. {X} denotes that X can occur one or more times. 4. The patterns are used to match corresponding styles in source code. 5. Style 2 reflects a rigid personality, whereas Style 3 indicates a freedom tendency. The person using Style 1 could never waste a penny.
Style 3: ConnectionMode W{W} SN3 -----------------------------------------------------------------------------Note: 1. W stands for an atomic name. The identification of the start of an atomic name is following the patterns: ‘‘ ?X’’, ‘‘XCX’’ where ’?’ stands for any character, X stands for an arbitrary sequent of characters, C stands for a capitalised character. ’?’ and ’C’ are therefore indicating the start of an atomic name. The identification of the end of an atomic name is following the patterns: ‘‘X? ‘‘, ‘‘X?CX’’ where the meaning of notation ’X’, ’?’ and ’C’ remains the same. ’?’ is therefore indicating the end of an atomic name. 2. {W} stands for once or more than once occurrence of W. 3. The person like Style 1 treasures cooperation more than the person who uses Style 2, whereas the person who uses Style 2 will generally think individual role is more important. The person who uses Style 3 must have a good eyesight than others. =============================================================================== Style of indentation Pattern Code ------------------------------------------------------------------------------Style 1: bbx
bbx
SI1
bbbbbbx
SI2
Style 2: ============================================================================= Style of Naming patterns code ----------------------------------------------------------------------------Style 1: Connection-Mode W{-W} SN1 Style 2: Connection_Mode
W{_W}
SN2
bbbbbbx
------------------------------------------------------------------------------Note: 1. b stands for a blank space. x stands for an arbitrary sequent of characters. 2. The person who likes Style 2 could be more romantic than the person who prefers to Style 2.
================================================================================
A program initially goes through a partitioning process which has three stages, namely, programming style sampling, program cutting and program re-healing. An algorithm for creating sampling function for programming styles is given in Script 1. Some abbreviations are Programming Styles (PS), Current Program Line (CPL), Sampling Function (SF), and Sample Interval (SI). ================================================================================ Algorithm for Creating Sampling Function of Programming Styles for A Program -------------------------------------------------------------------------------PS