Re-Evaluating Inheritance Depth on the Maintainability of ... - CiteSeerX

0 downloads 0 Views 111KB Size Report
of Object-Oriented Software ... of the key elements that makes object-oriented programming powerful (Mey- .... had used Java for all programming exercises.
Submission (July 17, 1998) to ’Empirical Software Engineering -- An International Journal’

Re-Evaluating Inheritance Depth on the Maintainability of Object-Oriented Software Lutz Prechelt, Barbara Unger, Michael Philippsen, Walter Tichy Universit¨at Karlsruhe, Fakult¨at f¨ur Informatik (Received ..... ; Accepted in final form .....) Abstract. John Daly et al. reported on two experiments evaluating the effects of inheritance depth on program maintenance. One experiment, which was performed twice, found that maintenance was performed significantly quicker for software using 3 levels of inheritance, compared to equivalent ‘flattened’ software without inheritance (0 levels). A second experiment found that maintenance for software using 5 levels of inheritance tended to be slower than for equivalent software without inheritance. We report on similar experiments on the same question. Our results clearly contradict those mentioned above: the flattened versions clearly tended to be superior. We made several crucial changes to the setup. In particular we used a longer and more complex program, made an inheritance diagram available to the subjects, and used additional kinds of maintenance tasks. Furthermore, our experiment design compares 0-level, 3-level and 5-level inheritance directly in one experiment. Our results suggest that there is no such thing as usefulness or harmfulness of a certain inheritance depth per se. Maintenance effort depends on other factors partly related to inheritance depth. Using a post-hoc analysis, we identify the number of relevant methods to be one such factor. Key words: controlled experiment, inheritance depth, maintenance, design style

1. Inheritance and complexity Inheritance (in its common combination with subtype polymorphism) is one of the key elements that makes object-oriented programming powerful (Meyer, 1997). Many design problems can be solved elegantly using inheritance and the resulting programs are often simpler, clearer, and more flexible than could have been achieved otherwise (Gamma et al., 1995). But it is also obvious that the construct of inheritance may also introduce additional complexity if used inappropriately. Hence, it is not obvious in which cases inheritance will be beneficial and in which it will be counter-productive. Daly, Brooks, Miller, Roper, and Wood conducted research in this direction. Using interviews and then a questionnaire they found that 55% of objectoriented practitioners among 273 respondents “agreed that inheritance depth is a factor when attempting to understand object-oriented software”. 31% of the respondents “indicated that between four and six levels of inheritance depth is where the difficulties begin”. The exact definition of inheritance

2

L. PRECHELT ET AL.

depth underlying these statements is unclear, but see Section 2 for the meaning in the experiments. Since program understanding is an important cost factor, in particular during software maintenance, the authors concentrated on this one aspect of inheritance and devised controlled experiments for assessing the influence of inheritance depth on program maintainability. The experiment tasks consisted of adding a new class to a program, which was given either with inheritance (experiment group) or without inheritance, i.e., “flattened” (control group). The results of these experiments were reported in this journal (Daly et al., 1996). In the following, we will refer to these experiments and to the article as O LDEXP. The present article presents a direct follow-up to that work. We believed that certain parameters should be changed in order to increase external validity. We designed an appropriate new experiment, performed it twice, and obtained results that contradict those of O LDEXP. We will not re-iterate the argumentation why the research question is important and what findings were reported in related work and refer the reader to the well-written O LDEXP article (Daly et al., 1996) instead. In the following, we will first summarize in what respects our experiment design was different from O LDEXP and then report on our actual experiments (design, subjects, tasks, procedure) and findings. Nevertheless, we will assume that the reader has read (Daly et al., 1996). We will refer to our own experiments as N EWEXP. Due to space restrictions, we cannot provide a complete description of the programs and tasks used. However, detailed information is available in a technical report (Unger and Prechelt, 1998) that includes the complete experiment materials such as the task descriptions and source program listings. 2. Main differences between O LDEXP and N EWEXP Overall we find O LDEXP well designed and described. Initially we had but one point of criticism: We believed the subjects should be given additional documentation of the inheritance tree, preferably in graphical form. Later we also found out that the programs used were rather simple, in both size and structure. Compared to O LDEXP, the design of our own experiment exhibits the following main differences: Combined 3-level and 5-level test: O LDEXP used two completely separate experiments for first comparing 3-level inheritance to a “flat” (0-level) program (in a two-phase experiment O LDEXP-1a and -1b and again in a replication of O LDEXP-1b) and then comparing 5-level inheritance to a “flat” (0-level) program (O LDEXP-2). In contrast, we used 0-, 3-, and 5-level versions of the same program within one experiment; refer to

ese.tex; 17/07/1998; 19:01; no v.; p.2

RE-EVALUATING INHERITANCE DEPTH

3

Sections 3.3 and 3.4 for details. Rationale: This design allows to directly compare performance on the 3-level with the 5-level version in addition to comparing both to the flat version. Class diagram: The subjects of O LDEXP were equipped with the program source code only; we also handed out a printed class diagram in OMT notation (see Figure 2 for an example), which allowed studying the inheritance relations at a glance. Rationale: In a modern programming environment one should not assume that the only means for exploring the inheritance tree is program source text. It later turned out the subjects of O LDEXP-1a and -1b (but not -2!) were also given a partial overview, namely a sorted table of instance variables in every class.1 Program complexity: Our programs were significantly longer than those used in O LDEXP (see also Table I and refer to Sections 3.2 and 3.3); the classes had more kinds of relationships and also performed less obvious functions. Rationale: The O LDEXP programs are unrealistically simple. All classes belong to but one inheritance tree, there are no other relations between the classes except for inheritance, and the function and implementation of each class can almost be deduced from its name alone. See Figure 2 about the N EWEXP program in contrast. Task complexity: The O LDEXP task always consisted solely of adding a new class, whose structure was very similar to the structure of all existing classes. In contrast, our tasks require understanding of classes with different internal structure and imply less straightforward extensions. The size of the tasks was also larger; see Table II and Section 3.5 for details. Changes vs. extensions: In addition to one task requiring adding a new class (as in O LDEXP), our subjects also performed a task involving changes to multiple existing classes; refer to Section 3.5. Rationale: This is a frequent type of maintenance where different effects must be produced than when merely adding classes. Definition of inheritance depth: We define inheritance depth (ID ) as the number of edges on a longest downward path from root to leaf in the inheritance tree(s) of a program, including both the program versions before and after the maintenance task. In contrast, O LDEXP’s definition (IDold ) counts classes, not edges, on the path (i.e., always one more than ID), but considers only the program version before the maintenance. Since the O LDEXP tasks always involve adding a class at the bottom of the hierarchy, they always have ID = IDold . Given the mix of task types in N EWEXP, however, we deem ID more appropriate, because program

ese.tex; 17/07/1998; 19:01; no v.; p.3

4

L. PRECHELT ET AL.

Table I. Experiment overview of our experiments (N EWEXP-G , N EWEXPU ) and the old experiments (O LDEXP-1a, O LDEXP-1b, O LDEXP-2). —N EWEXP— G U no. of subjects no. of groups no. of datapoints

—O LDEXP— 1a 1b 2

57 3 57

58 3 58

31 2 50

29 2 27

31 2 30

0-level program or version: classes method bodies lines program files

20 158 2470 1

20 160 2465 1

3 26 273 7

4 35 370 9

8 96 1007 17

3-level program or version: classes method bodies lines program files

27 100 1344 1

27 96 1317 1

5 21 252 11

6 27 323 13

5-level program or version: classes method bodies lines program files

28 80 1200 1

28 79 1187 1

task types

2x/1x add class,

11 56 694 23 1x add class

change available materials

file, listing, inherit. diagram

files

observed variables

elapsed time, correctness

elapsed time

understanding involves the deepest classes, no matter whether these are new or previously existing. Submission procedure: N EWEXP subjects decided alone when they considered their solution complete. In contrast, O LDEXP solutions were checked by a supervisor and the subjects were told to continue if their solution was not yet correct. We do not know the exact content of this feedback and how it may have influenced subject behavior. Further differences include our programs being written in Java (as opposed to C++), coming from a different domain, and having a graphical user interface (as opposed to purely textual stream I/O).

ese.tex; 17/07/1998; 19:01; no v.; p.4

RE-EVALUATING INHERITANCE DEPTH

5

3. Description of the experiments While reading the following sections, please refer to Tables I and II and to Figures 1 and 2 for further characterization of the experiment design, the program used, and the task complexity. 3.1. H YPOTHESES For the range of inheritance depths from 0 to 5 as studied here, we used the following initial hypotheses to guide our experiment: Hypothesis 1: Programs with more levels of inheritance will require less time for maintenance, given that sufficient information about the inheritance tree is provided in compact graphical form. Hypothesis 2: Programs with more levels of inheritance will result in a higher quality of maintenance, given that sufficient information about the inheritance tree is provided in compact graphical form. These hypotheses will be modified and refined based on the observations in the experiment before they are tested quantitatively. 3.2. S UBJECTS

AND ENVIRONMENT

We performed our experiment twice, with small changes. The first time we performed it with 57 graduate Computer Science students in the mandatory final exam of a 6-week intensive Java programming course (“Java/AWT Kompaktkurs (JAKK)”, about 70 participants). Participation required achieving 75 percent of all points in the previous course exercises. We call this experiment G (for “graduate course”). We replicated the experiment with 58 undergraduate Computer Science students at the end of their first year (lecture and lab course “Informatik II”, about 160 participants). Participation was optional and would result in a small bonus on the grade of the written exam to be conducted later. The course had used Java for all programming exercises. We call this experiment U (for “undergraduate course”). On average, the previous programming experience of the G subjects hU subjectsi was 8.1 years h6.1 yearsi using 4.0 h3.9i different languages with a median largest program of 2750 LOC h2000 LOCi. Before the course, 91% h62%i of the subjects had previous experience with object-oriented programming, 47% h45%i with programming graphical user interfaces (GUIs). With respect to Java AWT GUI programming, the course concentrated on JDK1.0.2-style hJDK1.1-stylei event handling. The U course covered only little GUI programming.

ese.tex; 17/07/1998; 19:01; no v.; p.5

6

L. PRECHELT ET AL.

Figure 1. Example screen shot of the “Boerse” experiment program. A textual month display is already open (bottom window). For a graphical month display about to be opened, the stock code number is just being entered (center window).

3.3. P ROGRAM

USED

Our experiment program, called “Boerse”, was an interactive application for displaying two different kinds of stock exchange data in various ways. The data is taken from two text files and displays can be selected to have textual table form or graphical chart form and to cover different time ranges into the past (day, week, month). The user selects a certain sort of display, then enters a stock code number (if required for that display). As a result, a window pops up with the desired presentation. The process can be repeated at will. See Figure 1 for an example screen shot. Three functionally equivalent versions of this program were used in the experiments: The 5-level program represents the original program design. See Figure 2 for its inheritance diagram. The 0-level “flattened” version was created by inserting inherited attributes and methods directly into the source text of each subclass and removing the inheritance relation — actual polymorphism is not used in any of the versions. The 3-level version was built according roughly to an object-oriented design principle: Inherit only interfaces, not implementations. Therefore in our 3-level version, subclasses were derived only from abstract classes, never from concrete classes. The rationale of this

ese.tex; 17/07/1998; 19:01; no v.; p.6

7

RE-EVALUATING INHERITANCE DEPTH Comparable DayDates IntervalDates DayChartValue MonthChartValue

StockInfo FreeInfo

Chart

ChartDisplay

InfoWithCosts Table

DayChart MonthChart

DayChartDisplay MonthChartDisplay

Constants Stock Choice Date Time Price

DayTable

IntervalTable

LastDayPrices WeekTable DayGains

MonthTable

MonthDynamicTable

Figure 2. Classes and their inheritance relations for the 5-level version of the “Boerse” program.

principle is maximizing flexibility for implementation change, see (Gamma et al., 1995) for details. Thus, our three versions are not arbitrary variants of one program, but represent three different, but sensible design styles. The programs were accompanied by two input files, whose format was relevant for solving the maintenance tasks. For the U experiment these three versions were converted from their original JDK1.0.2 form (using action() methods for event handling) into a form using inner classes and JDK1.1-style event handling.

3.4. E XPERIMENT

DESIGN

The independent variable in both experiments is the maximum depth of inheritance occurring in the program. The variable has three levels, 0, 3, and 5, resulting in three experimental groups. The subjects did not know what the experimental variable was. We used a matched-between-subjects design (Christensen, 1994). The subjects were ordered by expected performance and then randomly assigned to the three groups in such a manner that one out of any three subjects with similar expected performance would be assigned to each group. For the G experiment we used the scores from the previous course assignments for sorting the subjects, for the U experiment we used the size of the largest program they had ever written. We do not claim that these criteria provide perfect matches, but we claim they resulted in groups with well-balanced average subject ability. The dependent variables were the time required for each assignment (measured in minutes) and the quality of the delivered solution (measured on a defined discrete grading scale).

ese.tex; 17/07/1998; 19:01; no v.; p.7

8 3.5. TASKS

L. PRECHELT ET AL.

AND ASSIGNMENTS

Overall there were three different tasks, which we call Task 1, Task 2a, and Task 2b. The G subjects performed all three; the U subjects performed only 2a and 2b in order to reasonably limit their work time. There were two assignments that were measured separately. For the G experiment the first assignment consisted of Task 1 and the second of Tasks 2a+2b. For the U experiment the first assignment consisted of Task 2a the second of Task 2b. Task 1 consisted of converting the program from 2-digit to 4-digit years (“solving the Year-2000-Problem”). No further information was given. Changes were required at all points where records from either of the input files were split into their individual fields. At each point of change, the character offset of the date field and all subsequent fields had to be corrected. Most subjects found out during program understanding that the points of change were exactly all occurrences of the substring() method, which could then be found using the editor. There were only half as many change points in the 5-level program than in either the 3-level or 0-level program. Task 2a required writing a new class and extending the menu handling in the main class. The task consisted of adding a new type of display (arbitrary time interval price table display), whose functionality involved parts from both, existing chart-style and table-style display classes. Most of the tablestyle functionality could be inherited in the 3-level and 5-level programs and copied in the flat program. The code for selecting an arbitrary time interval could be adapted from a chart class in all versions. The new code must also overwrite behavior inherited from superclasses. This behavior is most heavily distributed in the 5-level program. Task 2b requested to add yet another type of display, featuring arbitrary time interval selection and table format as in Task 2a, but displaying differently computed data. It turned out that in the 5-level program, the solution from 2a can be reused entirely, if it was designed well. One only needs to change the superclass from which to inherit the data computation. The flat and 3-level programs could also profit from 2a, but less much so. 3.6. P ROCEDURE

AND MEASUREMENTS

Each of the experiments was performed in a single session in June 1997. The subjects implemented their solutions using JDK1.1 on IBM RS 6000 Unix workstations running AIX. The experiment had four parts, for each of which the materials were handed out and collected individually: first a background questionnaire and a short pretest for evaluating the subjects’ knowledge of Java inheritance rules, then assignment 1, assignment 2, and finally a postmortem questionnaire.

ese.tex; 17/07/1998; 19:01; no v.; p.8

9

RE-EVALUATING INHERITANCE DEPTH

Table II. Characterization, by program version, of the typical solution effort for the tasks. Investigated methods: number of methods that must be analyzed and understood for solving the problem. Hierarchy changes (dynamic): number of times one must switch to a subclass or superclass during that method understanding process if it is performed by dynamic tracing (i.e. following the execution call sequence). Hierarchy changes (static): ditto, if understanding is performed by only reading the relevant parts of the program in textual order and exploiting knowledge from the documentation. Solution methods: number of methods that are copied from existing classes (verbatim or with subsequent changes) or modified in order to create the solution. No methods needed to be written from completely from scratch. —N EWEXP— (G &U ) 0 3 5

inheritance depth

—O LDEXP— 1a/1b 2 0 3 0 5

Task 1:

solution methods

16

16

8

“Add class” task/ Task 2a:

investigated methods hierarchy changes —dynamic —static solution methods

13

16

17

4

5

4

7

0 0 17

17 3 9

21 4 5

0 0 10

2 1 4/5

0 7 15

5 8 4

investigated methods hierarchy changes —dynamic —static solution methods

2

2

2

0 0 17

0 0 10

0 0 5

Task 2b:

For each assignment and each subject we measured the time between handing out and collecting the experiment materials. As a backup and doublecheck, we also intercepted each compilation and each program run and registered their time. For each task, we graded the results according to the degree to which they fulfilled the requirements. Since graded scales of correctness are somewhat subjective we only report the number of completely correct solutions here. 3.7. T HREATS

TO INTERNAL VALIDITY

There are two major threats. First, there may be program differences that are not directly related to inheritance depth but still influence maintenance effort. Such differences may have crept in during the conversion of the original 5level program into the flat and 3-level versions. However, the conversion process was almost mechanic and we can hardly imagine how such differences could have occured. Second, the group abilities may be unbalanced due to bad luck. In our pretest we checked for this possibility using two Java comprehension assignments. We compared the proportions of correct versus wrong answers. Neither the Fisher exact p nor the 2 -Test indicated significant differences for

ese.tex; 17/07/1998; 19:01; no v.; p.9

10

L. PRECHELT ET AL.

any of the group pairs; the smallest of the 24 p-values obtained were 0.20, 0.24, and 0.33. 3.8. T HREATS

TO EXTERNAL VALIDITY

There are two main differences between the experimental and real software maintenance situations that may limit the generalizability (external validity) of the experiments: First, in real situations there are subjects with more experience and, second, there are programs and maintenance tasks of different complexity, structure, and domain. Experience: The most frequent concern with experiments using student subjects is that the results cannot be generalized to professionals, This is certainly true for the U experiment, where the subjects were rather inexperienced. The subjects of the G experiment, on the other hand, perform quite similar to professional software engineers, in particular since our Java course has attracted predominantly individuals from the most capable third of our studentship, with an average of more than 8 years of programming experience. Structure and complexity of program and task: It is unclear how the effects observed in the experiments relate to those that may occur with other programs and other maintenance tasks. In particular the program structure and the quality of available documentation may make a large difference; for instance the number of relationships (besides inheritance) between classes, the availability of different types of design documentation, and previous familiarity with the program and its domain (which both may also be considered a kind of documentation). Furthermore, as our results below show, the type of maintenance task is an important factor. More research is needed before the relation between task type, inheritance depth, and maintenance effort will be understood.

4. Results and discussion The average times required by the groups for the various tasks is depicted in Figure 3. We report the results of one-sided, pair-wise statistical tests for identical means of these data, which we performed using Bootstrap resampling percentiles (Efron and Tibshirani, 1993). 2 We report the p-values indicating the probability that the observed differences occured by chance alone; we consider a difference significant if p < 0:1. Task 1 (Y2K-Problem, experiment G ): The subjects with the flat program version must perform a larger number of modifications than the group with the inheritance depth of 5 but they were almost as fast; the difference is not significant (p = 0:200). Also, the number of modifications is the same

ese.tex; 17/07/1998; 19:01; no v.; p.10

11

time[minutes], correctness[%] 0 50 100 150

RE-EVALUATING INHERITANCE DEPTH

0 3 5 G, task1

0 3 5 G, task 2a/2b

0 3 5 U, task 2a

0 3 5 U, task 2b

Figure 3. Average work times in minutes (as bars) and resulting solution correctness in percent (as triangles and lines) for each task and each of the three program versions.

for the flat compared to the 3-level program, yet the 3-level group was significantly slower (p = 0:075) and hence was also even slower than the 5-level group (p = 0:023). These results contradict Hypothesis 1. The correctness of the solutions was perfect in the flat group (which obtained 100% of all points), very good in the 3-level group (90%), and good in the 5-level group (80%) — contradicting Hypothesis 2 as well. These results indicate that, for this task, the flat program is simplest to change. The relation between the 3-level and the 5-level version cannot reliably be concluded from the results because speed and correctness show opposite trends. Task 2a+2b (add two displays, experiment G ): In this task, the flat version was maintained significantly faster than the other two (p = 0:038 against the 3-level version and p = 0:005 against the 5-level version); these other two took about the same time (p = 0:393). Correctness is about equal in all three groups (79%, 75%, 80%, no significant differences). Again the largest amount of new source code needed to be produced for the flat version, but apparently it is easier to copy all of this code in one step and then modify it locally instead of tracing functionality up and down the inheritance hierarchy, independent of whether that is 3 levels or 5 levels deep. However, the non-difference between the 3-level and 5-level versions is composed of two different effects which we can discriminate in the U experiment as we will see now. Task 2a (add interval table price display, experiment U ): For this class addition task there is no difference between the flat and the 3-level version (p = 0:456). However, the 5-level version takes a lot longer to change (p = 0:038 against 0-level or p = 0:055 against 3-level). The necessi-

ese.tex; 17/07/1998; 19:01; no v.; p.11

12

L. PRECHELT ET AL.

ty for functionality tracing along the hierarchy as mentioned above apparently is not harmful in the 3-level version, but becomes harmful for the deeper 5-level hierarchy. The correctness of the solutions tends to decrease with increasing inheritance depth, but all the differences are not significant (0:185 < p < 0:374). Task 2b (add interval table gain/loss display, experiment U ): As mentioned above, the 5-level group can solve 2b by just duplicating the class written for the 2a solution, changing its name and superclass, and augmenting the menu. The same was not possible for the other two versions. As a result, maintenance of the 5-level program is significantly quicker for this task compared to the 3-level program (p = 0:004) or the flat program (p = 0:006); those two are about the same (p = 0:263). The correctness of the solutions is not significantly different between any of the groups. There is a caveat for interpreting these results: some U subjects dropped out of the experiment during task 2b. By comparison to the task 2a results we find, not surprisingly, that the dropouts tended to be from the less capable half of the participants. Since the mortality was most pronounced in the 5-level group, the group abilities became unbalanced and thus the 5-level results obtained above are somewhat exaggerated. Sensitivity to outliers: All of the above time results remain qualitatively the same when possible outliers are removed by right-trimming, i.e., ignoring the largest 10% or even 20% of the time values in each sample. Subjective experience of subjects: The postmortem questionnaire asked for subjective judgements and had some interesting results. Judging whether the use of inheritance in the programs was adequate, a large majority of the flat version group (in particular the graduate students of G ) found that inheritance was used too little or much too little. The other two groups by and large found this aspect OK. Still, however, more subjects in the flat program group than in the other groups believed they had a correct solution. With respect to the clarity of program structure, the G groups prefered the 5-level program over the other two, while the U groups prefered the other two over the 5level program. Similarly, the U groups found the task difficulty lower for the flat program than for the other two. G had no clear differences. The answers to “How well could you concentrate during the task?” for the second task indicated lower concentration for the groups with deeper inheritance in both experiments. Discussion: Both hypotheses stated in section 3.1 have to be rejected: neither do deeper inheritance hierarchies generally speed up maintenance, nor do they result in superior quality of the results. However, the opposite statements are also not generally valid, although there is a trend in that direction. At least the flat program version (without inheritance) tends to be maintained faster and with fewer mistakes than the program versions with inheritance depth of either 3 or 5.

ese.tex; 17/07/1998; 19:01; no v.; p.12

13

150

RE-EVALUATING INHERITANCE DEPTH

NG5 NG3

time [min] 50 100

NG0

0

O25 O10 O20 O13

0

5 10 15 number of relevant methods

20

Figure 4. Prediction of average time taken for O LDEXP-1a (O10, O13), O LDEXP-2 (O20, O25), and N EWEXP-G ,2a+2b (NG0, NG3, NG5) as a function of the number of investigated methods, i.e. methods analyzed during understanding.

5. Comparison to O LDEXP and further analysis These results are in sharp contrast to the results of O LDEXP, which claimed that 3 levels of inheritance was better than both 0 and 5 levels. One may ask: “Which of these experiments is right?”, but we think this is the wrong question. Instead we believe that inheritance depth in itself is not an important factor for maintenance tasks and we should search for other factors that explain the results. To find such factors, one may use a model of the maintenance process. Maintenance consists of two major steps: Understanding the program and actually performing the change. In both O LDEXP and N EWEXP, the tasks were dominated by understanding; the actual changes were not difficult. It is plausible that the effort for understanding correlates with factors such as the numbers of classes or methods that are actually analyzed or the number of times one switches from one class to another — either due to explicitly extern method calls or due to inheritance — during the understanding process. See Table II for a comparison of the program versions with respect to these metrics. Using these factors, we tried building a statistical model for explaining the average time required by the groups of both experiments. We could actually determine the above factors for the O LDEXP task and the N EWEXP-2a+2b task, because these have an almost canonical solution approach. In the fol-

ese.tex; 17/07/1998; 19:01; no v.; p.13

14

L. PRECHELT ET AL.

lowing we consider only the results of O LDEXP-1a and O LDEXP-2 and the N EWEXP-G task 2a+2b. However, essentially the same discussion applies to the results of O LDEXP-1b (both the original and the internal replication). We leave out experiment N EWEXP-U because its subjects were insufficiently experienced and leave out N EWEXP-G task 1 because its structure is too different from the others and we could not identify a canonical approach towards the solution as is necessary for our analysis. It turns out that the number M of methods analyzed during the understanding process is a very good predictor of the overall time required. M and group average time have a correlation 3 of 0.98, hence no additional variables are required for explaining the average time. 4 A least squares linear model fitted to the seven group means indicates a constant effort of 24 minutes, plus 6 minutes per method; see Figure 4. This is a rather plausible result. It is interesting that despite the high quality of the model, it does not predict that the O LDEXP-1a 3-level program (O13) is better than the flat one (O10) — these two data points relate against the trend of the other five, so there must be an unknown factor in these two programs that would explain the O LDEXP results. We do not claim that the number of relevant methods is a universally good predictor of maintenance effort. Quite on the contrary: we believe that its utility is highly task-dependent. In fact, its given utility is surprising, considering the significant differences between the experiments as discussed in Section 2. But the above analysis indicates that more relevant factors than inheritance depth exist. Identifying a complete set of such predictors and their quantitative relations is important future work. It may turn out that O LDEXP and N EWEXP are actually quite similar with respect to these (yet unknown) essential qualities of programs and tasks.

6. Conclusion In our experiment setting, less inheritance tended to result in less maintenance effort, contradicting the previous results of Daly et al. However, this result is presumably task-specific. Probably the real advantages of inheritance become visible only for tasks involving complicated functionality modifications where less inheritance means a higher degree of code duplication and hence higher effort for changes and consistency checking. Yet our results suggest that the assumption underlying both, Daly et al.’s old and our new experiments, is wrong: Inheritance depth is not in itself an important factor for maintenance effort. We speculate that the actual important factors may interact with inheritance depth, but are much more complex, such as (1) the match between the program design and the particular maintenance task and (2) interactions

ese.tex; 17/07/1998; 19:01; no v.; p.14

RE-EVALUATING INHERITANCE DEPTH

15

with previous tasks. Although neither our nor the previous experiments were designed to identify or measure such factors, we could identify one of their components: for the given tasks, the number of task-relevant methods in the programs was a most powerful predictor of maintenance effort. Other plausible components, such as the degree of distribution of relevant items over the source code need to be investigated in future experiments that are designed specifically for that purpose.

Acknowledgements We thank Gerd Hillebrand for providing the Informatik II subjects and Arno Wagner for helping with mastering the technical infrastructure. We also thank all our experimental subjects for giving their best.

Notes 1

This fact is not mentioned in (Daly et al., 1996). We found it from (Daly, 1996). We did not use the t-test because of non-normalities in our data. 3 We cannot compute the correlation based on the individual subjects’ times, because the raw data of O LDEXP is not available 4 Incidentally, for these two experiments the total number of classes in the program correlates strongly with the number of relevant methods and is only negligibly less powerful as a predictor of average time. 2

References Larry B. Christensen. Experimental Methodology. Allyn and Bacon, Needham Heights, MA, 6th edition, 1994. John Daly. Replication and a Multi-Method Approach to Empirical Software Engineering Research. PhD thesis, Dept. of Computer Science, University of Strathclyde, Glasgow, Scotland, 1996. John Daly, Andrew Brooks, James Miller, Marc Roper, and Murray Wood. Evaluating inheritance depth on the maintainability of object-oriented software. Empirical Software Engineering, 1(2):109–132, 1996. Bradley Efron and Robert Tibshirani. An introduction to the Bootstrap. Monographs on statistics and applied probability 57. Chapman and Hall, New York, London, 1993. Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley, Reading, MA, 1995. Bertrand Meyer. Object-Oriented Software Construction. Prentice Hall, Upper Saddle River, NJ, 2nd edition, 1997. Barbara Unger and Lutz Prechelt. The impact of inheritance depth on maintenance tasks: Detailed description and evaluation of two experiment replications. Technical Report 18/1998, Fakult¨at f¨ur Informatik, Universit¨at Karlsruhe, Germany, July 1998.

ese.tex; 17/07/1998; 19:01; no v.; p.15

16

L. PRECHELT ET AL.

Address for correspondence: Fakult¨at f¨ur Informatik, Universit¨at Karlsruhe D-76128 Karlsruhe, Germany Phone: +49/721/608-4068, Fax: +49/721/608-7343 Email: prechelt,unger,phlipp,[email protected] WWW: http://wwwipd.ira.uka.de/Tichy/

ese.tex; 17/07/1998; 19:01; no v.; p.16

Suggest Documents