A software engineering experiment in software component generation

3 downloads 33 Views 198KB Size Report
A Software Engineering Experiment. in Software Component Generation. yz. Richard B. Kieburtz. Laura McKinney. Je rey M. Bell James Hook. Alex Kotov.
A Software Engineering Experiment in Software Component Generation yz Richard B. Kieburtz Laura McKinney Je rey M. Bell James Hook Alex Kotov Je rey Lewis Dino P. Oliva Tim Sheard Ira Smith Lisa Walton

Abstract

This paper presents results of a software engineering experiment in which a new technology for constructing program generators from domain-speci c speci cation languages has been compared with a reuse technology that employs sets of reusable Ada program templates. Both technologies were applied to a common problem domain, constructing message translation and validation modules for military command, control, communications and information systems (C3I). The experiment employed four subjects to conduct trials of use of the two technologies on a common set of test examples. The experiment was conducted with personnel supplied and supervised by an independent contractor. Test cases consisted of message speci cations taken from Air Force C3I systems. The main results are that greater productivity was achieved and fewer error were introduced when subjects used the program generator than when they used Ada templates to implement software modules from sets of speci cations. The di erences in the average performance of the subjects are statistically signi cant at con dence levels exceeding 99 percent.

Keywords

Software component generation, productivity, reliability, exibility, usability.

1 Introduction

It is widely believed that software generators have the potential to improve human productivity and reduce the incidence of errors introduced in software development and maintenance. This paper reports the  The research reported here was supported by a contract with the USAF Materiel Command. y Paci c Software Research Center, Oregon Graduate Institute of Science & Technology, P.O. Box 91000, Portland, OR 97291-1000 USA, http://www.cse.ogi.edu/PacSoft/. Authors e-mail addresses: [dick,mckinney]@cse.ogi.edu z to appear in the proceedings of ICSE'96. Copyright 1996, IEEE

results of an experiment conducted to test these hypotheses by comparing the performance of subjects using a software generator with their performance in applying another state-of-the-art technology, reusable code templates, to the same programming tasks. The software generator was constructed using the SDRR (Software Design for Reliability and Reuse) method [4, 5], an experimental software synthesis technique developed by the Paci c Software Research Center of the Oregon Graduate Institute. In the SDRR method, the interface for an application designer is based upon a domain-speci c speci cation language in which the applications specialist can specify a software solution. A speci cation language is designed for each domain to capture the high-level abstractions of importance in the domain. It is intended to formalize the vernacular that domain experts use for informal description of their designs. The unique contribution of the SDRR method is in the technology provided to construct a program generator from the semantic de nition of a speci cation language. The constructed generator uses advanced techniques of automatic program transformation and translation to convert a high-level semantic de nition into reasonably ecient, compilable code in a selected target language for implementation. A software generator developed with SDRR comprises  a declarative, domain-speci c language in which to specify a software component from a selected problem domain,  a translator for the speci cation language that provides early prototyping of a generated solution,  a reusable program transformation and translation pipeline that reshapes and improves the code of the prototype translator,  a back-end code generator that produces code in a

selected target language, conforming to a software interface speci cation. The user of a program generator sacri ces control over the algorithms, data representations and coding standards that contribute to the design and implementation of a software component in exchange for the time, cost and reliability advantages of design automation. The SDRR method and supporting technology attempt to take the automation of software generation to a new level. The experiment described here was designed to evaluate how successful this e ort is, by applying it in a single, non-trivial application domain and comparing it with an available, state-of-the-art reuse technology.

1.1 A domain-speci c approach to reuse

Concepts in reuse technology have been moving beyond notions of reuse libraries towards collections of more exible design artifacts. Ten years ago, a commonly held vision of software reuse was that developers would assemble systems from a standard library of reusable software components, in analogy to the way that hardware systems were designed. This concept has not been as successful as some had envisioned. For many reasons, the software industry has not developed the standards that govern hardware technology families and which help to assure compatibility of components. Lacking such standards, software components capture in code too many late design decisions such as interface types and algorithm details. This makes it costly to adapt existing software components to new applications and can adversely a ect the performance of a resulting software product. Ideas on reuse have evolved to a more exible approach, trying to capture in reusable workproducts the domain engineering and pre-implementation design that is common to a class of software products. Often, the reuse class is a particular domain of applications which share an architecture and one or more domain-speci c languages for describing requirements or speci cations. A domain-speci c reuse kit [7] is a collection of reusable artifacts, tools and a process for using them, centered on an application domain. A reuse kit enables construction of new applicatiions within a speci ed domain. There are many variations on the idea of domainspeci c reuse, but certain concepts are central. The common aspects are:  A software architecture, which describes the kinds of entities (components, types, interfaces) that will make up a software product and how they are related.



A set of prototype components, or templates. These may be of a large or a very small scale, depending upon whether the reuse process emphasizes assembly or synthesis [10].



Tools that support generation or customization of

components. from prototype templates. Generation can be directed by prototype forms that implement classes of features, providing a synthetic facility for the creation of components [3].

An integration framework, guided by the architecture, that builds an application from existing, synthesized or customized components. The inegration framework provides the interface seen by the appllication engineer and may use sophisticated inference tools to select and adapt components for inclusion in an application [11]. Both the MTV Generator and the MTV Templates t into the framework of domain-speci c reuse kits. With MTV-G, the emphasis is on synthesis and the integration framework provides a language-based interface, while the prototype templates are for code generation and are hidden from view of the application engineer. With MTV Templates, the emphasis is on customization of prototype templates, and these constitute the visible interface to the applications engineer. 

1.2 Two reuse technologies

The templates-based technology with which the generator was compared was developed by the Software Engineering Institute [9] for the selected domain|message translation and validation (MTV) for Air Force Command, Control, Communications and Information (C3 I) systems. It consists of a set of 12 Ada program templates that correspond to the generic types of elds expected in a messagee, including numeric elds of any width, text, records and enumerations. There is a template to specify patterns in which elds are to be encoded as strings of bits in an external message. The templates contain specially bracketed identi ers that are to be replaced by parameters. These code templates can be edited by search-and-replace to produce Ada functions for the MTV application, specialized to a particular message format. With the MTV Templates solution, a set of edited templates is synthesized into an Ada package of six functions that translate messages between their datastream representations and internal data structures, and which perform validity checks on the contents of a message. This synthesis and the generation of a

simple test harness for the package are performed automatically. For the MTV Generator (MTV-G) created using the SDRR technology, the format of a message is speci ed in a message speci cation language (MSL) designed speci cally for the MTV application. MSL is the domain-speci c speci cation language for MTVG. MTV-G synthesizes an Ada package of six functions with interfaces like those of the templates-based solution. In addition, however, MTV-G provides an executable prototype of the message translator, with which a message speci cation written in MSL can be tested before Ada code is generated and compiled. The MTV domain is an excellent candidate for software automation because MTV modules are ubiquitous. The function of an MTV module is to translate incoming messages that arrive as streams of bits or bytes into an internal data structure that can be further interpreted by a controller. As a message is translated, validity checks may be performed on elds of the message. A typical C3 I system incorporates several di erent MTV modules to translate messages from its several sensors. Each MTV module performs similar functions but di ers in a multitude of details. Thus conventional reuse of software components is dicult for MTV modules. The MTV domain was selected for this experiment because the templates-based solution was already available, providing a basis for comparison.

2 Hypotheses of the Experiment

Unpublished data gathered in the late 1980's by IBM Federal Systems Division showed that developer productivity was improved approximately four times by using domain-speci c, reusable code templates, versus custom coding of applications [6]. This level of productivity increase is not inconsistent with that reported in a more recent experiment [2] that evaluated reliability and productivity bene ts of code reuse with object-oriented modules. We believe reusable code templates are at least as e ective as any other code reuse techniques. The experiment we report on here was designed to test several hypotheses comparing the MTV Generator that was developed with SDRR technology with the MTV Templates. Speci c hypotheses tested in this experiment are: 1. Flexibility|MTV-G can be con gured to at least as many instances of message speci cation as can the MTV Templates. 2. Productivity|developers will be more productive in creating or modifying a message translation

and validation module when using MTV-G than with the MTV Templates. 3. Reliability|fewer defects will be introduced in development with MTV-G than in a solution developed with MTV Templates. 4. Usability|developers will perceive MTV-G as easier to use than the MTV Templates.

3 Experiment Design

This section gives a brief overview of the design of the experiment. Complete details can be found in

Software Design for Reliability and Reuse: Phase I Final Scienti c and Technical Report [5].

Measurable attributes of the performance of human subjects in an experiment such as this one may be in uenced by a variety of factors including (i) the technologies used, (ii) order of training in the technologies, (iii) diculty of learning the technologies, (iv) work environment, (v) prior education and experience, (vi) di erences in individual ability and motivation, (vii) boredom, level of diculty of tasks, (viii) familiarity with task-speci c procedures. In designing an experiment to measure the e ect of a single factor of analysis, in this case the technologies used, one must neutralize, insofar as possible, the e ects of extraneous factors. The techniques available to control for extraneous factors are (A) matching to eliminate variation at each trial, (B) replication of trials, and (C) randomization of trials with respect to those factors that may vary systematically. The most important residual factor is di erences in the abilities of the human subjects. However, these di erences can be accounted for if the experiment is designed to make them vary independently of the factor of analysis. The degree of independence can be determined by statistical analysis of the measured data. To obtain independence, it is essential to control for potentially systematic factors, such as the order of training of subjects in the two technologies, the order of presentation of tasks to subjects, a common work environment, and the rate of progress in learning new technologies. The only systematic factor remaining should be the factor of analysis. Even after extraneous, systematic factors have been successfully controlled, there remains variability in the diculty of individual tasks. This is controlled by matching (the same task is assigned in both technologies) and by replication, through which the random e ects of variable factors are diluted in importance. Porter, Votta and Basili [1] give a detailed discussion of the control of factors a ecting human performance in a software engineering experiment.

3.1 Trial task speci cations

Message format speci cations for the experiment trials were taken from actual, unclassi ed Air Force message speci cations from a number of di erent C3I systems. These speci cations are quite rich in detail. The number of eld types in an individual message varied from 7 to approximately 40. Both characterbased and bit-based elds were included. Variablelength elds, optional elds, lists of elds and variant record elds all occurred. Up to 232 alternative string patterns were speci ed in a single variant eld. Some elds speci ed value ranges and numeric scaling. Inter- eld value constraints were encountered. An Air Force message format is speci ed by an Interface Control Document (ICD), a tabular form that speci es the layout of a message eld-by- eld. It prescribes the length of each eld or a terminating character if the length is variable, and gives its intended interpretation. The ICD's supplied for use in the experiment were selected by personnel at the Air Force Electronic Systems Center (ESC) at Hanscom AFB. They were reviewed prior to the beginning of trials by the OGI project manager, who obtained clari cation on points of ambiguity or apparent inconsistency. After review, the speci cations were given to other OGI personnel who were not otherwise involved in the design or analysis of the experiment, to generate acceptance test data for each speci cation.

3.2 Subjects

Four subjects were engaged in parallel development and simulated maintenance of speci ed MTV components. The subjects were rst trained in the use of both technologies, including hands-on experience with sample problems prior to beginning the experiment. Intermetrics, Inc. contracted to provide and supervise four subjects to conduct trials for the experiment. These persons were selected to have similar quali cations and experience. Each held a BS or MS degree in computer science and had one to three years Ada programming experience. There were two males and two females. They had no prior knowledge of the application domain. These subjects are probably somewhat over-quali ed for the particular tasks required of them in conducting the experiment, but it was felt that Ada experience was required in order not to put them at a disadvantage in using the MTV Templates. The experiment trials were conducted at the Intermetrics facility in Cambridge, Massachusetts. Direct supervision of the subjects was provided by Intermetrics.

3.3 Training

At the start of training, each subject signed a participant consent form, which informed him/her of the uses to be made of the data that were collected over the course of the experiment. Training in use of the MTV Generator and the MTV Templates was provided during a ten-day period prior to the start of the experiment trials. The subjects were also trained in use of the experiment monitoring environment in which they were to conduct their work. The trainers in each technology were persons with prior experience in its development and use. The training order was varied: two subjects were trained rst in the use of MTV-G and then in the MTV Templates, and two were trained in the reverse order.

3.4 Work Environment

The subjects worked in a restricted and controlled experiment monitoring environment on identical UNIX workstations. The subjects were informed that their work would be monitored, but were not told what data were being recorded. The environment recorded information automatically, without any user feedback or intervention. To protect subject privacy and the integrity of the data, the subjects were told in advance that the data collected in the experiment would not be used to evaluate their performance as individuals.

3.5 Experiment trial sequences

The experiment was designed to measure the performance of subjects both in developing translation and validation packages for new message formats and in maintaining MTV software as a message format evolves. Maintenance activity was simulated through a cumulative series of changes. The initial message speci cations consisted of ICD's for 12 independent formats. Eight of the initial message format speci cations were chosen for simulated maintenance. The maintenance tasks formed eight series, each consisting of seven cumulative modi cations to an initial ICD, thus there were 56 modi cations in all. Each of the message speci cations was implemented in each of the two technologies, thus there were 24 initial design tasks and 112 maintenance tasks to be assigned among the four subjects. Each subject implemented six of the initial ICD's, three in each technology. Each subject then implemented 28 maintenance tasks, randomly selected except that 14 were to be done with MTV-G and 14 were to use MTV Templates. The task sequences that were assigned to individual subjects were randomized, subject to the constraint that a subject alternated use of the two technologies in the initial six tasks. No

subject was assigned to implement the same initial message speci cation in both technologies. The rst task assigned to a subject required use of the same technology in which the subject was trained rst.

3.5.1 Out-of-order task performance

Although the maintenance tasks were assigned to subjects in random order, the ordering of these tasks was also constrained by the sequence of modi cations. A maintenance task could not be performed until its predecessor had been completed. Since subjects worked at di erent rates, it often happened that a subject was blocked on his/her next assignment pending completion of a predecessor task by another subject. In order to maintain work ow in the experiment, a subject whose assigned task was blocked awaiting completion of a predecessor was allowed to progress to his/her next assigned task and perform it out of order. There were many cases in which this occurred.

3.6 Work rules

A randomized schedule of task assignments was prepared before the start of the experiment. When a subject obtained a new task, he/she checked in the task identi er with the experiment monitoring environment to obtain access to les pertinent to the task. These les remained accessible until the subject logged the task's completion. The monitoring environment did not allow a subject to have more than one task active at any time. Subjects were requested to exit from the environment to perform personal tasks not related to the experiment. They were also asked not to work on experiment tasks outside the environment. Conformance with this work rule was con rmed in post-experiment interviews. The subjects were told not to work collaboratively on tasks and never to show one another the ICD specifying a task on which they were currently working. They were allowed to ask one another general technical questions, and did so. In the post-experiment interviews, the subjects reported that their technical conversations were of short duration, and predominantly concerned general problems, such as a suspected Ada compiler bug, problems with data or the environment, and work-arounds.

3.7 Task completion criteria

The normal criterion for completion of a task was that the implementation of the message speci cation passed an acceptance test whose data comprised 25 acceptable messages and 25 that violated the speci cation.

The experiment monitoring environment provided a test harness that could be invoked by a subject to test a module that had been created. Test runs were identi ed as either unit testing or acceptance testing. A subject provided his/her own test data for unit testing and the outcomes were not recorded. For acceptance testing, the test data had been pre-loaded into the environment in les that were not directly accessible to subjects. Test outcomes were recorded. Completion of a task occurred when it passed its acceptance test. Because the MTV Generator provided full-function prototyping capability, a subject could complete acceptance testing of an MTV-G module prior to actually generating Ada code. As a substantial amount of time was required to run the program translation, then build and compile an Ada package, the subject was allowed the option of initiating these actions o line, during lunch breaks or overnight. The resulting Ada package was then subjected to con rmation tests against the same acceptance test data. Successful completion of these tests terminated the task. The MTV Templates solution could be tested only after building and compiling an Ada package. The same acceptance test data were used to validate the completion of a common task in either technology.

3.8 Out-of-scope tasks

Since the data for the experiment trials were assembled from a variety of existing Air Force ICD's, it was anticipated that some message speci cations would be out of scope for the design of one or the other MTV solution. Subjects were instructed to report out-ofscope problems to their technical support contact and suspend the task until they received further instruction. For tasks to be done with MTV-G, the intended remedy was to supply the subject with an extended version of MSL that provided features that would enable the task to be completed. For tasks to be done with MTV Templates, the intended remedy was that new templates would be written, either by the subject or by the technical support person, to allow the task to be completed. However, no attempt was made either to provide or suggest such extensions to the basic technology unless they were speci cally requested by a subject. As it turned out, fewer MTV-G tasks were deemed to require out-of-scope extensions than had been anticipated by the OGI research team when they rst analyzed the task speci cations. The subjects were able to nd work-arounds to complete most tasks without requesting modi cations. There was a single case in which a subject requested and received an extension

to the MSL speci cation language to allow a task to be completed. Several tasks required the subject to modify or add templates in order to complete the tasks with MTV Templates. The subjects were generally successful in making the necessary modi cations to the Ada templates. However, the subjects did encounter out-ofscope problems that could not be handled simply by creating or modifying templates.

4 Data collection

The experiment produced a large amount of data in several forms. Data that were directly reported or were extracted from computer les maintained by the experiment environment consist of: 

Task e ort allocation reports, submitted by the subjects on a weekly basis. These reports attribute e ort hours to tasks.



Task assessment forms. A task assessment form was lled out by a subject as he/she completed each task, or at the end of a week, if a task was still being attempted. It asks a series of questions that can be answered with a number on a scale of 1-5 to assess the subject's perception of the diculty of the task and the usability of the technology and environment.







Session summaries logged by the experiment monitoring environment. These record the subject's identi er, the times at which the session began and ended, the times at which activities occurred (edit transactions, compilations, builds, test runs, etc.) the type of each activity and its outcome, if relevant, and the les accessed by the activity. The session summaries provide a detailed account of on-line work ow. Edit summaries extracted from RCS logs. The experiment environment was programmed to save le images to the RCS archive at the conclusion of each edit session. This gives a nely detailed, time-stamped summary of editing activity. Subject perceptions. After completion of the experiment, individual interviews were held with each subject. The subject was asked to respond to a series of questions that had been prepared in advance. The interviews were taped, and transcripts of the subject's answers were prepared for analysis.

4.1 Method of analysis

Data that directly compared the performance of subjects over a series of trials were compared using an analysis of variance. This is a statistical test of the hypothesis that observed di erences in the mean value of a performance metric cannot be accounted for by random variation in the observed values. In an analysis of variance it is important to eliminate all conceivable sources of systematic variation of the observables other than those that are being tested. Then the unaccountable variations that are observed can reasonably be assumed to be random, and normally distributed. Analysis of variance can take into account a single factor or multiple factors whose relationship to the observable is being tested. In the design of this experiment, the factors that could be correlated with observed di erences in performance were the two technologies used and the individual di erences among subjects. Other factors, such as order of use of the technologies, order of training in the technologies and prior familiarity with the particular task were eliminated in the design of the experiment. An unaccountable factor that contributed variability to the outcomes was the relative complexity of the individual design tasks. Unless otherwise noted, all di erences between reported means have been con rmed with a con dence level of 99% or greater.

5 Results of the experiment

5.1 Flexibility to meet varied speci cations

The exibility hypothesis was that the MTV Generator can be con gured to at least as many instances of message speci cation as can the MTV Templates. This was demonstrated by examining several characteristics: 

Scope: the spectrum of problem instances that a



Extensibility: the ability to modify a system to



System Performance: the ability to produce com-

solution could handle;

handle out-of-scope problems;

pilable Ada code for problem instances;

5.1.1 Scope and Extensibility In the course of the experiment, the subjects encountered one out-of-scope problem in using the MTV Generator and nine out-of-scope problems using the MTV Templates. The MTV Generator handled a broader scope of speci cations than did the MTV Templates.

The productivity hypothesis, that developers will be more productive with the MTV Generator technology than the MTV Templates, was tested by measuring the amount of work completed per reported hour of e ort for each technology. In this experiment, work product required of each subject was ostensibly the same, so the e ort required to perform this work was inversely proportional to his/her productivity. Figure

Effort hours

MTV-G

200

MTV Templates

150 100 50 0 A

B

C

D

Subject

Figure 1: E ort expended to accomplish tasks The distributions of number of tasks completed in each technology versus reported e ort hours are displayed in the chart of Figure 2. Because the tails of the distributions are extended, the horizontal axis has been compressed by using a logarithmic scale. Each distribution shows two modes. The mode at the smaller number of hours is associated with the tasks in which a speci cation was modi ed to simulate program maintenance. The second mode corresponds to original development tasks. 30 25 MTV-G 20

MTV Templates

15 10 5 128

64

32

16

8

4

2

0 1

5.2 Productivity improvement

250

0.5

Eleven of the 12 initial speci cations and 47 of the 56 maintenance modi cations were successfully expressed with MTV Templates. However, the use of legal but unusual typing strategies within the MTV Templates resulted in Ada code that manifested an unrecoverable error in the Sun Ada compiler, v1.1(j), resulting in the abandonment of a complete task sequence. Of the 12 initial and 56 maintenance tasks, all were completed with the MTV Generator. However, MTVG failed to generate Ada code for three of the task sequences due to time and space capacity problems within the generator. Post-experiment improvement of the generator subsequently solved these problems and the generator now produces Ada code successfully for all instances. However, the Sun Ada compiler v1.1(j) still cannot handle some generated code components, which exceed compiler-de ned capacity limits on line length and le size. All acceptance testing for the MTV Generator solutions was done at the prototyping level, so that the lack of a generated Ada component did not impede progress through the experiment.

300

0.25

5.1.2 Beginning-to-end system performance

1 shows the total e ort reported by each subject in using each of the two technologies on equal numbers of tasks1 .

Number of tasks

To handle the out-of-scope problems, extensions were developed for both the MTV Generator and MTV Templates. For the MTV Generator, a minor extension to the MSL language was sucient to encompass the out-of-scope problem. However, this extension could not be provided by the subjects themselves; it required the expertise of the OGI research team who designed and implemented the MSL language. Extension of the MTV Templates was partially successful in handling the nine problems encountered. Extensibility was gained through the Ada programming ability of the subjects that allowed them to produce new code templates. However, in the majority of instances, extensibility problems were handled not by writing new templates, but by nding workarounds or accepting partial solutions. One-third of the out-ofscope problems were not resolved within the course of the experiment.

Effort Hours (log)

Figure 2: Distribution of e ort hours per task. 1 The time logged by subjects in support tasks, such as developing or modifying templates to support out-of-scope speci cations, investigating problems with acceptance data, or diagnosing compiler errors, was not counted as e ort expended in performing the assigned tasks.

The ratio of the means, which is a direct indication of the productivity improvement observed with MTV-G relative to MTV Templates, was 2.92. These data were subjected to an analysis of variance to estimate the probability that the observed di erence in the means on 57 tasks is attributable to use of the two di erent technologies, rather than to unaccounted factors. The calculated signi cance level is 99.5%. Other measures of the subjects working time con rm the results obtained from the reported e ort hours. The table below shows the results of the three collected e ort measures: time spent in editing, total elapsed time spent on-line, and time reported on subject time sheets. Ratios of average times per task: MTV Templates/MTV-G Elapsed Elapsed Reported Time in Time Time Editor* On-Line* Ratio

3.20

2.67

2.92

* collected automatically The small number of subjects used in the experiment posed a risk to the conclusion that the productivity di erences we observed would be reproducible with other subjects. Note, however, that in spite of having selected individuals who were well matched in their levels of education and experience, Figure 1 shows that their measured productivities di ered by ratios of up to 3:1. These di erences in individual performance did not correlate strongly with the di erences observed in their use of the two technologies. For every subject, the ratio of productivity between using MTV-G and the MTV Templates was nearly the same. This observation is consistent with the thesis that the observed di erences in productivity are not particular to these subjects, but can be expected to hold for all subjects with similar training and experience. In evaluating the implications of these productivity data, it is important to consider what they do and do not include. It is conventional wisdom that there is little to be gained by improving the productivity of software engineers in coding alone, as that activity typically accounts for only 7 to 15 per cent of the total cost in a project. The activity monitored and accounted for in this experiment accounts for far more than just coding. It encompasses conversion of application requirements (the ICD) into a speci cation, plus design, coding and unit testing of a solution. It does not account for orig-

inal domain analysis (which is reused), nor for documentation and integration of the generated software artifact. For the family of frequently-repeated software modules that is represented by the MTV domain, the gain in productivity shown by the experiment is real.

5.2.1 Productivity and software reuse

The MTV Templates exhibits a typical limitation of reuse-based technology that depends on human identi cation of reuse opportunities. The reuse of MTV Templates was not always obvious or straightforward. In the post-experiment debrie ng, subjects commented on template reuse: \Some templates were never used in the experiment. I do not know what they do." \Not all templates were covered in the training, and we had to discover them when we had a new application requiring the template." For reuse of MTV Templates, the user needs to remember the purpose of each individual template le and must have a scheme for le management. Extensions to the MTV Templates solution result in new template les to be documented and understood. Reuse in MTV-G is accomplished using the speci cation language as a vehicle. Enhancements in MTVG are expressed as MSL language extensions that become available to all users of the language. Distribution of enhancements is managed by their inclusion in updated versions of a language description through an accepted process. There is no opportunity for the sorts of reuse \leaks" that occur with templates technology when engineers at di erent sites reimplement templates to accomplish the same extension. That activity proliferates non-standard templates and inhibits reuse.

5.3 Reliability

As a predictive indicator of the reliability of an MTV module operating in a system, we have measured the number of acceptance tests that a module has failed to pass prior to completion of the task. The instructions given to the subjects were that they could perform as much of their own testing of a module as they believed was necessary, and should not submit it for acceptance testing until they were con dent that it was correct. If a module passed its acceptance test on the rst submission, its failure score was zero. Each additional time that an acceptance test was run on the

5.4.1 Perceptions of exibility

40

Number of tasks

35 30 25 MTV-G

20

MTV Templates 15 10 5 0 0

1

2

3

4

5

6

7

8

9

10

Number of failures

Figure 3: Distribution of acceptance test failures. module added one to its failure score. Work on a module was not completed until it passed an acceptance test run. The same suite of test data was applied to the module on each run. Figure 3 shows the distribution of acceptance test failures for tasks performed in each technology. (The graph for MTV Templates does not show one outlying point, a task for which there were 15 failures before the acceptance tests were passed.) These graphs appear to approximate Poisson distributions. The mean number of failures for MTV-G tasks is 0.8; for MTV Templates tasks it is 1.8. These data were subjected to an analysis of variance, which con rms the signi cance of this di erence with a 97% con dence level2. The ratio of nearly 2.3 in the mean number of failed test runs con rms the hypothesis that fewer defects are introduced in modules implemented with MTV-G.

5.4 Usability

Users perceptions were captured both in the task assessment forms completed immediately after concluding work on each task and in the post-experiment debrie ngs. 2 In [8] it is suggested that a better estimate of signi cance of a test for di erence of means of Poisson-distributed factors is obtained if the data are preconditioned by taking a square root. This removes the functional relationship between the mean and the variance, if the distribution is indeed Poisson. When this was done, the di erence in the means remained signi cant, at a 95% con dence level.

On the task assessment forms, subjects were asked to rate from 1 (easy) to 5 (dicult) the level of diculty encountered in modifying a speci cation. The average diculty for MTV Generator implementations was 1.9 and the average for MTV Templates implementations was 2.8. This indicates a perception on the part of the subjects of greater exibility in the MTV-G framework. It is interesting to note that in the post-experiment debrie ngs, two of the four subjects volunteered the misconception that OGI personnel had generated the problem instances and as a result had biased the experiment in favor of MTV-G. The problem speci cations had actually been provided by ESC, allowing OGI personnel only enough time to generate acceptance test data before the experiment commenced. When asked, \Which technology better accommodated the entire range of ICD's used in the experiment?", three of the four subjects identi ed the MTVG technology, while the fourth perceived no di erence in exibility between the two technologies. In general, the limitations of exibility of the MTV Templates as identi ed by the subjects involved system limits, data representation diculties, and runtime problem discovery. The subjects encountered scaling problems where they needed to adjust hard limits or handle data that exceeded an anticipated size. Sometimes they were successful, and sometimes they were required to seek only partially satisfactory workarounds such as splitting a large numeric eld across two elds. Bit/byte boundary issues also caused problems within MTV Templates implementations. Requirements for new data representations such as variable-length elds and variant record types were handled using new templates. These kinds of limitations are isolated from the domain programmer with the use of the MTV-G technology. For the MTV Templates, domain programmers were required to be concerned with lower-level system problems and creation of necessary data types. MTV-G allowed the subjects to express domain speci cations without delving into non-domain-related issues. Whenever an application programmer must handle system details, it necessarily limits the portability and exibility of the program produced.

5.4.2 Perceptions about productivity The subjects' perceptions of their productivity in using the two technologies con rmed the measured results. Their feedback gives some clues as to the rea-

sons for their better performance in using MTV-G. Their average rating of implementation diculty over all tasks, on a scale from 1-100, was 15 for tasks performed with MTV-G and 37 for those performed with MTV Templates. When asked to rate their agreement from 1 (strongly disagree) to 5 (strongly agree) with the statement, \While working on this task, it was easy to determine which parts of a task were complete", subjects average response for MTV-G was 4.2, while the MTV Templates response was 3.7. A greater di erence was observed when subjects were asked to rate their agreement with the statement, \While working on this task, it was easy to determine which parts of a task remained to be completed." For MTV-G, the response average was 4.2 while with MTV Templates it was 3.5. All three of these di erences were statistically signi cant at a con dence level of at least 99%. During the post-experiment debrie ng, all four subjects con rmed that they believed that they were more productive using MTV-G, both for initial and for maintenance tasks. When asked more speci cally how productivity could be improved with both technologies, three of the four subjects identi ed the use of a more sophisticated user interface for the MTV Templates as important for productivity gains. One subject stated, \...Template Solution would bene t from a GUI or even a text-based front end which would say these are the types, and don't make me look at the Ada code; go generate the Ada code." This description sounds remarkably like the MTV-G approach. When asked how to further improve productivity using MTV-G, the only suggestion o ered was to improve generator performance during Ada code generation.

5.4.3 Con dence of subjects in their solutions

When the subjects were asked which technology afforded them greater con dence in the correctness of a given solution, three of the four expressed greater con dence in MTV-G and the other felt equally con dent in both technologies. A complaint about the MTV-G technology that surfaced several times during the post-experiment debrie ng sessions is the lack of ability to examine the Ada code. Con dence in the ability of the generator to produce correct Ada code was only moderate, and the personal desire to check the code directly was compelling. In the past, similar misgivings were common during the transition from programming in assembly languages to compiled languages. Programmers then wanted to see the compiler output. As domain-speci c languages become more widely accepted, the need to see the generated output

to subjectively arm correctness can be expected to diminish. Subjects were asked in the task assessment forms to rate their agreement (from 1 (strongly disagree) to 5 (strongly agree)) with the statement, \It was easy to locate errors." The average rating for MTV-G was 3.5, and for the MTV Templates was 3.0. During post-experiment debrie ng, however, the subjects' answers to questions about ease of error location were not consistent with this data. Subjectively, only two of the four subjects stated that it was easier to locate errors in MTV-G. In particular, the MSL compiler's error messages were poor, which impeded location of syntactic errors. For the MTV Templates, the error messages were good and the use of the Ada debugger allowed tracing of problems. Interestingly, the subjects also did not perceive a di erence in the likelihood of error insertion between the two technologies, although the measurements of their work indicated that there was a signi cant di erence. When asked to characterize the types of errors made, the subjects noted editing problems for both technologies, and for the MTV Templates, problems with eld widths and bit positions. Also, one subject noted that it was dicult to reconstruct or edit an existing template since the textual cues were gone after instantiation.

5.5 Summary of results

This experiment has produced de nitive results demonstrating a pronounced advantage for the MTV Generator constructed with SDRR technology in the areas of productivity improvement and a lower rate of delivered defects from applications developers. These advantages have been realized in a direct comparison with what is believed to be the best currently available technology for developing software components. The following table summarizes the quantitative results of the analysis: MTV-G

Productivity:

MTV Ratio Templates

Average e ort hours per task Std. deviation

2.80

8.17

4.2

13.5

Avg. number of test runs failed

0.8

1.8

Reliability:

2.92 2.25

MTV-G also exhibited superior exibility in handling a greater range of speci cations within the original design than the MTV Templates. This result was

achieved despite the fact that the MTV Templates had already been used in a deployed system. Furthermore, MTV-G demonstrated robustness in its rst trial { the subjects encountered no errors in MTV-G. The most serious problem encountered was MTV-G's inability to handle large message speci cations, a problem that has been substantially alleviated in a new version. The subjects favored MTV-G in their perceptions of its usability, although they had had no prior experience with it or any similar technology. The experiment has con rmed the most important of the hypotheses about the advantages of generating software from speci cations. It has demonstrated that SDRR is indeed a highly promising, new technology for developing software components.

References

[1] L. G. Votta A. A. Porter and V. R. Basili. Comparing detection methods for software requirements inspections: A replicated experiment. IEEE Transactions on Software Engineering, 21(6):563{575, June 1995. [2] Victor R. Basili, Lionel C. Briand, and Walcelio L Melo. Measuring the impact of reuse on quality and productivity in object-oriented systems. Communications of the ACM, 39 (to appear), 1996. [3] Don Batory, Vivek Singhai, Marty Sirkin, and Je Thomas. Scalable software libraries. In Proceedings of ACM SIGSOFT'93, Symposium on Foundations of Software Engineering, December

1993.

[4] Je rey Bell et al. Software design for reliability and reuse: A proof-of-concept demonstration. In TRI-Ada '94 Proceedings, pages 396{404. ACM, November 1994. [5] Paci c Software Research Center. SDRR project Phase I nal scienti c and technical report, February 1995. [6] Walter Ellis. private communication, June, 1993. [7] Martin L Griss and Kevin Wentzel. Hybrid domain-speci c kits for a exible software factory. In Proceedings of 1994 ACM Software Applications Conference, SAC'94, pages 47{52, March 1994.

[8] N. L. Johnson and F. C. Leone. Statistics and Ex-

perimental Design in Engineering and the Physical Sciences, 2nd Edition, volume II. John Wiley

& Sons, 1977. [9] Charles Plinta, Kenneth Lee, and Michael Rissman. A model solution for C3I message translation and validation. Technical Report CMU/SEI89-TR-12 ESD-89-TR-20, Software Engineering Institute, Carnegie Mellon University, December 1989. [10] Dennis Volpano and Richard B. Kieburtz. Software templates. In Proceedings Eighth International Conference on Software Engineering, pages 55{60. IEEE Computer Society, August 1985. [11] Richard Waldinger and Michael Lowry. AMPHION: Towards kinder, gentler formal methods. In Proceedings of the 1994 Monterey Workshop on Formal Methods. U.S. Naval Postgraduate School, September 1994.

Suggest Documents