Issues in the Evaluation of User Interface Tools - Semantic Scholar

Issues in the Evaluation of User Interface Tools Len Bass, Gregory Abowd Software Engineering Institute1

Rick Kazman University of Waterloo Abstract. We define a framework for the evaluation of user interface construction tools in terms of six software engineering qualities. These qualities are understood in terms of who judges the quality – developer or end user; and what artifact is judged – the development tool itself or the systems it produces. We identify four classes of evaluation techniques: functional inspection, architectural inspection, direct developer input, and benchmarking. We then examine each of the quality factors and discuss the effectiveness of different evaluation techniques for each.

1

Introduction

The evaluation and selection of user interface construction tools is a difficult problem. Such tools must be evaluated on criteria relative to the use of the tool and on criteria relative to the tool vendor and its support for the tool. The criteria relative to the use of the tool are those things called quality attributes such as functionality, support for modifiability, reusability, and cost of use to develop other systems. Existing methodologies (e.g., [HS91]) usually evaluate and compare tools based on a comparison of tool functionality. There is a need to consider other issues when doing such comparisons, but there is little direction on how such an evaluation should proceed. In this paper, we present a framework within which such an evaluation can proceed and discuss different techniques for the evaluation. The framework is designed to evaluate selected quality attributes of both the tool and the systems constructed using the tool. The techniques discussed are feature inspection, architectural inspection, direct developer input, and benchmarking. We have two goals in writing this paper. First, we want someone undertaking such an evaluation to understand what qualities are important, which techniques are appropriate for each quality, and what factors influence a quality. Second, we want to establish that an author proposing a new tool or user interface architecture should identify the quality attributes that are improved by their tool or architecture. They should provide some supporting evidence based on these evaluation techniques. Overview. In Section 2, we outline the framework for evaluation and define the six quality attributes important for software interface tools. In Section 3, we define the four evaluation techniques. In Section 4, we address each of the six quality attributes of importance and describe how they can be evaluated and what factors contribute to them. In Section 5, we discuss related work on evaluation of user interface tools before concluding and discussing future directions of this work in Section 6.

1. This work is sponsored by the U.S. Department of Defense

2

Framework for Evaluation

A number of different software qualities taxonomies exist [GC87] [MRW77]. One of these taxonomies has even become embodied in an international standard [ISO91]. The taxonomies specify different relationships among the various quality attributes of a software system. ISO 9126, for example, specifies six primary product qualities (functionality, reliability, usability, efficiency, maintainability, and portability) and many secondary qualities that affect the primary ones. Evaluating the overall quality of a software system is a matter of rating the system with respect to individual quality attributes and then weighting the individual ratings to achieve an overall rating. We concentrate here on the quality attributes generally considered primary in the evaluation of user interface construction tools. The weighting of these attributes to achieve an overall evaluation still remains the responsibility of the organization. When evaluating a user interface construction tool, there are two systems that enter into the evaluation: the tool itself and the systems constructed using the tool, which we call the target systems. In addition to asking what is evaluated – tool or target – we can ask who is most concerned with the quality being evaluated – the user of the tool (or developer) or the user of the target system (or end user). In our evaluation framework, we focus on six quality attributes that, based on our experience, are the most important: 1. modifiability of the target system; 2. construction efficiency, or time to develop the target system; 3. usability of the target system as indicated by the style of interaction provided to the end user; 4. compatibility, both of the tool within the development environment and of the output of the tool within the target system; 5. reusability of previously existing components within the target system; and 6. response time of the target system. Other quality attributes, such as reliability of the tool, are also important, but these six are the ones usually cited in support of a particular tool or architecture. Table 1 categorizes the above qualities with respect to what is evaluated (columns) and who is concerned with the quality (rows). We can see that the end user is concerned only with the interaction style and the response time of the target system. For example, the end user sees only systems that have been made to work. Thus issues of compatibility are not of concern to the end user. The other four qualities are concerns for the developer. Table 1. Categorization of Qualities for User Interface Tools Tool Developer

End user

3

compatibility construction efficiency

Target system modifiability compatibility reusability interaction style response time

Evaluation Techniques

As we mentioned above, we highlight four different techniques useful for evaluation of quality attributes. In order of increasing cost to the evaluator, they are: feature inspection, architectural inspection, direct developer input, and benchmarking. Of these four techniques, feature inspec-

tion and architectural inspection are objective since the results are individually repeatable. Direct developer input and benchmarking are subjective since the results are only statistically repeatable. We further define these techniques in the next four subsections. 3.1 Feature Inspection Feature inspection is an easy technique to apply and one in common use. It involves making a checklist of characteristics that are important for a given quality attribute and then determining whether the tool or the target system has each characteristic. For example, one features that affects reusability is language compatibility. Features that affect interaction style are supported input and output devices. The characteristics that can be included on such a checklist are ones that can be assigned a limited number of possibilities (e.g., yes/no). One cannot use this technique to evaluate characteristics that are continuous rather than discrete. 3.2 Architectural Inspection The qualities of concern to the developer depend on both the development environment of the tool and the software architecture that the tool imposes on the target system. The architecture of a software system can be described from two perspectives of interest: the functional partitioning of its domain of interest and its structure. A system’s functionality is what the system does. It may be a single function or a bundle of related functions that together describe the system’s overall behavior. For large systems, a partitioning divides the behavior into a collection of logically separated functions that together compose the overall system function but individually are simple to describe or otherwise conceptualize. Typically, a single system’s functionality is decomposed through techniques such as structured analysis or object-oriented analysis, but this is not always the case. When discussing architectures for a broader domain, the functional partitioning is often the result of a domain analysis. One sign of a mature domain, such as user interface software, is that over time some functional partitioning emerges that a community of developers adopts. That is, the partitioning of functionality has been exhaustively studied and is typically well understood, widely agreed upon, and canonized in implementation-independent terms. We return to this idea when discussing analysis for modifiability in Section 4.1. A system’s software structure contains information about •

a collection of components that represent computational entities, e.g., modules, procedures, processes, or persistent data repositories;

•

a representation of the connections between the computational entities, i.e., the communication and control relationships among the components; and

•

a description of the behavior of interconnected components and connectors.

Architectural inspection for a quality attribute will depend on either the functional or structural perspective, or both. 3.3 Direct Developer Input Direct developer input involves the querying and observing of past and current users of particular tools. These are the developers of the target system in our taxonomy. Developer polling is the querying of developers for their opinion on the qualities of interest. A list of the qualities of interest is provided to some collection of users of individual tools. Each user provides some demographic information such as the extent of experience with the tool and the types of systems on which they have used the tool and then rates the tool with respect to the qualities on the questionnaire.

There are several assumptions behind such polling: •

The developers who are polled are representative of the true user population in terms of background, the type of systems for which the tool would be used, and the types of use of the tool.

•

The sample is statistically valid.

•

The questions used on the questionnaire will be understood in the same way by all respondents (e.g., different developers may have different understandings of the words “usable and efficient”).

Other techniques for getting direct input from developers, such as heuristic evaluation and user testing, are described in [HH93]. 3.4 Benchmarking Benchmarking involves the identification of tasks that can be performed on different software artifacts (in our case, user interface tools) and some measurable characteristics of each task that serve as a means of comparison. In the context of user interface design, one benchmark task would be the implementation of a particular interface for a given application from scratch. The measurable characteristic would be the time it takes to successfully implement the interface. There are several common-sense, but frequently violated, assumptions involving representativeness and independence that are made in using this method: •

The benchmark tasks are representative of the actual tasks that will be demanded. So for user interface design, we would want the application to be a typical application for the organization, and we would want the characteristics of the interface to be consistent with the kinds of interfaces the organization expects to develop.

•

The test cases can either be abstracted from usage of current applications or constructed to reflect usage of future applications. If the test cases reflect current usage, the assumption is that the tool chosen won’t affect the interfaces constructed, only the cost of constructing those interfaces.

•

Those who perform the benchmark tasks (the subjects) are representative of the population that will actually be using the tool. Since much of what is being measured concerns human performance, the subjects must be as accurately chosen as the set of tasks.

•

Evaluators must account for learning effects for a single subject. Any problem a subject encounters in completing a task might be repeated between different tools, and we would expect that the subject would be better at solving the problem for later trials. One technique of compensating for learning effects is to counterbalance the order in which different subjects encounter the different tools.

3.5 Classification of Qualities by Evaluation Technique In Section 4 below, we will address each important quality factor. Table 2 summarizes which techniques are used to evaluate the various qualities. Table 2. Evaluation Techniques for Quality Attributes Technique

4

Quality

Feature Inspection

reusability compatibility interaction style

Architectural Inspection

modifiability reusability construction efficiency response time

Direct Developer Input

all

Benchmarking

modifiability construction efficiency

Evaluating Quality Attributes for User Interface Tools

In this section, we discuss the various quality attributes that we have identified, the factors that influence those attributes, and how the analysis techniques are used to perform the evaluation. Because direct developer input can be used to evaluate for any quality attribute, we do not discuss it individually as a technique applicable to particular qualities. 4.1 Modifiability Oskarsson gives an informal characterization of classes of modifications in [Osk82]. Drawing on his work, we have enumerated the following classes of modifications: 1. Extension of capabilities: adding new functionality, enhancing existing functionality. 2. Deletion of unwanted capabilities: e.g., to streamline or simplify the functionality of an existing application. 3. Adaptation to new operating environments: e.g., processor hardware, I/O devices, logical devices. 4. Restructuring: e.g., rationalizing system services; modularizing, optimizing, creating reusable components. Within the user interface domain, the types of modifications that are usually discussed are the first and the third. That is, the addition of new functionality or “look and feel” to the user interface, or the replacement of one toolkit or platform by another, is primarily an architectural issue. Our concern is not the generation of the new toolkit but its inclusion into the target system and the effect the new toolkit has on the rest of the target system. The extension of capabilities is both an architectural issue and a development environment issue. For example, the addition of a new icon in the user interface requires modification of the current target system (the architectural concern) as well as the generation of a new icon and its placement within the existing user interface (the development environment concern). “Modifiability,” even with the above breakdown, by itself is an abstract concept. That is, a particular implementation of a collection of functionality may make one modification very easy

and another very difficult, whereas a different implementation may reverse the difficulty. This suggests that modifiability must be discussed in the context of a set of likely modifications. Then each candidate architecture can be evaluated according to how well it supports each sample modification. This is an example of inspecting the architecture. A previous paper [KBA94] gives the details. So-called “reference architectures” [UIMS92] [Cou87] [Pfa85] have been developed based on assumptions about the modifications a system will undergo. That is, they assume that modifications to the functionality of the target system, the platform on which the target system runs, and the user interface are the most common. This leads to a type of architectural inspection based on the comparison of a reference architecture to the architecture of the target system. There are thus two different architectural inspection techniques for modifiability. The first relies on first principles and requires the evaluator to define a population of likely modifications. The evaluation is then performed based on samples from that population. The second uses a reference architecture and compares the reference architecture to the architecture of the target system to perform the evaluation. In either case, the measure used is locality of the modification. That is, the more components involved in a particular modification, the worse the design is for that modification. For the aspects of modifiability that are not captured through architectural evaluation (those depending on language or tool support), a benchmark test can be constructed. This benchmark involves making the sample modifications to a set of tests and measuring the time it takes to complete the modifications. 4.2 Construction Efficiency Construction efficiency is the rate at which the target system can be constructed. There are two aspects to it, both of interest to the evaluation of user interface construction tools. The target system is a collection of components that have been integrated together to form the total system. The two aspects of construction efficiency are (1) the efficiency of constructing the components; and (2) the efficiency of the integration process. The efficiency of constructing the components depends partially on architectural considerations and partially on the development environment. The architectural considerations are •

Power of data and control mechanisms. Constraint systems, for example, are provided in the runtime portion of some user interface construction tools, and they are used to reduce the construction time by providing a single mechanism that makes control passing implicit within a data relationship.

•

Structure and applicability of reused components. Reused components that solve the problem at hand and that fit into the architecture of the target system will reduce the construction time.

The influence of the development environment on the construction efficiency of the components is manifested through the following aspects: •

The power and ease of use of the user interface specification mechanisms. Graphical layout tools provide easy-to-use mechanisms to specify the appearance of a user interface. Even more powerful but more difficult to use are those tools that enable the specification of the dynamic behavior of the interface.

•

Existence and accessibility of a reuse library. To reuse components, one must be able to locate them and assess their utility. This is a function of the development environment.

The efficiency of the integration process is primarily an architectural issue. The criterion is whether the architecture provides for a component integration mechanism. Those user interface architectures [Cou87] [BCH+90] [TJ93] that provide for composition mechanisms are bet-

ter suited for the construction of large systems. The analysis techniques suitable for the evaluation of construction efficiency are the two types of inspection. Inspection of the target architecture will determine whether it has composition mechanisms, and feature inspection will enable the determination of the power of the various data and control mechanisms. Inspection of the architecture and the development environment will enable an evaluation of the power, applicability, and availability of the reusable components. Benchmarking can be used to determine the power of the specification mechanisms. A collection of representative user interfaces together with a sufficient part of the application to provide some interaction are implemented using the user interface construction tool. Again, the time for implementation is the primary measure of the power of the tool for the applications of choice. When performing such a benchmarking evaluation, the information collected must reflect whether the tool was sufficiently powerful for the task at hand. For example, in our evaluation experiment constructing a simple direct manipulation editor with several of the standard interface development tools, we were forced to perform some of the selection and feedback operations by interacting directly with the X-Window system rather than by interacting through the tools. This increased development time and also indicates how poorly the particular tool matches the types of interfaces being constructed. 4.3 Interaction Style of Target System Any organization doing an evaluation must decide on the types and styles of interfaces in the systems it wishes to construct. An organization that develops process control software will require different types of interfaces than an organization that develops office automation software. Some interaction styles are in widespread use, such as form-filling and menu styles. Standard toolkits have been designed for these types of interactions. Less common interaction styles, such as gesture recognition, do not have standard toolkits available. An organization must also decide if it wishes to have dynamic interfaces, in which the state of the widgets change in real time to reflect the state of the underlying system. For example, an interface for air traffic control must have some representation of the aircraft that changes position on the screen to reflect the change in position of the aircraft. Different user interface construction tools cater to different interaction styles. An organization must know whether any one toolkit will cover all anticipated interaction styles. Feature inspection is the evaluation technique to use in this case. Hix and Schuman [HS91] give an extensive checklist for performing this type of evaluation. 4.4 Compatibility Two types of compatibility are important: the compatibility of the tool within the development environment and the compatibility of the output of the tool with the remainder of the target system. The user interface construction tool is one part of a development environment and, as such, must operate in conjunction with other elements of the development environment such as compilers, configuration management systems, and project management tools. This can be evaluated by feature inspection. The output of the construction tool is one part of the target system and, as such, must operate in conjunction with the other parts. Language and platform compatibility are two aspects of this compatibility requirement. Again, feature inspection is the analytic technique used for this evaluation.

4.5 Reusability There are (at least) three different types of reuse associated with any system: 1. reuse of components within the current development activity; 2. reuse of components from previously developed applications; and 3. reuse of components for the benefit of future development activities. Our focus is on the second type of reuse—the reuse of previously developed components. In the user interface domain, there are a large number of standard widget sets. To reuse previously existing components within the target system, there are two aspects of these components that must be compatible with the target system: 1. functionality achieved by the components; and 2. structural types used in the construction of these components. As with modifiability, one cannot discuss reusability in the abstract. That is, one must choose a specific set of widgets (or a specific toolbox) to reuse. In the user interface domain, there are a small set of industry standard toolboxes (e.g., Motif, Windows, Apple Desktop). Specific organizations may have customized widget sets; but in any case, there are a small set of toolboxes that can be considered for reuse. Architectural inspection for reusability depends on choosing a particular toolbox and evaluating the suitability of the architecture of the target system for reusing the widgets within that toolbox. The assumption is that all elements of the toolbox have the same interface with the target system. If the interface is suitable, the evaluation consists of inspecting the target system’s software architecture to determine whether specific user interface functionality is achieved using the components available from the toolbox. 4.6 Response Time Response time is another quality attribute that has multiple influences. Some influences on response time are the algorithms used, the complexity of the problem being solved, and the coding efficiency. The software architecture is another influence, since there are costs of communication and of transfer of control. The importance of architecture depends on the system being constructed. In some systems, architectural considerations play a small role in the total efficiency of the system, and in others they play a larger role. Evaluating a software architecture for response time requires inspecting the structural types used for computation and coordination. Associating each component, communication mechanism, and control mechanism with its respective resource consumption will enable a very approximate estimate of the total resource consumption of the system.

5

Related Work

This work sits at the intersection of three different areas of research. One of these areas is the emerging field of software architectures [GS93]. Another is the understanding of software quality attributes [ISO91]. The third area is user interface software [BD93]. Each of these has an extensive literature, and rather than attempt to survey the literature here, we have given citations to survey articles. There is also extensive literature on benchmarking as an evaluation technique for CPU efficiency (e.g., [UIMS92]). The use we make of benchmarking here should be familiar to human factors specialists [HH93] since many of the experimental design issues are common within that field.

The attempt to evaluate construction tools by reference to multiple software quality factors is closest in spirit to the ISO 9126 work, but the use of software architecture as a basis for evaluation is closest in spirit to the various user interface reference models.

6

Conclusions and Future Work

User interface construction tools should be evaluated in terms of the same qualities as other software. The strength of this work and its contribution is the establishment of a framework for that evaluation. Assuming that we have chosen the correct attributes to consider and correctly identified the factors that influence them, the work reported here raises many questions that serve as a basis for much future work. For example •

The predictions of architectural inspection, such as those regarding modifiability, must be validated with actual data.

•

We have asserted the importance of software architecture for use in a particular kind of inspection technique. We have not, however, explained what percentage of a quality attribute is analyzable in terms of architecture.

•

We have not suggested what other features beside the architecture might be important subjects for inspection. Assuming other subjects of inspection beyond the software architecture exist, we must determine inspection techniques for them.

•

Since benchmarking is both expensive and somewhat problematic (see Section 3.3), a research goal that becomes apparent within this framework is to develop inspection techniques to evaluate those quality attributes that previously would have required benchmarking.

References [BCH+90]

L. Bass, B. Clapper, E. Hardy, R. Kazman, and R. Seacord. Serpent: A user interface management system. In Proceedings of the Winter 1990 USENIX Conference, pages 245-258, Berkeley, California, January 1990.

[BD93]

L. Bass and P. Dewan (eds). Trends in Software: User Interface Software. John Wiley & Sons, New York, 1993.

[Cou87]

J. Coutaz. PAC, an implementation model for dialog design. In Proceedings of Interact ‘87, pages 431-436, Stuttgart, Germany, September 1987.

[GC87]

R.B. Grady and D.L. Caswell. Software Metrics: Establishing a Company-Wide Program. Prentice-Hall, Englewood Cliffs, New Jersey, 1987.

[GS93]

D. Garlan and M. Shaw. An introduction to software architecture. In Advances in Software Engineering and Knowledge Engineering, Volume I. World Scientific Publishing, 1993.

[HH93]

H.R. Hartson and D. Hix. Developing User Interfaces: Ensuring Usability Through Product and Process. John Wiley & Sons, New York, 1993.

[HS91]

D. Hix and R. Schulman. Human-computer interface development tools: A methodology for their evaluation. Communications of the ACM, 34 (3): 74-87, March 1991.

[ISO91]

ISO/ISE, International Standard 9126. Information Technology - Software Product Evaluation - Quality Characteristics and Guidelines for Their Use. ISO/IEC Copyright Office, Geneva, Switzerland, 1991.

[KBA94]

R. Kazman, L. Bass, G. Abowd, and M. Webb. SAAM: A method for analyzing the properties of software architectures. In Proceedings of ICSE-16, Sorrento, Italy, May, 1994.

[MRW77]

J. McCall, P. Richards, and G. Walters. Factors in Software Quality, three volumes. NTIS AD-A049-014, 15, 055, November 1977.

[Osk82]

Oskarsson, Ö. Mechanisms of modifiability in large software systems. Linköping Studies in Science and Technology Dissertations No. 77, 1982.

[Pfa85]

Pfaff, G. (ed). User Interface Management Systems. Springer-Verlag, New York, 1985.

[Sof89]

Software Engineering Institute. Serpent Overview (CMU/SEI-91-UG-1, ADA240606). Carnegie Mellon University, Pittsburgh, Pennsylvania, August 1989.

[TJ93]

R. Taylor and G. Johnson. Separations of concerns in the Chiron-1 user interface development and management system. In Proceedings of InterCHI ‘93, pages 367-374, Amsterdam, May 1993.

[UIMS92]

UIMS Tool Developers Workshop. A metamodel for the runtime architecture of an interactive system. SIGCHI Bulletin, 24(1): 32-37, January 1992.