Proceedings of the 38th Hawaii International Conference on System Sciences - 2005
A Framework for Evaluating Collaborative Systems in the Real World Michelle Potts Steves and Jean Scholtz National Institute of Standards and Technology
[email protected] and
[email protected] Abstract Collaboration technologies are seeing widespread deployment even though it is difficult to assess the effectiveness of these systems. This paper presents an evaluation method that addresses this issue and the use of the method in the field. The method uses a framework to structure evaluations by mapping system goals to evaluation objectives, metrics, and measures. The uppermost levels of the framework are conceptual in nature, while the bottom level is implementation-specific, i.e., evaluation-specific. Capitalizing on this top-down approach, an evaluation template specifying the conceptual elements can be constructed for a series of evaluations. Then implementation-specific measures are specified for individual experiments. This structure makes use of the framework ideal for comparison of the effectiveness of collaboration tools in a particular environment. We present our findings from use of the method in the field to assess the performance of a particular collaboration technology deployment and its impact on the work process.
1. Introduction Many different types of software evaluations can be performed. Technical or algorithmic evaluations focus on software performance aspects such as computational speed, throughput, accuracy, and usability. In these evaluations a benchmark or “ground truth” is required for comparison. While necessary to inform system development efforts, these evaluations do not tell the whole story of how the software system will perform in the real world. The next step in evaluating these systems is to use real data; unfortunately, “ground truth” does not often exist in the real world. However, evaluations can be structured to assess improvements, e.g., effort and speed, compared to current conditions. Finally, to achieve a more comprehensive picture of how a system will ultimately perform, the software system must be evaluated using actual data, real-world users and their processes. This is impossible to achieve in a laboratory controlled experiment; users change, tasks change, and data change. How, then, can we make valid comparisons under these conditions? In this paper, we describe an evaluation framework [18] we have developed. We are using this framework on
several large research and development projects, e.g., [10]. The multi-level framework allows us to define metrics and measures in the context of system goals and evaluation objectives, and to “roll-up” or combine those assessments for an overall evaluation of a software system. Further, the conceptual levels of the framework provide for comparison, while the lower levels provide for application-specific feedback. We anticipate use of the framework as an aid to evaluation designers to systematically map system goals and evaluation objectives down through metrics and measures. This approach has several important benefits, namely: 1) articulation of the most important system goals and evaluation objectives focuses the evaluation during evaluation design, 2) data collection efforts are focused since required data elements are specified, 3) data interpretation is tied to stated goals, since the measurement data are tied to system goals, and 4) assessments of effectiveness with respect to both product and process can be conducted in operational environments. These benefits can be realized by use of the evaluation framework for any software evaluation. This is possible because the framework provides the mapping between framework elements, it is not a recipe for specifying an evaluation. It remains the evaluation designer’s responsibility to determine the critical factors for success in any given evaluation scenario and to select the evaluation objectives, metrics, and measures that best assess those issues for the scenario.
2. Background 2.1. Problem statement Evaluating collaboration software is difficult for many reasons. Grudin [7] notes that producing evaluations that are generalizable is difficult. Many factors confound evaluations including the makeup of the group of users in the collaboration effort, the types of tasks those users undertake, and the necessity to interface with software and data outside of the collaboration space. For example, as the number of users of a system increases, design factors that did not show up when used by a small group can suddenly become an issue [16]. Additionally, the tasks that different groups anticipate performing using the collaborative software certainly have a profound impact. An early framework for evaluation [4] discussed the
0-7695-2268-8/05/$20.00 (C) 2005 IEEE
1
Proceedings of the 38th Hawaii International Conference on System Sciences - 2005
various types of evaluations that might be done: feasibility, iterative, comparative, and appropriateness. Feasibility evaluations are based on performance and cost; iterative evaluations look for improvements in the collaborative software; comparative and appropriateness evaluations are more of interest to us, as they focus on comparing systems or determining if a given system has the necessary capabilities to be useful in an organization’s process. However, we are interested in going beyond evaluating the current fit between work and a given collaborative tool; we are interested in determining how the work is changed by the insertion of the tool. In particular, we are interested in how the organizational process or products are impacted. While we do not claim to have solved the problem of producing evaluations that are generalizable, we do believe we have developed a re-usable evaluation framework that can be applied in real-world settings for comparative and appropriateness evaluations.
2.2. Goal-oriented evaluation paradigms Rombach states in order for measurement to be successful, effective ‘top-down’ strategies that derive metrics from goals and interpret measurement data in the context of goals are needed [15]. From the literature, there are several approaches using a ‘top-down’ manner of identifying useful metrics from goals, which is considered a major improvement over the commonly used ‘bottomup’ approach to measurement,. They are the Quality Function Deployment (QFD) approach by Akao [9], the Software Quality Metrics (SQM) approach by Murine [12] based on prior work by Boehm [3] and McCall [11], and the Goal/Question/Metric (GQM) approach by Basili [1,2]. While all three methods share the top-down measurement approach, they vary significantly in the scope of supported measurement goals and potential uses [15]. We focus on the GQM method since it is identified as one of the most widely used of such methods. Case studies of its use have shown sizable value and adaptability to specific environments [5]. Furthermore, GQM can be used for process as well as product quality, whereas SQM and QFD are limited to product quality [15]. Rombach further states that “GQM benefits include its general applicability to all kinds of measurement goals, as well as its support for identifying and tailoring of metrics for interpreting collected data in context, for validating the usefulness of the selected metrics early on, for involving all interested parties in the measurement process, and for protecting sensitive data.” [15] GQM was developed for use with software improvement projects, and indeed, the majority of research and development software projects, and arguably, many software selection and deployment projects can be viewed as such. The GQM paradigm prescribes setting goals in tractable and operational ways.
Goals are then refined into a set of quantifiable questions that specify metrics. Data are tied to specific metrics that in turn are tied to specific goals. Because of this clear association, Basili and Rombach [1] state that use of GQM should “help in the interpretation of the results of the data collection process,” and that GQM facilitates identification of the link between the actual data and the purpose for its collection. The GQM method further prescribes a process of goal, question and metric selection via a set of templates and guidelines.
3. Evaluation framework To construct our framework, we used a top-down approach. Of the methods described above, our framework is most like the GQM method. Although we recognize the benefits of the GQM method, the projects to which we planned to apply our framework had specific requirements that precluded use of the GQM method outright. Specifically, more leeway regarding how the various elements were specified was required and we needed a differentiation between conceptual and implementation-specific measures, which GQM does not provide [17]. Our framework consists of five hierarchical levels or elements. They are: system goals, evaluation objectives, conceptual metrics, conceptual measures, and implementation-specific measures. Each successive element in the framework is a refinement in scope. This refinement can also be thought of as a sphere of concern or level of abstraction. In the framework, a system goal may have one or more evaluation goals. Each evaluation goal will have one or more conceptual metrics, which when assessed with its associated measures, will contribute to an assessment of if, and potentially how well, a particular system goal was met. First, we present a diagram of the framework and then discuss each of the elements in more detail.
0-7695-2268-8/05/$20.00 (C) 2005 IEEE
2
Proceedings of the 38th Hawaii International Conference on System Sciences - 2005
System Goals
Evaluation Objectives
Attribute assessments
Technical performance, process, and product goals
Partitioned evaluation concerns
Metrics (conceptual)
Performance indicators
Measures (conceptual)
Measures (implementationspecific) Measured values
Figure 1: Evaluation framework
3.1. System goals The top-level element, system goal, is the intended benefit or functionality the software system will provide. Each goal may specify either a software technical performance goal or an organization goal, e.g., a product or process improvement goal. In the design of the evaluation, the system goal element provides two critical ingredients: determining which aspect(s) of the system are of primary importance in the work domain and providing a high-level question the evaluation must answer – whether the stated goal was met. At this uppermost level, the system goal constitutes the beginning of the evaluation design and the end of the analysis process. Therefore, careful construction of system goal statements is critical so that relevant evaluation questions will be formulated in the evaluation design and subsequently answered in the evaluation analysis.
3.2. Evaluation objectives
Evaluation objectives are at the next level of the framework. As such, evaluation objectives provide further refinement in the ‘top-down’ design of the evaluation. Each evaluation objective drives an aspect of the evaluation in the context of its parent system goal. Therefore, evaluation objectives for any particular system goal should be specified in such a way as to cover all the aspects of the system goal that are to be evaluated while not overlapping inappropriately with each other.
3.3 Metrics and measures The lowest three levels in the framework constitute metrics and measures. Although they are somewhat interwoven, differences in levels of abstraction can be discerned. We start by defining the terms used. The Institute of Electrical and Electronics Engineers (IEEE) standard for Software Quality Metrics Methodology defines the term ‘software quality metric’ as “a function whose inputs are software data and whose output is a single numerical value that can be interpreted as the degree to which software possesses a given attribute that affects its quality.” [8] We define metric as the interpretation of one or more contributing elements, e.g., measures or other metrics, corresponding to the degree to which the set of attribute elements affects its quality. Interpretation can be human assessment of the contributing elements or a computation, often the summed, weighted values, of the contributing elements. When weights are used, they are assigned to elements according to their importance to the respective attribute with respect to the particular assessment scenario. One can see that our definition differs somewhat from the IEEE definition. The IEEE definition says that a metric is the assessment of one attribute, where an attribute is defined as “a measurable physical or abstract property of an entity” [8]. Our definition recognizes that an attribute may have multiple elements, referred to in this paper as measures, which contribute to an attribute’s assessment, and the assessment of a metric may be computed or interpreted. For example, a complicated attribute of a software component like efficiency, is partially derived from the interpretation of elements, i.e., measures, like time, user ratings, and tool usage. We define the term measure, a noun, as a performance indicator that can be observed singly or collectively. This concept corresponds to the term ‘element’ used to assess an attribute. Measures are countable events, can be directly observed, computed, calculated, and may, at times, be automatically collected. A simple way to distinguish between metrics and measures is by the following statement: a measure is an observable value, while a metric associates meaning to that value by applying human judgment, often through a formula based on weighted values from the contributing elements or measures.
0-7695-2268-8/05/$20.00 (C) 2005 IEEE
3
Proceedings of the 38th Hawaii International Conference on System Sciences - 2005
3.3.1. Metrics. Using the top-down design approach advocated in this framework, a metric is scoped by its parent evaluation objective. Likewise, assessments for each of the metrics within the set of metrics for a particular evaluation objective inform the assessment of that evaluation objective. Each metric scopes and is informed by its associated measures. 3.3.2. Measures – conceptual and implementationspecific. The lowest two levels in the framework constitute ‘measures.’ There are two levels of abstraction for measures: conceptual and implementation-specific. Conceptual measures identify the type of data to be collected. Implementation-specific measures identify the specifics about a particular collection instance, e.g., data element(s), associated tool, collection method, and so on. Once a value is obtained for an implementation-specific measure, that value is analogous to the term ‘measured value’ [14]; specifically, it is “the numerical result obtained from the application of a measurement method to an object, possessing a quantity.” [14] The following example shows the difference between conceptual and implementation measures: task completion time (a conceptual measure) can be calculated in different ways with different implementation-specific measures. The calculation can be based on start and end times or based on start time and duration.
3.4. Context of the experiment Although analysis is performed by assessing the specified metrics with collected measures and “rolling-up” these assessments to answer the larger questions that the evaluation addresses, the specified measures may not capture the entire story of the experiment scenario. We find it is also necessary to document the context of the experiment, i.e., the factors that influence the scenario, for a fuller context in which to place the evaluation analysis. For example, when assessing collaboration technology adoption, there are factors such as organizational culture and characteristics of the team under study that may impact adoption. While we advocate use of a top-down approach such as the evaluation framework, we also recommend documentation of influencing elements that provide context for the experiment and the resulting analysis.
3.5. Benefits of use We expect the following benefits from use of the evaluation framework: 1) up-front attention to the most important system goals and evaluation objectives focuses the evaluation during the evaluation design stage, 2) data collection efforts are focused since the required data elements are specified, 3) data interpretation will be more
efficient and effectively tied to stated goals, since the measurement data are tied to system goals, and 4) assessments of effectiveness with respect to both product and process can be conducted in operational environments. These benefits can be realized by use of the evaluation framework for any software evaluation.
4. Using the framework for comparative evaluations As stated earlier, there are situations where it is desirable to affect comparative studies of systems and for determining how organizational processes are affected when technologies are inserted into a particular process. In this section, we describe how the evaluation framework can provide comparative studies. We give two examples. First we discuss briefly a hypothetical example to illustrate use of the framework in a comparative study between two systems providing messaging capabilities, one audio-based and the other text-based. The second example reports on a field study using the evaluation framework to evaluate the effect of the insertion of a collaborative software tool in a particular process.
4.1. Structuring a comparative study To structure a comparative study, the framework can be employed to develop an evaluation template where the conceptual elements of the framework, i.e., goals, objectives, metrics, and conceptual measures are specified for each of a series of like-structured experiments. For each individual experiment, the low-level, implementation-specific measures are specified, as they will differ by varying degrees for each experiment. Thoughtful specification of the evaluation template elements is critical so that the template is applicable to each experiment in the envisioned series. Use of evaluation templates requires more up-front work in the design of the entire evaluation. However, we feel its use provides several important benefits including: 1) results from a series of carefully-constructed, likestructured experiments yield comparative studies, 2) data requirements are well-understood prior to the analysis phase of the evaluation – therefore, data collection can be planned in an orderly manner and the impact of not being able to collect any particular data element is wellunderstood in advance of the analysis phase of the evaluation, and 3) design time for each individual evaluation is greatly reduced.
4.2. Illustrative example To illustrate the framework concept and mappings, we provide a hypothetical situation that represents a class of evaluation situations that exists in the real-world, yet is
0-7695-2268-8/05/$20.00 (C) 2005 IEEE
4
Proceedings of the 38th Hawaii International Conference on System Sciences - 2005
straight-forward enough to provide an introduction to use of the evaluation framework. The scenario is as follows: A work group wants to institute a messaging system in their work process. Not only are different types of messaging systems being considered, e.g., text-based instant messaging and computer-based audio tools, but various tools within a particular capability category have been suggested for trial, e.g., different instant messaging tools. The group’s management has requested an assessment be performed before granting permission to institute and maintain new software on the company’s computing resources. The group documented its work process prior to any new technology insertion, a set of perceived functional requirements, and the list of candidate tools. To help the group determine which tool best suites their needs, a comparative study could be constructed using an evaluation template for a series of evaluations, where each evaluation in the series examines a particular tool from the candidate tools. The group identified two main areas of assessment interest: tool capabilities with respect to the group’s documented functional requirements and tool impact on the work group’s process. Each of these can be reflected in a system goal, with its attendant evaluation objectives, metrics, and measures. Below is an excerpt from an example evaluation template for this scenario. The excerpt shows how a headto-head comparative study for tools with different capabilities and potentially very different underlying technologies is mapped using the framework. The excerpt includes the conceptual elements of the framework described in this paper: goal statements, evaluation objectives and metrics mapped to conceptual measures (CM). Additionally, sample implementation-specific measures (ISM) are shown with their associated conceptual measures. Remember that implementationspecific measures are not specified in an evaluation template, but are individually specified for each evaluation in the series. (They are given in the structure here to show the mapping to their associated conceptual measures. Additionally, some implementation-specific measures might apply to both text-based and audio-based messaging tools as noted, although the specific collection details may vary and would be noted in a fully-specified evaluation plan.) Note that all elements presume that members of the workgroup are the intended users and their work process is the process in question, as applicable. In a complete evaluation template, more evaluation objectives, metrics, and measures would clearly be required to give a fuller understanding of how well any particular tool might perform for this work group.
Goal statement 1: The messaging tool provides capabilities to meet the identified functional requirements. Evaluation objective 1-1: Assess communication capabilities of the tool.
synchronous
Metric 1-1a: Efficiency CM: composition time ISM (text): start of typing until ‘send’ function activated ISM (audio): start of record until ‘send’ function activated Metric: 1-1b: Effectiveness CM: user ability to detect on-line presence of others in work group ISM (text & audio): number of detection-related messages CM: editing capability ISM (text): number of editing features ISM (audio): absence/presence of re-record option CM: expressive capability ISM (text): number of formatting features ISM (audio): comfort level of people to use voice inflection in recording messages Metric 1-1c: User satisfaction of interface CM: configurable message notification controls ISM (text & audio): user rating CM: frustration levels ISM (text & audio): user rating Evaluation objective 1-2: Assess communication capabilities of the tool.
asynchronous
(The metrics and measures for evaluation objective 1-2 will be quite similar to Evaluation objective 1-1, and are not given here in the interest of brevity.) Goal statement 2: The messaging tool will positively impact the work group’s process. Evaluation objective 2-1: communication response times.
Assess
changes
in
Metric 2-1a: Efficiency CM: response times ISM (text & audio): time to listen, time to respond Metric 2-1b: Effectiveness CM: usage statistics ISM (text & audio): number of round-trip communications CM: number of misunderstandings
0-7695-2268-8/05/$20.00 (C) 2005 IEEE
5
Proceedings of the 38th Hawaii International Conference on System Sciences - 2005
ISM (text & audio): count of clarification only messages Two system goal statements are given in the excerpt. The first is concerned with the usability for a candidate tool for the work group with regard to the work group’s identified functional requirements. In this evaluation template each functional requirement is addressed at the evaluation objective level. This reflects a choice made by the evaluation designer on how to partition evaluation concerns. Two sample evaluation objectives are given, e.g., synchronous and asynchronous communication capabilities. The second system goal is concerned with assessing the impact of tool use on the work process for this group. As stated, this system goal is quite broad. This is intentional. Of course, the ultimate goal of any system is to have a positive impact; however, in practice, many do not, or their success is mixed at best. It is the evaluation designer’s responsibility to assess the impact, positive or negative, by determining the critical factors for success in each scenario and selecting evaluation objectives, metrics, and measures to assess those elements.
5. Using the framework in the real world for a comparative study In this section, we report on use of the evaluation framework in a field study where use of a collaboration tool and its impact on process and product are of interest. Presently, two evaluations are envisioned to form a comparative study of the effect of the collaboration tool on collaboration between the group studied and others outside their group. Assessment requirements vary among the stakeholders. Upper levels of management are interested in assessing if ad-hoc collaborations are increased among distributed workers. Mid-level management is interested in assessing how effective the tools are in practice and the effect of collaboration tools on current processes. We designed an evaluation template to meet these varied assessment requirements. Currently, the first evaluation has completed and the collection for the second evaluation in the series is in progress. We describe the environment in which the study was conducted and report on how well the framework and evaluation template performed in this field study.
5.1. Study environment The study is situated in an environment where three information analysts are monitoring an information source to produce a daily summary focused on a particular topic – these are the study group members. The daily summary, the product, is produced with the intent of reducing the amount of information others in the organization must read from the source to find information relevant to the
particular topic. Members of the study group are physically collocated in close proximity; face-to-face interactions among this group are the norm. Collaborations with others outside the work group to produce the daily summary were not routine prior to the insertion of the collaborative application. Groove1 [6] was selected to provide collaboration capabilities for the study group members. Groove is a general purpose collaboration tool, designed to be used by a wide variety of organizations. It is a peer-to-peer application that provides its users with synchronous as well as asynchronous collaboration functionality. Groove uses the metaphor of shared spaces to create collaborative environments. Templates are used to create a shared space and then functionalities are incorporated. Functions provided by Groove include: file storage, discussion spaces, calendar facilities, sketch pads, and so on. In addition, Groove provides text chat, audio chat, as well as an instant messaging tool. Groove was not used by our study group prior to the first evaluation period. During the first evaluation period, only one of the three analysts in our study group had access to the collaboration tool, Groove. This information analyst had been trained on use of the tool and had contacts with other information analysts outside of the study group who also had Groove on their desktop machines. Additionally, some of these information analysts outside of our study group had access to other, different information sources than the analysts in our study group. When information gaps were noticed during production of the daily summary, the analyst with access to Groove would send a message to analysts outside our study group in an attempt to solicit additional information for the summary. In the second evaluation period, all three analysts in the study group have access to Groove from their respective desktop machines. Additionally, they now also have access to the other information source they considered the most valuable that they had not had access to during the first evaluation period.
5.2. Evaluation template design When designing an evaluation framework for a particular evaluation or an evaluation template for a series of evaluations, it is critical to identify the primary issues and related system functionalities the evaluation(s) must address. For this study, we identified several issues of primary importance that each evaluation in the series would need to address. They are:
1
Any commercial product identified in this document is for the purpose of describing a software environment only. This identification does not imply any recommendation or endorsement by NIST, nor imply it is necessarily the best product available.
0-7695-2268-8/05/$20.00 (C) 2005 IEEE
6
Proceedings of the 38th Hawaii International Conference on System Sciences - 2005
•
Did ad-hoc collaborations increase? Why? And by extension, would we expect any observed increase in ad-hoc collaborations to be sustainable? • Were the tools effective in supporting collaboration in the work process? • How did the work process change as a result of the inserted technology? • How did the daily summary change as a result of the inserted technology? Note that these questions deal with both process and product, i.e., the daily summary. Both can be assessed using the evaluation framework. Listed below is an excerpt from the evaluation template we used in this field study. Goal statement 1: The inserted technology will effectively support ad-hoc collaborations for the study group Evaluation objective 1-1: Assess collaborations between study group analysts and other analysts Metric 1-1a: Did collaborations change? CM: number & type Metric1-1b: What was the user satisfaction of the tool with respect to support of desired interactions? CM: user rating CM: tool/feature use CM: use of other tools providing similar functionality CM: number and types of work-arounds CM: adoption (indirect) CM: recommendation (indirect) Metric1-1c: Does the tool support ad-hoc, light-weight interactions? CM: user rating CM: evaluator assessment Evaluation objective 1-2: Assess the impact on work process for the study group Metric 1-2a: What was the impact from tool use on the process? CM: changes in efficiency CM: number and types of task changes CM: changes in interactions Metric1-2b: What was the efficiency of the process? CM: time CM: timeliness CM: productivity CM: changes in turn-around time CM: distribution of effort CM: adoption (indirect)
Evaluation objective 1-3: Assess the impact collaborations on the daily summaries (product)
of
Metric1-3a: Did the information received from collaborations change the quality of the daily summary? CM: confidence, vetting and confluence of data points CM: usefulness CM: coherence CM: adoption of process (indirect) In this evaluation template only one system functionality goal is identified, i.e., effective support for ad-hoc collaborations. Within that system functionality context, assessments of effect on process and product are of interest, as reflected in the evaluation objectives. Most of the metrics given will be used comparatively between the two evaluations envisioned for this series. However, Metric 1-1c is an example of evaluating appropriateness. It is specified in this template because tools supporting ad-hoc collaborations should be fairly light-weight and easy to initiate and use, e.g., [13, 19]. Therefore, assessment of these and related tool qualities, is deemed important when considering if the prospect of increased collaboration might be sustainable beyond the evaluation period. Due to the nature of the analysts’ work, their time pressures, and our expectation of somewhat limited access to the analysts, both in the study group and beyond, we knew some of the specified measures would be difficult or impossible to collect. Measures that were expected to be difficult to assess were left in the template in case the opportunity arose for their collection. To address the possibility of not being able to collect some of the direct measures, we tried to also specify indirect measures. For example, quantified efficiencies and distribution of effort for analysts outside the study group were not obtainable. In these instances, we were forced to rely on indirect measures of adoption to inform our assessment. In addition to specification of the evaluation template, we also planned to document any factors of significant influence on each experiment. This information is contained in the context of the experiment that is used to help place the experiment analysis is an appropriate context, as described in section 3.4. For example, in this field study, we planned to document some of the more pertinent aspects of the organizational culture and characteristics of the team members and their interactions with others, e.g., were they early technology adopters in their organization? Were they motivated artificially to use the technology? Were there other barriers to adoption or sustained use of the technology?
5.3. How the framework performed
0-7695-2268-8/05/$20.00 (C) 2005 IEEE
7
Proceedings of the 38th Hawaii International Conference on System Sciences - 2005
As stated earlier, two evaluations were originally envisioned for this particular field study. The first evaluation has completed and collection is underway for the second evaluation. Although analysis has not completed for the second evaluation, our expectation is that we will have sufficient data to complete the analysis and draw useful comparisons between the two evaluations. Analysis of the first experiment period revealed that indeed, collaborations increased. The analyst from our study group with access to Groove created a space for the daily summary topic, invited domain experts to the space, posted informal requests for additional information to the space, and monitored the space for responses to those requests for additional information. Responses were characterized as short, understandable, useful, and generally timely. Some of these responses were incorporated into the daily summary. As word spread, other domain experts were invited to the space and contributed information. The two analysts in our study group who did not have access to Groove asked for accounts. Within one month of the completion of our evaluation period, seven other groups within the organization were given Groove access. Answering the experiment questions: • Did ad-hoc collaborations increase? Why? Yes, ad-hoc collaborations increased. However, it is not entirely clear why. Some analysts in our study group were using the Groove mechanism to indirectly gain access to additional information sources. The question that remains to be answered is once the study group has access to this additional data source, as is the case during the second evaluation period, will collaboration decline or is there other perceived value that provides incentive for use? • Were the tools, i.e., Groove, effective in supporting collaboration in the work process? Our assessment is that, yes, the general collaboration tools within Groove were adapted to a work process the analysts used. Additionally, adoption was increasing throughout the latter half of the experimental period. In general, user satisfaction with Groove was good. The user interface for the features the analysts used was deemed easy to use and light weight enough for informal use. • How did the work process change as a result of the inserted technology? Additional tasks were added to use the collaboration tool. As noted earlier, one analyst from the study group created a Groove space for the daily summary topic, invited domain experts to the space, posted informal requests for additional information to the space, and monitored the space for responses to those requests for additional information. A response was then shared (face-to-face) with the other
analysts in the study group to determine if it should be incorporated in the daily summary. The extra tasks obviously required additional time and effort, but the perceived value outweighed those costs, from the perspectives of both the analysts (users) and their management. • How did the daily summary change as a result of the inserted technology? We were unable to gather much direct evidence of change to the daily summaries as a result of the collaborations. However, we turn to management’s adoption of the process, with its recognized costs, with the presumption of some (unspecified) increase in the quality of the daily summaries. We were able to use the framework to meaningfully assess the questions the evaluation was expected to address. Once analysis of the second experimental period is complete, comparisons between the two evaluations can be drawn. Additionally, although two studies were originally envisioned for this series, it is quite possible that additional evaluations using the same evaluation template will be undertaken to assess the emerging collaborative process for this group and organization.
6. Conclusion We have developed a goal-oriented, evaluation framework that includes system goals, evaluation objectives, metrics and measures. While the lower level measures change based on the specific software being evaluated, the abstractions at the higher levels allow us to compare different software tools for a particular evaluation scenario. This framework is particularly useful for comparing evaluation results where a number of conditions have changed, e.g., field studies, and for performing assessments of effectiveness with respect to process and product in operational environments. As we use the framework for more evaluations, we will be assessing how much the context of any given experiment can vary and still produce comparable evaluation results. We are currently using this framework to do comparative evaluations ranging from different software to support the same tasks to the same software tools used in similar processes in different organizations. Additionally, we are gaining some experience with how to partition evaluation concerns using the framework, but this is evolving slowly. At the present time, no generalizations have emerged; each evaluation scenario has had its own constraints and prominent issues that have guided how its evaluation has been partitioned. Therefore, we do not presently provide a process or prescription for specifying the framework elements other than the general guideline given earlier to identify the critical factors for success and the questions the evaluation must answer and then specify framework elements which support the assessment of those concerns.
0-7695-2268-8/05/$20.00 (C) 2005 IEEE
8
Proceedings of the 38th Hawaii International Conference on System Sciences - 2005
[9] Kogure, M. and Y. Akao, “Quality Function Deployment and CWQC in Japan”, Quality Progress, October 1983, pp. 2529.
7. Acknowledgments This work was funded in part by a classified funding source supporting the Research and Development Experimental Collaborative (RDEC) program.
8. References
[10] Mack, G., K. Longeran, J. Scholtz, M. P. Steves, and C. Hale, “A Framework for Metrics in Large, Complex Systems”, Proceedings of the IEEE Aerospace Conference, Big Sky, Montana, 2004. [11] McCall, J. A., P. K. Richards, and G. F. Walters, “Factors in Software Quality”, RADC TR-77-369, 1977.
[1] Basili, V. R. and H. Rombach, “Tailoring the software process to project goals and environments”, Proceedings of the 9th International Conference on Software Engineering, 1987, pp. 345-357.
[12] Murine, G. E., “Applying Software Quality Metrics in the Requirements Analysis Phase of a Distributive System”, Proceedings of the Minnowbrook Workshop, Blue Mountain Lake, New York, 1980.
[2] Basili, V. R and D. M. Weiss, “A Methodology for Collecting Valid Software Engineering Data”, IEEE Transactions on Software Engineering, Vol. SE-10, No. 3, 1984, pp. 728-738.
[13] Nardi, B. A., S. Whittaker, and E. Bradner, “Interaction and Outeraction: Instant Messaging in Action”, Proceedings of CSCW, 2000, pp. 79-88.
[3] Boehm, B. W., J. R. Brown, and M. Lipow, “Quantitative Evaluation of Software Quality”, Proceedings of the 2nd International Conference on Software Engineering, 1976, pp. 592-605.
[14] MEL/ITL Task Group on Metrology for Information Technology (IT), “Metrology for Information Technology”, NISTIR 6025, National Institute of Standards and Technology, Gaithersburg, MD, 1997, p. 8.
[4] Daminaos, L., L. Hirschman, R. Kozierok, J. Kurtz, A. Greenberg, R. Holgado, K. Walls, S. Laskowski, and J. Scholtz, “Evaluation for Collaborative Systems”, Computing Surveys, special edition DARPA Intelligent Collaboration and Visualization program, 1999. Nol. 2es (electronic).
[15] Rombach, H. D., “Practical Benefits of Goal-Oriented Measurement”, in Software Reliability and Metrics, eds. N. Fenton and B. Littlewood, Elsevier Science Publishing Co, London, 1991, pp. 217-235.
[5] Emam, K. E., N. Moukheiber, and N. H. Madhavji, “An Empirical Evaluation of the G/Q/M Method”, In Proceedings of the 1993 Conference of the Centre for Advanced Studies on Collaborative Research: software engineering, Vol. 1, Toronto, Ontario, 1993, pp. 265-289.
[16] Scholtz, J., “Design of a One to Many Collaborative Product”, Proceedings of Designing Interactive Systems, ACM, 1997, pp. 345-348. [17] Scholtz, J. and M. P. Steves, “A Framework for Real World Software Systems Evaluations”, Proceedings of CSCW, 2004, to appear.
[6] Groove software, http://www.groove.net/. [7] Grudin, J. “Why CSCW Applications Fail: Problems in the Design and Evaluation of Organizational Interfaces”, Proceedings of CSCW, 1988, pp. 85-93. [8] IEEE Standard for a Software Quality Metrics Methodology, IEEE Std 1061-1998, 1998, pp. 2-3.
[18] Steves, M. P. and J. Scholtz, “Metrics Mapping Concepts”, NISTIR 7061, National Institute of Standards and Technology, Gaithersburg, MD, 2003. [19] Whittaker, S., G. Swanson, J. Kucan, and C. Sidner, “Telenotes: managing lightweight interactions in the desktop”, Transactions on Computer Human Interactions, 1997, pp. 137168.
0-7695-2268-8/05/$20.00 (C) 2005 IEEE
9