A Theory of Benchmarking with Applications to Software Reverse Engineering
by
Susan Elliott Sim
A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Computer Science University of Toronto © Copyright by Susan Elliot Sim, 2003
ii
Abstract Benchmarking has been used to compare the performance of a variety of technologies, including computer systems, information retrieval systems, and database management systems. In these and other research areas, benchmarking has caused the discipline to make great strides. Until now, research disciplines have enjoyed these benefits without a good understanding of how they were achieved. In this dissertation, I present a theory of benchmarking to account for these effects. This theory were developed by examining case histories of successful benchmarks in computer science, my own experience with community-wide tool evaluations in software reverse engineering, and the literature from philosophy of science. According to the theory, it is the tight relationship between a benchmark and the scientific paradigm of a discipline that is responsible for the leap forward. A scientific paradigm, as described by Thomas S. Kuhn, is the dominant view of a science consisting of explicit technical facts and implicit rules of conduct. A benchmark operationalises a scientific paradigm; it takes an abstract concept and turns it into a guide for action. In effect, a benchmark is a statement of the discipline’s research goals and it emerges through a synergistic process of technical knowledge and social consensus proceeding in tandem. The theory of benchmarking is validated empirically against extant benchmarks and analytically against a hierarchical set of criteria. An important implication of the theory is that benchmarking can be used to cause a scientific discipline to make advances. To this end, a process model has been developed as a guide for others wishing to undertake a benchmarking project within their research communities. Also provided are a Benchmarking Readiness Assessment and criteria for evaluating the benchmarking process. The application of the theory and these process model is illustrated using two benchmarks that I have developed with the software reverse engineering community. These are the xfig benchmark for program comprehension tools and C++ Extractor Test Suite (CppETS), a benchmark for comparing fact extractors for the C++ programming language.
iii
iv
Acknowledgements Once upon a time, John Henshaw said to me, “Anything that is worth doing can’t be done alone.” I was frequently reminded of these words as I carried out the research for this dissertation. While my name is on the cover, this dissertation could not have happened without the help of many others. My deepest thanks goes to every member of the software reverse engineering community who participated in the various discussions and workshops that I have organised. This gratitude includes the participants who evaluated their own tools, as well as the attendees who came to listen and learn. Even though I was a lowly graduate student, they took part in these events. This research would not have been possible without them. Thank you, Holger Kienle, Mike Godfrey, Elliot Chikofsky, Rainer Koschke, Andreas Winter, Volker Riediger, Kostas Kontogiannis, Françoise Balmas, Homy Dayani-Fard, Andrew Walenstein, Hausi Müller, Tarja Systä, Sergei Marchenko, Árpád Beszédes, Rudolf Ferenc, Tibor Gyimóthy, Andrew Malton, Andrew Trevors, Johannes Martin, Ken Wong, Bruce Winter, Robert Mays, Eric Lee, John Tran, Tom Parry, Jingwei Wu, Andrew Trevors, Jeff Michaud, Arthur Tateishi, Piotr Kaminski, Jeremy Broughton, Ryan Chase, Sonia Vohra, Paul Holden, Howard Johnson, Bruno Laguë, Charles Leduc, Joe Wigglesworth, Jeff Turnham, Dave McKnight, and Cindy Nie. Peggy Storey and Tim Lethbridge have been crucial to the success of the benchmarks and workshops. I am extremely fortunate to have them as colleagues. Both of them have been generous to me with both their time and ideas. Peggy and I have collaborated on many projects, including our first, and probably most successful, benchmarking workshop, the xfig structured demonstration. New research and workshop ideas are still arising from that project. Tim always understood what I was trying to do and could be counted on to participate in my evaluations. I am immensely grateful to my co-supervisors, Ric Holt and Steve Easterbrook. Ric had the patience to stick with me for the seven years of my Master’s and Ph.D., which is no mean feat. He deserves the credit for the interesting parts of this dissertation because he insisted that the theory should explain the camaraderie that he felt when participating v
in a benchmarking workshop. While I don’t think the research ended up where he expected, Ric knew a good thing when he smelled it. Steve was invaluable in the crafting of the theory. We spent countless hours in his office getting the details to stay nailed down. He had an amazing grasp of the literature from different disciplines and he was able to provide great suggestions at just the right times. It sometimes took months for the significance of an idea to become fully apparent. Many thanks also to Renée Miller and Suzanne Stevenson who were on my supervisory committee. Renée has read several hundreds of pages of text from me and has always managed ask excellent, insightful questions. Suzanne was a meticulous reader, providing many detailed comments that were needed to get things right. She was the one who suggested that the title should be a theory of benchmarking. I would also like to thank Walter Tichy for serving as my external examiner. He fulfilled his duties conscientiously and was very supportive of the work. It was gratifying to have a fresh perspective on my work after labouring over the details for so long. In addition to aforementioned people who were directly involved with the thesis, I need to acknowledge people in the rest of my life. Thanks to my friends who provided me with support and needed distractions along the way, in the form of music, climbing, yoga, and scuba diving. It was a pleasure spending time with you and sharing the bond of friendship in my later formative years. Special thanks Matt Dreger, and Janice Cheung for help with proofreading. Thank you to my parents, Charles and Anne Sim, for everything. They decided to move the family to Canada, so the children could have a better life, after I failed my kindergarten entrance examinations in Hong Kong. Last but not least, I thank my husband, Jeffrey Sim Elliott, who has been my best friend and help-meet through all of this. He has provided encouragement, proofreading, technical support, and clean socks. When I suffered from writer’s block, he frequently suggested “the” as the next word, which incidentally is the first half of “theory”. Toronto, Canada October 3, 2003.
vi
Table of Contents Chapter 1.
Introduction................................................................................................................. 1
1.1
Research Contributions .......................................................................................................... 2
1.2
Overview of the Theory........................................................................................................... 3
1.2.1
Scope of Theory ............................................................................................................ 4
1.2.2
Benefits of Benchmarking............................................................................................. 4
1.2.3
Dangers of Benchmarking............................................................................................. 5
1.3
Validating the Theory ............................................................................................................. 6
1.4
Applications of the Theory...................................................................................................... 7
1.5
Organisation of Dissertation .................................................................................................. 8
Chapter 2.
Benchmarking of Software......................................................................................... 9
2.1
Validating Research Contributions ...................................................................................... 10
2.2
Benchmarking as an Empirical Method ............................................................................... 13
2.3
Case History: TPC-A™ for Database Management Systems............................................... 16
2.4
Case History: SPEC CPU 2000 for Computer Systems ....................................................... 18
2.5
Case History: TREC Ad Hoc Retrieval Task ........................................................................ 20
2.6
Analysis of Case Histories .................................................................................................... 22
2.7
Plan of Attack ....................................................................................................................... 24
2.8
Summary ............................................................................................................................... 24
Chapter 3. 3.1
Theory of Benchmarking.......................................................................................... 27 Definition of Benchmark....................................................................................................... 27
3.1.1
Motivating Comparison .............................................................................................. 28
3.1.2
Task Sample................................................................................................................ 29
3.1.3
Performance Measures ................................................................................................ 29
3.1.4
Proto-benchmarks ....................................................................................................... 30
3.2 3.2.1 3.3 3.3.1
Scientific Revolutions ........................................................................................................... 30 Scientific vs. Engineering Disciplines......................................................................... 33 Benchmarks as Paradigmatic: Processes for Progress........................................................ 35 Consensus ................................................................................................................... 36
vii
3.3.2
Impact of Benchmarking............................................................................................. 37
3.3.3
Role of Benchmarking in Stages of Revolution .......................................................... 38
3.3.3.1
Pre-Science ............................................................................................................ 39
3.3.3.2
Normal Science...................................................................................................... 40
3.3.3.3
Crisis ...................................................................................................................... 41
3.3.3.4
Revolution.............................................................................................................. 41
3.3.3.5
Retirement.............................................................................................................. 41
3.4
Mechanisms for Progress ..................................................................................................... 42
3.5
An Open Model of Science.................................................................................................... 42
3.6
Cross Fertilisation of Ideas .................................................................................................. 44
3.6.1
From Other Researchers.............................................................................................. 44
3.6.2
From Other Roles........................................................................................................ 45
3.7
Flexibility of Application ...................................................................................................... 46
3.7.1
Advancing a Single Research Effort ........................................................................... 47
3.7.2
Promoting Research Comparison and Understanding................................................. 48
3.7.3
Setting a Baseline for Research................................................................................... 48
3.7.4
Providing Evidence to Support Technology Transfer ................................................. 49
3.8
Recapitulation of Theory ...................................................................................................... 49
3.8.1
Assumptions................................................................................................................ 49
3.8.2
Concepts and Their Definitions................................................................................... 50
3.8.3
Relationships............................................................................................................... 51
3.8.4
Hypothesis................................................................................................................... 52
3.9
Summary ............................................................................................................................... 52
Chapter 4. 4.1
Validating the Theory of Benchmarking................................................................. 55 Validation Using Contributing Benchmarks......................................................................... 55
4.1.1
Components of Contributing Benchmarks .................................................................. 56
4.1.1.1
TPC-A™ Components........................................................................................... 56
4.1.1.2
SPEC CPU2000 ..................................................................................................... 57
4.1.1.3
TREC Ad Hoc Retrieval Task ............................................................................... 57
4.1.1.4
Discussion.............................................................................................................. 58
4.1.2
Impact of the Contributing Benchmarks ..................................................................... 59
4.1.2.1
TPC-A™................................................................................................................ 59
4.1.2.2
SPEC CPU2000 ..................................................................................................... 60
4.1.2.3
TREC Ad Hoc Retrieval Task ............................................................................... 62
viii
4.1.2.4 4.2
Discussion.............................................................................................................. 63
Validation Using Novel Benchmarks.................................................................................... 64
4.2.1
Optical Flow................................................................................................................ 65
4.2.2
KDD Cup .................................................................................................................... 66
4.3
Analytic Validation of the Theory’s Structure ...................................................................... 68
4.3.1
Provides a Causal Explanation.................................................................................... 68
4.3.2
Structural Soundness................................................................................................... 69
4.4
Validation by Comparison to A Rival Theory....................................................................... 69
4.4.1
A Rival Theory............................................................................................................ 70
4.4.2
Postdiction: Which theory better fits the empirical data?............................................ 70
4.4.3
Generality of Explanans: Which theory can account for more phenomena? .............. 71
4.4.4
Hypothetical Yield: Which theory produces more hypotheses?.................................. 71
4.4.5
Progressive Research Program: Which theory is more progressive? .......................... 72
4.4.6
Breadth of Policy Implications: Which theory provides more guidance for action?... 73
4.4.7
Parsimony: Which theory is simpler? ......................................................................... 73
4.5
Summary ............................................................................................................................... 73
Chapter 5. 5.1
Applying the Theory ................................................................................................. 75 Process Model for Benchmarking......................................................................................... 75
5.1.1
Prerequisites ................................................................................................................ 76
5.1.1.1
Minimum Level of Maturity .................................................................................. 77
5.1.1.2
Tradition of Comparison........................................................................................ 79
5.1.1.3
Ethos of Collaboration ........................................................................................... 80
5.1.2 5.2
Factors for Success...................................................................................................... 80 Readiness for Benchmarking ................................................................................................ 81
5.2.1
Readiness Assessment................................................................................................. 81
5.2.2
Example: Requirements Engineering.......................................................................... 82
5.2.3
Counterexample: Software Evolution ......................................................................... 83
5.3
Checklists for Benchmarks ................................................................................................... 84
5.3.1
Terminology................................................................................................................ 84
5.3.2
Caveats........................................................................................................................ 85
5.3.3
Technical Aspects of Benchmark Design ................................................................... 86
5.3.4
Checklists for Technical Criteria................................................................................. 87
5.3.5
Sociological Aspects of Benchmarking Process ......................................................... 91
5.3.6
Checklist for Sociological Criteria .............................................................................. 92
ix
5.4
Considerations for Software Engineering ............................................................................ 95
5.4.1
Motivating Comparison .............................................................................................. 95
5.4.2
Task Sample................................................................................................................ 96
5.4.3
Performance Measures ................................................................................................ 96
5.5 Chapter 6. 6.1
Summary ............................................................................................................................... 97 Tool Research in Reverse Engineering.................................................................... 99 Tools for Reverse Engineering ............................................................................................. 99
6.1.1
Software Work Products ........................................................................................... 101
6.1.2
Extraction Tools ........................................................................................................ 101
6.1.3
Analysis Tools........................................................................................................... 102
6.1.4
Presentation Tools..................................................................................................... 103
6.1.5
Tool Integration......................................................................................................... 104
6.1.6
Discussion ................................................................................................................. 105
6.2
Tool Comparisons in Reverse Engineering ........................................................................ 105
6.2.1
Comparing Call Graph Extractors: Murphy et al. 1998 ............................................ 106
6.2.2
Comparing Architecture Recovery Tools: Bellay and Gall, 1998............................. 107
6.2.3
Comparing Architecture Recovery Tools: Armstrong and Trudeau, 1998 ............... 109
6.2.4
Discussion ................................................................................................................. 110
6.3
Benchmarking in Reverse Engineering............................................................................... 110
6.3.1
Prerequisites .............................................................................................................. 111
6.3.2
Process ...................................................................................................................... 112
6.4 Chapter 7. 7.1
Summary ............................................................................................................................. 112 A Benchmark for Program Comprehension Tools .............................................. 113 Description of xfig Benchmark........................................................................................... 113
7.1.1
Motivating Comparison ............................................................................................ 114
7.1.2
Task Sample.............................................................................................................. 114
7.1.3
Performance Measures .............................................................................................. 115
7.2
xfig Benchmark Development Process ............................................................................... 116
7.2.1
CASCON99 Workshop............................................................................................. 116
7.2.2
WCRE2000 Workshop.............................................................................................. 118
7.3
Impact of xfig Benchmark .................................................................................................. 119
7.4
Evaluation of xfig Benchmark ............................................................................................ 121
7.4.1
Evaluation Design Checklists.................................................................................... 121
x
7.4.2
Process Checklists..................................................................................................... 124
7.4.3
Discussion ................................................................................................................. 125
7.5
Summary ............................................................................................................................. 127
Chapter 8. 8.1
A Benchmark for C++ Fact Extractors................................................................. 129 Description of CppETS ....................................................................................................... 130
8.1.1
Motivating Comparison ............................................................................................ 130
8.1.2
Task Sample.............................................................................................................. 131
8.1.2.1
Accuracy Category............................................................................................... 132
8.1.2.2
Robustness Category............................................................................................ 133
8.1.2.3
System Category .................................................................................................. 135
8.1.3
Performance Measures .............................................................................................. 135
8.2
CppETS Development Process ........................................................................................... 136
8.3
Impact of CppETS............................................................................................................... 137
8.4
Evaluation of CppETS ........................................................................................................ 139
8.4.1
Evaluation Design Checklists.................................................................................... 140
8.4.2
Process Checklists..................................................................................................... 142
8.4.3
Discussion ................................................................................................................. 143
8.5 Chapter 9.
Summary ............................................................................................................................. 145 Conclusion ............................................................................................................... 147
9.1
Summary ............................................................................................................................. 147
9.2
Future Work........................................................................................................................ 149
9.2.1
Expansion of Theory to Other Stages........................................................................ 149
9.2.2
Validation of the Theory ........................................................................................... 150
9.2.3
Testing of Assessments and Questionnaires ............................................................. 151
9.3
Applications in Software Engineering ................................................................................ 152
Appendix A: Sample Definitions for the Term Benchmark ................................................................. 155 Appendix B: Benchmarking Readiness Assessment.............................................................................. 156 Appendix C: Benchmark Evaluation Checklists ................................................................................... 160 Appendix D: Tool Developer Handbook ................................................................................................ 166
xi
Appendix E: Observer Handbook........................................................................................................... 172 References ................................................................................................................................................. 176
xii
List of Tables Table 2-1: Comparison of Benchmarking to Experiments and Case Studies ............................................... 14 Table 3-1: Changes Over Stages of Revolution............................................................................................ 39 Table 5-1: Redwine and Riddle’s Software Technology Maturation Phases ............................................... 78 Table 5-2: Interpretation of Readiness Assessment Scores .......................................................................... 82 Table 5-3: Checklist and Technical Criteria Traceability Matrix for Overall Questions.............................. 88 Table 5-4: Checklist and Technical Criteria Traceability Matrix for Motivating Comparison Questions .. 88 Table 5-5: Checklist and Technical Criteria Traceability Matrix for Task Sample Questions.................... 89 Table 5-6: Checklist and Technical Criteria Traceability Matrix for Performance Measures Questions .... 90 Table 5-7: Checklist and Sociological Criteria Traceability Matrix for Development Questions ............... 93 Table 5-8: Checklist and Sociological Criteria Traceability Matrix for Deployment Questions................. 94 Table 6-1: Summary of Research Design for Murphy, Notkin, Griswold, and Lan................................... 106 Table 6-2: Summary of Research Design for Bellay and Gall ................................................................... 108 Table 6-3: Summary of Research Design for Armstrong and Trudeau ...................................................... 109
xiii
List of Figures Figure 3-1: Stages of Scientific Revolutions ................................................................................................ 31 Figure 3-2: Truth Table for Implication ....................................................................................................... 50 Figure 4-1: Precision and Recall Measures .................................................................................................. 58 Figure 5-1: Process Model for Benchmarking.............................................................................................. 76 Figure 5-2: Processing of Benchmark Data.................................................................................................. 84 Figure 6-1: Stages of Reverse Engineering Process ................................................................................... 100 Figure 8-1: Conceptualisation of Design Space for Extractors................................................................... 131 Figure 8-2: Test Buckets in Accuracy Category......................................................................................... 133 Figure 8-3: Test Buckets in Robustness Category...................................................................................... 134
xiv
Chapter 1. Introduction Wednesday, October 10, 1999. 5:20PM We had just wrapped up a two-day workshop to evaluate program comprehension tools at CASCON99 (Centre for Advanced Studies CONference). The room was still buzzing with excitement over the connections that had been made between research groups, the intricate technical discussions about the tools, and the shared sense of achievement. There was laughter and camaraderie in the air and the researchers felt a renewed sense of purpose about their work. The workshop was titled “A Collective Demonstration of Program Comprehension Tools.” Prof. Margaret-Anne Storey (University of Victoria) and I had invited six development teams from research and industry to bring their program comprehension tools to CASCON to participate in a real-time, co-located evaluation [136]. On the first day at 9:30AM, we gave each of the teams a package of materials consisting of the source code for xfig 3.2.1 (an Open Source drawing program for UNIX) and a set of tasks to be performed on the source code. We also assigned industrial observers to some of the teams to help evaluate the tools. The teams struggled throughout the day, as we did, with various technical glitches, as well as the assigned tasks. Initially, the atmosphere was somewhat competitive, when the teams saw that they were all having problems, they started helping each other. The area in the exhibit hall where the evaluation was taking place was buzzing with activity, with some teams entering into very animated debates about how to solve the problems. Occasionally, a loud cheer would erupt when a team solved a particularly difficult problem. We forced the teams to stop at 5PM, so they could prepare their final submissions to us and their presentations for the second day of the workshop. During the second day, the development teams, the industrial observers, and we, the evaluators, all made presentations about the various tools and the evaluation. All the presentations about the capabilities and shortcomings of the tools were frank and honest, but polite. The interactions on the previous day already gave the participants a good idea of what the 1
various tools could do, so there was little embarrassment. The experience made the developers look at the their tools and each other in a new way. The tasks in the evaluation and the feedback came from a point of view that they did not normally take. The workshop gave them a rare opportunity to learn from each other at a very detailed, technical level. By the end of the day, all the participants were glad that they had taken part in the evaluation, regardless of how their tools performed. They had learned a lot and got to know their colleagues better. Clearly, the evaluation and the workshop were a success. One feeling that pervaded the workshop was that the achievement was bigger than any one of us could achieve individually and it was more than an empirical evaluation. What had we done exactly? Why was it successful? And could it be done again?
The structured demonstration, I eventually realised, was a kind of benchmark. A theory of benchmarking is my answer to the above questions. A theory provides a causal explanation of a phenomena by describing a set of concepts and the expected relationships between them. This theory provides an explanation for why benchmarking causes a scientific discipline to make advances. The theory was informed by other successful benchmarks in computer science [14, 58, 64, 92, 162] as well as ones that I had developed [134, 136]. Literature from philosophy of science and sociology of science was also used, because considering benchmarks solely as an evaluation was not sufficient to explain their success [68, 81, 95, 142]. In this introductory chapter, I give a synopsis of the theory and lay the foundation for a detailed presentation of the theory and its applications in later chapters. 1.1
Research Contributions There are two major research contributions in this dissertation. The first is the theory of
benchmarking. The theory provides insight into a research method that is widely used, but hitherto not well understood. With this theory, researchers can use benchmarks more effectively to realise their benefits while avoiding pitfalls. This theory has been published at a major conference, International Conference on Software Engineering (ICSE) 2003 [132]. The second contribution of this dissertation is the two benchmarks that were developed for software reverse engineering tools. A benchmark for program comprehension tools and one for 2
C++ fact extractors provided some much needed public evaluations in the area and brought advances to the discipline. The materials for the benchmarks are available on the web and papers on evaluations performed with them have already been published in peer reviewed conferences [134, 136]. The theory and the case studies are particularly relevant for software engineering because it is an area in need of both increased maturity and greater validation of research results. Benchmarking is an empirical method suitable for use with software technology. The process of developing and deploying a benchmark engages the same processes that are responsible for scientific progress. Consequently, undertaking a benchmarking effort causes a discipline to become more mature. The two case studies provide examples of how benchmarking can be used with software tools. These two contributions will be outlined in the next two sections. 1.2
Overview of the Theory A theory is a set of statements that puts forth a causal explanation of a set of phenomena that
is i) logically complete, ii) internally consistent, and iii) falsifiable [98]. A theory needs to provide a causal explanation, that is, to define a set of concepts and the expected relationships between them. The theory of benchmarking seeks to explain why benchmarks have had a significant positive impact on research communities in computer science. Briefly, the theory states that a benchmark operationalises the paradigm for a scientific discipline. Scientific paradigms were first described by Thomas S. Kuhn in The Structure of Scientific Revolutions [81]. A paradigm is the world view and knowledge that defines a discipline, and it includes both an explicit body of technical knowledge and tacit knowledge about conduct and values. A benchmarks operationalises a paradigm by taking a abstract concept and turning it into a concrete guide for action. It acts as a role model for a valid scientific problem and a scientifically acceptable solution, thereby communicating both the knowledge and the values of a discipline. A benchmark consists of three components: a Motivating Comparison, a Task Sample, and Performance Measure. Like paradigms, benchmarks are developed through the accumulation of technical results and community consensus. Successful benchmarks are constructed through a 3
collaborative process that involves the research community. The benchmark is also supported by scientific knowledge where it exists and further laboratory work where it is needed. A scientific discipline makes progress when both the quality of research results is increased and the level of consensus in the community is increased. The dual influences of technical and sociological factors can be found throughout science and in turn benchmarking. In summary, it is the process of benchmarking that brings advances to a discipline, rather than the existence of a benchmark. The theory of benchmarking provides an explanation of this tight relationship between benchmarks and paradigms, and between scientific progress and the process of benchmarking. This theory has many implications, and one is examined extensively in this dissertation. It is the idea that benchmarking can be used to encourage scientific progress. In other words, benchmarking provides a means for tapping into the same processes that cause scientific maturation. 1.2.1 Scope of Theory This theory is concerned primarily with benchmarks that are created and used by a technical research community. Benchmarks that are created by a single individual or laboratory and are not used widely tend not to have the same impact on the community and research results. The community of interest may include participants from academia, industry, and government, but they are all primarily interested in scientific research. Benchmarks designed for business or marketing purposes are optimised for different goals, so this theory does not extend to cover those. 1.2.2 Benefits of Benchmarking Benchmarking can have a strong positive effect on the scientific maturity of a research community. The benefits of benchmarking include a stronger consensus on the community’s research goals, greater collaboration between laboratories, more rigorous examination of research results, and faster technical progress. Others have also observed the benefits of benchmarking. Walter Tichy wrote, “…benchmarks cause an area to blossom suddenly because they make it easy to identify promising approaches and to discard poor ones” [153] (p. 36). In his Turing Award acceptance speech, Raj Reddy wrote about the success of benchmarks in speech recognition, “Using 4
common databases, competing models are evaluated within operational systems. The successful ideas then seem to appear magically in other systems within a few months, leading to a validation or refutation of specific mechanisms for modelling speech” [118]. Creating a benchmark requires a community to examine its understanding of the field, to come to an agreement on what are the key problems, and to capture this knowledge in an evaluation. Using the benchmark results in a more rigorous examination of research contributions, and an overall improvement in the tools and techniques being developed. Throughout the benchmarking process, there is greater communication and collaboration among different researchers, leading to a stronger consensus on the community’s research goals. 1.2.3 Dangers of Benchmarking Any discussion of benchmarking must include consideration of the costs and risks. There is a significant cost to developing and maintaining the benchmark, so there is a danger in committing to a benchmark too early. Tichy wrote: “Constructing a benchmark is usually intense work, but several laboratories can share the burden. Once defined, a benchmark can be executed repeatedly at moderate cost. In practice, it is necessary to evolve benchmarks to prevent overfitting” [153] (p. 36). A community must be ready to incur the cost of developing the benchmark, and subsequently maintaining it. Continued evolution of the benchmark is necessary to prevent researchers from making changes to optimise the performance of their contributions on a particular set of tests. Too much effort spent on such optimisations indicates stagnation, suggesting the benchmark should be changed or replaced. Locking into an inappropriate benchmark too early, using provisional results, can hold back later progress. The advantage of having a benchmark is that the community works together in one direction. However, this commitment means closing off other directions, albeit temporarily. Selection of one paradigm, by definition, excludes others. For example, the Penn treebank in computational linguistics accelerated research in statistical parsing techniques, but is impeding research into other parsing techniques, such as parsing into a semantic representation [32].
5
Benchmarking needs to be used with care in areas that should avoid over-using a data set, such as machine learning and data mining. In such areas, additional work needs to be done to ensure that the results of the comparisons are valid. It would be undesirable for a general-purpose machine learning algorithm to become tuned to data from a particular problem domain. This concern can be remedied by including a variety of data sets in a benchmark and updating the benchmark regularly as Tichy suggests. One possibility is to retain the other components of the benchmark, such as the tasks and performance measures, while incorporating a new collection of data. In data mining, a data set is examined for patterns using statistical techniques. When probabilistic tests are applied repeatedly to the same data, two issues arise: i) the tests can no longer be treated as independent, and ii) a low-probability random event will eventually occur. These problems can be remedied through appropriate use of statistical techniques for comparing results from using the benchmark [122]. 1.3
Validating the Theory Naturally, new theories need to be validated and this one is no exception. This theory is
validated in Chapter 4 both empirically and analytically. The theory is validated empirically using case histories that informed the development of the theory. This validation tests how well the theory postdicts known data. A second empirical validation is conducted using novel benchmarks that were not considered in the formulation of the theory. This validation tests how well the theory can predict, or extrapolate from known instances. Two further analytic validations of the theory are given. The first shows that the theory of benchmarking is well-structured and logical. A well-structured theory is i) logically complete, ii) internally consistent, and iii) falsifiable [98]. A theory is internally consistent if it does not contradict itself. It is logically complete when the claims that are deduced flow logically from the assumptions. Finally, it is falsifiable if the argument is not circular (or tautological) and can be tested empirically. The second analytic validation is performed using a hierarchical set of criteria: postdiction, generality of explanans, number of hypotheses generated, progressive research program, breadth of implications, and parsimony. These criteria are normally used to choose between two alternative theories, so a rival theory was constructed for this analysis.
6
1.4
Applications of the Theory The value of a theory becomes apparent when it is applied to practice. Of particular
relevance are the implications of the theory for future benchmarking efforts in computer science and software engineering research. Chapter 5 provides guidance in three specific ways. One, it presents a process model for benchmark development and deployment. Two, it gives a Benchmarking Readiness Assessment for determining whether a research area is ready to undertake a benchmarking effort. Three, it provides criteria and checklists for constructing and evaluating benchmarks. Over the last few years, I have been working to define benchmarks for software reverse engineering tools. During this time I applied and refined the principles described in the previous section. The first was the xfig benchmark mentioned at the beginning of this chapter. The benchmark was for comparing program comprehension tools (i.e. tools that helped programmers understand source code for the purpose of making modifications) [136, 137]. Users of the benchmark had to complete a number of maintenance and documentation tasks on xfig 3.2.1, an Open Source drawing program for UNIX. This package was initially used in a structured demonstration format at CASCON99 and has gone on to be used at WCRE (Working Conference on Reverse Engineering) 2000 and in the curriculum of a graduate course at University of Tampere, Finland. Experiences with this benchmark sparked my initial interest in formulating a theory. The second benchmark was CppETS (C++ Extractor Test Suite, pronounced see-pets) [134]. Fact extractors are used to analyse source code and produce facts to be used in subsequent reverse engineering analyses. I used a collection of small test programs written primarily in C++ and questions about these programs to evaluate the capabilities of different fact extractors. This benchmark was initially used at a workshop at CASCON2001 and followed-up at a working session at IWPC2002 (International Workshop on Program Comprehension). This benchmark was created after the core of the theory had been articulated, but there were still details to be refined. The process used with CppETS was guided by the theory and our experience with the xfig benchmark. Both of these benchmarks were developed in collaboration with other researchers, used by additional researchers and tool developers, and discussed at a workshop or conference. Both 7
benchmarks produced technically interesting results, but a more significant contribution was the deeper understanding of the tools and the research problem that they brought to the community [134, 136]. These findings were consistent with the theory of benchmarking [132]. 1.5
Organisation of Dissertation This dissertation has three parts: Background, Theory of Benchmarking, and Applications.
Part I, consisting of Chapter 1 and 2, provides an introduction and motivation for the theory. The theory is presented in Part II. The theory of benchmarking itself is found in Chapter 3 and the validation for it in Chapter 4. Part III is on Applications of the theory. Chapter 5 provides a road map for others who wish to use the theory as a starting point for developing benchmarks in their research fields, or to realise the other technical and sociological benefits of community-based benchmarks. It includes a process model for benchmarking and checklists for assessing the value of both types of contributions. The remaining three chapters describe two case studies where benchmarking was used to advance research in tools for software reverse engineering. Chapter 6 is an introduction to reverse engineering research and tools. It also describes the scientific community and prior evaluations upon which the two benchmarks were built. The xfig benchmark. This benchmark is described in Chapter 7 and its emergence is interpreted using the theory of benchmarking. This chapter includes a critique of the evaluation as a benchmark using the checklists from Chapter 5. Chapter 8 presents CppETS and has a similar format as the previous chapter. It describes the development of CppETS and includes a critique of it. This dissertation concludes with a summary and a discussion of future work in Chapter 9.
8
Chapter 2. Benchmarking of Software This chapter presents the motivation for this dissertation. In the introductory chapter, I described the success with the structured demonstration that led to interest in benchmarking as an effective empirical method. Here, I will discuss the wider problem in software engineering that I seek to address with the theory, and the previous work with benchmarks that set the stage for my contribution. The fundamental problem that I seek to remedy is the shortfall in validation or evaluation of research contributions within software engineering. Compared to other sciences, fewer published papers contain such evaluations and those that are published use less rigorous techniques [154, 171]. An appropriate validation should provide evidence of the efficacy of the new result and can be in the form of empirical data or an analytic proof. Marvin V. Zelkowitz and Dolores R. Wallace provide a trenchant description of this state of affairs. “Experimentation is one of those terms that is frequently used incorrectly in the computer science community. Researchers write papers that explain some new technology; then they perform ‘experiments’ to show how effective the technology is. In most cases, the creator of the technology both implements the technology and shows that it works. Very rarely does such experimentation involve any collection of data to show that the technology adheres to some underlying model or theory of software development or that the software is effective” [171] (p. 23). Clearly there are exceptions to this characterisation and there are many disciplines use benchmarking as a research community, such as databases, computer systems, speech recognition, and text compression. Some of these provide the case histories used in this chapter, TPC-A™ for database management systems (DBMSs) [58], SPEC CPU2000 for computer systems [64], and the TREC Ad Hoc Retrieval Task for information retrieval (IR) systems [162]. I will be discussing these benchmarks because they provide good illustrative examples of the phenomenon of community-based benchmarks. Later chapters will refer to these case histories as 9
they will be used as observations in the formulation and validation of the theory of benchmarking. These are well-known benchmarks that are supported by an organisational infrastructure, but they had their origins in intimate collaborations within a research community. TPC-A™, SPEC CPU2000, and the TREC Ad Hoc Retrieval Task were chosen as case histories for two reasons: 1) they had a strong positive impact on their respective research areas; and 2) there was good documentation on the development process. While there are many publications that report on the technical aspects of a benchmark, few discuss the process of arriving at the design. The three benchmarks in this chapter are exceptional because these non-technical aspects have been discussed in the literature. Nevertheless, these examples appear to be representative of benchmarking in general, because they are consistent with each other and my own experience. The body of this chapter is concerned with substantiating the points raised in this introduction. It will examine the status of research validation in software engineering, describe the three case histories, and begin a discussion on their implications for software engineering. 2.1
Validating Research Contributions There are many definitions of scientific method, but they have in common a sequence of
steps that involve observing some phenomenon, formulating an explanation for the phenomenon, using the explanation to make predictions, and testing the explanation by comparing the predictions to empirical data. This last step is called validation and it is done to improve the accuracy of the explanation and predictions. These steps are repeated until there are no discrepancies between the observations and the explanation. In computer science, the phenomenon under study is information processes. The products of our study can be explanatory theories, algorithms and tools that implement them, and methods for controlling information processes. There is some disagreement on whether the scientific method applies in computer science, but Tichy provides rebuttals to these arguments [153]. Regardless of these abstract arguments about scientific method, the need for validation is apparent in the practical problem of technology transfer. Industry is unlikely to adopt a new tool or technique without good evidence that it will improve software development. As Vincent Shen wrote, “…after a new technology is adopted, an organization must be willing to take a hit on productivity as their people undergo training for the new technology. It is difficult to imagine that any organization will commit to an activity that will 10
temporarily decrease productivity, unless the technology has proven long-term great value” [129]. To assess the level of validation undertaken within computer science and software engineering, Tichy et al. sampled 400 research papers from journals and conferences [154]. They found that computer science validates a lower proportion of research results and when validation studies are conducted they use weaker methods such as assertion and demonstrations. Assertions are essentially logical arguments in favour of a contribution, such as a software tool or prediction model. Demonstrations illustrate functionality on an example problem and they produce only predetermined results, not objective measurements. Within computer science, software engineering validates an even lower proportion. They examined a complete volumes of ACM Transactions on Computer Systems, ACM Transactions on Programming Languages, IEEE Transactions on Software Engineering, SIGPLAN Conference on Programming Language Design and Implementation, as well as a random sample of papers from the INSPEC database and from the papers published by ACM in 1993. For comparison, they also looked at papers from Optical Engineering (an established engineering field where many results have immediate applications) and Neural Computing (a relatively young interdisciplinary field with partial overlap with computer science, like software engineering). Tichy et al. found that 40% of papers that presented results in design and modelling did not contain any experimental evaluation. This proportion rose to 50% for software engineering papers. In contrast, Neural Computing and Optical Engineering, had fewer such papers, 12% and 15% respectively. In a similar study, Zelkowitz and Wallace surveyed all the articles published in the International Conference on Software Engineering (ICSE), IEEE Software, and IEEE Transactions on Software Engineering for the years 1985, 1990, and 1995 [171]. Approximately one third of the papers had no experimental validation and about one third validated their results using assertion, i.e. arguing that the contribution is an improvement on previous work. While Zelkowitz and Wallace found some improvement over the years, Shaw reported similar results in her ICSE2001 keynote address, The Coming-of-Age of Software Architecture Research [128]. She found that half of the papers in her sample used persuasion as a validation method. Her
11
sample included 24 classic software engineering papers and 10 papers on software architecture from that year’s conference. This shortage of rigorous validation has a number of detrimental effects on the discipline. Zelkowitz and Wallace wrote, “Without a confirming experiment, why should industry select a new method or tool? On what basis should researchers enhance a language (or extend a method) and develop supporting tools? In a scientific discipline, we need to do more than simply say, ‘I tried it, and I like it’” [171] (p. 23). This lack of insight leads to other problems, as Shaw wrote, “Poor internal understanding leads to poor execution, especially of validation, and poor appreciation of how much to expect from a project or result. There may also be secondary effects on the way we choose what problems to work on at all” [128]. These authors agree that there needs to be more validation, but there are minor differences in their suggested approaches. (In particular, Tichy et al. and Zelkowitz and Wallace advocate the use of empirical validation above all others, while Shaw advocates the use of appropriate validation, including analytic methods. Empirical methods involve the collection of data. Analytic methods involve the formulation of a logical or objective argument, such as a proof.) They all recommend raising awareness of the problem within the community, increasing sensitivity to research strategies, and promoting the publication of validated results and empirical studies. A selective, consolidated list of their recommendations is as follows. •
Recognise our research strategies and how they establish their results. (Shaw)
•
Use common taxonomy and terminology for discussing experimentation and the results they can produce. (Zelkowitz and Wallace)
•
“Wherever appropriate, publicly accessible sets of benchmark problems must be established to be used in experimental evaluations.” (Tichy)
•
“In many areas within CS, rules for how to conduct repeatable experiments still have to be discovered. Workshops, laboratories, and prizes should be organised to help with this process.” (Tichy)
•
“…computer scientists have to begin with themselves, in their own laboratories, with the their own colleagues and students to produce results that are grounded in evidence.” (Tichy) 12
While community-based benchmarking does not magically remedy the shortage of validation in software engineering, it provides a means for implementing the suggestions given above. Even small advances in awareness these areas would have significant benefits. The iterative process used to create the benchmark establishes a vocabulary for discussing evaluations and raises the level of discourse on technical problems. Finally, using benchmarks to evaluate technology creates a body of empirical results that are open to scrutiny. This is not a new idea, although this dissertation is the first to treat the issue thoroughly. In 1987, Weiderman et al. advocated the use of benchmarking for software tools [165]. There are many similarities between their work and this dissertation. They too note the importance of using realistic tasks that are based on user activities, employing empirical evaluation, and keeping the design of the evaluation independent of technology. The next section examines the merits of benchmarking as an empirical method. The case histories presented in the following three sections show how researchers working collaboratively to build a benchmark improved the quality of evaluations in a community. 2.2
Benchmarking as an Empirical Method This section will argue that benchmarking is an appropriate empirical method for evaluating
tools and techniques in software engineering. It will look at two issues: where benchmarking is situated relative to other empirical methods, and how suitable is benchmarking as a validation method in software engineering. Benchmarking is a hybrid empirical technique that lies somewhere in between experiments and case studies, in terms of control and ability to provide explanations [136]. Its features support the appropriate handling of four factors (technology under test, subject system, tasks, and user subjects) critical in any evaluation of technology. A benchmark is expected to be used multiple times, so the materials are designed for portability, thus facilitating replication. Finally, the familiarity of format to computer scientists makes the technique and results accessible.
13
Empirical evaluations in software engineering are typically experiments or case studies1 [73, 102]. These methods occupy opposite ends of a spectrum in terms of how much control the researcher has or has applied on the study setting [117, 170]. The main finding of an experiment is normally a statistic with a probability attached to it. In contrast, the main finding of a case study is an explanation or pattern of behaviour. It is possible to have exploratory experiments and case studies with quantitative results, hence these two study designs form a continuum. Benchmarking has characteristics from both case studies and experiments, and consequently shares features of each of these two well-understood empirical methods, as summarised in Table 2-1. Experiments Features • Use of control factors • Replicability
Case Studies Features • Little control over the evaluation setting, (e.g. choice of technology and user subjects) • No tests of statistical significance • Some open-ended questions possible Advantages for Benchmarking • Method is flexible and robust
Advantages for Benchmarking • Direct comparison of results • Built-in replication Disadvantages for Benchmarking Disadvantages for Benchmarking • Not suitable for building explanatory • Limited control reduces generalisability theories of results Table 2-1: Comparison of Benchmarking to Experiments and Case Studies
Like experiments, control of the Task Sample is used to reduce variability in the results—all tools and techniques are evaluated using the same tasks and materials. However, there is little control over the selection of tools or techniques to be evaluated and their human operators; in this sense, benchmarking is more like a case study because these factors are determined by the
1
The terms experiment and case study are often misused. As the quotation at the beginning of this chapter
from Zelkowitz and Wallace indicated, ‘experiment’ is sometimes used to mean exploration through trial and error. Similarly, ‘case study’ is frequently used to mean demonstration or experience report. These misuses hint at a range of technical problems in research design, but the most important distinction from the colloquial uses is the research methods seek to test or construct hypotheses that have intellectual value beyond the exercise at hand.
14
situation. Furthermore, tests of statistical significance are not always performed on benchmark results. A convenient feature of benchmarking is that replication is built into the method. Since the materials are designed to be used by in different laboratories, people can perform the evaluation on various tools and techniques, repeatedly, if desired. Also, some benchmarks can be automated, so a computer does the work of executing the tests, gathering the data, and producing the Performance Measures. In any evaluation of software tools and techniques, there are four factors that must be employed as a control or treatment factor. These factors are: Technology Under Test– The tools or techniques being studied. Subject System– The software being analysed, or manipulated using the tools. Tasks– The operations that must be performed on the subject system. User Subjects– The individuals who will be performing the tasks during the study. A treatment factor is the intervention being studied. For example, in a study of pretty-printing formats for source code, colour-coding could be a treatment factor. (This treatment factor would have different levels, or doses. To continue the example, one treatment level would be no colour, a second treatment level could be coloured keywords, and a third would be coloured keywords and parentheses.) A control factor is held constant in a study to help rule out alternative explanations. To build further on the example, the pretty-printed code should all be given to programmers using the same medium, e.g. on screen or on paper [117]. These four factors, Technology Under Test, Subject System, Tasks, and User Subjects, are critical to the interpretation of the results, so they cannot be allowed to vary freely. An experiment will vary the levels of the treatment factors systematically and carefully control the levels of factors that affect the outcome. A case study will typically have one treatment factor and let the situation determine the other factors. A benchmark holds constant the Subject System and the Tasks to support consistent evaluation of the tools. There is little control over the selection of User Subjects, because researchers decide on their own whether to participate in a benchmark. Theoretically, the tool developers know their own products better than the average user, especially for research tools. Consequently, the results may not be typical.
15
Having established that benchmarking is a valid empirical method for evaluating software technology, I will now examine three benchmarks that have had a significant impact on their respective research communities. I will return to these case histories in Chapter 4 to assess whether the theory of benchmarking adequately accounts for this set of phenomena. 2.3
Case History: TPC-A™ for Database Management Systems The database area is an example of a research field where benchmarking has a prominent
role. A great deal of attention has been devoted to theoretical and technical problems in benchmarking existing and emerging technologies, e.g. object-oriented databases and XML (eXtensible Markup Language) databases [10]. However, this has not always been the case. Here, I describe the emergence of TPC-A, the Transaction Procession Council (TPC) Benchmark™ A for Online Transaction Processing including a LAN or WAN network, one of the first standardised benchmarks for DBMSs. In the early 1980s, there was growing interest in measurement of database performance among vendors, customers, and researchers [58]. Due to improvements in DBMS software and the declining cost of computer hardware, there was increasing use of larger on-line transaction processing systems. Vendors claimed their DBMSs could sustain 1000 transactions per second, or more. Unfortunately, there was no commonly agreed upon definition for a transaction, nor was there a common method for measuring its rate. Consequently, a number of “ad hoc benchmarks” arose; these were isolated evaluations developed by a stakeholder to answer a specific question or to make a particular purchasing decision. The publication of poorly documented performance figures led to a widespread disillusionment with benchmarking. Jim Gray recalls some of these questionable tactics: “When a vendor did publish numbers, they were generally treated with skepticism. When comparative numbers were published by third parties or competitors, the losers generally cried ‘foul’ and tried to discredit the benchmark. Such events often caused benchmark wars. Benchmark wars start if someone loses an important or visible benchmark evaluation” [58] (p. 7). Thus begins a sequence of one-upmanship where the losing vendors rerun the benchmark using progressively more famous specialists and gurus. Gray continues,
16
“At a certain point, a special version of the system is employed, with promises that the enhanced performance features will be included in the next regular release. Benchmarketing is a variation of benchmark wars. For each system, there is some benchmark that rates the system as best” [58] (p. 7). In 1985, a paper was published in Datamation that marked the start of a multi-organisation collaborative effort to develop a standard benchmark [11]. This effort was spearheaded by Gray and had so many contributors from both industry and academia that the author on the paper was given as “Anon et al.” The paper described three tests, including DebitCredit and proposed two metrics for evaluating database performance: the number of transactions per second (tps) and cost per tps ($K/tps), where cost was measured over five years. This paper struck a chord in the database community, among both researchers and vendors. The ideas in the Anon et al. paper became a starting point for standardisation. Because the paper was written collaboratively by respected members of the community, it garnered a lot of attention. Moreover, it attracted others to participate in the effort. While it did not eliminate the confusion and benchmarking wars, it was a start. Other papers were published refining Anon et al.’s ideas, and other researchers joined Gray in promoting the benchmarking effort [58]. Eventually, a number of representatives from industry and academia formed the TPC for the purpose of standardising and supervising benchmarks for DBMSs. The DebitCredit test evolved into TPC-A™, and was published in November 1989 [58]. Developing TPC-A™ took over two years and required nearly 1200 person-days of effort contributed by researchers and 35 database vendors who were members of the consortium. The process involved many meetings as well as laboratory work by the members. The final TPC-A™ specification is over 40 pages long with 11 different clauses covering issues such as transaction and terminal profiles, scaling rules, response time, and rules for full disclosure. In 1991, Gray wrote: “So far the TPC has defined two benchmarks, TPC BM™ A, and TPC BM™ B. These two benchmarks have dramatically reduced the benchmark wars. Customers are beginning to request TPC ratings from vendors. The TPC metrics capture peak performance and price/performance of transaction processing and database systems running simple update-intensive transactions. In addition, the TPC reporting procedures have a ‘full disclosure’ mechanism that makes it 17
difficult to stretch the truth too much. The benchmark must be done on standard hardware and released software. Any special tuning or parameter-setting must be disclosed. In addition, the TPC highly recommends that an independent organization audit the benchmark tests” [58] (p. 8). Version 2.0 of TPC-A™ was declared obsolete in June, 1995. It was superseded by TPC-C™, which is now in version 5.0. The case history of TPC-A™ only tells part of the story of benchmarking in the database community. At the same time, other benchmarks were being developed including TPC-B™ (similar to TPC-A™, but without a network), the Wisconsin benchmark for evaluating relational query systems, and AS3AP for mixed workloads and utility functions [58]. Today there are countless database performance benchmarks. Some are proposals and innovations from researchers. Others are developed by vendors for marketing purposes. A few of these have become widely accepted and widely used. TPC remains an active organisation, standardising new benchmarks to reflect evolving DBMS technology [10]. 2.4
Case History: SPEC CPU 2000 for Computer Systems Computer systems is an area that has long been associated with benchmarking. Performance
features of microchip architectures, system configurations, etc., have been easy to measure, but these measures have been hard to compare and interpret in particular usage context. In the 1980s, there was a desire to move away from synthetic benchmarks that posed relatively small problems towards macroscopic benchmarks with realistic tasks. By the second half of the decade, computer vendors were independently researching such benchmarks. A number of them, Sun Microsystems, Hewlett-Packard (HP), Digital Equipment Corporation (DEC), and others came together to form a consortium to work on this problem together [108]. This organisation was SPEC, (Standard Performance Evaluation Cooperative, later renamed Standard Performance Evaluation Corporation). It was founded in November, 1988 and released its first CPU benchmark, SPEC89, the following year. The SPEC89 benchmark has been superseded three times with some changes in terminology, so the current version is CPU2000. These benchmarks have become widely used and widely quoted in the computer industry. Members of the non-profit consortium include hardware and software vendors, universities, and customers [64]. Committees are formed to 18
develop a particular benchmark and they solicit requirements, test cases, and votes on benchmark composition from consortium members and the general public. The committee uses “benchathons” to refine the benchmark. John Henning explained that at such events committee members from different organisations work together to solve a common problem: “The point of a benchathon is to gather as many as possible of the project leaders, platforms, and benchmarks in one place and have them work collectively to resolve technical issues involving multiple stakeholders. At a benchathon, it is common to see employees from different companies looking at the same screen, helping each other” [64] (p. 30). The benchmark selection process involves both technical and social issues. Some of the technical challenges include portability, coverage of hardware resources, methods for measuring comparable work on different systems, and tool support to automatically run the benchmark. The social issues primarily involve vendors being protective of their data and products. To ensure that the benchmark under development will have technical credibility, performance data from different platforms is needed. Henning wrote that while this problem is a concern, it can be solved: “But SPEC members are often employees of companies that compete with each other, and vendor confidentiality limits what can be said, for both business and legal reasons. The solution to this problem for CPU2000 was that all members simultaneously provided some amount of objective data, and the subcommittee kept this data confidential, thereby reducing management concerns. The process worked well: SPEC gained objective data about candidate benchmarks’ I/O, cache and main-memory behaviour, floating-point operation mixes, branches, code profiles, and code coverage” [64] (p. 30). Also, there is a temptation for a committee member to vote for a test because it favours his or her employer’s hardware. However, this bias is usually eliminated during peer review and Henning explains how this process works, “Of course, SPEC members who are vendor employees keep their employer’s interests in mind. For example, an employee of a company that makes big-endian Unix machines makes sure that the playing field is not tilted in favour of little19
endian NT systems. Arguments to level the playing field are always welcome and quickly attract support. But attempts to tilt the playing field just don’t work” [64] (p. 30). CPU2000 consists of 26 programs (12 with only integer arithmetic and 14 with floating point) and includes well-known programs such as GNUzip, GNU C Compiler, and the Perl programming language. The benchmark is scheduled to be replaced in 2004. The call has already gone out in publications, USENET newsgroups, and the SPEC web site for candidates to be considered for inclusion. 2.5
Case History: TREC Ad Hoc Retrieval Task Information retrieval (IR) is a research area concerned with the development of algorithms,
data structures, and systems for automatically analysing and classifying text documents. Examples of IR systems include web search engines such as Google, OpenText, and AltaVista, as well as customised subscription services provided by web sites such as CNN. Empirical studies have long had a role in this field and measures of precision and recall are widely accepted. In the early 1990s, both the National Institute of Standards and Technology2 (NIST), and the Defense Advanced Research Projects Agency (DARPA) initiated a major project to compare a number of IR systems from both industry and research. In 1992, the two organisations co-sponsored the first Text REtrieval Conference (TREC) which marked the beginning of a community-wide effort to develop a standard benchmark. In the overview of the conference proceedings, Donna Harman wrote, “In the 30 or 50 years of experimentation there have been two missing elements. First, although some research groups have used the same collections, there has been no concerted effort by groups to work with the same data, use the same evaluations techniques, and generally compare results across systems. The importance of this is not to show any system to be superior, but to allow
2
NIST has an interest in the evaluation of technology and has been leading efforts to create benchmarks for areas
including speech recognition and computational chemistry.
20
comparison across a very wide variety of techniques, much wider than only one research group can tackle. “…The second missing element, which has become critical in the last 10 years, is the lack of a realistically-sized test collection. Evaluation using the small collections currently available may not reflect performance of systems in a large full-text searching, and certainly does not demonstrate any proven abilities of these systems to operate in real-world information retrieval environments. This is a major barrier to the transfer of these laboratory systems into the commercial world” [60] (p. 1). This conference sought to tackle these two missing elements. Participating groups would apply their IR systems to a common set of tasks on a large corpus of text. TREC has become an important annual conference in IR and each year it has included challenges that are open to all participants. At TREC-1 in 1992, there were two challenges, a news article routing task and the ad hoc retrieval task [60]. This case history will focus on the latter. This ad hoc task consists of searching a corpus of text for articles that match a specific topic. The corpus used in this task originally contained about 740 000 articles and by 1998 had grown to more than twice this size [162]. It consists primarily of news items, such as those from the Wall Street Journal and the Associated Press. The formatting and length of the topics has also evolved since the inception of the task. A modern example of a topic is “What are the prospects of the Quebec separatists achieving independence from the rest of Canada?” [61]. In advance of each conference, the organisers made the rules, topics, and corpus available [60]. Participating teams used their IR systems on the given problems and submitted their results to the organisers prior to the conference. At the conference, organisers unveiled their results and participants gave presentations on the performance of their systems and the design of the task. The conference proceedings were written and published after the conference and the papers served as technical descriptions and experience reports. At TREC-1, 25 groups participated in the evaluation. At TREC-8 in 2000, there were 41 participants [60, 162]. The early TRECs were focused on refining the format of the evaluation, i.e. the statement of the topics, method for making relevance judgements, rules for reporting and participation. By TREC-4, the task had stabilised and additional tasks were being added. Over the years, there 21
have been as many as nine tasks, but in recent years this number has decreased to six. The ad hoc task was discontinued after 1998 because the benchmark was no longer pushing research forward, so the organisers felt that their energy would be better spent on other benchmarks. Ellen M. Voorhees and Harman wrote: “Retrieval effectiveness in the ad hoc task has improved dramatically since the beginning of TREC but now has levelled off. Because of these considerations and the fact that we now have 8 years worth of test collections, and the ad hoc main task will be discontinued in future TRECs. This is not to say that we believe that the ad hoc task is solve. Indeed, Figure 8 shows absolute performance on the task is far from ideal. Rather it is an acknowledgement that sufficient infrastructure exists so that researchers can pursue their investigations independently, and thereby free TREC resources for other tasks” [162] (p. 15). Voorhees and Harman made reference to Figure 8, which shows a steady improvement in precision from TREC-1 to TREC-5, but only marginal changes from TREC-6 onwards. Evidence of this levelling off in the rate of improvement can also be found in the design of the tasks. In the early TRECs, there were improvements in both the design of the ad hoc task and the IR systems. These adaptations show that both the organisers and the participants were learning from their experiences. Later descriptions of the task in conference overviews became less detailed. This abbreviation showed that the community had reached a level of collective understanding about the problem and a common vocabulary had been established. 2.6
Analysis of Case Histories While each of these case histories follow their own course, there are features common to all
of them. Maturity of Technology. In all three cases, the technology had matured to the point where a variety of alternative implementations were available and engineering issues began to dominate. As a research area emerged, different approaches were tried and initial success were readily attained. Consequently, maturation of the field was concurrent with a proliferation of different solutions. In addition, there was growing industrial interest in these results. When SPEC was formed, there were a number of hardware vendors with competing chip and system designs. 22
Vendors of DBMSs had tests to show that their systems were powerful enough able to sustain 1000 transactions per second, but it was difficult to compare systems using these ad hoc tests. Prior to TREC-1, IR systems could be found in academia and industry that used a variety of novel approaches. Interest in Evaluation and Comparison. Within the respective communities for research in database, systems architecture, and IR, there was already a history of empirical evaluations, often leading back to the 1960s. It was generally accepted that results needed to be validated, but standard techniques, tasks, or corpuses did not exist. Work on benchmarking was foreshadowed by isolated experiments or case studies and followed by a growing concern regarding the realism, validity, and scalabilty of these comparisons. In the computer systems community, there was a desire to use more realistic programs instead of synthetic ones in their benchmarks. For databases, there was also a multiplicity of tests, but using incommensurate techniques and definitions. Similarly in IR, DARPA had started the TIPSTER project with the participation of researchers for the purpose of developing a standard evaluation for IR systems. Collaboration Led by Champions. With these two factors mentioned so far as the backdrop, various members of the community began to work together on the problem of evaluation. Informal collaboration was followed by joint papers and/or meetings. In each case history, it is possible to identify a milestone that marks the start of an effort to develop a community-wide benchmark. For databases, this was the paper by Anon et al. For computer systems, it was the formation of SPEC. For IR systems, it was TREC-1. These events also served to identify the leaders for the work. While these people did make technical contributions along with other members of the community, they served primarily as organisers who kept the work active. Growth in Collective Understanding. Papers that reported on the benchmarks were initially concerned with minutiae of the evaluation process. They took pains to lay out the rationale and design decisions underlying the process. These descriptions become more laconic as the evaluation method became established. The consensus-based process used to develop the benchmarks brought with it a common vocabulary and set of concepts for discussing the evaluation. Detailed discussions about method and design became obsolete as the community
23
came to accept that the benchmark provided a fair and level playing field for comparing their tools. These similarities and my experiences with designing benchmarks provide a starting point for developing the theory of benchmarking. There are other commonalities that will be brought out in the presentation of the theory of benchmarking over the next two chapters. 2.7
Plan of Attack The plan for this dissertation is to provide an explanation for the effect that benchmarking
has in a scientific discipline. This explanation, or theory, was developed by examining existing benchmarks in computer science and reviewing the literature from philosophy of science. The case histories of benchmarks illustrate the set of phenomena that the theory needs to account for. These case histories include the three discussed in this chapter, as well as, the Penn Treebank for computational linguistics [32, 91], the Calgary and Canterbury Corpuses for data compression algorithms [14, 19, 115], and others. The background material from philosophy of science provide an understanding of how scientific disciplines function. It includes descriptions of the processes at work in these communities and serves as a conceptual and intellectual foundation for the theory [21, 25, 68, 81, 86, 99]. In addition, resources from sociology and political science were consulted for advice on theory construction [24, 47, 98]. Having constructed this theory, hypotheses can then be derived. Of particular interest, are implications that provide guidance on how benchmarking can be used in software engineering. For example, when benchmarking is appropriate, and how to undertake the work of building a community-based benchmark. This advice is tested in the development and deployment of two benchmarks in the reverse engineering community [134, 136]. 2.8
Summary In this chapter, I presented the motivation for developing a theory of benchmarking. Too
few of the research results in software engineering undergo any sort of validation and many of the results that are validated use methods that do not rely on evidence or objective data, such as 24
assertion or persuasion. Authors such as Shaw, Tichy, and Zelkowitz and Wallace have described the problems entailed by this shortfall and have suggested remedies. The success that other research areas such as databases, computer systems, and IR have had with benchmarking suggests that this technique can be used to increase validation in software engineering. In the next chapter, I will present the theory of benchmarking. It is the result of careful consideration of the case histories in this chapter and my own experiences in developing benchmarks for the reverse engineering community. The theory provides an explanation for beneficial effects that benchmarking has on a scientific discipline.
25
26
Chapter 3. Theory of Benchmarking Part II of this dissertation consists of this and the next chapter where the theory of benchmarking is presented and validated. In this chapter, the theory of benchmarking will be described in detail. In Chapter 1, it was claimed that benchmark operationalises a scientific paradigm. A fuller explanation of this relationship is provided here. It begins by giving the definition for benchmark that is used in this theory in Section 3.1 and providing background on the ideas of Thomas S. Kuhn on scientific revolutions and paradigms in Section 3.2. From there the linkages and parallels between the two will be drawn. This relationship between benchmarks and the dominant scientific paradigm for a discipline forms the basis for mechanisms that bring on the salutary effects of benchmarking. Two of the reasons are: benchmarking promotes an open model of science that invites scrutiny and collaboration; and benchmarking fosters cross-fertilisation of ideas between researchers and the different roles that they can assume. These and other mechanisms are further described in Sections 3.5-3.7. Finally, the theory of benchmarking is recapitulated as a series of concepts, definitions, and relationships in Section 3.8. 3.1
Definition of Benchmark Within this theory, a benchmark is defined as a standardised test or set of tests used for
comparing alternatives. A benchmark has three components, a Motivating Comparison, a Task Sample, and Performance Measures. This definition was developed by looking at existing definitions and case histories of benchmarks. A general definition is used to ensure wide applicability of the theory, but within this dissertation the discussion is restricted to benchmarks for computer science research. The term benchmark has been used in many places to mean many things. It likely originated in a trade such as carpentry where a mark was made on a work bench to determine whether a length of material sufficient for a task, literally a bench mark [139]. This term then moved to land surveying where a bench mark is an established baseline from which measurements can be made [7]. Typically, it was on a stationary object with a previously determined position and 27
elevation. The term benchmark (one word) has come to mean “A standard by which something can be measured or judged” and has become closely associated with computers [7]. Within computing, benchmarking is used in a variety of ways. It is most frequently used to measure the performance of hardware and computer systems. It is also used in usability testing [116] and software metrics [69] with a slightly different meaning, closer to the sense of baseline, but still retaining the element of comparison. A number of definitions used in computing are included in Appendix A. The definition used in this dissertation draws heavily upon two definitions, one provided by SPEC and one provided by Walter Tichy. SPEC is well-known for its benchmarks for computer systems, graphics cards, and Java™ compilers. The SPEC glossary gives a general definition: “…a “benchmark” is a test, or set of tests, designed to compare the performance of one computer system against the performance of others” [139]. To this, I added Tichy’s definition, which states that a benchmark is “…a task domain sample executed by a computer or by a human and computer. During execution, the human or computer records well-defined performance measurements [153].” From this definition, I took the idea of task sample and performance measurements that could be performed by a human. A benchmark has three components, a Motivating Comparison, a Task Sample, and Performance Measures. Development of a benchmark can begin with any of these components. 3.1.1 Motivating Comparison This component encompasses two concepts, comparison and motivation. The purpose of a benchmark is to compare, so the comparison that is at the heart of a benchmark must be clearly defined and accepted. The motivation pertains to the need for the research area, and in turn the benchmark itself and the work on it. Thus, the Motivating Comparison captures both the technical comparison to be made, as well as the research agenda that will be furthered by making this comparison. The research goals for a benchmark can be as simple as learning about different approaches or it can be as sophisticated as gathering evidence for technology transfer. The different purposes for comparison are discussed in detail in Section 3.7.
28
3.1.2 Task Sample At its core, a benchmark is a set of tests that have been standardised, so they can be applied uniformly. The tasks in the benchmark should be a representative sample of the tasks that a class of tools or techniques is expected to solve in actual practice. Since it is not possible to include the entire population of tasks from the problem domain, a selection of tasks acts as a surrogate. At a minimum the standard will specify the tasks, the setting or environment for the tasks, procedural rules for running the tests, and a format for reporting the results. The tasks that form a benchmark are often controversial and Tichy has also commented on this, “The most subjective and therefore weakest part of a benchmark test is the benchmark’s composition. Everything else, if properly documented, can be checked by the skeptic. Hence, benchmark composition is always hotly debated” [153] (p. 36). Included in the composition of the benchmark are Performance Measures, which are discussed next. 3.1.3 Performance Measures Performance is not an innate characteristic of the technology, but is a relationship between the technology and how it is used. As such, performance is a measure of fitness for purpose. These measurements can be made by a computer or by a human, and can be quantitative or qualitative. They can be on any measurement scale, nominal, ordinal, ratio, or absolute. In computing, the term “performance” is typically associated with the size, speed, or capacity of systems hardware and software, such as the speed of a processor chip, the capacity of a network, the seek time and throughput rate of a drive, thread creation and termination speed. However, other aspects of computing have performance aspects as well. Algorithms have worst case running time that can be determined through analysis, while expected running time can be measured statistically. For information systems and database management systems, common ways of expressing performance are throughput (tps—transactions per second), cost per tps, response rate, search time, and index time. A paint program (or any other software with a user interface) can be said to perform well if users can complete tasks quickly and make few errors. The performance of a software development process is often characterised using the quality of the software produced. Extrapolating from these examples, performance is a measure of whether a system is fit for the purpose for which it is to be used. Consequently, a Performance Measure does not need to be quantitative, so long as it is a indicator of the strength of this relationship. 29
Algorithmic correctness is not a performance measure that is used in benchmarking. While it is not difficult to prove an algorithm correct, it’s not possible to test for this correctness. In other words, testing cannot distinguish those situations where a tool produces the correct output, but for the wrong reasons. It is, however, possible to test for correctness in the sense of whether the technology meets user needs and this is more appropriately called reliability. If algorithmic correctness is a user requirement, it will need to be established using other means. 3.1.4 Proto-benchmarks A proto-benchmark is a set of tests that is missing one of these components. The most common proto-benchmarks lack a Performance Measure and are sometimes called case studies or exemplars [51]. These are typically used to demonstrate the features and capabilities of a new tool or technique, and occasionally used to compare different technologies in an exploratory manner. According to Feather et al., “The use of standard exemplars has become a widely accepted tool in specification research” [51] (p. 419). Commonly used exemplars in requirements research include the lift and library problems, a package router, and a railway crossing. Their definition for exemplars is consistent with the definition of benchmark used in this dissertation: “…a self-contained, informal description of a problem in some application domain; they are proposed as unique input for the specification process. Exemplars thus define, in the broadest sense, model specification tasks” [51] (p. 419). 3.2
Scientific Revolutions “…[scientific revolutions] are not isolated events but extended episodes with a regularly recurrent structure. Discovery commences with the awareness of anomaly, i.e., with the recognition that nature has somehow violated the paradigm-induced expectations that govern normal science. It then continues with a more or less extended exploration of the area of anomaly. And it closes only when the paradigm theory has been adjusted so that the anomalous has become the expected” [81] pp. 52-53. In 1962, Thomas S. Kuhn published a ground-breaking work in the history and philosophy
of science. As the introductory quotation shows, Kuhn makes the claim that scientific progress 30
occurs through revolutions, in much the same way that social and political progress occurs. Moreover, these scientific revolutions have a repeating pattern, or structure, as shown in Figure 3-1 below, and hence the title of the book, The Structure of Scientific Revolutions. All disciplines or fields of study begin in the pre-science stage, in which there are numerous and extensive disagreements about “the nature of legitimate scientific problems and methods,” [81] (p. x) because the discipline lacks a paradigm. A paradigm is the world view and knowledge that defines a community. This knowledge includes both explicit elements, such as, facts about the domain, required skills, and accepted techniques; and tacit elements, such as vocabulary, and values. A paradigm is often taught or communicated through a quintessential problem or exemplar.
pre-science
normal science
crisis
revolution
Figure 3-1: Stages of Scientific Revolutions A paradigm is necessary for a discipline to enter the normal science stage. Progress in the normal science stage is characterised as “puzzle-solving”, in that the problems worthy of being solved and the rules for evaluating the solutions are defined by the paradigm. There are three classes of legitimate problems in normal science: 1) determination of significant facts, 2) confirming theory with facts, and 3) further articulation of theory [81] (p. 25-27). The first class of problems involves finding facts with greater precision and accuracy, and applying the knowledge to new sub-fields. The second class of problems involves comparing predictions from theory with observations or experimental results. The third class of problems involves providing additional details to the theory or resolving some ambiguities within the theory. In this stage, when unexpected anomalies occur, either the dominant model is modified in some small way to account for these results, or the aberrant results are further investigated. The paradigm itself is not questioned unless the discipline is already in a crisis state.
31
A scientific community enters the crisis stage when there is sufficient dissatisfaction with the current paradigm within the discipline. This dissatisfaction usually arises because the paradigm has been amended so extensively that it no longer fulfils its role as a unifying world view. The unease may be found within key individuals or the discipline as a whole. During this stage, anomalies play a very different role than during the stage of normal science. Because individuals are already dissatisfied with the paradigm, anomalies point the way to alternate explanations or world views. A revolution occurs when the wider scientific community accepts a new paradigm ushered in by a new theory that adequately explains all the results including the anomalies. The new paradigm is normally incommensurable with the old one; it is not possible to hold two world views simultaneously. Some members of the community can be reluctant to accept the change because the new paradigm challenges much of what they know about the world. On the other hand some may have been in despair before and embrace the new paradigm, as Wolfgang Pauli was when he wrote, “Heisenberg’s mechanics has again given me hope and joy in life. To be sure it does not supply the solution to the riddle, but I believe it is again possible to march forward” [81] (p.84). At this point, I will deal with some of the controversy concerning Kuhn’s definition of paradigm. The remainder of this section deals with a specialist debate of Kuhn’s theory and can be skipped with no loss of continuity. The term paradigm has drifted over the years, and some of the blame can be placed on Kuhn as he evolved his understanding of the philosophy of science and between editions of The Structure of Scientific Revolutions [81]. This drift in definition has led to attacks on his theory. Care must be taken when using the term paradigm because different meanings have come to be associated with it over time. Within philosophy of science, this term is little used due to this evolving definition [160]. One of Kuhn’s early definitions for paradigm dovetails nicely with benchmarks. He described paradigms as “concrete problem solutions that the profession has come to accept” (p. 225). Consensus on a problem solution represents agreement on two things: “that a particular situation, articulated in a particular way constitutes a scientific problem, and agreement that a particular way of dealing with the problem constitutes a scientifically acceptable solution to it” 32
[68] (p. 134). They “are accepted not merely as what they are in themselves by also insofar as they provide guides for research, as the basis of scientific practice” [68] (p. 135). The main source of difficulty for this definition is a misleading whole-part distinction. This problem arises when trying to isolate a constituent component that is inseparable from the larger entity. In such cases, making this distinction is not useful, meaning that it doesn’t add to our understanding of the entity. The definition that Kuhn used in the first edition of The Structure of Scientific Revolutions is that a paradigm consists of symbolic generalisations (universal propositions), models of the phenomenon being studied, values entailed by prevailing scientific theory, and exemplary problem solutions [68]. In the broad sense, paradigms denoted all of these things. In the narrow sense, it denoted an exemplary problem solution. At other times, Kuhn used paradigm to mean some combination of components. In the second edition of Structure, Kuhn attempted to repair the definition. He used “disciplinary matrix” to mean paradigms in a broad sense along with the education, publications, and professional societies associated with a discipline. The term exemplar was used to replace paradigms in the narrow sense. Despite this attempt at clarification, Kuhn stopped using the term paradigm entirely after 1972 [68]. I believe that these amendments were not effective because they further entrench a whole-part distinction that is confusing and unnecessary from the outset. Furthermore, it is not a useful distinction in the context of this dissertation. Suffice it to say that despite flaws in the details, Kuhn’s ideas are seminal and still serve as a point of reference for work in the field today. Furthermore, outside philosophy of science this term remains popular due to its intuitive appeal. 3.2.1 Scientific vs. Engineering Disciplines Since Kuhn’s theory addresses scientific disciplines, an important question is whether it can be applied to engineering disciplines, in particular, software engineering3. According to
3
For practical purposes, both research and industry considers software engineering to be an engineering
discipline because it is concerned with the application of computer science and other knowledge in the creation of large software systems [112, 125]. However, there are some, including Shaw, who argue that software engineering has not yet achieved the level of rigor required for this label [127]. This criticism is intended to be constructive, as
33
conventional wisdom these two types of disciplines are very different. Science is concerned with discovering knowledge and engineering is concerned with applying this knowledge. However, Bruce I. Blum argues that the two are really “mirror-image twins” that make progress using the same underlying processes [25]. While Blum’s view is unusual, his arguments provide a good basis for arguing that Kuhn’s structure of scientific revolutions should also apply to engineering. Blum based his argument on a synthesis of ideas from historians, philosophers, and sociologists who have written about the nature of science and technology. In order to apply Kuhn’s ideas engineering, there needs to be strong similarities in the two disciplines in the kinds of technical facts that are generated and the social processes at work in the communities. Blum observed that the skills and processes involved in discovering new knowledge and in discovering how to apply this knowledge are the same. He wrote, “There is no division between the two categories of knowledge, and both disciplines engage in design. For science, it is design for discovery, and for technology it is discovery for design. Mirror-image twins” [25] (p. 61). Walter G. Vincenti identifies seven knowledge-generating activities in engineering and Blum contends that there are analogues for all of them in science [159] in [25]. (These activities are i) transfer from science, ii) invention, iii) theoretical engineering research, iv) experimental engineering research v) design practice, vi) product, and vii) direct trial.) This symmetry between the two further illustrate the intellectual similarity between the two. Blum wrote, “It shows a continuum of knowledge that is both created and used by these twins. Notice that replacing the terms scientists and engineers with, say, physicists and chemists does not affect the diagram’s validity” [25] (p. 62). Since both disciplines are involved in discovery, their respective communities have similar underlying social processes. Blum cited a description from Edwin T. Layton Jr. on the social structures in the respective disciplines: “While the two communities [in the 19th century] shared many of the same values, they reverse their rank order. In the physical sciences the highest prestige went to the most abstract and general—that is to the mathematical theorists… In the
Shaw argues that software engineering ought to become an engineering discipline and she provides a road map for how this can be achieved.
34
technology community the successful designing or builder ranked highest, with the ‘mere’ theorist the lowest. These differences are inherent in the ends pursued by the two communities: scientists seek to know, technologists to do. These two values influence not only the status of occupational specialists, but the nature of the work done and the ‘language’ in which the work is expressed” [87] (p. 576) cited in [25] (p. 57) This description further reinforced Blum’s assertion that science and engineering are mirrorimage twins. Both science and engineering occur within a social environment and are evaluated by peers and the broader cultural milieu. In summary, Kuhn’s theory can be applied safely to software engineering, because scientific and engineering communities function in much the same way. In other words, since the underlying processes hold the structure of scientific revolutions can be generalised to engineering without compromising the original ideas. 3.3
Benchmarks as Paradigmatic: Processes for Progress In this section, I take Kuhn’s structure of scientific revolutions and use it as the foundation
for the theory of benchmarking. At the heart of Kuhn’s theory is the scientific paradigms. Scientific progress occurs through the incorporation of knowledge and rules into the paradigm through a process of community consensus. Standardised benchmarks are constructed through the same processes—incorporation of technical facts through collective agreement. Benchmarks operationalise paradigms. The term operationalise is taken from research design and describes the process by which concepts in a theory are turned into operational units and relationships to be manipulated or measured in an experiment. A benchmark takes an abstract concept (a paradigm) and makes it more concrete, so that it can serve as a guide for action. In other words, it reifies the paradigm. The emergence of a benchmark depends on two factors equally: the state of knowledge and the level of consensus. Both must progress together for a standard benchmark to emerge. For example, a discipline may have the research results necessary to design a good benchmark, but lack the agreement that such an effort ought to be undertaken. The tight relationship between knowledge and consensus means that a benchmark is indicative of the maturity of the paradigm. 35
Its existence means that there is value placed on validation of research results. Agreement on a Motivating Comparison is indicative of agreement that the group ought to be working towards a particular goal. Agreement on the Task Sample is indicative of agreement that the group ought to be working on particular problems. Finally, agreement on Performance Measures is indicative of agreement on how success should be measured. Paradigms emerge through the same processes. Scientific results are generated by researchers and they become incorporated into the paradigm by consensus. During pre-science, a discipline makes progress by establishing a paradigm. During normal science, a discipline makes progress by adding facts to the paradigm. Benchmarks make aspects of this amorphous and tacit maturation more explicit, and therefore easier to understand and accelerate. 3.3.1 Consensus Consensus is an important component in scientific progress and deserves additional attention within the context of benchmarking. Consensus is the collective opinion or general agreement in a group. This agreement can be reached through either collaboration or competition. With collaboration, the agreement is obvious. People have agreed to work together collectively on a problem. They have resolved their differences sufficiently to work towards a common goal. Consensus can also be found in competition. Even when groups are in conflict and competing with each other, there is consensus. The participants agree to some degree on the competition and how a winner is declared. According to Dahrendorf, even in times of international conflict, there is consensus on the rules of conduct, for example, the Geneva Convention on the treatment of prisoners of war [42]. It is also possible to have too much of either collaboration or competition, or both. If a group collaborates too tightly, the result may be reduced innovation leading to slower progress. This loss of innovation can occur through inbreeding of ideas or through peer pressure causing members of the community to conform. On the other hand, a group that is too competitive may also suffer from reduced innovation. There may be a reluctance to share knowledge and techniques among members of the community. Some researchers participating in a benchmark evaluation may lose sight of the larger goal, that of producing high quality research, and instead put time and resources into winning. 36
Since consensus can be achieved through both collaboration and competition, they both have a place in benchmarking. Collaboration is needed to bring out commonalities. Competition is needed to bring out differences, thereby motivating people to distinguish themselves through excellence. 3.3.2 Impact of Benchmarking A successful benchmark has an impact on the discipline. Authors quoted in the introductory chapter used words such as “blossom” and “magically” to describe the positive effect that benchmarking has on a research area [118, 153]. Benchmarks raise the quality of technical results though validation and they increase the cohesiveness of the community through collaboration. Unfortunately, benchmarking has negative effects. Each design decision that selects one option, by necessity closes off others. Often these decisions are made with best guesses and imperfect information, due to the nature of scientific research. Committing to a benchmark too early can slow later progress because it is too expensive and difficult develop a new one. Some researchers may optimise their tools and techniques to perform well on a benchmark at the expense of general performance. In such cases, maintenance and evolution of the benchmark is necessary to prevent overfitting. Many of the negative traits found in benchmarks have been inherited from paradigms. By their nature, paradigms are exclusive, rightly or wrongly, they define the proper conduct of science for a discipline [68]. The impact of benchmarking is easy to identify when it is manifested within a scientific community, but hard to define and even harder to measure. This problem is part phenomenological and part methodological. It is easy to count the number of publications that use or refer to a benchmark, but it is hard to measure the influence of a benchmark on ideas. To measure this impact properly requires the application of survey techniques, i.e. questionnaires and interviews, from social science where they have a good understanding of how to gauge attitudes and opinions. The development of a benchmark is influenced by the same results, trends, and people, as the work that follows. In other words, the benchmark and subsequent work are all part of the same stream and all are affected by upstream events. Consequently, it is difficult to partition the influence between the benchmark and previous work.
37
One of the implications of this theory is that benchmarking can be used to cause a discipline to move forward. Previously, scientific communities would create benchmarks to validate research and then marvel at the ensuing improvements. Instead, a benchmark can be used to increase the maturity of a discipline. This inference will be explored in greater detail in Part III of this dissertation. 3.3.3 Role of Benchmarking in Stages of Revolution Benchmarking functions differently in the various stages of scientific revolution. An immature discipline, one that is in the pre-science stage, does not have a paradigm. By extension, it can not have a benchmark. A discipline that has an accepted benchmark that is widely used is in the normal science stage. However, not all disciplines in the normal science stage have a benchmark. A discipline in the crisis stage will abandon and criticise a previously popular benchmark. During the revolution stage, a previously flawed benchmark may be repaired and restored to wide acceptance. In contrast to Kuhn’s stages, which account for how a discipline moves from one paradigm to another, a lifecycle describes the rise and fall of a single benchmark. It is important to note that a paradigm may have more than one benchmark. There may be multiple problems of interest or shifts in understanding without a significant change in the underlying belief system. Consequently, a benchmark may retire or fall out of favour in a community without a change in the underlying paradigm. Consequently, the lifecycle of a benchmark has an additional stage, retirement, that does not appear in a scientific revolution. The status of a benchmark during the different stages of its lifecycle can be characterised by two parameters: the number of participants and rate of change in the benchmark. The former reflects the cohesiveness of the community and the latter reflects the technical knowledge of the field. These parameters are defined as follows. •
Number of participants—a count of the number of people involved in developing and using the benchmark. This figure is further broken down into the number of contributors and number of users. The number of participants in the benchmark is a indicator of community cohesiveness, i.e. the number of people who can work together collaboratively on a project.
38
•
Rate of Change—number of modifications to the benchmark. This number is further divided into major and minor changes. Minor changes are defect repairs, typographic corrections, and other small modifications that do not change the substance of the benchmark. Major changes are a significant change in the nature of the benchmark components. These figures are an indicator of the level of controversy surrounding the benchmark, and in turn, its scientific soundness.
Table 3-1 below shows how these parameters vary over the different stages in the lifecycle of a benchmark. These stages will be described in detail in the remainder of this section.
Parameter
Stage Crisis
Revolution
Retirement
Pre-science
Normal Science
low none to low
high high
low low
changing changing
low to none low to none
high high
low high
high low
low low
none low to none
Participants # of contributors # of users Rate of Change # major changes # minor changes
Table 3-1: Changes Over Stages of Revolution 3.3.3.1
Pre-Science
The pre-science stage is a time of change and disagreement, usually because the discipline has recently split from another research area [59]. The discipline does not have a paradigm nor a widely-accepted benchmark. Consensus within the community and scientific results move in lock-step to develop the paradigm and benchmark. Work on the benchmark can also help propel the discipline into the normal science stage. When work does occur, it is led by a small number of participants and is highly controversial. Participants may begin by working on any of the benchmark components simultaneously or individually. They may even make significant progress on some of the components, but it is not possible to create a full-fledged benchmark during this stage because there is no paradigm. Software evolution is an example of a discipline in the pre-science stage that is pursuing a benchmark. It is difficult to assess whether software evolution is a sub-area of software maintenance or the same discipline, but with a name change to reflect new understanding of the 39
problem. Software evolution appears regularly as a technical session at conferences on software maintenance (International Conference on Software Maintenance—ICSM, and European Conference on Software Maintenance and Reengineering—CSMR), and has recently started an annual workshop (International Workshop on Principles of Software Evolution—IWPSE). Over the last couple of years, key people in the field have published papers [45, 94] and proposed workshops to draw the community’s attention to the need for a benchmark. This work is motivated by a desire to better understand research results by comparing them using common Task Samples. This discipline does not yet have a well-defined paradigm, so it will be difficult to complete a benchmark during this pre-science stage. Fortunately, work on a benchmark will help accelerate the maturation process. 3.3.3.2
Normal Science
The normal science stage begins once a scientific paradigm is established, and research during this stage is characterised as “puzzle-solving”. One of these puzzles could easily be a benchmark. At this point the discipline is sufficiently defined for a benchmark to be developed, although a benchmark is not needed for a discipline to enter the normal science stage. The rate of change for a benchmark in the normal science stage is relatively slow. After being established, there may be many minor changes or corrections, but few major changes. Even anomalies in the paradigm will be accommodated by minor changes only. As the discipline becomes more established, more participants will become involved with developing and using the benchmark. A successful benchmark will have a high participation rate, which can be achieved by having a large proportion of the community involved with the benchmark or by having an influential minority. The development process for a benchmark within the normal science stage will be discussed in Section 5.1. An example of normal science maturation for a benchmark can be found in the data compression community. This area researches algorithms for compressing data files, such as text, images, video, audio, and considers other operations on compressed files, such as searching and browsing. For many years, the Calgary Corpus was used as a benchmark for lossless compression algorithms. It consists of eighteen files, including English texts, a bibliography, source code, a transcript of a terminal session, geophysical data, executable programs, and a bitmap image. The Performance Measures were compression ratio, compression time, and 40
decompression time [19]. It was in use for over ten years before being updated by the Canterbury Corpus [14, 115]. It is similar to the Calgary Corpus, but with changes in the details to reflect trends in typical files. This corpus contains eleven files and including some HTML (Hypertext Mark-up Language) files, a Microsoft Excel spreadsheet, and UNIX man pages. Progress in lossless compression algorithms occurs through incremental progress. There are already many good algorithms, such as Huffman coding, multi-order character PPM, blocksorting, and it would be difficult to achieve significant improvements [14, 115]. By extension, the Canterbury corpus is an incremental revision of the Calgary Corpus. 3.3.3.3
Crisis
The crisis stage is marked by a dissatisfaction with the scientific paradigm and this dissatisfaction is transferred over to benchmarks as well. This stage is a highly controversial time for the discipline with many new paradigms proposed as alternatives to the previous one. There are divisions in the community between advocates of the various paradigms. There is a reluctance to build on the artifacts of the old paradigm, including benchmarks. Work that is done, if any, tends to be critical of the benchmark and will suggest major changes. Entirely new benchmarks associated with one of the new paradigms may be put forth. 3.3.3.4
Revolution
During the revolution stage, one of the competing paradigms from the crisis stage becomes accepted by the community. In general, a benchmark that is associated with a discarded paradigm does not survive the revolution, because it is no longer representative of the ethos and scientific interests of the discipline. As the revolution takes root and the discipline returns to the normal science stage, new benchmarks will be developed. There will be little additional work on the old benchmark, although it may be used as a historical curiosity. 3.3.3.5
Retirement
A benchmark moves into retirement when the community is no longer interested in using it. This can occur due to a change in the technology or a change in the community. For instance, 41
punch cards and vacuum tubes have become obsolete, so standardised evaluations for them are not needed. The community may also change and move on to other problems, with or without solving the one exemplified by the benchmark. The TREC ad hoc retrieval task is an example of such a retirement. As mentioned in the case history in Section 2.5, the ad hoc task was used for eight years (1992-1999) before being discontinued. By the fifth year, performance on the problem had levelled off and researchers were no longer making improvements to their tools. 3.4
Mechanisms for Progress Over the following three Sections, I will describe three mechanisms by which benchmarking
results in scientific progress. A mechanism is a manifestation of a process; it is a specialisation of abstract principles into a set of pressures and consequences in a particular context. These mechanisms are derived from the relationship between benchmarks and paradigms. The theory of benchmarking inherits a number of mechanisms from the Kuhn’s structure of scientific revolutions, but only three will be discussed here. 1. Benchmarking role models an open and public model of science. 2. Benchmarking fosters cross-fertilisation of ideas. 3. Benchmarks are flexible, so they can be used a variety of purposes, often simultaneously. All of these mechanisms involve technological and sociological factors working in synergy with each other. Each mechanism is motivated by a technical issue and is resolved in a particular way due to the social context created by benchmarking. 3.5
An Open Model of Science As concrete problem solutions, benchmarks make statements about the valid scientific
problems and solutions in the area and the implicit rules for conducting research. The presence of a benchmark states that the community believes that contributions ought to be evaluated against clearly defined standards. The benchmark itself promotes the conduct of research that is collaborative, open and public. Evaluations carried out using benchmarks are, by their nature, open and public. The materials are available for general use, and often so is the technology being tested. It is difficult 42
to hide the flaws of a tool or technique, or to aggrandise its strengths when everyone involved has a good understanding of how the tests were applied. Moreover, anyone could use the benchmark with the same tools or techniques, and attempt to replicate the results. Collaboration, openness, and publicness, result in frank and elaborate, technical interactions among researchers. This kind of public evaluation contrasts with the descriptions of tools and techniques that are currently found in conference or journal publications. A well-written paper is expected to show that the work is a novel and worthy contribution to the field. While this is an entirely reasonable expectation, it does not permit sharing of gritty implementation details and other advice on other practical problems. Susan Leigh Star and others have studied “the transmogrification and deletion of uncertainties from lab work to published reports” [142] (p. 392). Details from daily laboratory work are glossed over to achieve the level of assertiveness needed in a publication, as she wrote, “The published data reveal rather than hint; articles state, rather than guess at; subjects line up and are counted— by the time they get to the journals, they don’t run away and hide behind the lab equipment or try sabotaging the experiments” [142] (p. 392). These nuts-and-bolts details are invaluable for others tackling the same problems or using similar equipment. Star and others argue that solving these small, mundane problems form a significant part of scientific research. Benchmarking workshops and seminars provide a venue for communicating these dirty details of research. Details such as debugging techniques, design decisions, and mistakes, are forced out into the open and shared between groups. Latour and Woolgar performed an ethnographic study of a number of physics laboratories [86]. They found that labs were not able to replicate results from another lab, using only published accounts of the experiment. They needed someone from the original lab to instruct them on the subtleties of the experiment before it could be successfully replicated. This finding is consistent with our experience in reverse engineering. Benchmarking workshops expedite the process of knowledge transfer between laboratories. Wherever the benchmark and results are discussed, such practical, technical details emerge.
43
3.6
Cross Fertilisation of Ideas The second reason for the power of the benchmarking process is that it encourages the cross
fertilisation of ideas. According to Michael Mulkay, cross fertilisation is “the interplay of divergent cognitive-normative frameworks” [99] (p. 138). A cognitive normative framework, sometimes called a frame of reference, consists of a person’s knowledge and expectations about the world. A paradigm strongly influences a researcher’s cognitive normative framework. Cross fertilisation occurs when scientists with different cognitive frameworks come together and there is a clash of ideas, resulting in innovation. Mulkay wrote, “Innovation occurs as the divergent cognitive frameworks are merged into some kind of synthesis” [99] (p. 141). Cross fertilisation is an important mechanism for innovation throughout the stages of a science, especially during normal science because it is one of the few ways that new ideas are generated. Two common and effective ways that cross fertilisation occurs are i) interaction between researchers and ii) adoption of a novel role or viewpoint by a researcher. The former occurs when researchers collaborate on the benchmark. Scientists from different laboratories working together often pick up ideas from each other. The latter occurs during deployment of the benchmark because the evaluation causes the scientist to view his or her contribution from a different perspective, typically that of an end user. Both of these mechanisms allow ideas to be spread and critiqued across the community. As Tichy and Reddy both observed, the good ones are adopted quickly and the bad ones drift away quietly [118, 153]. In this section, these two mechanisms for cross fertilisation will be further examined. 3.6.1 From Other Researchers The benchmarking effort typically brings researchers from diverse organisations and sectors (e.g. academic, industrial, government) who would not otherwise work with each other. Mulkay observed that loose social structures more effectively bring about cross-fertilisation. He wrote, “While normal science may be carried on quite effectively within centralised organizations, scientific innovation is facilitated by a fluid social structure which brings divergent frames of reference together” [99] (p. 140). Development of TPC-A™ required nearly 1200 person-days of effort contributed by 35 database vendors and universities [58]. The SPEC CPU2000 development process used “benchathons” to provide committee members an opportunity to work on the benchmark together in a week-long marathon session [64]. 44
Within the loose structure of the benchmarking process, I have observed different drivers of cross fertilisation during development and deployment activities. During development, it is newly-formed agreement that moves the discipline forward. During evaluation, the technical results move the discipline forward. During development the interactions are more collaborative, while during evaluation the interactions are more competitive. When the benchmark is being formed, the participants are working to build consensus on the benchmark components, which is a difficult task because the benchmark represents the paradigm for the discipline. During development, it is the agreement produced by work on the benchmark that advances the discipline. Consensus needs to be established on the benchmark components and other design decisions. Agreement is established through dialog among participants with a grounding in technical results. Cross fertilisation occurs because the participants expose their own ideas and assumptions about the discipline to be critiqued or affirmed by their peers. This is an invigorating and controversial time, filled with discussion and debate. Once the benchmark has been created, the wider community can then use it to evaluate research contributions. During this time, it is the results from the benchmark that advance the discipline. When a tool or technique appears on a list of benchmark results, it raises the visibility of that tool or technique. (As the status of a benchmark grows, so does the importance of appearing on the list of participants, even with a poor result. For example, it becomes a way for a new laboratory to show that it works at the same level as more established sites.) Solutions that rank highly are scrutinised carefully for the reasons behind their success and the good ideas are quickly disseminated throughout the community. The results often have greater impact when they are unveiled and discussed at a conference, as was the case for the xfig structured demonstration, CppETS, and the TREC tasks. This discussion increases the credibility of the results because the participants can conduct a post mortem on the evaluation. It also makes it easier to identify the lessons learned to be fed back into the benchmark. 3.6.2 From Other Roles Cross fertilisation of ideas also occurs when scientists take on different roles, such as practitioner, teacher, user, or patient, which causes them to look at their research from a different viewpoint. In a historical study of nineteenth century medical researchers, Joseph Ben-David 45
found that the most innovative scientists were “role hybrids” who had clinical cases and other contact with practice [21]. These practitioner-scientists were more willing to modify theoretical assumptions in their research because they were constantly being confronted by their practical experience. A benchmark attempts to bring practice into the laboratory by using a Task Sample that is representative of the problems found in actual practice. Additional roles often give the researcher new insights into their tools or techniques because they point out the researchers’ assumptions and biases about the research problem or solution. Researchers and developers often are not aware of their cognitive normative framework, so they build their assumptions and biases into the evaluations that they conduct. Twidale, Randall, and Bentley observed that it is difficult for a developer to design a good evaluation of his or her own tool [155]. They found that despite earlier validation studies with students, evaluations with actual end users yielded new user requirements and shortcomings in the tool. They wrote, “There are inevitably implicit assumptions about the nature and style of use (and about the user, task, etc.) in the design of the tool. In making up test problems, developers are in danger of incorporating the same assumptions. Thus the study will fail to reveal them” [155]. In other words, inbreeding occurs when tool designers also design the evaluations of the tools. Benchmarking overcomes this problem by bringing together researchers with different viewpoints to participate in the design of the evaluation. While individual researchers may be tied to the frame of reference associated with a particular tool, the group that participated in the design of the benchmark will bring many frames of reference to the evaluation. 3.7
Flexibility of Application The raison d’être for a benchmark is to enable comparisons between alternatives, but this is
often done for different purposes, particularly in research with emerging technologies. For example, the xfig structured demonstration was used by one team to train a new researcher on their tools, it was used by another team to market their tools within IBM, while, we, the organisers, used the occasion to evaluate program comprehension tools. In other settings, benchmarks have been used to demonstrate features of a new system, to highlight research directions, and to provide evidence for technology transfer.
46
A benchmark can be used by different researchers to fulfil multiple purposes at the same time. In this sense, benchmarks are boundary objects, and Star defines them in the following way: “Boundary objects are objects that are both plastic enough to adapt to the local needs and constraints of the several parties employing them, yet robust enough to maintain a common identity across sites” [143] (p. 103). Two defining features of a boundary object are plasticity and coherence. Plasticity in this context is the ability “…to adapt to different local circumstances…” [143] (p. 97) and coherence is the capacity to “…incorporate many local circumstances and still retain a recognizable identity” [143] (p. 97). Benchmarks have both these characteristics. The list at the beginning of this section includes the different purposes and goals of the xfig benchmark, thus demonstrating plasticity, and all the participants would agree that they were participating in an evaluation, thus demonstrating coherence. These characteristics make it easy for researchers with different perspectives to participate in the benchmarking process and still fulfil their own technical, intellectual, and organisational commitments. In the remainder of this section, I will describe typical purposes for benchmarking. This list expands on the work by Feather et al. [51]. This discussion will help show that i) benchmarks have the plasticity and coherence to be boundary objects and ii) this flexibility is part of the reason for the power of benchmarking. 3.7.1 Advancing a Single Research Effort A new technology can be run on a benchmark to illustrate its strengths and weaknesses. Even without comparing its results with those of other technologies, developers of a prototype can learn about a tool or technique by applying it to a Task Sample. A commonly-used and widely-understood example is effective for illustrating the key contributions of a new technology, particularly for a research paper. In this context, the benchmark is sometimes called colloquially a guinea pig, lab rat, or case study. A variant is the pathological problem, which motivates new research by highlighting shortcomings in the status quo [51].
47
3.7.2 Promoting Research Comparison and Understanding When a research community applies many of their tools or techniques to a common benchmark, they can make comparisons between proposed solutions. Feather et al. wrote, “Within their area of concern exemplars have proved to be excellent vehicles for the investigation of approaches. Since different approaches will typically have different sets of strengths and weaknesses, exemplars can be used to suggest how different approaches can perhaps be combined to yield a hybrid of their respective strengths. When an aspect appears problematic to all approaches, this suggests an area worth of attention by the field as a whole” [51] (p. 422). Hence, this comparison can lead to researchers to an improved understanding of the research undertaken by the field, their place within this community, and the direction of the field. At the same time, a benchmark can be used for pedagogical purposes, as a vehicle to educate students about the nature of the problem domain, and the available tools and techniques. Using a benchmark in this manner does not hinge upon having a well-defined Performance Measure and can be accomplished using a proto-benchmark or exemplar. Goals such as promoting understanding are difficult to reduce to a small set of measures. The lack of quantitative metrics can be beneficial because it reduces competitiveness—without final standings, there can be no winner. While it is enlightening to distil the results down to some key dimensions of performance, it can also hinder deeper examination of issues. Consequently, the output from these applications of a benchmark or proto-benchmark can be qualitative descriptions, lessons learned, and experience reports. This application still supports comparison and assessment of fitness for purpose, which is consistent with the intention of benchmarking. 3.7.3 Setting a Baseline for Research A benchmark can be used to move a research community forward by identifying the problems that are considered solved and directing attention to open problems for future research. By including solved problems, the benchmark sets expectations for capabilities in current tools or techniques. Open problems are included to mark the progress made by research. The solved and open problems are identified through community consensus, which also moves forward the research effort.
48
3.7.4 Providing Evidence to Support Technology Transfer A benchmark can also be used to demonstrate the maturity of a technology and thereby encourage uptake of results by industry. This purpose is most similar to making a purchasing decision, which originally motivated the work with TPC-A™. The aim of the TIPSTER project was to transfer IR technology into military intelligence applications. The Task Sample for such a benchmark must be large enough to demonstrate that the technology can work on commercialgrade problems. The problem should also provide an opportunity to illustrate how the innovation fits with and improves on extant technology. This second requirement implies that the benchmark must work with both the new and old technology, so they can be compared. Using the benchmark in this manner can be used to transfer techniques as well as tools. Some techniques are closely tied with tools, for example, clone detection and configuration management, and these ideas are adopted more slowly because they need to be accompanied by industrial strength tools. In contrast, techniques developed independently of tools, such as code inspections and design patterns, can be moved into industry more easily. Consequently, a benchmark needs to take these differences into consideration and to test the maturity of the technology in appropriate ways. 3.8
Recapitulation of Theory In this section, the theory of benchmarking will be re-stated as a set of assumptions,
concepts, and inferences. The earlier sections of this chapter were concerned more with providing background material and conveying the richness of the theory and the interactions therein. In contrast, this section strives to be as clear and simple as possible. One shortcoming of this re-statement is that the simplicity obscures the depth and breadth that are necessary in the theory to account for complex social interactions of a community. 3.8.1 Assumptions Assumptions are the foundation upon which a theory is built. Surprisingly, assumptions need not be true for the theory to hold. Recall the truth table for the logical operator for implication.
49
P
Q
P>Q
T
T
T
T
F
F
F
T
T
F
F
T
Figure 3-2: Truth Table for Implication In this table, where the antecedent (P) is false, the relation is still true. The assumptions are the antecedent for the theory, which itself is the consequent. The theory of benchmarking is based on one assumption. Scientific paradigms exist. The theory assumes that scientific paradigms as described by Thomas Kuhn exist and they accurately describe how scientific disciplines function. One relevant feature is that a scientific discipline is defined both by empirical facts and social processes. 3.8.2 Concepts and Their Definitions Concepts are the important entities in the theory and their definitions provide a means for identifying them. Concepts tend to be abstract and need to be represented by a number of variables to make them quantifiable. The first three concepts, scientific discipline, scientific paradigm, and scientific progress are taken from Kuhn’s structure of scientific revolutions. The fourth concept was specified by the theory of benchmarking. Scientific discipline. A scientific discipline is one that seeks to construct theories about the world and to validate them empirically. A scientific discipline is defined in terms of a community of researchers with similar backgrounds and interests, and working on a set of related problems. While Kuhn’s theory addressed only scientific disciplines directly, I argued in Section 3.2.1 the ideas can be extended appropriately to engineering disciplines without compromising their integrity. Scientific paradigm. A paradigm is the collection of knowledge that is needed to function within a scientific discipline. It includes factual knowledge, such as familiarity with the literature and facility with techniques, and implicit rules for conduct and values. A scientific paradigm 50
arises through consensus among members of the scientific community. The community in turn consists of researchers who work on related problems, have similar educational training, read the same literature, and attend the same conferences [68, 81]. Scientific progress. Scientific progress is achieved through different processes during each stage within Kuhn’s structure of scientific revolution. During the pre-science stage, the discipline is seeking to establish a paradigm. During normal science, knowledge is added to the existing paradigm by consensus of the community. During crisis and revolution, an old paradigm is being discarded in favour of a new one. In the past, it was believed that science moved objectively and inexorably towards a better understanding of reality. Kuhn has argued convincingly that science is conducted by groups and that these groups are subject to the same social processes that occur elsewhere in society [81]. Benchmarking. Benchmarking is the process by which a scientific community develops a benchmark and deploys it to evaluate their results. The process includes improvements to these innovations in response to benchmark results. This definition of benchmarking builds on the definition of benchmark given in Section 3.1 where it is defined as a set of standard tests for comparing alternatives. A benchmark has three components: i) a Motivating Comparison, ii) a Task Sample, and iii) Performance Measures. An evaluation that is missing one of these components is a proto-benchmark. 3.8.3 Relationships The theory of benchmarking describes two relationships among the concepts. These two relationships form the heart of the theory and they have been described in this chapter and further argued in the next one. Benchmarks operationalise scientific paradigms. The term operationalise is borrowed from research design and describes the process by which concepts in a theory are turned into operational units and relationships to be manipulated or measured in an experiment. This inference was argued in Section 3.3. A benchmark takes an abstract concept, a paradigm, and makes it more concrete, so that it can serve as a guide for action. In other words, it reifies the paradigm. Over its lifetime, a paradigm can have more than one benchmark, but a benchmark can only be associated with a single paradigm (see Section 3.3.3). 51
Benchmarking causes a leap forward within a discipline. Both technical and sociological advances are needed to make scientific progress. A discipline makes progress by increasing consensus and collaboration within the community and through accumulation, codification, and dissemination of technical results. These are all activities in the benchmarking process. Evaluation provides empirical data regarding the relative merits of different innovations. Researchers become more aware of each other’s work through collaboration. Consequently, good solutions are adopted more quickly across the community. 3.8.4 Hypothesis From the theory, a number of hypotheses can be generated about the nature of benchmarking, scientific progress, paradigms, etc. In this dissertation, one hypothesis will be explored in detail. Benchmarking can be used to cause a scientific discipline to leap forward. Since a discipline makes progress by generating technical results and increasing consensus, and benchmarking can be used to attain these same effects, then benchmarking can be used to achieve scientific progress. The various activities involved in designing and deploying a benchmark have benefits such as openness, collaboration, and the production of empirical results. In other words, it is the process of benchmarking that advances a discipline rather than the existence of a set of standardised tests. This hypothesis is concerned with the second relationship discussed in the previous subsection, but in reverse. While benchmarking is often undertaken to improve research, such efforts are usually concerned with the improvement of technical results only. An important contribution of this theory is the recognition of the significance of social processes. In other words, it is a recognition that increasing the level of consensus in a community also results in scientific progress. Moreover, by tackling both technical and sociological issues in tandem results in increased maturation of the paradigm. 3.9
Summary In this chapter, the theory of benchmarking was described in detail. After covering some
basic definitions, it explained the underlying processes that are responsible for the advances in a discipline. Benchmarks are closely tied to the paradigm for a scientific discipline. Their 52
emergence depends on both the accumulation of research results and the consensus of the community. They operationalise the paradigm into a role model for how science ought to be conducted. In particular, they act as a role model for a science that is open and public, and uses empirical validation. These processes are expressed as mechanisms that result in scientific progress. In the context of benchmarking, three such processes were discussed in this chapter. The first is that the adoption of benchmarking by a research community promotes an model of science that is open and public. The second is that interactions during benchmarking encourage cross fertilisation of ideas, which results innovation. The final mechanism discussed is the flexibility of benchmarks allows different researchers to participate in the effort, and at the same time meet their own goals, which in turn permits more widespread adoption of the benchmark. This set of characteristics is unique to benchmarking, and it sets this type of evaluation apart from other empirical studies. In addition to the merits of benchmarking as an evaluation, this activity makes a community’s paradigm explicit and allows them to discuss it as a technical artefact rather than an a tacit set of beliefs. It is these interactions that cause the discipline to mature as much as the technical results derived from creating and using the benchmark. In the next chapter, the theory of benchmarking will be validated using empirical and analytic approaches.
53
54
Chapter 4. Validating the Theory of Benchmarking A theory can be validated in two ways: empirically and analytically. An empirical validation involves comparing the theory against extant instances, in this case, benchmarks that have been used by scientific communities. This type of validation is best suited for checking the external validity of a theory. An analytical validation argues that the theory is correct, using a prescribed method or procedure such as formal logic. This type of validation is best used to check the internal validity of a theory and is similar to software verification. Both analytical and empirical validation have their place and in this chapter, both types of evaluation will be used. This chapter will begin with two empirical validations. In Section 4.1, the theory is used to interpret the three case histories from Chapter 2. Since these case histories were used in the formulation of the theory, this validation will assess how well it postdicts the data. The theory will then be compared in Section 4.2 against novel benchmarks, i.e. ones that were not used in its development. This validation will assess how well the theory makes predictions. Having established its internal validity, we turn then to external validity. Section 4.3 argues that the theory of benchmarking is sound using prose. A final analytic validation is performed in Section 4.4 where the theory of benchmarking will be compared to a rival theory using a hierarchy of criteria. 4.1
Validation Using Contributing Benchmarks In this section, I will re-visit the case histories presented in Chapter 2 and interpret the
events using the theory of benchmarking. The three benchmarks are TPC-A™ for DBMSs [58], SPEC CPU2000 for computer systems [64], and the TREC ad hoc Retrieval Task for IR systems [60, 162]. At a minimum, the theory should postdict these benchmarks well because they were used in its development. I will examine the components of these benchmarks, the development process used, and the impact they have had on their communities.
55
4.1.1 Components of Contributing Benchmarks In this section, I will identify the Motivating Comparison, Task Sample, and Performance Measures for each of the benchmarks and I will discuss how well they fit the definition. 4.1.1.1
TPC-A™ Components
Motivating Comparison. The purpose of the TPC-A™ is to rate DBMS on cost per transactions per second ($K/tps) within a class of update-intensive environments. These environments typically have multiple online terminal sessions, significant disk input/output, moderate system and application execution time, and require transaction integrity. Task Sample. The TPC-A™ Standard Specification is very detailed and spans approximately forty pages with eleven clauses. The application environment for the benchmark is a hypothetical bank. Clause 1.1.1 of the standard states: “…The bank has one or more branches. Each branch has multiple tellers. The bank has many customers, each with an account. The database represents the cash position of each entity (branch, teller, and account) and a history of recent transactions run by the bank. A transaction represents the work done when a customer makes a deposit or a withdrawal against his account. The transaction is performed by a teller at some branch” [58] (p. 43). The document goes on to specify the data transmitted in a transaction, transaction system properties, database design, scaling rules, test parameters, and so on. Performance Measures. The two key metrics in the benchmark are cost and tps. Cost is defined as the expenditures required to purchase and maintain all hardware and software over five years, excluding those used for the network. The rate of tps is measured using tpsA-Wide for systems that use a wide area network (WAN) and/or tpsA-Local for systems that use a local area network (LAN). This figure is based on a response time of less than 2 seconds for 90% of transactions measured at the test driver. A transaction is a deposit or a withdrawal made by a customer and includes the work to retrieve the new account balance.
56
4.1.1.2
SPEC CPU2000
Motivating Comparison. The purpose of SPEC CPU2000 is to provide a suite of applications to simulate realistic workloads on various system configurations (i.e. CPU, cache, memory, and operating system.) The suite includes a test harness for running the programs and collecting performance data. Task Sample. The Task Sample for CPU2000 is a suite of twelve integer applications and fourteen floating-point applications. One of the integer applications is written in C++ and the remainder in C. These include GNU C compiler, GNUzip compression program, a chess game, a word processor, and two placement and routing programs. The floating point applications are written in Fortran (77 and 90) and C. These include a number of scientific computation programs, mathematical solvers, and simulations. Performance Measures. Performance is the amount of time required to complete a task in the benchmark relative to a baseline system. The baseline system is given a score of 100 and the results from other system configurations are scaled to this figure. While speed is the primary metric used, statistics are also collected on memory usage and number of cache misses. 4.1.1.3
TREC Ad Hoc Retrieval Task
Motivating Comparison. The purpose of the ad hoc retrieval task is to evaluate the effectiveness of an information retrieval system as used by a dedicated, experienced searcher to perform ad hoc searches on archived data. Task Sample. The Task Sample consists of a text corpus and a collection of topics to be searched. The text corpus consists primarily of news items from sources such as the Wall Street Journal and the Associated Press. In 1992, there were 740 000 documents and by 1998 this number had grown to more than twice this size. The queries could be generated automatically by a tool, by hand, or some combination of the two. The following is topic #58 used at TREC-1 in 1992: “A relevant document will either report an impending rail strike, describing the conditions which may lead to a strike, or will provide an update on an ongoing strike. To be relevant, the document will identify the location of the strike or potential strike. For an impending strike, the document will report the status of 57
negotiations, contract talks, etc., to enable assessment of the probability of a strike. For an ongoing strike, the document will report the length of the strike to the current date and the status of negotiations or mediation” [60] (p. 9). Performance Measures. The TREC tasks use two Performance Measures, precision and recall. Precision is the proportion of retrieved documents that are relevant. In Figure 4-1 below, precision is the number of documents in the cross-hatched area divided by the number of documents retrieved. Recall is the proportion of relevant documents that are retrieved. In the diagram, recall is the number of documents in the cross-hatched area divided by the number of documents in the relevant box. Relevance is judged by human assessors who review the pool of documents retrieved by all the tools for a given topic. Corpus
relevant I retrieved retrieved relevant I retrieved recall = relevant precision =
relevant
retrieved
Figure 4-1: Precision and Recall Measures 4.1.1.4
Discussion
For all three benchmarks, the components were easily identified. The Motivating Comparison captures an essential problem in each of the research areas— transaction-intensive databases in a commercial setting, the efficiency of different system configurations, and text searching by an experienced searcher. The benchmarks provided a level playing field for making comparisons between systems. In each case, the Task Sample was designed to be as realistic as possible in a constrained setting. TPC-A™ attempts to emulate a hypothetical bank, complete with scaling rules for varying the size of the institution. CPU2000 consists of popular computer
58
applications. The TREC Ad Hoc Retrieval Task uses a large corpus of news articles and queries based on current events. Since each of the components can be easily identified in the three benchmarks, the definition fits and the rest of the theory can now be considered. 4.1.2 Impact of the Contributing Benchmarks According to the theory, a benchmark advances a scientific discipline because it improves technical results and it increases the cohesiveness of the community. These benefits are accrued as part of the process of developing and deploying the benchmark. For each of the case histories, I will argue that this was the case. 4.1.2.1
TPC-A™
Before TPC-A™, there was no agreed upon method for measuring performance of DBMSs. Consequently, there was a morass of unsubstantiated claims from vendors about the performance of their DBMSs. TPC-A™ gave the database community a standard method of measuring cost and tps, which improved research and development because it provided a means for comparing the performance of various systems. The creation of the TPC and the development of TPC-A™ arose through both collaboration and the contribution of research results. Jim Gray spearheaded the initial work, but he was quickly joined by other members of the community, and they produced the famous paper by Anon et al. Other individuals and organisations joined the effort which led to the formation of TPC [58]. In addition to this collaboration, this paper is noteworthy for its technical results: valuable insights into the design of benchmarks for DBMSs. This increased consensus in the community was carried over to the adoption of TPC benchmarks. Within a few months of the release of TPC-A™, virtually all major database vendors had released tpsA-wide or tpsA-local results for their products. It takes a non-trivial amount of effort to implement TPC-A™ for a database, which makes this high rate of acceptance even more remarkable. A number of vendors made modifications to their DBMS products to improve performance [90]. Some of these adaptations were positive, but others were not. Charles Levine, Jim Gray, Steve Kiss, and Walt Kohler wrote: 59
“…Some vendors eliminated bottlenecks, and so got huge improvements. Many of these improvements are generic, (e.g. write-ahead log, group commit, record locking, etc. so that conventional applications have benefited from them. But today, most systems have fixed their ‘general’ TPC-A problems and there is an alarming trend to special features to get the next leg up in the TPC-A race” [90] (p. 8-9) An example of a specialisation is discrete transactions in Oracle version 7.0. This feature was designed to improve the performance of short, non-distributed transactions under very restricted circumstances. The list of restrictions is long and includes constraints such as the database block can only be modified once per discrete transaction, and the discrete transaction cannot see the new value of the data after modifying it. These restrictions describe a very narrow set of transactions—ones that are typical of TPC-A™. The addition of discrete transactions to Oracle is entirely legal within the rules of TPC-A™, and improves their performance on the benchmark. Unfortunately, users will be hard pressed to take advantage of discrete transactions and will be unlikely to see the same performance improvements in their database applications. To this day, benchmarking has a prominent role in database research. TPC is an active organisation with a stable of benchmarks. In addition, TPC-A™has been superseded by TPCC™ [10]. Across the community, there are efforts to propose new benchmarks and to refine existing ones. There are many benchmarks available for a variety of problem domains. The emergence of a new research area, for instance XML databases, is accompanied by work on corresponding benchmarks. 4.1.2.2
SPEC CPU2000
When a number of hardware vendors realised that they were separately working towards the creation of realistic, macroscopic benchmarks for computer systems, they came together and formed SPEC to reduce overlap and share resources. According to legend, this coming together occurred in a bar in Campbell, California circa 1987 [108]. In the years since, SPEC has produced a series of benchmarks for computer systems, SPEC89, CPU92, CPU95, and CPU2000. With each successive benchmark, there have been technical advances in the benchmarks and greater collaboration among the community. The most significant 60
improvements have come as a result of criticism from the wider community. While this group includes more than scientific researchers, it works with at the leading edge of technology, so the theory of benchmarking is applicable here as well. Virtually every aspect of the CPU benchmarks has been criticised, including the Motivating Comparison, the Task Sample, the Performance Measures, the test harness used for the suite, and the development process [1]. Despite this discord, the work by SPEC has advanced the discipline by putting out a widely-used, widely-cited standard [8]. The benchmark put in place a baseline or statement that represented the discipline’s current understanding of the problem. These criticisms were an indicator more of the range of viewpoints present in the community and the influence of the benchmark, than of a fundamental flaw in the design of the benchmark. By acting as a lightening rod for criticism, improvements came quickly. Changes were made to the Task Sample and Performance Measures, but the Motivating Comparison has remained the essentially same [140]. One frequent criticism of the benchmark development process is that the process was not sufficiently inclusive, that is, researchers and practitioners who were not members of SPEC felt that they did not have enough input into the final product. Consequently, the CPU2000 and CPU2004 process have provided opportunities for the general public and academics to provide input [49, 164]. SPEC has also attempted to reach out to the university community by reducing the cost of membership and making their products available to them for free. These efforts to increase inclusiveness expands on the collaboration that occurs within the benchmark committees. These groups have always worked cooperatively; they regularly hold week-long benchathons when the committee meets in one location to work on the benchmarks [64]. The SPEC CPU benchmarks illustrate that the process of benchmarking as much as the benchmark has a positive influence on the community. One could even argue that the positive outcomes occurred because there were shortcomings in the benchmark [1]. These benchmarks continue a tradition of benchmarking and research into performance in the computer systems discipline. Other benchmarks have been put forth by the computing press and technical magazines, but the SPEC benchmarks still hold an important place in the discipline [8]. Evidence of the impact of the SPEC benchmarks can be seen in the most recent version of a essential text in the area, Computer Architecture: A Quantitative Approach, Third Edition, by 61
John L. Hennessy and David A. Patterson [63]. The majority of the introductory chapter is spent discussing basic issues in benchmark design. Many of the ideas are illustrated using results from various versions of the SPEC CPU benchmark. The largest subsection in the chapter, Section 1.9 “Fallacies and Pitfalls”, is entirely concerned with lessons learned from developing and deploying the SPEC benchmarks. This section includes a discussion of how to keep benchmarks up to date in response to changing technology. As with TPC-A™, some of these changes were general improvements and others were to increase performance on the benchmark. Systems are allowed to submit two sets of performance figures: a baseline set, where the same compiler flags are used to compile the programs in all the tests; and a tuned or optimised set, where test-specific flags are used. Vendors have often added flags solely for the purpose of improving their results. In some cases the difference between the two sets of numbers can be substantial. For example, on SPEC CPU2000 the overall performance of the AlphaServer DS20E Model 6/667 is 1.12 times higher on the tuned numbers than the baseline. On the bright side, these special features eventually translate into improved overall performance, even for the end user. Hennessy and Patterson wrote “As compiler technology improves, a system tends to achieve closer to peak performance using the base flags. Similarly, as the benchmarks improve in quality, they become less susceptible to highly application-specific optimizations. Thus, the gap between peak and base, which in early times was 20%, has narrowed” [63] (p. 59). Thus, improvements in benchmark design result in enhancements in systems and compilers, which in turn result in improved products for users. 4.1.2.3
TREC Ad Hoc Retrieval Task
TREC was created as part of TIPSTER, a large, American, multi-agency, multi-contractor project “to advance the state of the art in text processing technologies through the cooperation of researchers and developers in government, industry and academia” [105]. TREC-1 was cosponsored by NIST and DARPA. The conference was held to improve evaluation methods for IR systems by creating a large, realistically-sized test collection that groups across the community would use to compare results. In other words, they were attempting to standardise a Task 62
Sample. This pattern of cooperating to solve technical issues continues through the subsequent decade of TRECs. In addition to the contractors from the TIPSTER project, many others from academia, industry, and government have participated in TRECs. The conferences organisers from NIST (Donna Harman, who was later joined by Ellen Voorhees) championed the evaluation, but it was a collaborative effort. Participation by creators of IR systems was key because they helped shape the design of the evaluation. The developers evaluated their own tools using the collection and helped to provide relevance judgements that were used to calculate the Performance Measures. The importance of participation is evident in the steadily growing number of participants; there were 25 groups at TREC-1 and 41 at TREC-8 in 2000 [60, 162]. This increased collaboration was accompanied by an improvement in technical results. Over the first four years of TREC, recall was improved from about 30% to as high as 75% [105]. Through the collaboration, the developers became aware of each other’s approaches and tools, so that when the evaluation identified the more promising algorithms, they quickly spread through the community. After TREC-8 in 2000, the ad hoc task was discontinued, so the organisers could channel their efforts into other evaluations. They had achieved their goal of creating a large standard text corpus, topics to be queried, and relevance judgements. They had also achieved the goal of improving the state of the art in IR tools, but the advances that were sought had stagnated over the last few years. It seems that both the task and TREC were starting points for further research and collaborative evaluations [162]. 4.1.2.4
Discussion
In each of the case histories reviewed in this section, benchmarking caused an improvement in technical results and increased consensus within a scientific community. By advancing both factors in tandem, a leap forward in the state of the art was made. The formation of a collaboration was a seminal event in the case histories. For TPC-A™, it was the paper by Anon et al., which led to Jim Gray being identified as a champion of a standardised benchmark. For SPEC CPU, it was the meeting of minds in a bar. TREC arose out of a large collaborative project co-sponsored by NIST, DARPA, and others. 63
This seed group of collaborators put forward a starting point for the benchmark, which was then refined through use and feedback by the wider community. In no case, did an individual or a small group of individuals put forth a benchmark that was subsequently accepted wholesale by the community. The benchmark was often refined many times before arriving at a final version. The Anon et al. paper was published in 1984, but the TPC was not formed until 1987 and TPCA™ was released two years later. SPEC CPU had a shorter timeline; the infamous bar meeting happened in the second half of the eighties, SPEC was formed in 1988 and the CPU benchmark released the following year. The ad hoc task was introduced at TREC-1 in 1992, but did not settle into its final format until TREC-4. Successors to the each of these benchmarks are still being evolved today. Experiences from benchmarking were fed back into both the benchmark and the technology under study. The result was advances in the state of knowledge of the respective disciplines. TPC-A™ brought clarity to a problem that was mired in controversy and suspicion. It allowed vendors, customers, and researchers to communicate with each other using common vocabulary and concepts. Today, there is a strong commitment to benchmarking in the database community. SPEC CPU had a similar effect on the computer systems community and there is a corresponding increase in sensitivity to issues of benchmark design and empirical techniques. For IR, there was a significant improvement in the precision and recall rates of document retrieval tools. The existence of a standard benchmark that has been developed by consensus and is widely accepted provides a beacon for the entire community. It lights the path for others to follow and shows that there is agreement on what is important. It focuses attention on a key problem, so that improvements are concentrated. It is these effects, which emerge out of both technical and sociological factors, that have lasting positive impact on the discipline. Hence, the emphasis in this theory on the process of benchmarking, rather than simple possession of a benchmark. 4.2
Validation Using Novel Benchmarks The second empirical validation in this chapter involves using the theory of benchmarking to
interpret emergence of novel benchmarks, that is, ones that were not used to formulate the theory. Conceptually, this is similar to the machine learning procedure of testing a classifier on different data than was used in training. In this section, two case histories will be considered, 64
work by Barron et al. on optical flow from computer vision [16] and the KDD (Knowledge Discovery and Data mining) Cup [39, 75, 169]. Both of these case histories were suggested to me by people who were familiar with those research areas but believed that these benchmarks did not fit the theory. In both cases, further investigation yielded evidence that suggests that the theory does indeed fit. 4.2.1 Optical Flow After giving a presentation at University of Sydney, an audience member approached me and suggested that a benchmark developed by Barron, Fleet, and Beauchemin for optical flow techniques did not fit the theory [16]. Optical flow techniques are developed by computer vision researchers to track motion, or flow, of objects across a visual field. Barron et al. performed an empirical evaluation of optical flow techniques by implementing nine techniques and comparing their performance on a set of synthetic images. The audience member argued that this landmark paper on its own was sufficient to create a benchmark, and discussions and workshops were not necessary. Investigating the emergence and impact of a benchmark is often difficult due to lack of documentation on the social processes involved, and this one was no exception. My research unearthed a few hints about the role and status of the Barron et al. paper within computer vision research. While the Barron et al. paper was certainly influential, a slightly different picture emerges when viewed from the perspective of empirical validation within computer vision. There is a wide range of benchmarks in that field [113] and at least some of these have been developed in a manner consistent with this theory [28, 126]. Two workshops in particular stand out as counter-arguments to the audience member’s claim and these are a Dagstuhl Seminar on Evaluation and Validation of Computer Vision Algorithms [74] and an IEEE Workshop on Empirical Evaluation of Computer Vision Algorithms [28]. Both of these workshops were held in 1998, after Barron et al. paper was published. At both of these, the work by Barron et al. was considered important, but it was one of many noteworthy evaluations. In a report on the second workshop, Bowyer and Phillips divide empirical work in the field into three categories [29]. The first category is evaluations that are independently administered. These are tests that are organised by a group and individual competitors subject their algorithms to the evaluation. The second category is evaluations of a set 65
of classification algorithms by one group. The third category is problems where the “ground truth” is not self evident and part of the process of evaluation is deciding what is this ground truth, or correct answer. From their description, benchmarks most resemble the first type of evaluation, and the authors place the Barron et al. paper into the second group. I conclude that the Barron et al. paper does not contradict the theory of benchmarking, simply because it is not the same kind of benchmark that is addressed by the theory. The audience member did have a valid point in that laboratory evaluations can also have a significant impact. 4.2.2 KDD Cup The KDD Cup is a challenge that has been held annually at the Knowledge Discovery and Data Mining conference since 1997. Like the TREC, there are multiple tasks available. Unlike the other case histories discussed in this dissertation, the tasks change every year. Each KDD Cup has a different set of organisers who design an evaluation for fresh problem. For example, KDD Cup 2002 had two tasks, “Information Extraction from Biomedical Articles” and “Yeast Gene Regulation Prediction” [39, 40, 169]. While KDD Cup 2000 had one data set (transaction records from an online store), but five independent questions were asked about this data [75]. Each of these tasks fit the definition of benchmark as used in this dissertation and the benchmark components can easily identified. Furthermore, there is a high level of community participation in the development and deployment of these benchmarks. However, there is no ongoing discussion, refinement, or maintenance. There are some differences between the KDD Cup and the case histories considered in this dissertation, but the theory of benchmarking can still be applied. Moreover, the KDD Cup is a valuable example of how a community can approach the construction of benchmarks. I will examine first the merits of the approach used in the KDD Cup and then argue that the theory does apply to this example. The main drawback of the KDD Cup approach is there is no follow-on competition to provide a venue for researchers to demonstrate improvements to their systems. Consequently, there is less emphasis on learning from experience than would be the case with a repeated
66
benchmark. However, the KDD Cup approach of using new tasks each year is shrewd because it mitigates many of the dangers of benchmarking. •
Moderates Competitiveness. While some competitiveness is required in benchmarking, an
excessive amount can actually reduce the scientific value of the project. Because each KDD Cup competition is only run once, there is no incentive to tune a system to excel at a particular benchmark task. Consequently, once the lessons have been learned from participation and discussion, the community moves forward. •
Distribution of Work. Researchers with different resources and areas of expertise can
participate to spread the work load more equitably. Since the competition is run annually, this limits the amount of time that can be put into the creation of a single evaluation, which in turn minimises the “cost” of moving on to another task. •
Inclusion. This KDD Cup model for organising a benchmark allows many people to
become involved. As a result, the competition reflects a spectrum of opinion from across the community and the benchmark does not become over-committed to a research direction. Clearly, there are benefits to using the KDD Cup approach to benchmarking. As to whether the theory of benchmarking generalises to this case, the competition needs to be examined over several years as was done with the other case histories. It does not seem reasonable to consider the KDD Cup a single benchmark, because the tasks change so radically. On the other hand, it also does not seem reasonable to consider it as a series of disconnected evaluations. Taken as a whole, the KDD Cup is a reification of the community’s paradigm. Both technical knowledge and community consensus come into play in the design each year’s competition. The organisers are domain experts in the task and there are opportunities for participants to provide input on the design of the evaluation. In every KDD Cup since 2000 there has been a “question period” lasting 1-2 weeks between the initial release of the data and the final competition [40, 75, 107]. During this time, the competitors comment on format of the task and the organisers make appropriate adjustments. This interaction allows the participants to come to an agreement on whether the task provides a level playing field for their systems. More importantly, the KDD Cup is a statement by the community on what constitutes a scientific problem and what are scientifically acceptable solutions. I would infer from these 67
competitions that the KDD community is concerned with the development of algorithms, data structures, and systems that discover patterns in data. While certain approaches work better with some types of problems and some researchers have expertise in particular domain, the scientific results ought to be domain-independent. While the KDD community is a big tent, the selection of particular tasks and organisers is a recognition of current, possibly trendy, problems that are scientifically interesting. Consequently, the tasks included in the KDD Cup are indicative of the community’s paradigm, so the theory of benchmarking does generalise to this case history. 4.3
Analytic Validation of the Theory’s Structure In this section, I will present an analytic validation of the structure theory of benchmarking.
A theory is a set of statements that puts forth a causal explanation of a set of phenomena that is i) logically complete, ii) internally consistent, and iii) falsifiable [98]. A theory needs to provide a causal explanation, that is, to define a set of concepts and the expected relationships between them, in order to set it apart from descriptive inferences, which only describe the empirical world. A theory is internally consistent if it does not contradict itself. It is logically complete if the assumptions needed to make the inferences are stated at the outset. Finally, it is falsifiable if the argument is not circular (or tautological) and can be tested empirically. In this section, I will describe these criteria further and argue that the theory put forth in this dissertation fulfils these criteria. This analysis will rely on the recapitulation of the theory from Section 3.8. 4.3.1 Provides a Causal Explanation A theory needs to provide a causal explanation, that is, it should say why something happens. The phenomenon being explained (the explanans) is described by giving definitions of the things in the theory and the relationships between them. In addition, the theory should state any assumptions that it makes. This theory puts forth a causal explanation for the impact of benchmarking in computer science and software engineering. The theory was constructed using Kuhn’s structure of scientific revolutions as a foundation and consequently inherits features from that theory, in particular, the nature of scientific paradigms and the processes for scientific progress. The explanans is the four concepts defined in the recapitulation, scientific discipline, scientific paradigm, scientific progress, and benchmarking. The first concepts three were taken 68
from Kuhn and the last one was given in Section 3.1 as part of the theory. The two relationships described by the theory are between benchmarks and scientific paradigms, and between benchmarking and scientific progress. It states that the process of benchmarking advances research by improving technical results and increasing the consensus in the community, because a benchmark is closely tied to a discipline’s paradigm. 4.3.2 Structural Soundness The recapitulation of the theory in Section 3.8 states the theory using one assumption, four concepts and their definitions, and two relationships. Since the theory has few elements it is not difficult to show that the theory is logically complete, internally consistent, and falsifiable by using inspection. (For larger theories, it is possible to use more rigorous methods such as formal logic and model checking [22].) The assumption that scientific paradigms exist is necessary because the theory uses Kuhn’s work as a foundation. In other words, his ideas must be assumed to be correct, so that articulation of the theory of benchmarking can proceed. This assumption provides us with the basis for the three of four concepts and the two relationships. Therefore, the theory of benchmarking is logically complete all the needed assumptions are stated at the outset. The theory of benchmarking is internally consistent because none of the elements contradict each other. The concepts build on each other and address related entities, but each of them are distinct. The two relationships connect together different concepts, so there is no conflict there. Finally, the theory is falsifiable because the argument put forth is not circular. Furthermore, it is possible to negate the relationships and test these statements empirically. At the end of Section 3.8, there is a hypothesis that attempts to reverse one of the relationships in the theory. This hypothesis is not a circular argument, but an examination of the causal flow in the theory. In other words, there are many feedback arcs among the variables and concepts, and these allow social processes to leveraged bi-directionally. 4.4
Validation by Comparison to A Rival Theory A theory has value if it improves our understanding of a phenomenon and allows us to make
stronger knowledge claims. Consequently, a theory is evaluated by comparing it to existing, rival 69
theories. There is a hierarchy of six criteria for making this comparison: postdiction, generality of explanans, number of hypotheses generated, progressive research program, breadth of implications, and parsimony [98]. Two theories are compared using each criteria in turn until one is determined to be superior. Where possible, judgements should be based on criteria that appear earlier on the list. Since this is the only theory of benchmarking that I know of, there is no existing, rival theory that can be used for comparison. Consequently, I will need to construct a rival theory that represents our prior state of knowledge to perform this validation and unfortunately this rival theory will be rather hollow. This comparison will be conducted using all six criteria, although the process normally stops when one theory is found to be superior. 4.4.1 A Rival Theory The primary characteristic of the rival theory is that benchmarks are simply technical artifacts that exist for the purpose of performing empirical evaluations. They produce sound empirical results and their effectiveness is determined solely by the quality of the research design. Scientific progress is achieved because additional technical results have been generated. 4.4.2 Postdiction: Which theory better fits the empirical data? Postdiction is the ability to explain past events or existing empirical data. It is the opposite of prediction, being able to describe future events. The theory that fits empirical data better is preferred. This theory of benchmarking was constructed using a number of existing benchmarks as case histories, so it naturally postdicts that data. This postdiction will be discussed in detail in Section 4.1. Moreover, this theory was constructed because the conventional view was not sufficient to explain the events. For instance, benchmarking had an impact on the entire community, beyond what would be expected for an empirical evaluation. Other equally welldesigned experiments and case studies were recognised as valuable and even won best paper awards, but did not cause “winning” ideas to spread quickly. The conventional view also could not explain why researchers were willing to subject their tools and techniques to public evaluation (and possible failure), and why they were enthusiastic about repeating the experience.
70
Once Kuhn’s ideas about scientific communities and paradigms were applied, the impact of benchmarking became easier to explain. Scientific progress depends on both technical results and consensus in the community. Once the sociological component was added to the theory, arguments about values and cross fertilisation could be brought to bear. In addition, Kuhn’s structure of scientific revolutions provided a framework for interpreting the symbolic value of a benchmark at different stages. Usually this is the only criterion needed to decide between two theories and that is the case here. Once a theory is found to be superior to another one is it not necessary to continue making comparisons down this list of criteria. However, for the sake of discussion the remaining criteria will also be considered. 4.4.3 Generality of Explanans: Which theory can account for more phenomena? This criterion relates to the breadth of the phenomenon being explained, that is, the explanans. For example, a theory of program comprehension is more general than a theory of how novice programmers debug toy programs. Theories that are more general and have a larger explanans are preferred. The rival theory has a more general explanans. It can be applied to any evaluation called a benchmark. In contrast, the theory of benchmarking applies only to benchmarks as defined in the theory, i.e., community-based benchmarks. 4.4.4 Hypothetical Yield: Which theory produces more hypotheses? In this context, a hypothesis is a claim or statement that can be made about the world. A theory that can yield explanations for phenomenon other than the one under investigation is preferred. The theory of benchmarking has a greater hypothetical yield than the rival theory, simply because it is a larger theory. Since it uses Kuhn’s structure of scientific revolutions as a foundation, it inherits concepts, processes, and mechanisms. Some of the additional hypotheses that can be tested are the role of benchmarks during different stages of revolution, as described in Section 3.3.3. If benchmarks can be used as an indicator of scientific maturity, this theory of benchmarking would also be a contribution to 71
philosophy of science, since benchmarks could then be used as a measure in a sociological or anthropological study. 4.4.5 Progressive Research Program: Which theory is more progressive? Progressive and degenerative research programs were concepts introduced by Imre Lakatos [25, 98]. He expected research programs and their derivative theories to change over time and he wrote, “In a progressive research programme, theory leads to the discovery of hitherto unknown novel facts. In degenerating programmes, however, theories are fabricated only in order to accommodate known facts” [25] (p. 43). A degenerative theory accommodates contrary evidence by reducing the explanans, adding assumptions, and/or adding restrictions. A progressive theory expands the explanans, eliminates assumptions, and/or eliminates restrictions. The theory of benchmark is more progressive because it can easily be applied to other aspects of scientific disciplines where collaboration occurs. For example, the development of GXL (Graph eXchange Language) can also be explained using many of the ideas in the theory of benchmarking [67, 135, 167]. The GXL standard, like a benchmark, is symbolic of a paradigm. In this case, it is a statement about how reverse engineering tools are expected to share and process data. GXL was also created collaboratively by researchers from multiple sites and it was accepted by the community in a process of ongoing consultation supported by laboratory work. Consistent with the process model for benchmarking (discussed in the next chapter), the development effort was led by champions, work on the format built on extant exchange formats, and there was already a sprit of cooperation within the community. The shortcomings of the rival theory has parallels for standards as well. Looking at standards only as technical artifacts is not sufficient to explain their success. Some standards are used extensively, while others are not. Some standards are widely adopted, for example, UML, without the benefit of a regulating body. Others undergo a careful consultation process, such as those published by IEEE, but are not widely used. Other factors need to be brought in to account for this variability.
72
4.4.6 Breadth of Policy Implications: Which theory provides more guidance for action? The purpose of a theory is to help us understand how the world works. This knowledge should help us make decisions about how to act. A theory whose implications provide more guidance and useful advice for action should be preferred. This aspect of the theory of benchmarking is described at length in the next chapter. It provides a great deal more guidance than the rival theory. Its implications include ways to advance progress in a research community and how to develop a successful benchmark. 4.4.7 Parsimony: Which theory is simpler? A theory that is simpler, i.e. one that has fewer assumptions, is parsimonious and should be preferred. This criterion is an application of Ockham’s razor. If two theories are equally good at this point, then choose the simpler one. The rival theory is more parsimonious than the theory of benchmarking. It makes fewer assumptions and is a shorter, simpler explanation. This finding illustrates why this criterion is last on the list. A theory that is longer but improves our understanding of the world is better than a theory that is short but does not. 4.5
Summary This chapter was concerned with showing that the theory of benchmarking is correct. Two
types of validations were performed, empirical and analytic. Empirical data is needed to show external validity. The case histories of TPC-A™, SPEC CPU2000, and the TREC Ad Hoc Retrieval Task were used to show that the theory postdicts known instances. Two novel case histories, KDD Cup and object tracking by computer vision, were used to test how well the theory predicts. A further two analytic validations were performed that to show that the theory is internally valid. The first of these was performed to show that the theory does provide a causal explanation and is structurally sound. The second analytic validation used an hierarchical set of criteria to show that the theory of benchmarking had more explanatory power than a rival theory that represents the current view that benchmarking is simply an effective evaluation technique. This validation will continue in Chapter 6-8 with two benchmarks that were developed with the reverse engineering community.
73
74
Chapter 5. Applying the Theory Part III of this dissertation concerns the application of the theory of benchmarking. This chapter provides practical advice for benchmarking and the remaining three chapters illustrate how benchmarking was used to advance research in reverse engineering. According to the theory, benchmarking improves scientific results and consensus in a discipline. This chapter hinges upon using benchmarking deliberately to gain these benefits rather than only enjoying them as a side-effect. Benchmarking can be used to validate research contributions, to codify technical knowledge, and to make a community more cohesive. In this chapter, a prescription is given for a benchmarking process that has a high degree of community involvement. I will begin by giving some guidance on how to determine whether to undertake such a project in a particular research area. For areas that are amenable to this work, I provide principles for the benchmarking process and checklists for evaluating the product. 5.1
Process Model for Benchmarking The process for benchmarking is shown in Figure 5-1, which gives an overview of the
different actors and their interactions. The components of the diagram will be described here briefly before moving to a full discussion of the process model. There are three laboratories (A, B, and C) working on the same problem—developing tools to split a crystal. Each of these laboratories have developed different tools (a hammer, a pickaxe, and a sword) for this purpose. These elements represent the minimum maturity needed before embarking on the benchmarking process. The diagram also shows interactions between laboratories. A researcher will visit another laboratory to learn about a new tool or to teach them about her or his tool. This interaction represents the ethos of collaboration needed for successful benchmarking. At the centre right of the diagram is a workshop to discuss and use the emerging benchmark. The workshop is led by a small number of champions. A champion is primarily an organiser 75
and may come from one of the participating laboratories, but works to advance the benchmark. Participating laboratories bring their tools to the workshop to be evaluated and provide feedback on the benchmark. These same participants take the benchmark results back to their laboratories to inform further work. These cycles ensure that the benchmark is supported by laboratory work.
Inter-laboratory Collaboration
Laboratory B
Champion for Benchmarking
Returning with Results
Laboratory A
Participating with Technology
Benchmarking Workshop
Inter-laboratory Collaboration Laboratory C
Champion for Benchmarking
Figure 5-1: Process Model for Benchmarking 5.1.1 Prerequisites There are three conditions that must exist within a discipline before construction of a benchmark can be fruitfully attempted, and these are: a minimum level of maturity, a tradition of 76
comparing research results, and an ethos of collaboration. If these prerequisites are not met, it does not mean benchmarking cannot be employed. Rather, it means that steps to establish these prerequisites must first be undertaken. The first two prerequisites can be addressed by time and scientific results. In some situations, it is merely sufficient to wait for a critical mass of sentiment and technology before embarking on the road to a benchmark. In other situations, preliminary research on benchmark components may also be necessary. Laboratory experiments that compare the efficacy of different tools or techniques are helpful for demonstrating the value of comparison and validation studies. The third prerequisite can be addressed through a series of planned activities, but again, patience is necessary because it takes time to change the ethos of a group. Some ways to build up a collaborative ethos include informal meetings and joint projects between groups, time set aside for discussions at conferences, and workshops where new problems and ideas are explored. 5.1.1.1
Minimum Level of Maturity
The first prerequisite is that there must be a minimum level of maturity in the discipline. As a research area becomes established, diverse approaches and solutions proliferate. This is a necessary and appropriate stage of development, because the bounds of the area are being established and different methods are being applied. This proliferation is necessary, so there will be a variety of tools and techniques to be compared by the benchmark. In terms of the Redwine and Riddle characterisation of software technology maturation, the tools or techniques should have reached at least the Internal Enhancement and Exploration phase before benchmarking should be attempted. Samuel T. Redwine and William E. Riddle reviewed 17 different software technologies, such as abstract data types, formal verification, and UNIX, and identified six phases of maturity and the transition points between them [119]. These phases are summarised in Table 5-1. Internal Enhancement and Exploration is the fourth phase in the maturity model and transition into it occurs when usable versions of the tools or techniques become available. Activities during this phase are concerned with extending the technology, applying it to real problems, and the generation of results showing its efficacy.
77
A research area needs to attain this level of maturity, so that it has the wherewithal to act on the results of the benchmark. Also, at this level of technological maturity benchmarking becomes more than an academically interesting activity because it provides knowledge necessary to reach the next level of maturity. Maturation Phase Basic Research
Milestone
Activities/Trigger • •
2
Investigation of ideas and concepts that later prove fundamental General recognition of problem and discussion of its scope/nature Appearance of a Key Idea Underlying the Technology or a Clear Articulation of the Problem • Informal circulation of ideas • Convergence on a compatible set of ideas • General publication of solutions to parts of the problem Clear Definition of Solution Approach Via a Seminal Paper or a Demonstration System • Trial, preliminary use of the technology • Clarification of the underlying ideas • Extension of the general approach to a broader solution Usable Capabilities Become Available
3
• Major extension of general approach to other problem domains • Use of the technology to solve real problems • Stabilization and porting of the technology • Development of training materials • Derivation of results indicating value Shift to Usage Outside of Development Group
0 Concept Formulation 1 Development and Extension
Enhancement and Exploration (Internal)
•
Enhancement and Exploration (External)
Same activities as previous phase, but carried out by a broader group, including people outside the development group
4
Substantial Evidence of Value and Applicability
4a
• Appearance of production-quality, supported versions • Commercialization and marketing of the technology • Propagation of the technology through community of users Propagation through 40% of the Community
4b
Propagation through 70% of the Community
Popularization
Table 5-1: Redwine and Riddle’s Software Technology Maturation Phases
78
This prerequisite is important because there is a significant cost to developing and maintaining the benchmark, and a danger in committing to a benchmark too early. As Tichy wrote: “Constructing a benchmark is usually intense work, but several laboratories can share the burden. Once defined, a benchmark can be executed repeatedly at moderate cost. In practice, it is necessary to evolve benchmarks to prevent overfitting” [153] (p. 36). The community must be ready to incur the cost of developing the benchmark, and subsequently maintaining it. Continued evolution of the benchmark is necessary to prevent researchers from making changes to optimise the performance of their contributions on a particular set of tests. Too much effort spent on such optimisations indicates stagnation, suggesting that the benchmark should be changed or replaced. Locking into an inappropriate benchmark too early, using provisional results, can hold back later progress. The advantage of having a benchmark is that the community works together in one direction. However, this commitment means closing off other directions, albeit temporarily. Selection of one paradigm, by definition, excludes others. 5.1.1.2
Tradition of Comparison
A tradition of comparison between research contributions paves the way for a successful benchmarking process for two reasons. One, it indicates that the community recognises the importance of validation and it does not need to be convinced of its value. Two, there is a body of research on empirical evaluation that serves as a starting point for benchmarking. This tradition may have started recently or it may have a long history. Communities with a newly-acquired interest would have only a few recent papers either reporting on or proposing an empirical study. In communities with a longstanding interest, a benchmark can be an extension of a ongoing empirical work or it may re-kindle a dormant tradition. Evidence of interest in making comparisons can be found in: •
an increasing concern with validation of research results;
•
comparison between solutions developed at different laboratories; 79
•
attempted replication of results; use of proto-benchmarks (or at least attempts to apply
solutions to a common set of sample problems); and • 5.1.1.3
an increasing resistance to accept speculative papers for publication. Ethos of Collaboration
The third prerequisite is that there must be an ethos of collaboration within the community. In other words, there must be a willingness to work together to solve common problems. A past history of collaboration demonstrates the presence of a good working relationship and sets up the expectation that community members take part in such work. This ethos, familiarity, and experience creates a community that is more receptive to the results and therefore more likely to use the benchmark. Evidence of this ethos can be found in: •
multi-site collaborative projects;
•
papers with authors from disparate geographic locations and sectors of the economy;
•
exchange visits between laboratories; and
•
standards for terminology and publication.
Seminars and workshops that provide opportunities for discussion and interaction also foster this ethos of collaboration. 5.1.2 Factors for Success The key principle underlying the benchmark development process is this: scientific knowledge and community consensus must progress in lock-step during creation of the benchmark. Therefore, the construction of the benchmark is as much about making design decisions as it is about building consensus in the community. The selected benchmark components must be endorsed by the community. Choosing the Task Sample will undoubtedly be controversial. Creating Performance Measures can also be difficult. In most cases, no obvious measures are available prior to the benchmarking effort. Although many people associate benchmarking with test of simple one-dimensional attributes (such as speed), in fact the measures can address any aspect of fitness for purpose, and hence can be quite complex. 80
Since knowledge and consensus must be advanced together, a successful benchmark development process needs to have the following attributes. •
The effort must be led by a small number of champions. This group keeps the work
active and the project often becomes identified with one or two key members of the group. They act primarily as organisers, co-ordinating activities such as opportunities to provide feedback and publication of results. While a champion may be affiliated with a particular research lab, it is important that he or she acts impartially when working on the benchmark. •
There must be many opportunities for the general community to participate and
provide feedback, in order to build up consensus. Pursuant to Tichy’s observation, there must be many opportunities to debate. These opportunities can come in a variety of formats. They can be newsgroups or mailing lists for informal discussion. A more formal, written Request for Comment (RFC) procedure could also be used. I have found that face-to-face meetings that occur as part of conferences and seminars work best. •
Design decisions for the benchmark need to be supported by laboratory work. The
benchmark should use established research results where possible. In addition, prototypes and small tests may be needed to further understanding or to support decision-making. For example, selection of a representative Task Sample may require data on usage in the field or a model of user behaviour. 5.2
Readiness for Benchmarking The purpose of this section is to provide guidance for researchers who are trying to decide
whether their research area is ready to work on a community-based benchmark. To this end, it provides a questionnaire for assessing readiness for benchmarking. To illustrate its use, the assessment will be applied with two research areas: one that is ripe for benchmarking (requirements engineering) and one that is not (software evolution). The discussion of this instrument will conclude an examination of its shortcomings. 5.2.1 Readiness Assessment The full readiness assessment is found in Appendix B. First, a research community and a technology to be benchmarked are identified. The assessment asks a series of multiple choice 81
questions on the three prerequisites, minimum level of maturity (6 questions), tradition of comparison (6 questions), and ethos of collaboration (5 questions). Points are awarded for the answers (0 points for each “a”, 1 point for each “b”, 2 points for each “c”) and a score is calculated for each prerequisite. The scores are then interpreted according to Table 5-2. Ideally, a research area should be assessed as ready on all three prerequisites, but other combinations may also yield a successful effort. Too Soon Ready for Benchmarking What are you waiting for?
Maturity 0-4 5-9 10-12
Comparison 0-4 5-9 10-12
Collaboration 0-3 4-7 8-10
Table 5-2: Interpretation of Readiness Assessment Scores The questions were created by examining the prerequisites in the process model and identifying corresponding quantifiable features in the theory of benchmarking. The threshold values for the questions were selected by applying the questions to the case histories of TPC-A, SPEC 200, and the TREC Ad Hoc Retrieval Task. As it stands, the Benchmarking Readiness Assessment is a specialist’s tool, meaning that it requires intimate knowledge of the theory of benchmarking to use appropriately. Consequently, the reliability and repeatability of this instrument has not been tested rigorously. Despite these shortcomings, the readiness assessment serves as a starting point for an important tool. 5.2.2 Example: Requirements Engineering Requirements engineering has matured as a research community considerably over the past decade, with a series of annual conferences now in their tenth year, and a number of smaller regular workshops. Collaborative links such as the recent European-funded RENOIR network of excellence provide strong evidence of an ethos of collaboration. In some areas of requirements engineering, most notably software specification, protobenchmarks have been used for years. A recent paper by Feather et al. characterises these as exemplars [51], and identifies the uses, strengths and weaknesses of exemplars for evaluating specification techniques. When used for promoting research comparison and understanding, exemplars have both a Motivating Comparison and a Task Sample. However, as Feather et al point out, exemplars lack appropriate evaluation criteria: 82
“How can we determine whether the specification language used on one exemplar has appropriate expressiveness, scaleability, evolveability, deductive power, development process efficiency, and so on?” [51] (p. 423). This preceding list provides an obvious starting point for defining Performance Measures. These characteristics suggest that the prerequisites have been met. I used the Benchmarking Readiness Assessment was used to examine the requirements engineering research area, in particular, requirements specification languages. This subarea received the following scores: Maturity
6
Comparison
4
Collaboration 5 The Comparison score is one point shy of being ready, but the other two scores fall into an acceptable range. Efforts to construct a benchmark in this area and technology combination should begin with some comparisons, which will quickly raise the score for this prerequisite. 5.2.3 Counterexample: Software Evolution Software evolution is an emerging research community, but they are already pursuing a benchmark. This research area is interested in studying large-scale changes that occur in software over time. Over the last couple of years, key people in the field have published papers [45, 94] and proposed workshops to draw the community’s attention to this problem. This work is motivated by a desire to better understand research results by comparing them using common Task Samples. In applying the Benchmarking Readiness Assessment, it was difficult to settle on a particular community and technology. There is some confusion on which researchers and results belong to this community. Furthermore, this research area is still trying to characterise software evolution, which in itself is an indicator of lack of benchmarking readiness. Consequently, there are many ways to measure and examine software evolution, but little sense of how these techniques could be applied to industrial software development. The technology chosen for the assessment was tools for measuring software evolution and these received the following scores :
83
Maturity
3
Comparison
2
Collaboration 2 These results are consistent with a review of the literature and observations of the community. An attempt to organise a workshop to begin a benchmarking effort in 2002 was not successful and there is little experience with comparison. Also, they have not quite attained sufficient proliferation of research results. However, the needed multiplicity of approaches will come with time and so will the consensus and results needed for a benchmark. 5.3
Checklists for Benchmarks In this section, I present some criteria for evaluating benchmarks. As with other aspects of
benchmarking, there are both technical and sociological criteria. The technical criteria are used to assess the benchmark as an empirical evaluation. The sociological criteria are used to determine the extent the community has been involved in and affected by the process. These criteria are used to generate an initial list of questions for evaluating benchmarks. 5.3.1 Terminology
Benchmark
Task Sample
User Products
Technology
Performance Measures
Raw Data
Results
Solutions
Figure 5-2: Processing of Benchmark Data Before discussing these checklists, some of the terminology in the questions should be clarified. They use terms such as “raw data”, “solutions” and “results.” The Task Sample includes all the input data included in the benchmark, for example, task descriptions and a text corpus. Raw data is the information generated by the tool or technique to complete the tasks. This data can be further analysed or documented to produce solutions. Depending on the context, 84
raw data and solutions can be the same, but they differ in that the solutions provide specific answers to the tasks. The results are the outcome of the evaluation after the solutions have been scored, graded, or ranked. This sequence of work products and processing steps are shown in Figure 5-2. The top row of boxes show components and data for the benchmark. The bottom row of boxes show user components and data. The solid arrows show processing steps that can be performed by a computer or a human. In some cases, no processing is required to turn raw data into solutions. Processing of benchmark data can proceed in a number of ways and SPEC CPU2000 and the xfig benchmark exemplify two of them. In the case of CPU2000, the Task Sample is the programs to be compiled and run. The technology is a computer system including the hardware configuration, operating system, and compiler. Raw data is generated when the compiled programs are run. A testing harness collects the raw data and transforms them into solutions, e.g., measures of work done and timing. The solutions are given to SPEC which calculates the results for a particular computer system by giving a score relative to a baseline system configuration. This processing sequence is almost entirely automated. With the xfig structured demonstration, the Task Sample consisted of source code and tasks for users to complete. A tool was used to analyse the source code and produce the raw data. Human operators manipulated and documented the raw data to produce solutions. The Performance Measure was industry observers who assessed the tools on their applicability and usefulness to their own development teams. The solutions and observations were returned to the organisers who summarised them into results, which, in this case, were lessons learned. This processing sequence had a great deal of user intervention, and consequently a higher degree of subjectivity. 5.3.2 Caveats The purpose of these questions is to make the criteria more concrete, as many software engineers lack the background in empirical methods and social science to interpret and apply criteria such as “external validity” and “engagement.” Benchmarks that fail despite having met these checklists will reveal shortcomings in the evaluation. These criteria will be discussed and a list of questions will be presented to aid in the evaluation of a benchmark.
85
These checlists have not undergone user testing as questionnaires. A key problem with them is that the questions, as given, are not sufficiently discriminating to elicit trustworthy responses. Questions such as T1 on the Task Sample which asks “Is it representative of problems found in actual practice?” are too easy to answer inaccurately. An enthusiastic benchmark champion could easily answer in the affirmative without providing evidence. These questions attempt to tackle underlying issues with the benchmark and the process, but are not phrased in ways that are resistant to misinterpretation or overly optimistic answers. 5.3.3 Technical Aspects of Benchmark Design At its core, a benchmark is an empirical study. Like other evaluations of this kind, it should yield conclusions that are accurate, produce results that can be repeated, and are relevant to science and society [117] (p. 52). In other words, the study should be valid, reliable, and replicable. In addition to these usual concerns, a benchmark needs to be re-usable by others without extensive training in empirical methods. Consequently, the rigor of the design as well as clarity and quality of the specification are particularly important. Relevance. In order for the benchmark to be relevant, it needs to tackle a problem that is important to more than a single researcher. Since the raison d’être for a benchmark is to make fair comparisons, the Motivating Comparison should be of interest to the broader scientific community and even to industry and society. Validity. In an empirical study, there are three types: internal validity, external validity, and construct validity. •
Internal validity relates to the logic and design of the study. An internally valid study shows that a particular treatment caused the outcome of the study and rules out other reasons for the same outcome.
•
External validity is the applicability of the results of the study to other contexts and is also called generalisablity. A study may not generalise well if the problem is not representative of the larger world or the laboratory setting is too constrained or artificial.
•
Construct validity is the most technical of the three. It relates to how well the concepts behind the problem are turned into measurable aspects in the study. For example, TPC-A™ uses cost per transaction per second ($/tps) as a measure of DBMS merit. It defines cost as the purchase price plus the cost to maintain the DBMS for 60 months. 86
This definition was used because it had greater construct validity than using only the purchase price of the system. A construct that is valid should provide an adequate definition of the concept being measured and should be sufficiently sensitive to the reflect the manipulations in the study. An empirical study, and by extension, a benchmark needs to ensure that it is valid in all three ways. Internal validity can be achieved through a logical structure in research design. External validity is particularly important in the selection of the Task Sample. Construct Validity is manifest in the Performance Measures. Reliability and Repeatability. A study that is reliable will produce the same outcome when repeated. This criterion is particularly important because the outcome is decided by pooling data from researchers who have used the materials on different innovations in different settings at different times. A study that is repeatable is self-contained and resistant to the influence of random variation. Replication is needed in order to have independent verification of the results through additional evaluations. Consequently, the materials for conducting a benchmark need to be packaged for re-use with such issues in mind. 5.3.4 Checklists for Technical Criteria Using general criteria from the previous sub-section, questions have been formulated to assess whether a benchmark fulfils the technical criteria. This checklist can be used to aid the development of a benchmark or to assess a finished benchmark. The questions will be presented and explained here according to the components of the benchmark: the Motivating Comparison, Task Sample, and Performance Measures. I begin with some questions about the overall benchmark. The full list of questions and interpretation for responses are found in Appendix C. Table 5-3 shows the questions for the benchmark as a whole in the first column. In the remaining columns, it shows which criteria are assessed by the question. The most critical requirement for a benchmark is that it allows fair comparisons between alternatives, and this is reflected in the first question of the checklist. Subsequent questions in the checklist make this same query in different ways about finer grained details of the benchmark. As for questions O2 and O3, the benchmark should clearly state assumptions about how it is to be used. But at the same time, it should also be sufficiently flexible to survive minor violations of these assumptions. 87
Internal Relevance Validity
Question Category: Overall O1. Does the benchmark provide a level playing field for comparing tools? O2. Are the tools or techniques that are intended to be tested defined at the outset? O3. Can other tools or techniques use the benchmark?
External Validity
Construct Validity
X
Reliability/ Repeatability X
X
X
X
X
Table 5-3: Checklist and Technical Criteria Traceability Matrix for Overall Questions The questions in Table 5-4 are concerned with the Motivating Comparison. They all focus on the relevance of the benchmark. The benchmark should be concerned with one of the most important problems, if not the most important problem, in a discipline. All other things being equal, more significant research problems should have benchmarks produced for them first. However, there can be reasonable exceptions, for instance, a benchmark may tackle an intermediate problem in preparation for dealing with the larger problem. Question
Relevance
Category: Motivating Comparison M1. Is the benchmark concerned with an important problem in the research area? M2. Does it capture the raison d’être for the discipline? M3. Are there other problems in need of a benchmark? Should one of those have been developed before this one?
Internal Validity
External Validity
Construct Validity
Reliability/ Repeatability
X X X
Table 5-4: Checklist and Technical Criteria Traceability Matrix for Motivating Comparison Questions The next set of questions in Table 5-5 are concerned with the Task Sample. Above all, the Task Sample must be representative of the problems seen in actual practice. This requirement for realism also applies to the expected usage context and user characteristics. Selection of the Task 88
Sample should be supported by empirical data and/or a theory of how the technology is used. Because the usage patterns for some research tools are not well-understood, compromises may be made regarding the quality and source of the data (e.g. anecdotal vs. systematic) and theory (e.g. models from expert reviews and heuristic reviews). Question Category: Task Sample T1. Is it representative of problems found in actual practice? T2. Is there a description of the expected user? Is it realistic? T3. Is there a description of the usage context? Is it realistic? T4. Is the selection of tasks supported by empirical work? T5. Is the selection of tasks supported by a model or theory? T6. Can the tasks be solved? T7. Is a good solution possible? T8. Is a poor solution possible? T9. Would two people performing the task with the same tools produce the same solution? T10. Can the benchmark be used with prototypes as well as mature technologies? T11. Can the tasks be scaled to be more numerous, more complex, or longer? T12. Is the benchmark biased in favour of particular approaches? T13. Is the benchmark tied to a particular platform or technology?
Internal Relevance Validity
External Validity
X
X
X
X
X
X X
X
X
X
X X X
Construct Validity
Reliability/ Repeatability
X X X X
X
X
X
X
X
X
X
X
X
X
X
Table 5-5: Checklist and Technical Criteria Traceability Matrix for Task Sample Questions To facilitate comparisons, a benchmark should be scaleable so that more alternatives can be compared. For example, some rules or algorithms regarding how the problem can be made 89
smaller or larger would allow comparisons between research prototypes and industrial strength products. Other scaling rules could allow for adjustment on additional dimensions, such as hardware configurations. Finally, the benchmark should not be biased in favour or against a particular implementation or approach. Internal Relevance Validity
Question Category: Performance Measures P1. Does a score represent the capabilities of a tool or technique fairly? i.e. Are the results for a single technology accurate? P2. Can the scores be used to compare two technologies directly? i.e. Are comparisons accurate? P3. Are the measures used in the benchmark indicators of fitness for purpose between the technology and the tasks? P4. Is it possible for a tool or technique that does not have fitness for purpose to obtain a good performance score? P5. Was the selection of Performance Measures supported by previous empirical work? P6. Was the selection of Performance Measures supported by a model or theory? P7. Would the one person using the benchmark on the same tool or technique twice get the same result? P8. Would two people using the benchmark on the same tool or technique get the same result?
External Validity
Construct Validity
Reliability/ Repeatability
X
X
X
X
X
X
X
X
X
X
X
Table 5-6: Checklist and Technical Criteria Traceability Matrix for Performance Measures Questions The final group of questions on technical criteria pertains to the Performance Measures. As with other metrics, the Performance Measures need to be accurate and reliable. In other words, 90
they should measure exactly the parameter of interest and it should give the same result when the measurement is repeated. These requirements apply to both quantitative and qualitative measures. Performance Measures need to be accurate and reliable first for a single tool or technique, so that comparisons are likewise accurate and reliable. Like the Task Sample, the selection of the Performance Measures should be supported by both empirical data and/or theory. 5.3.5 Sociological Aspects of Benchmarking Process As a technical artefact, a benchmark can be designed to meet certain criteria. The benchmarking process can also be designed according to specified criteria. A successful benchmark is one that has an impact on the scientific community that created it. This impact can be achieved by engaging members of the community, making participation accessible, using a process that is responsive to feedback, and keeping the process and results transparent. All of the criteria for the sociological aspects are geared towards maximizing the impact of the benchmark. Engagement. As many people as possible should be engaged in the benchmarking process. The meaning here is close to the French word, engagement, that is, being actively committed, as to a political cause. It is a concept that spans both actions and emotion. Engagement in the benchmarking process can range from driving the work forward to giving feedback, from doing supporting laboratory work to passive interest. In order to achieve this requirement, there has be sufficient publicity and opportunity for the community at large to participate. Accessibility. The amount of effort, or overhead, required to participate in benchmarking should be kept low. In other words, it should be easy for someone to obtain the benchmark and using it should not consume a great deal of resources. By keeping the benchmark and process accessible, there will be more participants and subsequently greater impact. This criteria also needs to be reflected in the design of the benchmark. For example, an automated benchmark requires fewer resources than one that requires a great deal of manual intervention to create and grade the solutions. An exception to this criteria can be made if there is a very large reward for participating in a resource-intensive benchmark, for instance, for vendors to produce results for benchmarks from TPC or SPEC. Responsiveness. The benchmarking process needs to be responsive to new results and feedback from the community. It is important for participants to feel that their involvement
91
makes a difference. An indication that their input is being valued is the benchmark changing for the better in response to their suggestions. Transparency. The benchmark development and deployment process should be open, public, and scrutinised. Design and implementation decisions should be justified and open to discussion. Results from using the benchmark should be publicly available and may be subjected to an audit. This transparency makes it more difficult for the benchmarking process to be subverted by hidden agendas, side deals, or special treatment for some. It ensures that there is a level playing field for the benchmarking process as well. There are a number of ways of manifesting these criteria in a social process and the questions in the checklist point to some possibilities. 5.3.6 Checklist for Sociological Criteria The checklist of questions for the benchmarking process can be divided into those pertaining to the Development activity and those for Deployment activity. These activities can take place simultaneously, sequentially, or iteratively. The full list of questions and interpretation for responses are found in Appendix C. The questions for the Development activity are given in Table 5-7. The important criteria during this stage are Engagement and Responsiveness. The benchmark development activity should involve as many members of the community as possible, so that a broad range of viewpoints are represented. In order for the benchmark to benefit from a large number of participants, it needs to be inclusive. The design of the benchmark should reflect the consensus of the community. This acknowledgement and respect for disparate opinions encourages further participation. Having many minds involved provides the benchmark with a built-in peer review committee, which in turn will lead to improved adoption.
92
Question
Engagement
Category: Development D1. Was there interest in comparing results within the discipline? D2. Was the project well publicised? D3. How many people are involved in the development of the benchmark? D4. How many research groups are involved? D5. Were there opportunities for the wider community to become involved? D6. Were refinements made to the benchmark based on feedback? D7. Was the benchmark prototyped? D8. Was the design of the benchmark discussed at a meeting that was open to the community? More than once? D9. Were the results of the benchmark discussed at a meeting that was open to the community? More than once?
Accessibility
Responsiveness
Transparency
X X
X
X X X
X
X
X
X
X
X X
X
X
X
X
X
X
X
Table 5-7: Checklist and Sociological Criteria Traceability Matrix for Development Questions Table 5-8 shows the set of questions for Deployment of the benchmark. During this stage, Accessibility is the main concern; by making it easy to use the benchmark, more people will become involved. During this time, the process will have low Responsiveness because it is counterproductive to change a benchmark while it is actively being used for evaluations. Once the evaluations have been completed, the Development activity can be revived using the lessons learned. As with the Technical Criteria, the questions are listed in the first column of the table above and the remaining columns show which criteria are addressed. The benchmarking effort should be well publicised to involve as many people as possible. The process also benefits from having disparate points of view involved, such as people from different laboratories, countries, sectors, and research areas.
93
Question
Engagement
Category: Deployment E1. What proportion of the eligible technology have been evaluated using the benchmark? E2. Was the evaluation well publicised? E3. Was participation in the evaluation open to all interested parties? E4. Are the materials easy to obtain? E5. Are the materials clear and unambiguous? E6. Is there a fee for using the benchmark? E7. Is there a license agreement for using the benchmark? E8. How much time is required to run the benchmark? E9. Is the benchmark automated? E10. Are specialised skills or training needed to use the benchmark? E11. Are the results easy to obtain? E12. Are the results clear and unambiguous? E13. Is it possible to examine the raw data as well as the benchmark results? E14. Is there a fee for accessing the raw data or solutions? E15. Is there a license agreement for accessing the raw data or solutions? E16. Are there procedures in place for auditing the submitted solutions? E17. Are the solutions audited? E18. Is there a process for users to vet their results before they are released?
Accessibility
Responsiveness
Transparency
X X
X X X
X
X
X
X X X X X X
X
X
X X
X X X X X
X
Table 5-8: Checklist and Sociological Criteria Traceability Matrix for Deployment Questions Many of the questions in this set relate to access and protection of raw data and solutions, because this data can be easily misused. It can be cited out of context, and used to portray a particular system negatively. It may be misinterpreted, intentionally or unintentionally, and compromise the integrity of the exercise. When the raw data and solutions are abused, it erodes the benchmark’s credibility and stature in the community. Both SPEC and TREC control access 94
to the results of their benchmarks; people who wish to examine this data must agree to conditions of use that include publication restrictions. 5.4
Considerations for Software Engineering
Because it is difficult to agree on the key problems and performance indicators in software engineering, developing benchmarks will be highly beneficial. As the discipline emerges, it is necessary to branch out into sub-problems and to allow a variety of tools and techniques to proliferate. As software engineering progresses into the next stage of maturity, we need to consolidate what we have learned and evaluate the effectiveness of the innovations. If each area could start working on a single benchmark, it would serve to focus the attention of researchers. The effort would serve as a flagship for the area, to highlight key problems and solutions. In this section, I discuss considerations that relate specifically to applying the theory and process to software engineering research. These issues are related to both our problem domain and the state of software engineering as a scientific discipline. This discussion will be organised using the three benchmark components, the Motivating Comparison, Task Sample, and Performance Measures. 5.4.1 Motivating Comparison As discussed in Section 3.1.1, the Motivating Comparison captures two aspects, comparison and motivation. The motivation aspect refers to the underlying need for the discipline, the benchmark, and finally, work on the benchmark. The comparison aspect relates to the acceptance of the need to compare and evaluate research results empirically. Both of these aspects are fundamental questions, and many areas of software engineering cannot provide unanimous answers to them. For many sub-areas in software engineering, participation by the community in the creation of a benchmark would be a strong statement in support of empirical work. The accumulation of many benchmarks and work like them would gradually make software engineering a more scientific discipline.
95
5.4.2 Task Sample The Task Sample in a benchmark is a representative sample of the tasks that a class of tools or techniques is expected to solve in actual practice. Identifying a Task Sample depends on two factors: i) the task domain must exist in practice and ii) the task domain has been identified. For research tools and techniques, it is possible for one or both requirements to be absent. A research tool may not have a well understood application domain because it is too novel, or is a radical departure from current technology. Sometimes software can create a task domain, as was the case with presentation software such as Microsoft® PowerPoint®. Before this class of applications existed, presentations were normally given lecture-style, sometimes with a blackboard, but rarely with a set of bullet-pointed visual materials. In this post-PowerPoint® world, presentations are commonly given using a laptop connected to a data projector. It is impressive that this software has driven the market for these data projectors, which are expensive pieces of equipment. In situations where the task domain is emergent, i.e. it did not exist prior to the new technology, it will be difficult to anticipate users’ expectations [83]. Since the tasks included in the benchmark need to be a representative sample, we need to have a good understanding of the population of tasks where the technology is used. For some task domains, analytic techniques are sufficient to describe the population. However, for the majority of task domains in software engineering, empirical studies of developers and/or processes are needed to characterise the task domain. These studies may be conducted in the laboratory or in the field, depending on the appropriate unit of analysis. Once the population has been characterised, a variety of techniques can be used to choose the sample [43]. The most statistically sound techniques involve randomisation, such as simple random sampling, systemic sampling, or stratified sampling. However, not all situations require this and non-probabilistic sampling techniques may be appropriate, for instance, purposive sampling, quota sampling, and availability sampling. 5.4.3 Performance Measures Performance measures are an indicator of fitness for purpose between a technology and a task. It is not an innate characteristic of the technology, but a relationship between the tool or technique and the setting. 96
Creating Performance Measures will be particularly difficult in software engineering. In this discipline, we are primarily interested in creating tools and techniques that help people build large software systems, a Phase 2 problem according to Thomas Landauer [83]. (He identified two phases, or classes, of computer applications. Phase 1 applications replace human computation, usually because computers can do the task more efficiently. Examples of class 1 software are spreadsheets and telephone switches. Phase 2 applications assist human computation to complete a task more efficiently.) The assistance provided by Phase 2 applications is difficult to assess and by some measures they have not yielded any productivity gains. Due to the difficulty of this problem, even small steps will result in significant advances in our understanding. Most areas of software engineering already have people working in empirical studies, so they would be a good starting point for guidance. There are many books and papers on the topic. Shari Lawrence Pfleeger [111] and Barbara Kitchenham [72] each wrote a series of articles in ACM SIGSOFT Software Engineering Notes on this topic. The GQM (Goal-Question-Metric) Approach developed by Victor R. Basili, H. Dieter Rombach and their colleagues is another good source of guidance [17]. The three steps in that evaluation method dovetails with the three benchmark components. Also, the guidelines for evaluating benchmarks in this chapter can also be used as a checklist for development. 5.5
Summary The theory of benchmarking provides a causal explanation for why benchmarks drive
advances in a scientific discipline. An implication of the theory is that benchmarks can be used deliberately to achieve improvements in a discipline. This chapter provided some starting points for this application. It prescribed a process model for others wishing to start a community-wide benchmarking project. A diagnostic tool, the Benchmarking Readiness Assessment, was presented to help assess whether a research area and technology pair were ready for such an undertaking. For those who have elected to begin work on a benchmark, checklists for technical and sociological criteria were provided to guide benchmark development and to evaluate existing benchmarks. This chapter concluded with guidance on applying the theory to software engineering.
97
98
Chapter 6. Tool Research in Reverse Engineering The next three chapters of this dissertation describe a case study of how the theory of benchmarking was applied in reverse engineering. This chapter provides a brief introduction to reverse engineering, including an overview of the tools that researchers have developed, and noteworthy evaluations that have been conducted on them. Chapter 7 presents the first case study, a benchmark for program comprehension tools. The second case study, described in Chapter 8, describes a benchmark for C++ fact extractors. The purpose of the background material in this chapter is to establish that reverse engineering has met the prerequisites for undertaking a benchmarking effort. A review of tools developed by research will serve to show that the field has attained the required minimum level of maturity. A critique of previous comparative evaluations of tools illustrates the tradition of comparison. These two prerequisites along with the ethos of collaboration will be examined at the end of the chapter. 6.1
Tools for Reverse Engineering A standard definition for reverse engineering was given by Chikofsky and Cross in 1990: “Reverse engineering is the process of analyzing a subject [software] system to: •
identify the system’s components and their interrelationships and
•
create representations of the system in another form or at a higher level of abstraction” [35].
This definition has proven useful because it is flexible; it describes reverse engineering as a process without explicitly stating its inputs or outputs. In practice, the inputs to the process include source code, logs from configuration management systems, interviews with developers, and documentation. The outputs can be diagrams, new documentation, a searchable database, or re-formatted source code.
99
Research in reverse engineering is concerned with the creation of tools and techniques to facilitate this analysis, particularly for legacy software systems with thousands of lines of code (KLOC) or millions of lines of code (MLOC). There is a focus on tools in this research area because it is difficult to apply techniques to large legacy systems without tool support. Figure 6-1 shows a typical sequence of steps to reverse engineer a software system. The inputs and outputs are shown in boxes with rounded corners. The steps in the process are shown as rectangles with square corners. The first step is to extract the desired facts from the input. In the second step, the facts are further analyzed before being presented to the user in the third and final step. Below each of the boxes are example products or tools. This sequence of steps is similar to compilation with front end analysis, optimisation, and code generation. The distinction here between extraction and analysis is somewhat artificial in that most extraction and presentation tools perform some analysis, but it is useful because most tool suites have separate tools for each stage. Because the individual tools are difficult to use in isolation, most researchers create workbenches or pipelines that consist of tools that perform each step along with software to link them together. The integration software typically consists of controller software that invokes each of the tools in sequence and a database, or factbase, for storing and querying the facts. I begin by discussing each of the entities in Figure 6-1 before moving on to integration software.
Software Work Product
Source Code Design Diagrams Specifications User Manuals Revision Control Logs
Extraction
Analysis
Presentation
Parser-analyser Profiler Data Import
Metrics Dependency Clone Detection Clustering Slicing Re-factoring Layout
Visual Editor Text Editor Code Browser Web Browser Search Tools
New View(s) of Product
Diagrams Re-formatting Documentation Reports
Figure 6-1: Stages of Reverse Engineering Process
100
6.1.1 Software Work Products As mentioned in the definition, the reverse engineering process can accept a variety of inputs. The majority of tools use source code as the sole or primary input. The conventional view is that source code provides the most complete, up-to-date information about a software system and can be examined without disrupting productivity. Other possible inputs are documents that were created at any stage of the development process, such as user manuals, logs from tools, email messages between colleagues, diagrams, and interviews with developers. Many tool suites use a combination of source code and other information. For example, the Esprit de Corp Suite (EDCS) uses software requirements, expressed as scenarios, as a starting point for architecture recovery on source code and subsequent re-design of the code [121]. The PBS (Portable Bookshelf) uses documentation and interviews with developers to create a conceptual architecture that is mapped onto a concrete architecture that is extracted from the source code [156]. 6.1.2 Extraction Tools Fact extraction is a challenging and important problem in reverse engineering, because subsequent analyses depend on it. Each type of work product, notation, and programming language needs its own extractor. Other programming language issues complicate the extraction process, for example, preprocessing, conditional compilation, dynamic linking, and language dialects. Some languages have a design and grammar that make them easy to parse and analyze, while others, such as C++, are difficult to handle. Consequently, researchers have developed a variety of approaches to this problem. Extractors can be characterised using three dimensions: static vs. dynamic, sound vs. unsound, full vs. partial. Most fact extractors are static and sound, and use a partial data model. Extraction can be performed by analysing the code statically, i.e., by treating it as text. It can also be performed by executing the code to resolve late type binding or run-time references. The code can also be run through a profiler and data extracted from the call traces [163]. An unsound extractor can use heuristics to make intelligent guesses about the code. Reverse engineering uses unsound extractors sometimes to simplify implementation or in situations where some errors can be tolerated. For example, SNiFF+, an interactive source code searching 101
environment, uses regular expressions to match definitions and uses of a variable [150]. This information is presented to a software developer who can then make judgements before using the data. A sound extractor uses full parsing and analysis. Code transformation, such as re-factoring or migration to another programming language, requires a high degree of accuracy and reliability [76]. A partial extraction only gathers information of interest. A full extraction collects all the facts necessary to regenerate the code; the resulting fact base is called “source complete.” Some analyses only require a limited number of facts. For example, a call graph can be constructed using function calls [101]. A number of approaches with varying degrees of accuracy have been developed to perform partial extractions, such as island grammars [97], TXL factors [38], and LSME (Lightweight Source Model Extraction) [103]. 6.1.3 Analysis Tools The analyses that are performed during this second stage form the heart of reverse engineering. They are performed for a variety of reasons: to recover the architecture or highlevel design of a system, to help a programmer understand the source code, to re-organise code so it is easier to maintain, and to collect data to inform management decisions. A few representative analyses will be described here briefly. Dependency. Two parts of a software system depend on each other if one part uses code or resources from the other. Two well-known measures of dependency are cohesion and coupling [52]. The parts can be lines of code, functions, variables, source files, database tables, executables, and scripts. The relationships between parts can be calls to functions, reads and writes of data, and invocations of programs. It is useful to know which parts of a system are dependent on each other when making small changes to maintain the code, as well as, when making large scale renovations. Slicing. The purpose of slicing is to reduce the amount of code that a programmer needs to understand to solve a problem. A program slice is created by taking a variable and identifying all the code that affects the value of that variable. The slice should be an executable subset of the original program. This idea was first introduced by Mark Weiser [166] and since then there has
102
been extensive work on increasing the reliability and flexibility of slicing algorithms [62, 77, 106]. Clustering. This procedure takes pieces of the source code and groups them together into subsystems or components [157]. The atomic units in this analysis can be classes or code blocks, but is most commonly source files. There are many approaches to solve this problem, clustering based on dependency relationships, naming convention [12], software metrics [57], genetic algorithms [46], and graph connectivity [104]. Architecture Recovery. Architecture is the abstract design of the software system at the highest level and encompasses all features of the software, including design elements, composition of design elements, assignment of functionality to design elements, physical distribution, communication protocols, and so on. The architecture is long-lived and persists beyond changes in the details of the design. Architecture recovery uncovers the architecture of an existing system using software work products and knowledge from experts. The process involves combining domain information such as design patterns [71], mental models of developers [156], along with information extracted from source code and other artifacts [70, 80]. 6.1.4 Presentation Tools Once the analysis is complete, the results need to be presented to the user. In general, there are two types of presentation tools: textual or visual. In practice, tools use both modes at the same time, but one will dominate. Textual. When the analysis produces a report or other conclusions, this output is shown as static information. Text editors and web browsers are often used for this purpose. This format is often used for metrics, re-formatted code [18], and documentation [158]. Other tools allow users to explore the textual output. Some examples are source code searching tools [48, 150] and environments for removing duplicated sections of code [15]. Visual. Software architecture is often presented visually because diagrams can show a lot of information in a limited space. Some of these include Rigi [100], SHriMP [168], and PBS [55], which allow you to browse and interact with pictures. Animation can be used to show events that happen over time, such as software evolution and execution traces [163].
103
6.1.5 Tool Integration As mentioned earlier, it is difficult to use these tools in isolation. They each have their own calling conventions, data formats, and ontologies. Consequently, reverse engineering tools are used as part of an environment or suite that provides end-to-end support. The most widely cited tools Rigi [100] and CIA (C Information Abstraction System) [34], are part of integrated suites, as are other environments that support program comprehension [48, 89, 168] and design recovery [70, 71]. Tool suites typically integrate components using a common interface and a shared repository. A common interface gives the appearance that a single tool is being used rather than a collection. Command line tools may use controller software that invokes each of the tools in sequence and passes parameters appropriately. This approach results in a pipeline of tools, for instance, the PBS [55]. GUI tools employ a window or a set of windows and user events to invoke component tools [100, 147]. A shared repository allow the tools to operate on common data and reduce the cost of some calculations. The shared repository can be a text file [55, 89] or it may be a knowledge base with special operators for reverse engineering [70, 71]. Both Figure 6-1 and this overview of tools show that there are similarities among tools within each stage of the reverse engineering process. Some researchers have attempted to combine the best features of tools rather than building a suite from scratch. GXL (Graph eXchange Language) has grown out of a desire to combine functionality from different suites [67, 135]. GXL is an XML (eXtensible Mark-up Language) sub-language for representing object-relational data such as graphs. Its design borrows heavily from a number of existing formats including RSF (Rigi Standard Form), Tgraphs, TA (Tuple Attribute language used by PBS and TkSee) [167]. The distinguishing feature of GXL is that it utilises the same DTD (Document Type Definition) to describe both the data and the schema for the data. Although work is continuing on the format, GXL has been accepted as the format of choice for exchanging data between repositories of reverse engineering and re-engineering tools.
104
6.1.6 Discussion This section gave an overview of the tools developed by researchers in reverse engineering. For the sake of brevity this was a selective introduction, but it illustrates tool research in reverse engineering is a well-established pursuit. The abundance of tools in the discipline shows that: •
Research groups work on similar problems; and
•
There are multiple solutions available within the community.
These facts contribute to the minimum level of maturity that is needed in a field before a successful benchmarking can be undertaken. In the next section, I will examine another indicator of this field’s maturity, the empirical evaluations of the tools. Researchers undertake these comparisons when they want to move beyond proof of concept implementations to accumulating more lasting lessons learned. 6.2
Tool Comparisons in Reverse Engineering While the critical comparison of tools has long been acknowledged as an important
endeavour [36, 82], good empirical studies have been slow in coming. Most papers that introduce a new tool or technique include a “case study” consisting only of a technology demonstration that serves only to showcase the features of the tool. Fortunately, this state of affairs has been improving. There have been a number of studies that examine whether reverse engineering tools help or hinder programmers [23, 26, 144, 145], and there has been some examination of the methodological issues specific to the discipline [79, 96]. In this section, I review three studies where tools were compared: an evaluation of call graph extractors [101], and two comprehensive evaluations of tool capabilities [13, 20]. One study that is not included deserves mention at this point. In 1998, the Reverse Engineering Demonstration Project was organised by Elliot Chikofsky [3]. A number of reverse engineering and reengineering tool developers were invited to apply their tools to a common subject system, WELTAB III, a set of tools for tabulating election results. The participants discussed their results of analysis at conferences, but they did not publish any written reports.
105
6.2.1 Comparing Call Graph Extractors: Murphy et al. 1998 Murphy et al. looked at eight static call graph extractors by using them to analyse three large software systems written in C [101]. To evaluate these tools, they compared the output against a baseline and established criteria for assessing deviance from the baseline. They chose GCT (Generic Coverage Tool, for assessing test coverage) as a baseline because it was derived from a parser (GNU C) and gave the most complete and conservative results. Based on an analysis of the results from the eight tools they suggest some guidelines for software engineers in selecting a static call graph extractor. The design of the study is summarised in Table 6-1. It shows the four factors that must be accounted for in a tool evaluation and whether they are treatment or control factors. A treatment factor is the intervention being investigated. Control factors are held constant or otherwise controlled to ensure that the study is valid and the comparison is fair. These were discussed in greater detail in Section 2.2.
Tool(s)
Treatment Factor √
Control Factor
Subject System(s)
√
Task(s) User Subject(s)
√ √
Comments cflow, cawk, CIA, Field, GCT, Imagix, LSME, Mawk, and rigiparse mapmaker4 (31 KLOC) mosaic (70 KLOC) gcc (290 KLOC) Creation of a static call graph. The authors were the user subjects.
Table 6-1: Summary of Research Design for Murphy, Notkin, Griswold, and Lan The call graphs extracted by the eight tools varied a great deal. One extractor (cawk) came very close to approximating the output of GCT baseline, while the rest uniquely reported false positives and false negatives, often simultaneously. A false positive results when a function call that does not exist is reported. A false negative results when a function call is not reported. Moreover, the quality of the extracted call graphs varied with the subject systems used. Murphy et al. developed the following categorisation of call graphs (not extractor tools).
4
Mapmaker is a molecular biology application.
106
No False Negatives False Negatives
No False Positives Precise Optimistic
False Positives Conservative Approximate
Relative to the baseline, none of the tools produced precise call graphs. Four tools (Field, Imagix, LSMEcg, and MAWKcg) produced approximate call graphs for all three subject systems. The remaining four tools produced call graphs from more than one category. Murphy et al. go on to consider the implications of this unpredictable variability for software engineers selecting a tool. In making such a decision, they identified three aspects of the extractor that must be examined: constraints, behaviour, and engineering considerations. •
Constraints are essentially the design decisions regarding the format of the input to the
extractor. Does the extractor work with unprocessed source, preprocessed source, object code with symbol table information, or executable code with symbol table information? How does the extractor deal with preprocessor directives? How are input files specified? •
Behaviour is how the tool deals with the input. What algorithms does it use to analyse the
code? How does it deal with error? Does it resolve function calls globally or does it only provide local call information? •
Engineering considerations are issues that affect the use of the extractor by a software
engineering. Murphy et al. mention programmability and time required to perform the extraction. Is it possible to program or otherwise configure the extractor to change the output? How much time is required to write such a program? How much time does it take to run such a customised extraction? Knowing the answers to the above questions will help a software engineer select a call graph extractor that is appropriate for the task at hand. Different tasks will have slightly different requirements. Murphy et al. suggest that identifying these is an important research topic. 6.2.2 Comparing Architecture Recovery Tools: Bellay and Gall, 1998 Bellay and Gall [20] analysed an embedded software system using four reverse engineering tools to assess the applicability of the tools to that domain. They used Refine/C, Imagix4D, SNiFF+, and Rigi to analyse part of an embedded system to control trains. The system contained 107
150 KLOC of C and Assembler. In the course of analysing the software, they produced a checklist of features for reverse engineering tools, which became the basis for their evaluation. The bulk of their results consisted of tables where each tool was evaluated on each feature as “excellent, good, minimal, or not at all.” They provided some interpretation of these tables, and one of their observations was there was no single best tool, as none of them were designed to analyze embedded software. Factor Tool(s) Subject System(s) Task(s) User Subject(s)
Treatment √
Control √ √ √
Comments Refine/C, Imagix 4D, Rigi, SNiFF+ Embedded train control system (150 KLOC of C and Assembler) Recovery of architecture information The authors were the user subjects.
Table 6-2: Summary of Research Design for Bellay and Gall In the course of analysing the train control software, the authors encountered some problems commonly found when reverse engineering a system. The software included both C and Assembler source code. The tools studied had fact extractors for C, but not Assembler, so only data from those parts of the system were analysed. Also, the software had to be analysed on the operating platform of the tools, which was different from the platform where the software normally ran, both of which were different from the platform where it was written. These problems would likely occur with other embedded software systems. In terms of the views of the software produced by the tools, the authors found that they, paradoxically, had both too little and too much information. The information available was produced using only static analysis, they had to manually build views to get a complete picture, and they were missing information about the Assembler portions of the code. Moreover, the capabilities of the tools were constrained by how the designers expected them to be used: on code that compiled and executed on the same platform as the recovery tool with particular types of architectural designs. For example, they would have difficulty showing the call graph of a system with a client-server architecture. At the same time, the authors were overwhelmed by the amount of information presented in the views. The system was not large, 150 KLOC, and the resulting graphs were difficult to read. They could view the whole diagram at once, but the individual boxes and labels would be too small to read. Alternatively, they could zoom in on the picture, so the labels were legible, but it was difficult to track the context of the local 108
neighbourhood. They found that they could not follow one entity, say a function, from one view to another, for example, a call graph to the data flow graph. Clearly, the scalability of the tools was a problem that was not yet solved. In the end, they concluded that was there was no single best tool, as none of them were designed to analyse embedded software. They observed that “the performance and capabilities of a reverse engineering tool are dependent on the case study and its application domain as well as the analysis purpose.” 6.2.3 Comparing Architecture Recovery Tools: Armstrong and Trudeau, 1998 Armstrong and Trudeau [13] examined the fitness of five reverse engineering tools (CIA, Dali, PBS Rigi, and SNiFF+) for architecture recovery. Their evaluation of the tools is based on a architecture extraction process with three phases: extraction, classification, and visualisation. In their paper, they describe how the tools were used during each phase and provide a critique of the available features. For the first phase, they used a small degenerate C program, degen, which contains features that are problematic for fact extractors. In the remaining two phases, the tools were applied to CLIPS (C-Language Interface Processing System from NASA). Since SNiFF+ and CIA did not have classification features, only three tools were examined for the second phase. Similarly, Dali was not considered separately in the last phase because it uses Rigi for visualisation. In addition to degen and CLIPS, some of the tools were also used to explore the architecture of the LINUX kernel. Factor Tool(s)
Treatment √
Control
Subject System(s)
√
Task(s)
√
User Subject(s)
√
Comments CIA, Dali, PBS (Software Bookshelf), Rigi, and SNiFF+ degen CLIPS (40 KLOC) architecture recovery: extraction, classification, and visualisation The authors were the user subjects.
Table 6-3: Summary of Research Design for Armstrong and Trudeau Armstrong and Trudeau found each tool had strengths and weaknesses for architecture extractions. Of the tools evaluated, they found that PBS and Dali were the best ones for architecture recovery. CIA and SNiFF+ were good tools for analysing and understanding small 109
systems, but were weak in classification features needed for examining architecture. Similarly, Rigi was strong in extraction and visualisation, but needed the classification features provided by Dali to be useful for large systems. They concluded that “…development should be continued by examining each others’ deficiencies and features as well as incorporating new strategies and ideas for approaching the architecture analysis problem.” 6.2.4 Discussion These studies show the growing interest in evaluation and comparison of tools in the reverse engineering community. All three of these studies were published in 1998 and the Reverse Engineering Demonstration Project took place the same year. The evaluations were based on applying tools to a common subject system and tasks. The comparisons served to highlight the relative strengths and weaknesses of the tools. However, the most dramatic benefit from these comparisons is that they expose hidden or unspoken assumptions made by the different researchers. By placing a tool in a context not designed by its developers, and examining its performance along side those of other tools, opens up opportunities for discovery and cross-fertilisation. The overview of tools in Section 6.1 and the review of tool comparisons in this section set the stage for the work on benchmarks that is described in the next two chapters. The first benchmarking workshop was conducted in 1999 and there has been one each year until 2002. 6.3
Benchmarking in Reverse Engineering
The first benchmark that I developed was for comparing program comprehension tools (i.e. tools that helped programmers understand source code for the purpose of making modifications) [136, 137]. Users of the benchmark had to complete a number of maintenance and documentation tasks on xfig 3.2.1, an Open Source drawing program for UNIX. The name of this subject system provides the name used to identify the benchmark. The second benchmark was CppETS (C++ Extractor Test Suite) [134]. I used a collection of small test programs written primarily in C++ and questions about these programs to evaluate the capabilities of different fact extractors. Both of these benchmarks were developed in collaboration with other researchers, used by additional researchers and tool developers, and discussed in a workshop or conference. Both benchmarks produced interesting technical results, but a more significant contribution was 110
the deeper understanding of the tools and the research problem that they brought to the community. 6.3.1 Prerequisites Recall from Section 5.1.1 that there are three prerequisites that are needed prior to working on a community-based benchmark. These are a minimum level, or quantity, of scientific maturity, a culture of collaborative work, and the presence of studies that compared tools empirically. All three were present in the reverse engineering community before I began my work. The overview of tools in Section 6.1 shows not only that the community has developed a variety of tools, but that there are a number of implementations by different research groups. Many of these tools have similar features, but use different approaches. Consequently, the minimum level of maturity has been reached. The tool comparisons in Section 6.2 illustrates that there is a tradition of tool comparison in the community, so the second prerequisite was satisfied. The community had a track record of collaboration and discussion. The main conference in the area, Working Conference on Reverse Engineering (WCRE), is organised to maximise discussion time. For example, paper presentations are 20 minutes long instead of the usual 30 minutes and there is a half-hour discussion following three paper presentations. These efforts have fostered a culture of public debate and exchange of ideas at conferences. Further evidence of this community’s desire to work together can be found in GXL, a format for exchanging data between software tools. This format was ratified for use by the community in January, 2001 and has been adopted by over 30 researchers in 8 countries. A key segment of the community comes from Canadian universities who are a part of CSER (Consortium for Software Engineering Research). CSER is organised as a collection of related nodes and each node consists of an industrial partner and one or more university partners. It holds semi-annual meetings where all members of the research teams are encouraged to attend. At these meetings, two types of sessions are favoured: presentations of the latest research by graduate students and common themes discussions, for example, empirical methods and exchange formats. These meetings provide further opportunities for researchers to exchange ideas and build relationships. This familiarity naturally transfers when the same researchers participate in international conferences and committees. 111
I applied the Benchmarking Readiness Assessment to the reverse engineering community and produced the following scores: Maturity
8
Comparison
7
Collaboration 8 These scores were for both program comprehension tools and fact extractors. While the answers to specific questions differed, the final tally was the same. All three prerequisites were met, indicating the community was ready to undertake a benchmarking effort. 6.3.2 Process CSER and CASCON served as friendly microcosms to try out new formats for tool evaluations. Both the xfig benchmark and CppETS had their origins in discussions at CSER meetings and were later debuted at CASCON before moving on to an international conference. Consistent with the factors for success from Section 5.1.2, the efforts were led by a small number of champions, who provided opportunities for feedback and discussion, and were supported by laboratory work by the participants. In both cases the benchmarks came about through joint work between multiple universities. The organisers, or champions, did the initial work of designing the benchmark and arranging for a opportunity to present and discuss the results at a conference. On one occasion, the actual evaluation took place at a conference. The process used for each benchmark will be described in detail within their respective chapters. 6.4
Summary This chapter set the stage for the case studies in the next two chapters. It provided
background information on the reverse engineering community and the kinds of tools that it develops. There was a focus on tools because it is difficult to apply a technique to large legacy software without automation and tool support. Four general categories of tools were discussed: Extraction, Analysis, Presentation, and Integration. Subsequently, three significant studies that reported on tool comparisons were reviewed, one for call graph extractors and two for architecture recovery tools. Finally, the state of the reverse engineering community was discussed to show that the discipline was ready to undertake benchmarking. 112
Chapter 7. A Benchmark for Program Comprehension Tools This evaluation is often called the xfig benchmark because it uses xfig 3.2.1 as the subject system. It is intended to evaluate tools for their ability to help a developer understand source code both conceptually and for the purpose of modifying the program. It consists of a handbook for the tool developer teams which contains instructions and tasks for them to perform on xfig source code, and a handbook for industrial observers who are responsible for assessing the performance of the tools. This benchmark was first used at CASCON99 in a “structured demonstration” format; tool developers evaluated their own tools at the same time in a common setting [136]. A subset of the materials were used in a second evaluation and the results presented at WCRE 2000 [137]. This application omitted the observer handbook, because the teams worked on the tasks in their own laboratories without strict time limits and this made it difficult for them to arrange for industrial observers. The materials were then used as the basis for a term project in a graduate course at University of Tampere [149]. In the second and third uses the xfig demonstration should be considered a proto-benchmark because the Performance Measures were not used. This benchmark was developed jointly with Prof. Margaret-Anne Storey, University of Victoria. In this chapter, I will describe the xfig benchmark, its development process, discuss its impact on the reverse engineering community, and evaluate it as a benchmark using the checklists from Section 5.3. 7.1
Description of xfig Benchmark The design of the xfig benchmark is described in detail elsewhere [136, 137] and the Tool
Developer Handbook and Observer Handbook can be found in Appendix D and E respectively. In this section, a concise description of the Motivating Comparison, Task Sample, and Performance Measures will be given.
113
7.1.1 Motivating Comparison The Motivating Comparison was to assess the out-of-the-box experience that a software maintainer would have with a program comprehension tool. In other words, to rate how quickly a tool provided useful information on a novel system. We assumed that the software maintainer was unfamiliar with the subject system, but was an expert software developer and an expert with the reverse engineering tools. We were interested in the end user experience starting with source code and documentation through to when the teams used their tools to complete the assigned tasks. The assumptions about the user allowed us to consider issues such as ease of use, flexibility and capability rather than focusing solely on usability. It also allowed us to consider the usefulness of the tool from the task perspectives of program comprehension and software maintenance. 7.1.2 Task Sample The Task Sample consists of the source code for xfig 3.2.1 and the Tool Developer Handbook. A précis of the tasks is given here. The teams were presented with a scenario that motivated their analysis of xfig, an Open Source drawing package for UNIX, and the deliverables we requested. They were given two reverse engineering tasks and three maintenance tasks. They were required to complete both of the former tasks and at least one of the latter. Reverse Engineering Task 1. Creating some documentation for the source code. Some suggestions given were a call graph, subsystem decomposition, description of the main data and file structures, but the teams were free to include what they wished in any format. Reverse Engineering Task 2. Evaluating the structure of the application. For this task, we gave a number of questions, such as “Was it well designed initially?”, “ How difficult will it be to maintain and modify?” and “Are there any GOTO’s? If so, how many, and what changes would need to be made to remove them?” The maintenance tasks were extracted from the xfig’s TODO file, which contained unresolved defects and new feature suggestions. The teams were instructed to outline changes required to complete the task, but they were not required to change the code.
114
Maintenance Task 1. Modifying the existing command panel. The xfig program pre-dates modern conventions regarding the organisation of menus at the top of the window. For this task, the teams had to re-organise these menu items into “File,” “Edit,” “View,” and “Help” drop down menus. Maintenance Task 2. Adding a new method for specifying arcs. Initially, arcs were created by specifying three points that are then used to create a spline curve. For this task, the teams had to add a feature that allowed a user to draw an arc by clicking on the centre of a circle and then selecting two points on the circumference, i.e. by specifying a radius and angle. Maintenance Task 3. Repairing a defect: Loading library objects. When a user loaded an object from a library the program crashed and gave a “Segmentation Fault” error. The developers had to identify the cause of the problem and suggest a solution. 7.1.3 Performance Measures The Performance Measures were the judgements from the industrial observers. Their job was to determine whether a tool would be helpful to their own development team in industry. These observers were experienced software developers and they were expected to act as “apprentices” to a tool team to learn about both the nuts-and-bolts aspects of running the tools and the concepts behind the analyses. The teams helped the observer with his or her job by teaching them about the tool. In addition, some of the questions in the observer handbook were included to provide data for the observer’s assessment. This assessment was based primarily on the following questions. •
Would you use this tool as a developer?
•
Do you feel that this tool has a place in your organization? For which tasks?
•
Do you consider this tool difficult to use?
•
After observing for the day, do feel you are capable of using the most of this tool’s functionality?
•
If you had not observed this tool being used, how long do you think it would have taken you to become an expert in the tool?
•
To what extent did the developers use the tool to perform the maintenance tasks?
115
•
Your team may have used other tools to perform the assigned tasks, would you find any of these tools useful in your work context?
This assessment was targeted mainly at uses of the tool for daily tasks and only peripherally at usability issues. 7.2
xfig Benchmark Development Process The xfig benchmark arose from discussions held at CSER (Consortium for Software
Engineering Research) meetings. After meeting regularly for a couple of years and seeing presentations on the same tools repeatedly, it was clear that there were similarities. It was less obvious what were the important differences and the relative strengths. There were suggestions as early as 1996 that we should find common “guinea pigs” for demonstrating tool capabilities. At the CSER meeting in fall of 1998, this issue was discussed again and afterwards Storey and I decided to undertake this project. We both had experience designing empirical studies and the four tool comparisons published or held in that year (see Section 6.2) provided some examples of how this could be accomplished. In the summer of 1999, we invited various members of CSER to participate in the structured demonstration that we planned to hold during CASCON99 in October. Over the next few months, we used a web site to organise the participants [2], worked closely with CASCON organisers to ensure that our requirements were met, and developed the materials for the benchmark. 7.2.1 CASCON99 Workshop The structured demonstration consisted of two phases, each lasting for a full day. In the first phase, the tool development teams demonstrated their tools in a live setting by applying their tools to a common subject software system. This took place in the technology showcase of the conference, so attendees were able to watch and ask questions. The second phase took place two days later and consisted of presentations by the tool teams and industrial observers. The day concluded with a series of panels and discussion. Five software development teams participated in the demonstration: •
Lemma, IBM RTP [161]
116
•
PBS, University of Waterloo [55]
•
Rigi, University of Victoria [100]
•
TkSee, University of Ottawa [138]
•
Visual Age C++, IBM Toronto Lab5
A sixth team of software developers, called “Red Hack,” used a set of UNIX tools to solve the same set of tasks. At the workshop, we learned a lot about the tools and about how to design such an evaluation. While some organisational details did not work out as planned, everyone, from the tool teams to the conference attendees, had an overwhelmingly positive experience. The most significant contribution of the event, and one that we had not anticipated, was the community building that occurred over the three days. We deliberately did not give any instructions on collaboration, neither condoning nor forbidding it. At first, the teams were quite competitive, but when they realised that they were having similar problems they began to work together. However, this sharing did not extend to comparing their solutions. During the workshop presentations there was a great deal of laughter as the participants looked back at their struggles. The structured demonstration allowed them to see the flaws in each other’s tools fostering a feeling of familiarity that paper presentations and technology demonstrations normally do not. The end of the workshop was an epiphany for use because participants were so reluctant to leave. They participants wanted to stay behind and talk, long after the workshop had ended. After CASON99, Storey and I were looking ahead at other evaluations that we could conduct because this structured demonstration gave us ideas for other evaluations. However, the xfig benchmark had applications beyond what we expected and there was demand in the community to use it in other ways.
5
This integrated development environment is no longer available. Its functionality has been moved into a
C/C++ Plug-In for the Eclipse framework, see http://www.eclipse.org/cdt/.
117
7.2.2 WCRE2000 Workshop In February of 2000, Andreas Winter organised a Workshop on Algebraic and GraphTheoretic Approaches in Software Re-engineering at the University of Koblenz-Landau, Germany. At this workshop, four program comprehension tools and their underlying approaches were discussed from a more theoretical point of view. One of the conclusions from the workshop was that the participating approaches were based on similar concepts and that a comparative demonstration of the tools would be very informative. The participants agreed to apply their tools to the subject system and tasks from the CASCON99 structured demonstration. A “Tools Comparison Workshop” was organised at WCRE 2000 to provide a forum for the teams from both workshops to review their experiences with the structured demonstration materials [41, 93, 109, 120, 137, 151]. In total, five tool teams reported on their experiences with this benchmark: three of the original six teams from the CASCON99 workshop (Rigi, PBS, and Red Hack) and an additional two teams from the Koblenz workshop: •
GUPRO, University of Koblenz [48]
•
Bauhaus, University of Stuttgart [78]
There were two important differences between this workshop and the one at CASCON99. One, the tool teams completed the benchmark separately in their own laboratories. There were several reasons for this. The materials were already released, so there would be no element of surprise. It was also difficult to change the conference schedule to accommodate a full-day live demonstration. Two, perhaps the most pressing reason, the teams needed to submit a short paper for the proceedings describing their experience and lessons learned. Three, they did not have industrial observers. As mentioned in the introduction to this chapter, it was difficult to find software developers in industry who had the time available. Also, the participants did not feel that they were critical to achieving their objectives, that is, learning from each others tools. At the workshop, the organisers and teams gave presentations. These largely covered the same material as the short papers: their results, amount of time spent on the tasks, changes made to their tools to complete the tasks, other tools used to solve the tasks, and their recommendations for future changes which should be made to their tool. The presentations were followed by discussion that included the conference attendees. 118
The conclusion from the workshop was that we are still learning, about reverse engineering tools and techniques, and how to evaluate these contributions. This exemplar and workshop were important because they showed how tool comparisons could be done successfully. 7.3
Impact of xfig Benchmark The primary reason that the benchmark had a major impact is the tool developers witnessed
the shortcomings of their tools first hand. For example, defects were found in every tool and four out of five teams had difficulty performing fact extraction on the code. In addition, there were usability issues with both the user interface level and functionality in the tools. This direct experience brought home to the tool developers many of the abstract lessons from previous tool evaluations, such as the importance of a good user interface and the difficulties in loading new project into the code. Storey and I had conducted a number of empirical studies previously [131, 133, 144, 145], so were familiar with the findings of our own and other studies. We had expected the visualisation tools to do better on the reverse engineering tasks and the search tools to do better on the maintenance tasks. These expectations were confirmed, both by the performance of the tools and by observer comments. Moreover, the tasks were designed so that each category of tool (visualisation, advanced source code searching, and code creation) had opportunities to display its key features. The visualisation tools, PBS, Rigi, GUPRO, and Bauhaus, were designed for creating graph-based visual representations of software systems that showed file clustering. These teams focused on the same tasks and used diagrams in their documentation of the subject system. They also faced similar problems with the overwhelming amount of information in the diagrams and the difficulties of customising filters. TkSee and Lemma had features for advanced searching and tracing through the source code. They both had grep-like functionality included in their tools. The observers for both teams at CASCON99 were impressed with the functionality and were willing to use them in their daily work, but were concerned about the lack of high-level views. Finally, Visual Age and UNIX tools are development environments, intended to be used to create and edit code. The impact of the benchmarking process is evident in how the participants used the lessons learned. Here are some examples:
119
•
One of the members of the PBS tool team, Eric Lee, used his experience on the structured demonstration as a starting point for his Master’s thesis [88]. For this, he implemented a tool that added source code searching features to PBS to complement the existing visualisation capabilities.
•
The development team for Rigi added a tool that simplified the task of loading an existing project [4]. This tool was part of their demonstration at CASCON 2000.
•
The industrial observer for Lemma in the structured demonstration was assessing the tool for his development team at IBM. The DB2 group subsequently adopted Lemma into their tool kit. This was a success story for the Lemma developers because their ability to travel and attend conferences is contingent on reaching new users in the organisation.
In addition, the xfig benchmark has made contributions to the literature. There is one technical paper on the CASCON99 structured demonstration [136] and six short papers reporting on the WCRE2000 experience [41, 93, 109, 120, 137, 151]. It has also been used as an exemplar in new research [123]. In addition, the benchmark has been used to teach reverse engineering to graduate students at University of Tampere in Finland [149]. The demonstrations clearly had benefits not only for the direct participants, but also to the wider community. Based on the success of the xfig benchmark, we conducted two further evaluations. The first was a collaborative demonstration and the other was a benchmark for fact extractors. The former was more open-ended and the latter was more constrained with clearer Performance Measures. The collaborative demonstration was organised because we felt that there was not enough inter-tool comparisons designed into the xfig evaluation. In this demonstration, teams worked on a common subject system and were required to share data with each other. Intermediate results were presented at WCRE2001 [146] and there are plans to revisit this exercise at a future conference, such as ICSM (International Conference on Software Maintenance) or VISSOFT (Workshop On Visualizing Software For Understanding And Analysis). The second evaluation arose in response to the problems that the teams had with fact extraction. Parsing the source code into their tools (a necessary step for all teams except the UNIX tools) was the biggest difficulty for most of the teams. Although Lemma only spent 20 minutes parsing and loading the subject system, a bug in their tool slowed their progress initially. 120
The others had to spend several hours modifying their parsers or customising scripts to load the software. The result is CppETS, which is discussed in the next chapter. 7.4
Evaluation of xfig Benchmark In this section, the xfig benchmark will be evaluated using the checklists from Section 5.3.
The checklists for the benchmark as an empirical evaluation will be used in Section 7.4.1 followed by the checklists for the benchmarking process in Section 7.4.2. Since the checklists use open-ended questions, the evaluation of the benchmark will be discussion in Section 7.4.3. 7.4.1 Evaluation Design Checklists xfig Benchmark Overall O1. Does the benchmark provide a level playing field for comparing tools or techniques? O2. Are the tools or techniques that are intended to be tested defined at the outset? O3. Can other tools or techniques use the benchmark?
It provides a level playing field within the specified context of use— by an expert tool user, expert programmer, working with an unfamiliar Open Source program. Yes, tools that help a software maintainer to understand an existing code base. Yes. Architecture recovery tools were evaluated later using the benchmark at WCRE2000. xfig Benchmark
Motivating Comparison M1. Is the benchmark concerned with an important problem in the research area? M2. Does it capture the raison d’être for the discipline? M3. Are there other problems in need of a benchmark? Should one of those have been developed before this one?
Yes. Maintenance of legacy software with little documentation is a key problem. Yes. There are other technologies that could have been benchmarked (e.g. clustering, clone detection), but there were not compelling reasons to chose those over program comprehension tools.
121
xfig Benchmark Task Sample T1. Is it representative of problems found in actual practice? T2. Is there a description of the expected user? Is it realistic?
T3. Is there a description of the usage context? Is it realistic? T4. Is the selection of tasks supported by empirical work? T5. Is the selection of tasks supported by a model or theory? T6. Can the tasks be solved? T7. Is a good solution possible? T8. Is a poor solution possible? T9. Would two people performing the task with the same tools produce the same solution? T10. Can the benchmark be used with prototypes as well as mature technologies? T11. Can the tasks be scaled to be more numerous, more complex, or longer? T12. Is the benchmark biased in favour of particular approaches? T13. Is the benchmark tied to a particular platform or technology?
Yes. Yes—an experienced software maintainer. It was reasonable in the context of the evaluation, but in practice, few software maintainers are experts at using program comprehension tools Yes. The situation was typical for a new team member becoming familiar with a program, but it was framed to be humourous. Yes. Yes. There were previous empirical studies of programmers that were conducted in laboratories[144, 145] and in the field [131, 133, 138]. Yes. Yes. Yes. No. Familiarity with the tools was a factor in performance. Yes. No. However, there was a mix of required and optional tasks. No. Yes. The subject system, xfig, is written to run on UNIX and related operating systems. This made the problem more difficult for tools on other operating systems, but the tasks were still solvable.
122
xfig Benchmark Performance Measures P1. Does a score represent the capabilities of a tool or technique fairly ? i.e. Are the results for a single technology accurate? P2. Can the scores be used to compare two technologies directly? i.e. Are comparisons accurate? P3. Are the measures used in the benchmark indicators of fitness for purpose between the technology and the tasks? P4. Is it possible for a tool or technique that does not have fitness for purpose to obtain a good performance score? P5. Was the selection of Performance Measures supported by previous empirical work? P6. Was the selection of Performance Measures supported by a model or theory? P7. Would the one person using the benchmark on the same tool or technique twice get the same result? P8. Would two people using the benchmark the same tool or technique get the same result?
Yes. While the tools were not scored, the judgements from the industrial observers were accurate assessments of the tool capabilities. Yes. The purpose of the workshops was comparison. Yes. Again, no score, but there was a good mapping between judgements and fitness. Yes. Heuristic and expert reviews are common evaluation methods from human-computer interaction. Instead of using interface experts, we used task domain experts. No. Unknown. Possible, but learning effects are likely. No. The judgements were based on the industrial observer’s own development teams.
123
7.4.2 Process Checklists xfig Benchmark Development D1. Was there interest in comparing results within the discipline? D2. Was the project well publicised? D3. How many people are involved in the development of the benchmark? D4. How many research groups are involved? D5. Were there opportunities for the wider community to become involved? D6. Were refinements made to the benchmark based on feedback? D7. Was the benchmark prototyped? D8. Was the design of the benchmark discussed at a meeting that was open to the community? More than once? D9. Were the results of the benchmark discussed at a meeting that was open to the community? More than once?
Yes. No. Contact was mainly through direct email. Two people developed the benchmark, following discussion at a general meeting. Two. No. No. No. No. Yes.
xfig Benchmark Deployment E1. What proportion of the eligible technology have been evaluated using the benchmark? E2. Was the evaluation well publicised? E3. Was participation in the evaluation open to all interested parties? E4. Are the materials easy to obtain? E5. Are the materials clear and unambiguous? E6. Is there a fee for using the benchmark? E7. Is there a license agreement for using the benchmark? E8. How much time is required to run the benchmark? E9. Is the benchmark automated? E10. Are specialised skills or training needed to use the benchmark?
A majority of the eligible tools were evaluated, including some respected pioneers in the field. No. Contact was mainly through direct email. However, the workshops were publicised as part of the conference program. No. Some limitations were lack of publicity and location of the workshops to discuss the benchmarks. Yes, they can be downloaded from a web site. http://www.csr.uvic.ca/~mstorey/cascon99/ Could be improved. The tasks were completed, but not well-documented. No. No. The benchmark can be completed in one day, but many teams took longer. No. No.
124
E11. Are the results easy to obtain? E12. Are the results clear and unambiguous?
Yes, they can be downloaded from a web site. No. The results were based on judgements by industrial observers, so there were subject to interpretation. However, conclusions regarding the relative strengths of tools were clear. E13. Is it possible to examine the raw data as The solutions are available, but the raw data well as the benchmark results? are not. E14. Is there a fee for accessing the raw data or No. solutions? E15. Is there a license agreement for accessing No. the raw data or solutions? E16. Are there procedures in place for auditing No. the submitted solutions? E17. Are the solutions audited? No. E18. Is there a process for users to vet their No. The results were announced and discussed results before they are released? at a workshop before further publication, but the users had no prior knowledge of their results. 7.4.3 Discussion The purpose of the checklists was to help determine whether a benchmark satisfied a number of technical and sociological criteria. In this section, I will use the answers given to the questions above to discuss how the xfig benchmark fulfils the these. There were three technical criteria for evaluating the benchmark as an empirical evaluation. These were relevance, validity, and reliability and repeatability. •
Relevance. The xfig benchmark was relevant to the discipline, and this was evident in the interest in the evaluation as an empirical, in addition to the outcome of the comparison. After the structured demonstration, a second workshop was held at an international conference to compare tools. The participating teams used the lessons learned from the experience to improve their tools. In using industrial observers as Performance Measures, an attempt was made to make the results relevant to potential users of the tools.
•
Validity. The evaluation had good internal validity because the study had a logical design and competing explanations for the results were accounted for and ruled out. The Task Sample was a realistic selection of problems encountered by software maintainers in industry, so the evaluation had good external validity. The study had reasonable construct validity as well. While the Performance Measures were based on a subjective 125
judgement, this was mitigated by using using multiple observers and different data sources to make comparisons. •
Reliability and Repeatability. The benchmark has been successfully used on multiple occasions for different purposes. The materials were resistant to mis-interpretation, although the structured demonstration format was daunting to replicate. In other words, xfig benchmark was reliable, but its repeatability could be improved.
There were four sociological criteria for evaluating the benchmarking process. These were engagement, accessibility, responsiveness, and transparency. •
Engagement. For the xfig benchmark, engagement during development was low compared to during deployment. The development effort involved two people from different research groups. The benchmark materials have since been used by many people on many tools around the world.
•
Accessibility. The benchmark is highly accessible. It can be downloaded from a web site and can be completed in approximately a day by one or two people. (Although, some teams and tools did require more time.)
•
Responsiveness. The xfig benchmark was not responsive to feedback or results, as it was not changed between workshops or uses. However, the designers of the benchmark have used the lessons learned to develop additional tool evaluations, such as CppETS and a collaborative tool demonstration.
•
Transparency. While the materials was developed in private to ensure a level playing field during the structured demonstration (i.e. the same level of ignorance in all team members), the deployment of the benchmark has been public and visible. The results have been discussed more than once, so the process has had a high degree of transparency.
For the most part, the xfig benchmark meets the technical and sociological criteria. Shortcomings in repeatability and responsiveness are overshadowed by the overall success and impact of the evaluation.
126
7.5
Summary In this chapter, the xfig benchmark was presented in the context of the theory of
benchmarking. Like the theory, the technical product was considered alongside the sociological process. The benchmark was evaluated both as an empirical evaluation and a process for building consensus. These two parallel structures, product and process, technical and sociological, are responsible for scientific progress. The xfig benchmark had a significant impact both in improving research and increasing the cohesiveness of a like-minded community.
127
128
Chapter 8. A Benchmark for C++ Fact Extractors Fact extraction from source code is a fundamental activity for reverse engineering and program comprehension tools, because all subsequent activities depend on the data produced. As a result, it is important to produce the facts required, accurately and reliably. Creating such an extractor is a challenging engineering problem, especially for complex source languages such as C++ [44, 54]. Consequently, it would be useful to have a convenient means to evaluate a fact extractor. I have prototyped a benchmark for C++ extractors, called CppETS (C++ Extractor Test Suite, pronounced see-pets) to address this need [134]. Conceptually, writing a test suite for a fact extractor is straightforward; it is similar to writing a test suite for a compiler. The challenges appear when trying to evaluate the output from a fact extractor. The main difficulty is what to use as a standard for judging whether the answers correct. For small hand-crafted inputs, it is also possible to hand-craft the answers for the purposes of evaluation. For larger inputs, an oracle or baseline is necessary. If there was an oracle that produced correct answers all the time, the fact extraction problem would be solved and this evaluation would not be necessary. This is clearly not the case. Murphy et al. used a baseline tool for their study [101], but in this benchmark there was greater variability among the extractors. The facts produced followed a variety of schemas, ranging from the abstract syntax tree level to the architectural level. They could be stored in a variety of formats, such as an inmemory repository or in a human-readable intermediate format, such as GXL [67]. Writing a tool to check the accuracy of facts as specified by a schema can be as difficult as writing an extractor itself. The compromise solution for CppETS was to use small hand-crafted inputs and outputs, and this design decision had a significant effect on scalability. CppETS 1.0 was first used in the fall of 2001 and discussed in a workshop at CASCON2001. I was the champion for this prototype, but the effort had other less active supporters. It was guided by discussions held at CSER meetings on this topic and Mike Godfrey (University of Waterloo) provided feedback and source code. Based on lessons learned at CASCON, CppETS 1.1 was developed and discussed at IWPC (International Workshop on
129
Program Comprehension) in June, 2002. Holger Kienle (University of Victoria) joined me as a champion for this second workshop. This chapter will have a similar structure to the previous one. It will cover the components of the CppETS benchmark, the development process used, and its impact. The chapter will conclude with an evaluation using the checklists developed in Section 5.3. 8.1
Description of CppETS The CppETS benchmark characterises C++ extractors along two dimensions: Accuracy and
Robustness. It consists of a series of test buckets that contain small C++ programs and related questions that pose different challenges to the extractors. Tool teams used their tools to produce solutions to these test buckets and these solutions were scored by the benchmark organisers. These scores were used to evaluate the fact extractors, as well as, the veridicality of the benchmark. Version 1.0 of CppETS was released in September, 2001 and was subsequently discussed at CASCON2001 the following month [130]. A paper on this evaluation was subsequently published at IWPC2002 [134], where CppETS 1.1 was discussed in a working session. 8.1.1 Motivating Comparison The purpose of CppETS is to rate the accuracy and robustness of fact extractors for the C++ programming language. We reviewed the evaluations by Murphy et al. (Section 6.2.1) and by Armstrong and Trudeau (Section 6.2.3), in order to characterise the design space for fact extractors. It appeared that these extractors traded accuracy for robustness. Some extractors used a compiler-based approach and performed a full analysis of the source to produce facts. While these extractors tended to be highly accurate, they could not handle input that didn’t conform to the expected grammar. Examples of these tools are Acacia [33] and rigiparse [100]. Others used more approximate approaches, such a lexical matching, and these could handle unexpected constructs more easily. SNiFF+ [150] and LSME [103] are examples of this second approach. Their philosophy can be summed up as, “it’s not perfect, but something is better than nothing.” We used accuracy and robustness as the two dimensions for evaluation in CppETS, see Figure 8-1. (The data points have been included for illustrative purposes and do not represent any 130
particular extractors.) Full analysis approaches would be situated in the top left corner of graph, with high accuracy, but low robustness, as it is a significant engineering problem to incorporate a high degree of fault tolerance into a parser. Lexical matching approaches would be situated in the bottom right corner, with low accuracy, but high robustness. An ideal extractor would have both high accuracy and high robustness.
full analysis
Accuracy
1
lexical matching
0
Robustness
1
Figure 8-1: Conceptualisation of Design Space for Extractors 8.1.2 Task Sample The Task Sample for CppETS is a collection of test buckets each consisting of small C++ test programs and an associated Question File that asked about different facts that could potentially be extracted from the source code. Teams had to perform an extraction on these test buckets and show that the extractor produced answers to the questions. The solution to a test bucket consisted of the output from the extractor and an Answer File that served as documentation and/or concordance to the output. The test buckets contained a collection of source code or test cases that were representative of the problems an extractor would have to deal with in actual practice. Selection of the Task Sample was performed by enumerating mundane and problematic C++ language features, analysis problems, and reverse engineering issues. This list was then used to create a series of test buckets. The source code for the test buckets came from a variety of sources. Some were specially written for the benchmark. Others were donated by IBM and by Michael Godfrey. Some were taken from books and web sites. These test cases were small, typically less than 100 lines of 131
code, and none more than 1000 lines. I had considered using C++ compiler test suites such as the one distributed with GNU g++ [152] and commercial C++ validation suites from Perennial [110] and Plum Hall [114]. However, these suites test the minutiae of the C++ language using thousands or tens of thousands of test cases, typically using an automated testing harness. There are too many test cases with too much detail to include any suite completely in CppETS and doing so would not have improved the quality of the benchmark, as there is already a representative sample. We created three categories of test buckets, Accuracy, Robustness, and System, with the first two corresponding to the two dimensions of our Motivating Comparison and the third combines them. CppETS 1.1 contains 30 test buckets, 19 in the Accuracy category, 10 in Robustness, and one in System. These test buckets and the rationale for them will be discussed in the remainder of this section. 8.1.2.1
Accuracy Category
Figure 8-2 lists the groups of test buckets in the Accuracy category. All of the test buckets in this category used only ANSI standard C++ syntax [50]. Not all of them followed modern (i.e. post-ANSI standard) C++ idiom, but this was consistent with extant legacy code. The preprocessor directives present their own class of difficulties, so they were given their own test group (#1-3). The purpose of the Preprocessor group is to determine whether the extractor analyses the source code before or after preprocessing and the correctness of the facts produced. An extractor that analyses the source code before preprocessing will produce facts that are faithful to the programmer’s view of the source code, but may miss some information. For example, macros can be combined to create source code or to define different build configurations. On the other hand, an extractor that analyses the source code after preprocessing will be missing information about preprocessor directives such as macros.
132
Preprocessing 1. Macros 2. Conditional Compilation 3. Pragmas C++ Syntax Data Structures 4. array 5. enum 6. union 7. struct 8. Variables 9. Functions 10. Templates 1 11. Templates 2 12. Operators 1 13. Operators 2 14. Exceptions 15. Inheritance 16. Namespaces 17. Statements 18. asm keyword 19. Types Figure 8-2: Test Buckets in Accuracy Category The second group (#4-15) is concerned with C++ language features. The purpose of this group is to test identification of language features and resolution of references, mainly calls to functions and uses of variables. These test buckets include many of the potential extractor problems identified by Armstrong and Trudeau [13], such as an implicit call to a function using a pointer, array traversal using indices and pointer arithmetic, multiple variables with the same name, and usage of data structure elements. 8.1.2.2
Robustness Category
Figure 8-3 lists the test buckets in the Robustness category. These test buckets are intended to represent the kinds of problems encountered in reverse engineering.
133
Incomplete Information 20. Missing source 21. Missing header 22. Missing library Dialects 23. GNU g++ 24. MS Visual C+ 25. IBM VisualAge C++ Heterogeneous Source 26. C and Fortran 27. Embeded SQL Generated Code 28. lex/yacc 29. GUI Builder Figure 8-3: Test Buckets in Robustness Category The Incomplete Information test buckets (#20-22) are standard C++ source code, but with a file missing. On a reverse engineering project, the client may have neglected to provide a file, or worse, may not be able to provide a file. The test buckets in the Dialects group (#23-25) contain compiler extensions. These tests can be considered to be C++ with extra keywords. These test buckets are representative of those situations where the legacy source code was developed using a compiler that has a slightly different grammar than the extractor. The Heterogenous Source tests (#26-27) are C++ (or C) together with statements from another source language. Programming languages are often combined to perform special purpose tasks, for example embedded SQL for interfacing with databases and FORTRAN for scientific computing. The non-C++ code is normally handled by another tool, such as a preprocessor for embedded SQL and another compiler for FORTRAN. Unfortunately, appropriate tools for fact extraction are rarely available. The Generated Code (#26-28) tests contain files that were not C++ at all, but contain descriptions used to generate C++. These descriptions may be grammars, state charts, or GUI resources, and they are the maintenance artifacts, not the generated source code. Consequently, they view the inputs to the code generator as the maintenance artifacts. Often, the appropriate tool is not available to generate the source code or analyse the initial descriptions.
134
8.1.2.3
System Category
This final category contained only one test bucket consisting of a portion of code from the base infrastructure of the SUIF 2.2 (Stanford University Intermediate Format) for supporting research on optimising and parallelising compilers. The purpose of this test bucket was to present a larger body of code that posed both accuracy and robustness problems at the same time. Three subsystems consisting of 29 000 lines of code that formed a self-contained unit were selected. 8.1.3 Performance Measures The Performance Measures for CppETS were scores awarded by human judges according to a marking scheme. The solutions submitted earned points for correct answers and completeness of documentation. It was not easy to find good Performance Measures for this benchmark. Taking an arbitrary extractor and examining its output for completeness and correctness is not a simple problem. The facts produced could be stored in memory, in a binary-encoded database, or in a human-readable intermediate format, such as GXL [67]. The output schema of the extractors could also vary significantly, ranging from the abstract syntax tree level to the architectural level. Writing a tool to check the accuracy of facts as specified by a schema can be as difficult as writing an extractor itself. We handled this challenge by making two simplifying assumptions. 1. The output of the extractors must be stored in a text file that was human-readable. Alternatively, the extractor could be accompanied by a tool that allowed users to query the factbase. This assumption excluded tools that store the facts in memory from using the benchmark. In practice, this assumption affected few tools and did not affect any research tools. 2. Operators/users of the extractors would be involved in assessing the output to simplify the problem of comparing output with different schemas and formats. Using these two assumptions, I devised the following Performance Measures for the tests in the benchmark. Along with the source code in each test bucket, there was a text file containing questions about the program. The answers to the questions were also provided and it was the
135
responsibility of the person operating the extractor to demonstrate that these answers can be found in the parser output. The questions covered a variety of topics, including simple recognition and resolution of language constructs and their attributes. For the recognition questions, the human operator of the extractor was required to show the output for a specified feature, such as a template or exceptions. Sometimes questions asked for a comparison of related language features, such as a class and a struct. In terms of resolution, there were questions to determine whether the extractor correctly linked a reference with its declaration or definition. For attributes, I asked for location information in varying combinations, file name, start, end, line, character on a line, and byte offset from start of file. The questions covered a wide range of functionality and data models, so a variety of extractors could be tested with the same material. Consequently, no single extractor was expected to be able to correctly answer all of the questions. 8.2
CppETS Development Process
The development of CppETS began as an off-shoot of the xfig benchmark. As mentioned in the previous chapter, the common problems faced by the tool teams further added momentum to work on fact extraction. In Section 5.1.2, the theory of benchmarking identifies three factors needed for successful development process. These were champions to lead the process, design decisions that were supported by research, and opportunities for community participation and feedback. All three were present in the development of CppETS. I started championing this benchmark in 2000 to build on the lessons learned from the xfig benchmark. I was joined by Holger Kienle (University of Victoria) after the CASCON2001 workshop on CppETS to work on Version 1.1. This benchmark built on discussions about C++ fact extraction that were already taking place in the reverse engineering community. Bruno Laguë, Sébastien Lapierre, and Charles Leduc [66, 84] gave early guidance on the contents of the Task Sample. Mike Godfrey critiqued the work in progress and contributed source code. While developing the benchmark, Kienle and I tested the benchmark in our own laboratories, using different compilers, operating systems, and computer systems. In addition, I used two publications extensively. The first was a paper by Bowman, Godfrey, and Holt that described 136
some issues in exchanging data between reverse engineering tools [27]. The second was a paper examining the schema and data model required for C++ at the abstract syntax tree level to be used with GXL [54]. To date, there have been two opportunities for the community to participate in the development of the benchmark. CppETS 1.0 was discussed at CASCON2001 in November of that year. The following fact extractors participated in this evaluation: Acacia (AT&T, represented by University of Waterloo) [33], cppx (University of Waterloo and Queen’s University) [6, 44], Rigi C++ Extractor (University of Victoria) [4, 100], TkSee/SN (University of Ottawa) [5]. CppETS 1.1 was discussed at IWPC (International Workshop on Program Comprehension) in June, 2002. The participants in this workshop were Acacia, cppx, TkSEE/SN, and Columbus (University of Szeged and FrontEndART) [9, 53]. For both workshops, the benchmark was published in advance and teams submitted their solutions for scoring prior to the workshop. At the workshop, the teams gave presentations on their extractors and how they did on the benchmark. The organisers presented the rankings of the teams and chaired a discussion of the benchmark. At both workshops, the participants and the organisers gained new insights into both their tools and the underlying research issues. 8.3
Impact of CppETS
According to my theory, a successful benchmark is one that has an impact on the discipline. Some of this impact was immediately evident at the workshops. Others have appeared in the months since the workshops. Lessons learned on a conceptual level will take longer to appear. I will first examine the influence that the benchmark had on our understanding of the problem of fact extraction, and then discuss specific tools. Since both the CASCON and IWPC workshop were small, every participant and attendee had an opportunity to speak. They felt that CppETS was filling an important gap in the work on fact extraction and that this work should continue. They also provided many suggestions for additions (but not deletions) to the benchmark. Everyone learned from the experience. For instance, the fact extractor with the highest score was a surprise to the participants. For the first workshop, the success of TkSEE/SN was unexpected because it was not highly visible. At the second workshop, cppx’s score was a surprise because it had placed last in the previous workshop.
137
CppETS has improved both the technical results and the cohesiveness of a community by acting as a vehicle to further our shared understanding of the problem of fact extraction. The benchmark consists of a series of small extraction problems which made it easy to identify points of disagreement in terminology. Once we were able to establish a common vocabulary for the C++ language features and their analysis, we were able to discuss our conceptualisation of fact extraction more clearly. An example of this improved understanding relates to identifying the location of a code fragment. Consider the question, “On what line does the definition for function ‘average’ start?” One possible answer is that it starts on the line containing the function signature, e.g. int average (int *list). Another possible answer is that it starts on the line containing the curly brace that denotes the start of the block containing the function body. An argument can be made for both interpretations, and neither can be said to more correct without some idea how this fact will be used by a subsequent analysis. This problem is further complicated by preprocessor directives and subsequent transformations that occur during the compilation process. These operations may cause a feature in the original source code to move or disappear entirely. While this problem might seem trivial, it led us to the insight that the correct answer depended upon when fact extraction intervened in the compilation process. A typically sequence of stages for C++ compilation are: preprocessing, lexical analysis, syntax analysis, semantic analysis, code generation, and linking. Inserting fact extraction before or after any of these stages will yield a different “correct” answer. Moreover, the decision regarding where to insert extraction reflects how the facts will be used in subsequent analysis [65]. In other words, we need to have a Task Sample that consists of more than just source code. By the same token, fact extractors need to be more clear about what analyses their data models support. These are the kinds of insights needed to arrive at a standard schema for C/C++ suitable for use with GXL [54]. By the end of the day, there was a high level of energy in the room, researchers felt a renewed sense of purpose for their work. They now have a shared experience of having worked together successfully, which further cements their relationships. These strengthened ties and personal history will make it easier for them to work together in the future. They are looking forward to the next version of CppETS and to co-authoring a paper together.
138
The benchmark has had an impact on the development trajectory of several fact extractors. Following the CASCON2001 workshop, developers from IBM and Sun have downloaded the benchmark and are using it as part of their internal test processes. The benchmark tasks uncovered defects in all the extractors that participated in the workshops. The extractors that were in active development (Columbus, cppx, and TkSEE/SN) made repairs both during the benchmarking exercise and in the time since. For TkSEE/SN, the benchmark posed questions about what should be included in their schema. For example, they had previously assumed that their schema would not include information about templates, but their experience with CppETS caused them to re-visit this decision. The cppx extractor had the lowest score at the CASCON2001 workshop, so they put a concerted effort into refining their tool, and obtained the highest score at the IWPC 2002 workshop, only six months later. Some of these changes included completing the implementation of the extractor (including linking to resolve references) and improving the searching capabilities of their analysis tool [30]. Also, Ian Bull’s Master’s thesis generalised from experience gained while producing solutions to the CppETS using cppx [31]. At the first workshop, all of the extractors that participated in the workshop use COTS (commercial off-the-shelf) components as front ends. TkSEE/SN was able to distinguish itself from the others because it used Cynus Source Navigator as a front end plus some home-grown scripts to augment the factbase. Source Navigator is a code browser, so it is intended to be used to extract and display information about source code rather than compile it. The remainder relied on the intermediate representations from a compiler: Ccia used a front end from Edison Design Group; cppx used GNU gcc; and Rigiparse used VisualAge C++. Consequently, the capabilities of these extractors were quite similar. 8.4
Evaluation of CppETS In this section, CppETS will be evaluated using the checklists from Section 5.3. The
checklists for the benchmark as an empirical evaluation will be used in Section 8.4.1 followed by the checklists for the benchmarking process in Section 8.4.2. Since the checklists use open-ended questions, the evaluation of the benchmark will be discussion in Section 8.4.3.
139
8.4.1 Evaluation Design Checklists CppETS Overall O1. Does the benchmark provide a level playing field for comparing tools or techniques?
O2. Are the tools or techniques that are intended to be tested defined at the outset? O3. Can other tools or techniques use the benchmark?
No. Over the course of the evaluation, we found it was very difficult to separate the performance of the fact extractor and the query tool used to create the solutions. Some fact extractors with poor query tools created poor solutions. Yes. Yes, for example for testing.
CppETS Motivating Comparison M1. Is the benchmark concerned with an important problem in the research area? M2. Does it capture the raison d’être for the discipline? M3. Are there other problems in need of a benchmark? Should one of those have been developed before this one?
Yes. Fact extraction is a fundamental, but unglamorous problem in reverse engineering. No, but it is a key technology. The arguments here are similar to the ones for the xfig benchmark. Fact extractors for C++ was chosen because i) it’s a very difficult problem and ii) insights from this evaluation were needed to develop a standard schema for C++ ASTs for GXL. CppETS
Task Sample T1. Is it representative of problems found in actual practice? T2. Is there a description of the expected user? Is it realistic? T3. Is there a description of the usage context? Is it realistic? T4. Is the selection of tasks supported by empirical work? T5. Is the selection of tasks supported by a model or theory?
The benchmark is representative of the breadth of problems, but not the size (scale) of ones found in industrial settings. No. We assumed (incorrectly) that the performance of the fact extractors would be independent of user characteristics. No to both. Yes, there was previous work on fact extraction and C++ language schemas [13, 27, 54, 101]. Yes, C++ language references [50, 148].
140
T6. Can the tasks be solved? T7. Is a good solution possible? T8. Is a poor solution possible? T9. Would two people performing the task with the same tools produce the same solution? T10. Can the benchmark be used with prototypes as well as mature technologies? T11. Can the tasks be scaled to be more numerous, more complex, or longer? T12. Is the benchmark biased in favour of particular approaches? T13. Is the benchmark tied to a particular platform or technology?
Yes. No extractor was expected to solve all the tasks. Some tasks could not be solved by any of the participating fact extractors. Yes. Yes. No. The quality of the solutions depended a great deal on the user. We assumed that users would put in the required effort to obtain a good result on the benchmark. Yes. No. No. Yes, the C++ programming language, but is otherwise independent. CppETS
Performance Measures P1. Does a score represent the capabilities of a tool or technique fairly? i.e. Are the results for a single technology accurate? P2. Can the scores be used to compare two technologies directly? i.e. Are comparisons accurate? P3. Are the measures used in the benchmark indicators of fitness for purpose between the technology and the tasks? P4. Is it possible for a tool or technique that does not have fitness for purpose to obtain a good performance score? P5. Was the selection of Performance Measures supported by previous empirical work? P6. Was the selection of Performance Measures supported by a model or theory? P7. Would the one person using the benchmark on the same tool or technique twice get the same result? P8. Would two people using the benchmark the same tool or technique get the same result?
Yes. Yes. Yes. Yes. This was a problem uncovered over the course of the evaluation. No. Yes. Not necessarily. The results could improve (learning by the user) or degrade (inattentiveness). No. User characteristics such as motivation and familiarity with the tool affected performance.
141
8.4.2 Process Checklists CppETS Development D1. Was there interest in comparing results within the discipline? D2. Was the project well publicised? D3. How many people are involved in the development of the benchmark? D4. How many research groups are involved? D5. Were there opportunities for the wider community to become involved? D6. Were refinements made to the benchmark based on feedback? D7. Was the benchmark prototyped? D8. Was the design of the benchmark discussed at a meeting that was open to the community? More than once? D9. Were the results of the benchmark discussed at a meeting that was open to the community? More than once?
Yes. No, mainly through direct contact by email. Primarily two designers, with feedback from many others, approximately ten. Two, with feedback from five. Yes. Yes. No. Yes. Yes.
CppETS Deployment E1. What proportion of the eligible technology have been evaluated using the benchmark? E2. Was the evaluation well publicised? E3. Was participation in the evaluation open to all interested parties? E4. Are the materials easy to obtain? E5. Are the materials clear and unambiguous? E6. Is there a fee for using the benchmark? E7. Is there a license agreement for using the benchmark? E8. How much time is required to run the benchmark? E9. Is the benchmark automated? E10. Are specialised skills or training needed to use the benchmark? E11. Are the results easy to obtain? E12. Are the results clear and unambiguous?
Majority. C++ fact extractors are difficult to implement, so there are only a few in the community. No, mainly through direct email. However, the workshops were publicised as part of the conference program. Yes. Yes—download from a web site. Yes, but there were errors in them. No. No. About 3 days, excluding time to modify the fact extractor. No. No. Yes. Yes. 142
E13. Is it possible to examine the raw data as well as the benchmark results? E14. Is there a fee for accessing the raw data or solutions? E15. Is there a license agreement for accessing the raw data or solutions? E16. Are there procedures in place for auditing the submitted solutions? E17. Are the solutions audited? E18. Is there a process for users to vet their results before they are released?
Yes. No. No. No. No. No. The results were announced at a workshop and discussed, prior to wider publication. However, users had no knowledge of the results beforehand.
8.4.3 Discussion The checklists used in the previous two subsections provide a detailed evaluation of both the benchmark and the benchmarking process. Using this data, CppETS can now be assessed in terms of how well it met the technical and sociological criteria used to create the checklists. As an empirical evaluation, there were three technical criteria for examining CppETS. •
Relevance. Fact extraction is a fundamental problem in reverse engineering and C++ is a difficult language to analyse. CppETS tackled a problem studied by a small group of technical specialists, but with implications for the larger community. In other words, CppETS was relevant, but dealt with only a small, detailed aspect of the large problem of reverse engineering software.
•
Validity. There were some problems with internal validity and construct validity, but external validity was good. The benchmark attempts to evaluate fact extractors independently from downstream tools and analyses. In practice we could not make a clean separation, for two reasons. One, factbases are not usable without a query tool. In other words, teams could not produce solutions to the test buckets without some way to search the voluminous output of the extractor. Two, the downstream analysis to be performed determines the requirements for the data, and in turn the criteria for evaluating the fact extractors. The problem of modelling how fact extractors function in a tool stream still not been completely solved. This difficulty in operationalising fitness for purpose led to additional problems with the Performance Measures. However, both 143
the benchmark designers and the tool builders felt that the final results were a fair assessment of their fact extractors and that the results were indicative of how the extractors would perform in practice. •
Reliability and Repeatability. The evaluation was repeatable (and has been repeated), but the amount of human processing required affected its reliability. User characteristics such as motivation, attentiveness, and expertise, had a significant effect on the quality of solutions produced and in turn the scores received.
As a process for community building and scientific progress, there were four sociological criteria. •
Engagement. There were many opportunities for members of the community to become involved during both development and deployment. Prior to the start of development, there were opportunities for researchers to comment on the design of the benchmark. In addition, many people contributed source code for the test buckets or provided feedback. Following the first deployment of the benchmark, one of the users became a champion.
•
Accessibility. CppETS had good accessibility. It could be downloaded from a web site and required approximately three days to complete (excluding time to repair defects in the extractor.)
•
Responsiveness. The process to develop and deploy CppETS was responsive. It used feedback obtained prior to development. As well, the test buckets and documentation were changed after the first workshop.
•
Transparency. For the most part, CppETS was transparent. The test buckets and solutions were available for scrutiny and the results were discussed at workshops. However, the scoring system, while well-documented, was not obvious. This shortcoming is indicative of how difficult it was to find good performance measures.
Compared to the xfig benchmark, CppETS fulfilled more of the criteria but was less interesting to the general community because it dealt with a more technical problem. This was a deliberate choice because a constrained, technical problem allowed scores and rankings to be produced for the different tools. Despite this lack of charm, CppETS was responsible for improved cohesiveness and understanding among the small group that worked on this problem. 144
8.5
Summary
Over the previous three chapters of this dissertation demonstrated the application of the theory of benchmarking and the benefits of benchmarking within the reverse engineering community. Chapter 6 gave a brief introduction to reverse engineering and demonstrated that the community met the prerequisites, as described in Chapter 5, for undertaking a benchmarking effort. Chapters 7 and 8 described the xfig benchmark for program comprehension tools and CppETS, respectively. Both of these benchmarks were evaluated within the context of the theory using the technical and sociological checklists from Chapter 5. CppETS, a benchmark for C++ fact extractors, was discussed in this chapter. It consists of a collection of test buckets each containing source code and a file posing questions about the code. Users were expected to run their tools on the source code and to document where the answers to the questions could be found in extracted facts. Both the CppETS benchmark and benchmarking process were discussed. The former was critiqued as a technical artefact and an empirical evaluation. The latter was critiqued as a mechanism for building consensus and community cohesiveness. In both respects, CppETS had a positive impact on the small group of researchers working on this difficult problem.
145
146
Chapter 9. Conclusion
If you want to build a ship, don’t drum up people to collect wood and don’t assign them tasks and work, but rather teach them to long for the endless immensity of the sea. –Antoine de Saint-Exupery
Since this dissertation began with an ending, it is fitting that it ends with a beginning. Chapter 1 opened with a description of a few moments after the end the structured demonstration at CASCON99. This dissertation presented the theory of benchmarking as I currently understand it. Like benchmarking, scientific discovery is a process. This final chapter of the dissertation is a launching point for future work to refine and apply the theory. 9.1
Summary This dissertation argued that benchmarking causes a scientific discipline to leap forward
because it operationalises its paradigm, and it leverages the underlying mechanisms for scientific progress. This occurs because a benchmark turns an abstract concept into a concrete guide for action that the research community can use to pull together in the same direction. Paradigms become more mature through accretion; technical results are added through a process of community consensus. Consequently, scientific progress requires both high quality research results and a cooperative community that can agree. A benchmark is developed using the same mechanisms as scientific progress, technical refinements and social process. In other words, benchmarking causes a discipline to advance because it utilises the processes that cause maturation to occur. With this insight into benchmarking, it is a small step to infer that a discipline can be made more mature by using benchmarking. These ideas were borne out by the case histories of TPC-A™, SPEC CPU2000, and the TREC Ad Hoc Retrieval Task, and the benchmarks that I developed within the reverse engineering community.
147
The theory was validated both empirically and analytically. The empirical validation was used to establish the external validity of the theory. The theory was applied to the three case histories used in its formulation to assess how well it postdicts, and it was found that the theory did fit the data. The theory was also applied to two novel benchmarks, optical flow and KDD Cup, to assess how well it makes predications, and it was found that the theory did so reasonably well. An analytic validation was used first to establish the internal validity of the theory. A second validation was accomplished by creating a rival theory that represented the status quo explanation for the effect of benchmarking in a scientific community, that is, considering benchmarks only as a well-designed and well-executed empirical study. The theory of benchmarking and the rival theory was then compared using a hierarchical set of criteria, which were postdictive power, generality of explanans, hypothetical yield, progressiveness of research program, breadth of policy implications, and parsimony. With this evaluation, progression through the criteria stops when one theory is found to be superior. It was found that the theory of benchmarking was better able to account for the impact of benchmarking on a scientific community than the rival theory, so only one criterion was needed. While a theory provides a causal explanation, its practical value becomes evident in its applications and policy implications. To this end, a process model, a Benchmarking Readiness Assessment, and evaluation criteria for benchmarking were developed. The process model was created by using the theory to identify common elements in the case histories. From this process model, the Benchmarking Readiness Assessment was created. This instrument is used to determine whether a research area and an associated technology have met the prerequisites for a successful community-based benchmarking project. The evaluation criteria assess the benchmarking project both as an empirical evaluation and as a social process. These high-level criteria were made more concrete using a series of checklists. All of these instruments can be used to guide a benchmarking effort. The process model provides a road map. The Benchmarking Readiness Assessment can be used as a gate to prevent false starts. The criteria and checklists can be used as decision-making aids during a benchmarking effort or to evaluate a finished product. Use of these tools were illustrated in the presentation of two benchmarks that I developed with the software reverse engineering community, the xfig benchmark for program comprehension tools and CppETS, the C++ Extractor Test Suite. The xfig benchmark pre-dates 148
the creation of the theory, so these tools provide an interpretative lens. CppETS was created after the theory was formulated, so it benefited from the guidance from some of these tools. Both of these benchmarks have been used extensively by the community. The xfig benchmark has been used as the topic of two workshops and as assignments in two graduate classes. Since its creation, it has been used to evaluate fifteen program comprehension tools. CppETS has been the subject of two workshops and has been used to evaluate three out of four C++ fact extractors in active development by researchers. Both of these benchmarks are available for download on the Internet and have been published at international conferences [134, 136]. 9.2
Future Work The research in this dissertation has only begun to explore a theory of benchmarking. The
theory itself and the associated models and checklists still need further refinement and validation. Unfortunately, the time scale of a Ph.D. is too short to create a benchmark and assess its impact. The development of TPC-A™ began in 1985 with the Anon et al. paper, was released in 1989, and retired in 1995. SPEC CPU2000 replaced a benchmark that was five years old and slated to remain in use until 2004. The TREC Ad Hoc Task was used for eight years before it was retired. Consequently, there is a great deal of future work for the refining and applying the theory. Like many research projects, there is more potential future work than was completed during the study. The future work falls into two broad categories, refinement and validation of the theory of the benchmarking and the instruments that guide its application need to be further validated. 9.2.1 Expansion of Theory to Other Stages The theory of benchmarking as described in this dissertation has been primarily concerned with the normal science stage in the structure of scientific revolutions. During the normal science stage, the paradigm has been established and scientific work can be characterised as puzzlesolving. The creation of the benchmark is a process of articulating the paradigm as a standardised test for research contributions. While the theory adequately describes this process, it does not adequately account for the effect of benchmarking on the other stages of the scientific revolution. Of particular interest to software engineering is the role of benchmarking during the prescience stage because there are many sub-areas where a paradigm has not yet been established. 149
Evidence of these maturity issues can be found the concerns about lack of validation as discussed in Chapter 2 and debates on whether software engineering is an engineering discipline, which was mentioned in Section 3.2.1. The theory of benchmarking and the case histories tell us that a benchmark reifies a paradigm, that is, it takes an abstract concept and turns into a technical artefact. Attempts to build a benchmark where a community does not have a paradigm will quickly expose this deficit. Aside from identifying this deficiency, what will undertaking a benchmarking project do for such a community and how? From the experience with benchmarking during the normal science stage that work on the benchmark will be helpful but increase consensus and in turn raise maturity. However, I have not done the data collection and analysis to make statements beyond this. Expanding the theory to other stages of the scientific revolution will be hindered by the same scarcity of data on the social aspects of a benchmark development as was seen in the research conducted for this dissertation. Researchers rarely document the emergence of a benchmark and community reaction to it and this problem is even more acute outside the normal science stage. During the pre-science stage, they lack the perspective or insight into the importance of the work. During the crisis and revolution stages, times are chaotic and researchers are often entrenched in the rejection or advocacy of a paradigm. Data to support expansion of the theory to the other stages may come from the work to validate the theory and to develop benchmarks in software engineering. These directions for future will be discussed over the next few sections. 9.2.2 Validation of the Theory This dissertation was mainly concerned with the technical artifacts produced by benchmarking and the social factors that affect them. Although beyond the scope of this dissertation, an empirical validation of the benchmarking process should be conducted using modes of inquiry commonly used in history and sociology. Such a study could interview researchers to ask them about their reasons for participating in benchmarking, assess the impact of specific events, and record their reactions during meetings. This data could be used to plot a timeline for events during benchmarking, provide additional guidance for managing the development and deployment process, and design better measures for sociological criteria.
150
Another way that the theory can be refined is through the examination of additional case histories. Existing benchmarks can be examples to support the theory or counter-examples that falsify the theory. In presenting these ideas to researchers from other branches of computer science, I have often been told of a benchmark that doesn’t seem to fit the theory. I have pursued these benchmarks with interest and found that closer scrutiny reveals that theory does in fact fit. These case histories were discussed in Section 4.2, where the theory of benchmarking was validated using novel benchmarks. A major problem in finding additional case histories is that benchmarks are usually described as technical artefacts and there is almost no information in the literature about the process used. In such situations, interviews of community members and meta-analysis of conference proceedings and journals can be used to unearth the required evidence. Live benchmarks, that is, ones that are currently being used or developed, provide valuable opportunities to validate the theory. They can be fertile ground for the sociological study mentioned above and for further validation of the assessments and checklists presented in this dissertation. Furthermore, these live applications of the theory provided an opportunity to determine whether personal characteristics of the champions have an effect on the process or whether it is sufficient to have the other factors for success in place. 9.2.3 Testing of Assessments and Questionnaires The tools derived from the theory of benchmarking, in particular the Benchmarking Readiness Assessment and the evaluation checklists, also need to be validated further. Both of these aids were developed by applying the theory and the process model, so they are conceptually sound. However, they have not undergone sufficient user testing and do not necessarily produce correct results when used by non-specialists. Consequently, their reliability and validity as questionnaires has not been established. As they stand, the Benchmarking Readiness Assessment and the evaluation checklists are specialist tools that require intimate knowledge of the underlying and can be easily misinterpreted. To illustrate these issues, I give one problem for each instrument encountered so far in their use.
151
The first step in using the Benchmarking Readiness Assessment is to select a research area and an associated technology. The difficulty with issuing such an instruction to software engineers is that most do not have sufficient background in philosophy of science and Kuhn’s work to select an appropriately-sized community. Without this grounding, it is too easy to choose a group that is too big or too small, for instance, all of computer science since its inception or a particular professor and all of her current and former students. There is a similar problem in selecting the associated technology. Some assumptions that I made in the formulation of the Assessment is the technology to be benchmarked should have emerged out of research conducted by the community and it should be of interest to a sufficiently large proportion of said community. These and other assumptions need to made explicit and the instructions for using the Assessment need to be improved before it can be widely disseminated. A key problem with the evaluation checklists is the questions, as given, are not sufficiently discriminating to elicit trustworthy responses. Questions such as T1 on the Task Sample which asks “Is it representative of problems found in actual practice?” are too easy to answer inaccurately. An enthusiastic benchmark champion could easily answer in the affirmative without providing evidence. These questions attempt to tackle underlying issues with the benchmark and the process, but are not phrased in ways that are resistant to misinterpretation or overly optimistic answers. To validate these checklists, the questions will need to be refined using principles on survey design used in the social sciences [43, 56] and tested by average users. Another potential solution is to develop instructions for administration and analysis for using the checklists as an indicator of community consensus. Both the Benchmarking Readiness Assessment and evaluation checklists are starting points derived from this initial work to better understand benchmarks. They are a springboard for additional work and have the potential to be important tools for benchmark creators in the future. 9.3
Applications in Software Engineering As stated in Chapter 2, the motivation of the research undertaken in this dissertation was the
shortage of appropriate validation in software engineering. Authors such as Zelkowitz and Wallace, Tichy, and Shaw, all argued that researchers need to validate more of their research results and to do so more objectively. They argued that this lack of appropriate validation hurts both our credibility as a discipline and our understanding of research results and their 152
implications. Each of these authors make a variety of recommendations for addressing this problem and I will submit that benchmarking meets many of these. For reference, the list from Chapter 2 will be reproduced here. •
Recognise our research strategies and how they establish their results. (Shaw)
•
Use common taxonomy and terminology for discussing experimentation and the results they can produce. (Zelkowitz and Wallace)
•
“Wherever appropriate, publicly accessible sets of benchmark problems must be established to be used in experimental evaluations.” (Tichy)
•
“In many areas within CS, rules for how to conduct repeatable experiments still have to be discovered. Workshops, laboratories, and prizes should be organised to help with this process.” (Tichy)
•
“…computer scientists have to begin with themselves, in their own laboratories, with the their own colleagues and students to produce results that are grounded in evidence.” (Tichy)
The first two recommendations urge better understanding of different research strategies and their merits. This dissertation is my contribution to improved understanding of benchmarking as a research method. It puts forth a terminology for benchmarks, a process model, and an explanatory theory. This work clarifies what can be achieved using benchmarking and how it can be achieved. Benchmarking helps us to leverage our knowledge of how to conduct empirical studies to maximise the impact of evaluation across a community of researchers. Applying a benchmark even once evaluates a number of systems developed by the community. The third recommendation from Tichy suggests the creation of publicly accessible benchmarks. This dissertation concurs and includes two benchmarks that can be downloaded on the Internet. Both the xfig benchmark for program comprehension tools and CppETS can be accessed by any researcher, student, or software engineer who is so inclined. Furthermore, these benchmarks have been used as the basis of four workshops, thus fulfilling the fourth recommendation as well. The xfig benchmark was used for workshops at CASCON99 and WCRE2000 and CppETS was used at CASCON2001 and IWPC2002. Others have fulfilled the
153
last recommendation by using the benchmarks to teach graduate courses [149] and to evaluate systems in their own laboratories [37, 124]. To return to the problem that Zelkowitz and Wallace, Tichy, and Shaw addressing, these recommendations were not made to advocate for a particular kind of validation or empirical method. They were all attempting to make software engineering more rigorous and to increase our awareness of research strategies. I agree that we need to be make conscious choices about the research strategies that we use, in order to improve the quality of research results and increase appreciation and respect from outside the discipline. What they suggest is, in effect, a paradigm shift, changing the accepted way that things are done. Such changes are difficult because they can not usually be achieved through forceful arguments, but through gentle shifts in community consensus. This theory of benchmarking is an attempt to increase awareness of our research strategies. It offers a solution to a problem, a deficit of rigorous validation in software engineering. Like the quotation from Antoine de Saint-Exupery at the beginning of this chapter, it does so not by providing a treatise on validation methods, but rather by describing a scientific discipline to be longed for by software engineering researchers. Progress will be made not solely through the recruitment of willing hands, but by instilling into minds “the endless immensity of the sea.”
154
Appendix A: Sample Definitions for the Term Benchmark Institute for Electrical and Electronics Engineers, Inc. (IEEE) IEEE Standard 610.12-90 Glossary of Software Engineering Terminology defines a benchmark as: (1) A standard against which measurements or comparisons can be made. (2) A problem, procedure, or test that can be used to compare systems or components to each other or to a standard as in (1). (3) A recovery file. [141] Standard Performance Evaluation Corporation (SPEC) …a “benchmark” is a test, or set of tests, designed to compare the performance of one computer system against the performance of others [139]. Dictionary of Computer Science, Engineering, and Technology a controlled experiment that measures the execution time, utilization, or throughput for a system using an artificial workload chosen to represent the actual operational use of the system. Benchmarks are used to obtain an estimate of performance, to compare similar systems, or to parameterize a model of the system that will be used to determine whether it will operate effectively as a component of a larger system. [85] Walter F. Tichy, University of Karlsruhe “Essentially, a benchmark is a task domain sample executed by a computer or by a human and computer. During execution, the human or computer records well-defined performance measurements” [153].
155
Appendix B: Benchmarking Readiness Assessment Research Community: Technology: Response Maturity How many years ago did this research area split from another one? a) four or fewer b) five to ten c) ten or more How many implementations are there of technology under study? a) two or fewer b) three to five c) six or more What phase of maturity has the technology reached on the Redwine and Riddle Maturity Model? a) Basic Research, Concept Formulation, Development and Extension b) Enhancement and Exploration (internal) c) Enhancement and Exploration (external) or Popularization How many annual conferences and workshops are dedicated to this research area? a) none b) one or two c) three or more How many journals are dedicated to this research area? a) none b) one c) two or more How difficult is it to publish a speculative paper in one of the above meetings or journals? a) not applicable or easy b) somewhat difficult c) very difficult
156
Response Comparison How difficult is it to publish a paper that introduces a new technology without validation? a) easy b) somewhat difficult c) very difficult How different implementations of the technology have been applied to solve an industrial problem? a) two or fewer b) three to five c) six or more When was a paper that compared three or more approaches or implementations last published? a) never or more than five years ago b) one to four years ago c) within the last year Does the research area use standard proto-benchmarks or benchmarks to compare technology? a) no b) proto-benchmarks c) benchmarks Have there been attempts to replicate the results of these comparisons? a) no b) using the same technology c) using the same technology and evaluation method Have there been tutorials or workshops on how to conduct empirical studies in this particular research area? a) no b) conference session (~1.5 hours) to half a day c) full day or longer
157
Response Collaboration Is time set aside at conferences and workshops for discussions? a) no b) once or twice per meeting c) almost every session of the meeting Has there been a seminar or workshop in this research area dedicated to discussion and interaction, e.g. a Dagstuhl seminar? a) no b) once c) regularly How often do research groups meet to exchange ideas, tools, or techniques? a) rarely b) occasionally, but meetings are not repeated c) regularly (once per year or more) Have there been many multi-site, multi-year research projects, consortiums, or task forces? a) no b) one or two c) three or more Are there community-wide standards for paper formats, data exchange, or auditing of research results? a) no b) one c) two or more
158
Scoring of Assessment a b c Totals:
0 points 1 point 2 points Maturity:
______
Comparison: ______
Collaboration: ______
Interpretation of Scores Too Soon Ready for Benchmarking What are you waiting for?
Maturity 0-4 5-9 10-12
Comparison 0-4 5-9 10-12
Collaboration 0-3 4-7 8-10
159
Appendix C: Benchmark Evaluation Checklists Interpretation Overall O1. Does the benchmark provide a level playing field for comparing tools or techniques? O2. Are the tools or techniques that are intended to be tested defined at the outset? O3. Can other tools or techniques use the benchmark?
The benchmark should provide a level playing field and not be biased in favour or against an implementation. The benchmark should state these assumptions. A flexible benchmark should be usable by tools outside the initial assumptions. Interpretation
Motivating Comparison M1. Is the benchmark concerned with an important problem in the research area? M2. Does it capture the raison d’être for the discipline? M3. Are there other problems in need of a benchmark? Should one of those have been developed before this one?
The problem can be a large prominent one or it can be a small fundamental one, i.e. particularly thorny sub-problem. There should be a close relationship between the raison d’être and the Motivating Comparison. A benchmark should target a pressing problem, but there are worthwhile exceptions. It can target an interesting high-profile problem or it can tackle a less important problem that is easier to study empirically.
160
Interpretation Task Sample T1. Is it representative of problems found in actual practice? T2. Is there a description of the expected user? Is it realistic? T3. Is there a description of the usage context? Is it realistic? T4. Is the selection of tasks supported by empirical work? T5. Is the selection of tasks supported by a model or theory? T6. Can the tasks be solved? T7. Is a good solution possible? T8. Is a poor solution possible? T9. Would two people performing the task with the same tools produce the same solution? T10. Can the benchmark be used with prototypes as well as mature technologies? T11. Can the tasks be scaled to be more numerous, more complex, or longer? T12. Is the benchmark biased in favour of particular approaches? T13. Is the benchmark tied to a particular platform or technology?
The design of the benchmark needs to be aware of what are the problems that are found in practice, what sampling method is used, and whether the final sample is representative. User characteristics form part of the description of actual practice or applied usage of the technology. The setting also forms part of the description of actual practice. Information about the task domain can be collected empirically, and in turn used to construct the Task Sample. The task domain can also be modelled and the model can guide the task selection. (The Task Sample can be based on empirical data or theory or both.) These three questions tap into the ability of the Task Sample to discriminate. It should be possible to at least partially solve the task. There must be ways to succeed and there must be ways to fail. This question taps into whether evaluator or demonstrator characteristics affect the solution produced. Automated benchmarks should not be affected by these user characteristics. A realistic Task Sample tends to work better with mature technologies. It can be too large or too complex to be solved using prototypes. Efficiency of different algorithms, system configurations etc. may not become evident until a large load is placed on the technology. These two questions attempt to identify source of bias in the Task Sample.
161
Interpretation Performance Measures P1. Does a score represent the capabilities of a tool or technique fairly? i.e. Are the results for a single technology accurate? P2. Can the scores be used to compare two technologies directly? i.e. Are comparisons accurate? P3. Are the measures used in the benchmark indicators of fitness for purpose between the technology and the tasks? P4. Is it possible for a tool or technique that does not have fitness for purpose to obtain a good performance score? P5. Was the selection of Performance Measures supported by previous empirical work? P6. Was the selection of Performance Measures supported by a model or theory?
P7. Would the one person using the benchmark on the same tool or technique twice get the same result? P8. Would two people using the benchmark with the same tool or technique get the same result?
At a minimum, the measures need to be accurate for a single tool. Measures that are accurate for multiple tools can be used to make comparisons. High scores on the Performance Measures should map to good performance in actual practice. This question is the inverse of P3. There selection of Performance Measures is based on empirical development of accurate and consistent metrics. The selection of Performance Measures is based on a model of how the technology functions. (The Performance Measures can be based on empirical data or theory or both.) This question taps into the reliability of the benchmark results from one usage to another, holding user characteristics constant. This question taps into the reliability of the benchmark results while changing user characteristics.
162
Interpretation Development D1. Was there interest in comparing results within the discipline? D2. Was the project well publicised?
D3. How many people are involved in the development of the benchmark?
D4. How many research groups are involved? D5. Were there opportunities for the wider community to become involved? D6. Were refinements made to the benchmark based on feedback? D7. Was the benchmark prototyped?
D8. Was the design of the benchmark discussed at a meeting that was open to the community? More than once? D9. Were the results of the benchmark discussed at a meeting that was open to the community? More than once?
The benchmarking effort should build on existing interest and research directions. People need to know about the benchmarking effort in order to participate in it, otherwise is it merely a private project. Public calls for participation and word-of-mouth contact is key to raising the visibility of the project. As many people as possible should be involved in the benchmark. It’s not possible to have too many participants and controversy polishes the final product. This participation can be in the form of comments or research to refine benchmark components. This question is asked separately from D3 to distinguish laboratories that participate as a bloc. The community-at-large should have as many opportunities as possible to comment on and use the benchmark. The feedback obtained from the community should be used to make meaningful changes and improvements to the benchmarks. Prototyping is another opportunity for feedback. When designing experiments and evaluations, prototypes are “dry runs” that help debug the materials. Face-to-face meetings with many participants helps to build consensus. Discussion of the design increases the validity and reliability of the evaluation. Face-to-face meetings with many participants helps to build consensus. Discussion of the results increases their acceptance and impact in the community.
163
Interpretation Deployment E1. What proportion of the eligible technology have been evaluated using the benchmark?
E2. Was the evaluation well publicised?
E3. Was participation in the evaluation open to all interested parties?
E4. Are the materials easy to obtain? E5. Are the materials clear and unambiguous?
E6. Is there a fee for using the benchmark? E7. Is there a license agreement for using the benchmark? E8. How much time is required to run the benchmark? E9. Is the benchmark automated?
It is desirable to have a high proportion of the technology evaluated by the benchmark. Alternatively, it may be sufficient to have the visible and respected implementations participate. Awareness of the benchmark is necessary before researchers can use it. Publicity can be generated through mass mailings or personal contact. The evaluation should be open to all who wish to participate. The value of additional participation usually outweighs any logistical difficulties. To ensure a high level of participation, the benchmark should be easy to use and consume as little time and resources as it reasonable. Questions E4-E10 address this issue. The benefits of some barriers to participation (e.g. fees, license agreements) outweigh the costs, but should be used with care. Obtaining the materials is the first step in using the benchmark, so it is important that it does not become a barrier to participation. The materials should be easy to understand so they will be used correctly. Instructions that require less interpretation have fewer loopholes in them. A fee may be needed to help maintain the benchmark, but it can also act as a barrier to participation. A license agreement protects how the benchmark materials are used and can specify a code of conduct, but is another barrier to participation. A benchmark should require the time necessary to perform a fair evaluation, but no more. Automation reduces the human resource required to perform an evaluation, but may not be appropriate for man problems.
164
E10. Are specialised skills or training needed to use the benchmark?
The amount of specialised knowledge needed should be kept to a minimum to ensure that the materials are usable by as many as possible and that the results are understandable by the wider community. E11. Are the results easy to obtain? There is no clear correct answer for this question. Some organisations enthusiastically publicise results (e.g. TPC), but others are more circumspect (e.g. TREC). The results should be sufficiently accessible that they have an impact on the community, but protected from mis-use. E12. Are the results clear and unambiguous? The meaning of the results should be not be open to (mis-)interpretation or debate. E13. Is it possible to examine the raw data as A similar difficult exists with raw data as with well as the benchmark results? results. Some raw data is very sensitive and other raw data is too complex to easily compromised. Raw data should be available to some degree to permit scrutiny of the benchmark and the technologies evaluated. E14. Is there a fee for accessing the raw data or Fees can help to maintain the benchmark, but solutions? can be barrier to participation. They can be easier to justify for the raw data or solutions than the benchmark materials. E15. Is there a license agreement for accessing Similar to fees, the protection provided by a the raw data or solutions? license to the raw data or solutions outweighs any potential abuse or negative publicity. E16. Are there procedures in place for auditing Having an auditing procedure in place the submitted solutions? provides some recourse when malfeasance occurs. E17. Are the solutions audited? Actual application of the auditing procedures shows that they are not simply toothless regulations. E18. Is there a process for users to vet their The users who used the benchmark should results before they are released? agree with the final results. A vetting process prevents unhappy users from become unpleasant detractors.
165
Appendix D: Tool Developer Handbook
A Collective Tool Demonstration Tool Developer Handbook A CASCON 1999 Workshop Susan Elliott Sim Department of Computer Science University of Toronto 10 Kings College Rd. Toronto, Ontario Canada M5S 3G4 Tel. +1 (416) 978-4158 Fax. +1 (416) 978-4765
[email protected]
Dr. Margaret-Anne Storey Assistant Professor Department of Computer Science University of Victoria PO Box 3055 STN CSC Victoria, BC Canada V8W 3P6 Tel: +1 (250) 721 8796 Fax: +1 (250) 721 7292
[email protected]
Thank you for agreeing to participate in the Collective Tool Demonstration. The day promises to be fun, exciting, and even nerve-wracking, for participants and organizers alike. Over the course of the today and Wednesday, we expect to learn a lot about the program comprehension tools you have brought-- more than we could by doing separate demonstrations. We want to emphasize that the goal of this workshop is not to find a winner. We are more interested in learning which aspects of the studied tools are useful for particular tasks. In addition, comparing tools directly would be difficult, if not impossible, as it is likely that the participating tools were designed to support varying tasks, and will therefore have different strengths and weaknesses. We have selected a subject system for you to analyze and your tasks will be given to you in the form of a scenario. In the scenario, you will take the role of a development team assigned to work on an unfamiliar application. You will be using your tool to help understand this subject system. You will be given two sets of tasks. The first set of tasks is primarily concerned with the overall structure of the subject system and you are required to complete these tasks. The second set of tasks is similar to the to-do list for a maintenance programmer. Please complete at least 166
one task from this second set. As you complete the tasks, document your progress as suggested in the General Instructions, below. The remainder of this handbook is organized as follows: •
General instructions: outline of the basic procedures you should follow throughout the day.
•
Scenario: description of the tasks to be completed 1. Required Tasks: two tasks that must be performed. 2. Maintenance Tasks: optional tasks, do least one of these tasks
•
Preparing your presentation: reviews the main points you should cover in your 15-minute presentation on Wednesday. Read these now as they will help you form your solutions to the tasks.
General Instructions An observer has been assigned to your team. The observer should be viewed as an apprentice trying to learn both how to use the tool set and to understand the concepts underlying the analysis that you perform. Please assist them with their assignment and answer whatever questions they have. For each of the tasks, you need to submit two things: a deliverable and a description. The deliverable is the solution to the task. The description is a short note on how you arrived at your solution. If you use any additional resources, please make a note of this in your description. Additional resources include: •
other tools, i.e. ones not included in the tool description in the workshop report
•
information from the Internet, e.g. downloaded files or browsed documents
•
people from outside the team
We would prefer to receive the deliverables and descriptions as electronic files (which we can copy to floppy disk) but neatly handwritten answers would be OK. We will not require you to submit modified, running code as part of your solutions.
167
You may run the program to view its behaviour. The observer may participate in the analysis of the subject system. There will be a lunch break from 12-1PM. Stop work at 5PM-you’ve earned a break. Afterwards, you can work on your presentations, but your analysis should be essentially complete.
Scenario xfig is a drawing application that runs on a variety of UNIX platforms. The current version is 3.2.1 and consists of about 30 000 lines of ANSI C. The old xfig team and manager quit the xfig project to move south for higher salaries! You have been assigned, along with some of your colleagues, to rescue the future development of the xfig application. You are placed under a new manager, a recent MBA graduate, who is impressed that you are going to use some fancy tools to get him and the rest of your new team up to speed. The first thing the new manager would like you to do is to use your tool(s) to create some documentation which would summarize the main structures and architecture of the xfig application. He would also like you to explore how you would go about implementing some of the changes that were identified in the inherited “TODO” list. Section 1: Required Tasks 1.1 Documentation Provide a textual and/or graphical summary of how the xfig source code is organized. This documentation should provide the manager with an overview of the system, and may include a call graph, subsystem decomposition, description of the main data and file structures or any other appropriate information. Use whatever format you think is appropriate, text files, HTML, Word documents, graphics, etc. Deliverable: The documentation you have created. 1.2 Evaluate the structure of the application. Your manager would like you to form an opinion on the structure of the xfig program. In particular, you should answer the following questions: 1. Was it well-designed initially? 168
2. Is this original design still intact? 3. How difficult will it be to maintain and modify? 4. Are there some modules that are unnecessarily complex? 5. Are there any GOTO’s? If so, how many, what changes would need to be made to remove them? Deliverable: Your opinion on the answers to the above questions, and justification for the above. Section 2: Maintenance Tasks The manager would like you to examine in detail some of the tasks from the “TODO” list. He would like you to outline the changes that would need to be made to the code to complete the task (i.e. which parts will need to be modified) and the impact of the change on the rest of the system. Consider at least one of the following tasks (preferably more), order is not important. 2.1 Modify the existing command panel. The buttons in the command panel, a.k.a. tool bar, at the top of the window are somewhat unconventional. The tool bar should be more consistent with those in other graphical user interfaces, with the headings “File”, “Edit”, and “View” justified to the left and “Help” menu item justified on the right side. The buttons currently in the command panel should be rearranged in the follow way: File New Load/Merge Save Save As Export Print Exit
Edit Undo Paste Find Replace Spell Check
View Landscape Portrait Redraw
Help Xfig HTML Reference Xfig tutorial in pdf Xfig man pages in pdf About Xfig
Identify which functions or files would need to be modified in order to implement this change. Deliverable: A list of functions or files involved in the change.
169
2.2 Add a new method for specifying arcs. Currently, arcs are created by specifying three points (you may want to run the program to try this out), which are then used to create a spline curve. Add a feature that allows a user to draw an arc by clicking on the centre of a circle and then selecting two points on the circumference, i.e. by specifying a radius and angle. Identify the functions and files that need to be changed to add this feature. Also explain the approach you would take to implement this new feature. Deliverable: A list of functions or files involved in the change. 2.3 Bug fix: Loading library objects. Loading objects from a library causes the program to crash. This error occurs when the user attempts to load a library object using the bookshelf icon on the left-hand side of the screen. When you click on this icon, a dialog box opens that allows you to select a Library and an object to load. Doing so will cause xfig to hang and eventually crash with a “Segmentation Fault” error. Identify the functions or files that need to be changed to repair this defect. Deliverable: A list of functions or files involved in the repair. Note: If there are no libraries available to load, it means that xfig was not able to locate a library directory. You can specify one at the command line using the -library_dir or -li switch. In the distribution, the Libraries subdirectory is found in the Examples subdirectory. To start xfig from the distribution root with this option, type: /xfig -li ./Examples/Libraries
Preparing your presentation We would like you to prepare a short 15 minute presentation for Wednesday. During the presentation, we would like you to include the following topics: •
How long did it take you to read the source code into your tool?
•
What difficulties did you encounter with your tool? Did it crash? Any other surprises?
•
How long did you spend on the required tasks?
•
What kind of documentation did you create? 170
•
Which maintenance tasks did you do?
•
How long did each of them take?
In addressing the questions about time, you do not need to give precise answers. We will have a laptop available with PowerPoint if you wish to use this to prepare your talk. We also have some transparencies and pens if you would prefer to use those.
171
Appendix E: Observer Handbook
A Collective Tool Demonstration Observer Handbook A CASCON 1999 Workshop Dr. Margaret-Anne Storey Assistant Professor Department of Computer Science University of Victoria PO Box 3055 STN CSC Victoria, BC Canada V8W 3P6 Tel: +1 (250) 721 8796 Fax: +1 (250) 721 7292
[email protected]
Susan Elliott Sim Department of Computer Science University of Toronto 10 Kings College Rd. Toronto, Ontario Canada M5S 3G4 Tel. +1 (416) 978-4158 Fax. +1 (416) 978-4765
[email protected]
Thank you for agreeing to be an observer for the Collective Tool Demonstration. We have invited developers of program comprehension tools from IBM and university research to participate in a formal demonstration by applying their tools to a common subject system. You will be observing the XXX team as they use their tool to analyse the subject system. You should consider yourself as an “apprentice” and that during the day your goal is to develop a mastery of XXX. In particular, you should consider how the tool could be used in your work as a software developer. You will report on your experiences on Day 2 of the workshop. We will be presenting the tool teams a scenario in which they play developers of the subject system, xfig. The subject system is written in C and consists of approx. 30,000 LOC. Your team will be using XXX to analyze xfig and complete two sets of tasks. The first set of tasks is required, and consist of tasks such as creating documentation to provide an overview of a software system. The second set is optional (although we would like them to do at least one of these tasks) and consists of maintenance tasks such as fixing a bug or adding a new feature. 172
We have attached a copy of the handbook which we have given to the tool developers, we encourage you to read the tasks at the beginning of the experiment so that you are familiar with them. The remainder of this document is organized as follows: •
General Instructions: these outline the basic procedures you should follow throughout the day.
•
Preparing your presentation: reviews the main points you should include in your 10 minute presentation for Wednesday. Read this now as it will help you focus your observations.
•
Questionnaire: this section describes a questionnaire which you need to complete on behalf of the tool team.
General Instructions You should consider yourself as an “apprentice” and that your goal during the day is to develop a mastery of the XXX. You will need to pay attention to both the nuts-and-bolts aspects of running the tools and the concepts behind the analysis that they team performs. In other words, how they use XXX and why they use XXX the way they do. You will therefore need to ask the team members what they are doing and how they are using the tool. The team members may work silently, do not be afraid to ask questions, for example: •
What are you doing now?
•
Why did you do that?
•
Could you elaborate on that?
•
Are things not working as expected?
Your interruptions may break their concentration, so you need to walk a fine line between finding out what they are doing and not interrupting them unduly. Please record important observations as you go along, as it is hard to remember important details later on. We will all pause for lunch 12-1PM. 173
Read the points to be considered for preparing your presentation on Wednesday (see below). These points will help you focus your observations. We would like you to fill out a short questionnaire on behalf of your team. The questionnaire can be found on the last page of this handbook. When you have completed the questionnaire, please detach the page and give it to Peggy or Susan.
Preparing your presentation We would like you to prepare a short 10-minute presentation for Wednesday. Consider some of the following questions as you prepare your presentation: •
Would you use XXX as a developer?
•
Do you feel that XXX has a place in your organization? For which tasks?
•
Do you consider XXX difficult to use?
•
After observing for the day, do feel you are capable of using the most of XXX’s functionality?
•
If you had not observed XXX being used, how long do you think it would have taken you to become an expert in the tool?
•
To what extent did the developers use the XXX to perform the maintenance tasks?
•
Your team may have used other tools to perform the assigned tasks, would you find any of these tools useful in your work context?
We will have a laptop available with PowerPoint if you wish to use this to prepare your talk. We also have some transparencies and pens if you would prefer to use those.
174
Questionnaire Whenever it is convenient, please complete the following questionnaire for the team participants (you may summarize their responses on this page). Note: Ask for more than yes or no answers, dig a little bit to find out how much they have worked on or browsed such programs before. Tool Team Name: Observer Name: 1. Have you ever used (ran) the xfig drawing package before? To what extent did you use it?
2. Have you used (run) other drawing packages? Which ones, and to what extent?
3. Have you written, modified or browsed the source code for xfig before today?
4. Have you written, modified or browsed the source code for another drawing package before?
175
References [1]
“International SPEC Workshop, September 22-24, 1998,” http://mafalda.unipv.it/Laboratory/meetings/agspec.html, last accessed 4 July 2003.
[2]
“A Collective Demonstration of Program Comprehension Tools,” http://www.csr.uvic.ca/cascon99, last accessed 15 May 2000.
[3]
“Reverse Engineering Demonstration Project Home Page,” Available at http://pathbridge.net/reproject/cfp2.htm, last accessed 15 May 2000.
[4]
“Rigi Group Home Page,” http://www.rigi.csc.uvic.ca/, last accessed 8 January 2002.
[5]
“C/C++ Parser with TA++ and GXL Output,” http://www.site.uottawa.ca:4333/dmm/, last accessed 8 January 2002.
[6]
“CPPX Home Page,” http://swag.uwaterloo.ca/~cppx/, last accessed 8 January 2002.
[7]
“Merriam-Webster OnLine,” http://www.m-w.com/home.htm, last accessed 8 May 2002.
[8]
“The Benchmark Gateway,” http://www.ideasinternational.com/benchmark/bench.html, last accessed 3 July 2003.
[9]
“Front End Art Ltd. Home Page,” http://www.frontendart.com/, last accessed 7 July 2003.
[10]
“Transaction Processing Performance Council Home Page,” http://www.tpc.org/, last accessed 28 June 2003.
[11]
Anon et al., “A Measure of Transaction Processing Power,” Datamation, vol. 31, no. 7, April 1, 1985, pp. 112-118, 1985.
[12]
Nicolas Anquetil and Timothy Lethbridge, “File Clustering Using Naming Conventions for Legacy Systems,” presented at CASCON97, Toronto, Canada, pp. 184-195, 4-7 November 1997.
176
[13]
M. N. Armstrong and C. Trudeau, “Evaluating Architectural Extractors,” presented at Fifth Working Conference on Reverse Engineering, Honolulu, HI, pp. 30-39, 12-14 October 1998.
[14]
Ross Arnold and Tim Bell, “A Corpus for the Evaluation of Lossless Compression Algorithms,” presented at Data Compression Conference (DCC'97), Snowbird, Utah, pp. 201-210, 25-27 March 1997.
[15]
Magdalena Balazinska, Ettore Merlo, Michel Dagenais, Bruno Lagüe, and Kostas Kontogiannis, “Partial Redesign of Java Software Systems Based on Clone Analysis,” presented at Sixth Working Conference on Reverse Engineering, Atlanta, GA, pp. 326336, 6-8 October 1999.
[16]
J.L. Barron, D.J. Fleet, and S.S. Beauchemin, “Performance of Optical Flow Techniques,” International Journal of Computer Vision, vol. 12, no. 1, pp. 43-77, 1994.
[17]
Victor R. Basili, Gianluigi Caldiera, and H. Dieter Rombach, “The Goal Question Metric Approach,” in Encyclopedia of Software Engineering, Two Volume Set, Gianluigi Caldiera and Dieter H. Rombach, Eds. New York City: John Wiley and Sons, Inc., pp. 528-532, 1994.
[18]
Ira D. Baxter, “Design Maintenance Systems,” Communications of the ACM, vol. 35, no. 4, pp. 73-89, 1992.
[19]
Tim C. Bell, Ian H. Witten, and J. G. Cleary, “Modelling for Text Compression,” Computing Surveys, vol. 21, no. 4, pp. 557-591, 1989.
[20]
Berndt Bellay and Harald Gall, “An Evaluation of Reverse Engineering Tool Capabilities,” Software Maintenance: Research and Practice, vol. 10, no. 5, pp. 305-331, 1998.
[21]
Joseph Ben-David, “Scientific Productivity and Academic Organization in Nineteenth Century Medicine,” American Sociological Review, vol. 25, pp. 828-843, December, 1960.
[22]
Merrie Bergmann, James Moor, and Jack Nelson, The Logic Book, Second Edition. New York, NY: McGraw-Hill Publishing Company, 1990.
177
[23]
David Binkley, “An Empirical Study of the Effect of Semantic Differences on Programmer Comprehension,” presented at Tenth International Workshop on Program Comprehension, Paris, France, pp. 97-106, 27-29 June 2002.
[24]
Hubert M. Blalock Jr., Basic Dilemmas in the Social Sciences. Beverly Hills, CA: Sage Publications, 1984.
[25]
Bruce I. Blum, Beyond Programming: To a New Era of Design. Oxford: Oxford University Press, 1996.
[26]
Robert W. Bowdidge and William G. Griswold, “How Software Tools Organize Programmer Behavior During the Task of Data Encapsulation,” Empirical Software Engineering, vol. 2, no. 3, pp. 221-267, 1997.
[27]
Ivan T. Bowman, Michael W. Godfrey, and Ric Holt, “Connecting Architecture Reconstruction Frameworks,” Journal of Information and Software Technology, vol. 42, no. 2, pp. 93-104, 1999.
[28]
Kevin Bowyer, “Report on IEEE Computer Society Workshop on Empirical Evaluation of Computer Vision Algorithms,” http://peipa.essex.ac.uk/benchmark/cvpr98-report.html, last accessed 21 July 2003.
[29]
Kevin W. Bowyer and P. Jonathon Phillips, “Overview of Work in Empirical Evaluation of Computer Vision Algorithms,” in Empirical Evaluation Techniques in Computer Vision, Kevin W. Bowyer and P. Jonathon Phillips, Eds. Los Alamitos, CA: IEEE Computer Society Press, 1998.
[30]
R. Ian Bull, Andrew Trevors, Andrew J. Malton, and Michael W. Godfrey, “Semantic Grep: Regular Expressions + Relational Abstraction,” presented at Ninth Working Conference on Reverse Engineering, Richmond, VA, pp. 267-276, 29 October - 1 November, 2002.
[31]
Robert Ian Bull, Abstraction Patterns for Reverse Engineering. Master's Thesis, Department of Computer Science, University of Waterloo, 2002.
[32]
Eugene Charniak, “Statistical Techniques for Natural Language Parsing,” AI Magazine, vol. 18, no. 4, pp. 33-44, 1997.
178
[33]
Yih-Farn Chen, Emden Gansner, and Eleftherios Koutsofios, “A C++ Data Model Supporting Reachability Analysis and Dead Code Detection,” IEEE Transactions on Software Engineering, vol. 24, no. 9, 1998.
[34]
Yih-Farn Chen, Michael Y. Nishimoto, and C. V. Ramamoorthy, “The C Information Abstraction System,” IEEE Transactions on Software Engineering, vol. 16, no. 3, pp. 325-334, 1990.
[35]
Elliot J. Chikofsky and James H. Cross II, “Reverse Engineering and Design Recovery: A Taxonomy,” IEEE Software, pp. 13-17, 1990.
[36]
Elliot Chikofsky, David E. Martin, and Hugh Chang, “Assessing the State of Tools Assessment,” IEEE Software, pp. 18-21, May, 1992.
[37]
Michael L. Collard, Huzefa H. Kagdi, and Jonathan I. Maletic, “An XML-Based Lightweight C++ Fact Extractor,” presented at Eleventh International Workshop on Program Comprehension, Portland, OR, pp. 134-143, 10-11 May 2003.
[38]
Jim R. Cordy, Kevin A. Schneider, Thomas R. Dean, and Andrew J. Malton, “HSML: Design Directed Source Code Hot Spots,” presented at Ninth International Workshop on Program Comprehension, Toronto, Canada, pp. 145-154, 12-13 May 2001.
[39]
Mark Craven, “The Genomics of a Signalling Pathway: A KDD Cup Challenge Task,” SIGKDD Explorations, vol. 4, no. 2, pp. 97-98, 2002.
[40]
Mark Craven, “KDD Cup 2002,” http://www.biostat.wisc.edu/~craven/kddcup/, last accessed 21 July 2003.
[41]
Jörg Czeranski, Thomas Eisenbarth, Holger Kienle, Rainer Koschke, and Daniel Simon, “Analyzing xfig Using the Bauhaus Tool,” presented at Seventh Working Conference on Reverse Engineering, Brisbane, Queensland, Australia, pp. 197-199, 23-25 November 2000.
[42]
R. Dahrendorf, Class and Class Conflict in Industrial Society. London: Routledge and Kegan Paul, 1959.
[43]
D. A. de Vaus, Surveys in Social Research, Fourth Edition. London: UCL Press, 1996.
179
[44]
Thomas R. Dean, Andrew J. Malton, and Ric Holt, “Union Schemas as a Basis for a C++ Extractor,” presented at Eighth Working Conference on Reverse Engineering, Stuttgart, Germany, pp. 59-67, 2-5 October 2001.
[45]
Serge Demeyer, Tom Mens, and Michel Wermelinger, “Towards a Software Evolution Benchmark,” presented at International Workshop on Principles of Software Evolution (IWPSE2001), Vienna, Austria, 10-11 September, 2001.
[46]
Diego Doval, Sprios Mancoridis, and Brian S. Mitchell, “Automatic Clustering of Sosftware Systems Using a Genetic Algorithm,” presented at Software Tools and Engineering Practice, Pittsburgh, PA, 30 August - 3 September 1999.
[47]
Robert Dubin, Theory Building. New York, NY: The Free Press, 1971.
[48]
J. Ebert, R. Gimnich, H. H. Stasch, and A. Winter, GUPRO - Generische Umgebung zum Programmverstehen. Koblenz: Fölbach, 1998.
[49]
Rudolf Eigenmann and Reinhold Weicker, “SPEC Workshop on Performance Evaluation with Realistic Applications,” http://www.specbench.org/events/specworkshop/, last accessed 10 February 2003.
[50]
Margaret A. Ellis and Bjarne Stroustrup, The Annotated C++ Reference Manual. Boston, MA: Addison-Wesley Publishing Co., 1990.
[51]
Martin S. Feather, Stephen Fickas, Anthony Finkelstein, and Axel van Lamsweerde, “Requirements and Specification Exemplars,” Automated Software Engineering, vol. 4, pp. 419-438, 1997.
[52]
Norman E. Fenton and Shari Lawrence Pfleeger, Software Metrics: A Rigorous and Practical Approach, Second Edition. Boston, MA: PWS Publishing Company, 1997.
[53]
Rudolf Ferenc, Árpád Beszédes, Mikko Tarkiainen, and Tibor Gyimóthy, “Columbus Reverse Engineering Tool and Schema for C++,” presented at International Conference on Software Maintenance, Montréal, Canada, pp. 172-181, 3-6 October 2002.
[54]
Rudolf Ferenc, Susan Elliott Sim, Richard C. Holt, Rainer Koschke, and Tibor Gyimóthy, “Towards a Standard Schema for C/C++,” presented at Eighth Working Conference on Reverse Engineering, Stuttgart, Germany, pp. 49-58, 2-5 October 2001. 180
[55]
P.J. Finnigan, R.C. Holt, S. Kerr, K. Kontogiannis, H.A. Müller, J. Mylopoulos, S.G. Perelgut, M. Stanley, and K. Wong, “The Software Bookshelf,” IBM Systems Journal, vol. 36, no. 4, pp. 564-593, 1997.
[56]
William Foddy, Constructing Questions for Interviews and Questionnaires: Theory and Practice in Social Research. Melbourne, Australia: Cambridge University Press, 1993.
[57]
Jean-François Girard, Rainer Koschke, and Georg Schied, “A Metric-based Approach to Detect Abstract Data types and Abstract State Encapsulation,” Journal on Automated Software Engineering, vol. 6, pp. 357-386, 1999.
[58]
Jim Gray, The Benchmark Handbook: For Database and Transaction Processing Systems. San Mateo, CA: Morgan Kaufman Publishers, Inc., 1991.
[59]
W. O. Hagstrom, “The Differentiation of Disciplines,” in Sociology of Science, Barry Barnes, Ed. Middlesex, England: Penguin Books, 1972.
[60]
Donna Harman, “Overview of the First Text REtrieval Conference (TREC-1),” presented at Text REtrieval Conference (TREC-1), Gaithersberg, MD, pp. 1-20, 4-6 November 1992.
[61]
Donna Harman, “Overview of the Fourth Text REtrieval Conference (TREC-4),” presented at Text REtrieval Conference (TREC-4), Gaithersberg, MD, pp. 1-24, 1-3 November 1995.
[62]
Mark Harman and Sebastian Danicic, “Amorphous Program Slicing,” presented at Fifth International Workshop on Program Comprehension, Dearborn, MI, pp. 70-79, 28-30 May 1997.
[63]
John L. Hennessy and David A. Patterson, Computer Architecture: A Quantitative Approach, Third Edition. Amsterdam: Morgan Kaufmann Publishers, 2003.
[64]
John L. Henning, “SPEC CPU2000: Measuring CPU Performance in the New Millennium,” IEEE Computer, no. July, pp. 28-35, 2000.
[65]
Richard C. Holt, Michael W. Godfrey, and Andrew J. Malton, “The build / comprehend pipelines,” presented at Second ASERC Workshop on Software Architecture, Banff, Canada, 18-19 February, 2003. 181
[66]
Richard C. Holt, Ahmed E. Hassan, Bruno Laguë, Sébastien Lapierre, and Charles Leduc, “E/R Schema for the Datrix C/C++/Java Exchange Format,” presented at Seventh Working Conference on Reverse Engineering, Brisbane, Australia, pp. 284-286, 23-25 November 2000.
[67]
Richard C. Holt, Andreas Winter, and Andy Schürr, “GXL: Toward a Standard Exchange Format,” presented at Seventh Working Conference on Reverse Engineering, Brisbane, Australia, pp. 162-171, 23-25 November 2000.
[68]
Paul Hoyningen-Huene, Reconstructing Scientific Revolutions: Thomas S. Kuhn's Philosophy of Science. Chicago, IL: University of Chicago Press, 1993.
[69]
Capers Jones, Software Assessments, Benchmarks, and Best Practices. Reading, Massachusetts: Addison-Wesley, 2000.
[70]
Rick Kazman and S. Jeromy Carrière, “View Extraction and View Fusion in Architectural Understanding,” presented at Fifth International Conference on Software Reuse, Victoria, BC, pp. 290-299, 2-5 June 1998.
[71]
Rudolf K. Keller, Reinhard Schauer, Sebastien Robitaille, and Patrick Page, “PatternBased Reverse Engineering of Design Components,” presented at Twenty-first International Conference on Software Engineering, Los Angeles, CA, pp. 226-235, 16-22 May 1999.
[72]
Barbara Ann Kitchenham, “Evaluating software engineering methods and tools. Parts 1 to 12.,” ACM SIGSOFT Software Engineering Notes, vol. 21-23, 1996-1998.
[73]
Barbara Kitchenham, Stephen Linkman, and David Law, “DESMET: a methodology for evaluating software engineering methods and tools,” Computing and Control Engineering Journal, vol. 8, no. 3, pp. 120-126, 1997.
[74]
Reinhard Klette, Siegfried Stiehl, Max A. Viergever, and Koen L. Vincken, Performance Characterization in Computer Vision. Dordrecht, The Netherlands: Kluwer Academic Publishers, 2000.
[75]
Ron Kohavi, Carla E. Brodley, Brian Frasca, Llew Mason, and Zijian Zheng, “KDD-Cup 2000 Organizers' Report: Peeling the Onion,” SIGKDD Explorations, vol. 2, no. 2, pp. 86-93, 2000. 182
[76]
Kostas Kontogiannis, Johannes Martin, Ken Wong, Richard Gregory, Hausi Müller, and John Mylopoulos, “Code Migration Through Transformations: An Experience Report,” presented at CASCON98, Toronto, Canada, pp. 1-12, 30 November-3 December 1998.
[77]
Brian Korel and Janusz Laski, “Dynamic program slicing,” Information Processing Letters, vol. 29, no. 3, pp. 155-163, 1988.
[78]
Rainer Koschke, Atomic Architectural Component Recovery for Program Understanding and Evolution. Ph.D., Institute of Computer Science, University of Stuttgart, 2000.
[79]
Rainer Koschke and Thomas Eisenbarth, “A Framework for Experimental Evaluation of Clustering Techniques,” presented at Eighth International Workshop on Program Comprehension, Limerick, Ireland, pp. 201-210, 10-11 June 2000.
[80]
René Krikhaar, “Reverse Architecting Approach for Complex Systems,” presented at International Conference on Software Maintenance, Bari, Italy, pp. 4-11, 1-3 October, 1997.
[81]
Thomas S. Kuhn, The Structure of Scientific Revolutions, Third Edition. Chicago: The University of Chicago Press, 1996.
[82]
Arun Lakhotia and J.M. Gravely, “Toward Experimental Evaluation of Subsystem Classification Recovery Techniques,” presented at Second Working Conference on Reverse Engineering, Toronto, Canada, pp. 262-269, 14-16 July 1995.
[83]
Thomas K. Landauer, The Trouble with Computers: Usefulness, Usability, and Productivity. Cambridge, MA: MIT Press, 1996.
[84]
Sébastien Lapierre, Bruno Laguë, and Charles Leduc, “Datrix(TM) Source Code Model and its Interchange Format: Lessons Learned and Considerations for Future Work,” presented at WoSEF: Workshop on Standard Exchange Format, An ICSE2000 Workshop, Limerick, Ireland, pp. 40-45, 6 June 2000.
[85]
Philip A. Laplante, Dictionary of computer science, engineering, and technology. Boca Raton, FL: CRC Press, 2001.
[86]
Bruno Latour and Steve Woolgar, Laboratory Life. Beverly Hills, CA: Sage Publications, 1979. 183
[87]
Edwin T. Jr. Layton, “Mirror-Image Twins: The Communities of Sciencie and Technology in 19th-Century America,” Technology and Culture, vol. 12, pp. 567-580, 1971.
[88]
Eric H. S. Lee, Software Comprehension Across Levels of Abstraction. MMath, Department of Computer Science, University of Waterloo, 2001.
[89]
Timothy Lethbridge and Nicolas Anquetil, “Architecture of a Source Code Exploration Tool: A Software Engineering Case Study,” University of Ottawa Computer Science Technical Report TR-97-07, 1997.
[90]
Charles Levine, Jim Gray, Steve Kiss, and Walt Kohler, “The Evolution of TPCBenchmarks: Why TPC-A and TPC-B are Obsolete,” Digital Equipment Corporation, San Francisco, CA SFSC Technical Report 93.1, 1993.
[91]
Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz, “Building a Large Annotated Corpus of English: the Penn Treebank,” Computational Linguistics, vol. 19, pp. 313-330, 1993.
[92]
Alvin Martin, Mark Przybocki, George Doddington, and Douglas Reynolds, “The NIST Speaker Recognition Evaluation - Overview, methodology, systems, results, perspectives (1998),” Speech Communications, vol. 31, pp. 225-254, 2000.
[93]
Johannes Martin, Kenny Wong, Bruce Winter, and Hausi Müller, “Analyzing xfig Using the Rigi Tool Suite,” presented at Seventh Working Conference on Reverse Engineering, Brisbane, Queensland, Australia, pp. 207-209, 23-25 November 2000.
[94]
Tom Mens and Serge Demeyer, “Evolution Metrics,” presented at International Workshop on Principles of Software Evolution (IWPSE2001), Vienna, Austria, 10-11 September, 2001.
[95]
Robert K. Merton, “The Ethos of Science,” in On Social Structure and Science, Piotr Sztompka, Ed. Chicago: The University of Chicago Press, pp. 267-276, 1996.
[96]
Brian S. Mitchell and Sprios Mancoridis, “CRAFT: A Framework for Evaluating Software Clustering Results in the Absence of Benchmark Decompositions,” presented at Eighth Working Conference on Reverse Engineering, Stuttgart, Germany, pp. 93-102, 25 October 2001. 184
[97]
Leon Moonen, “Generating Robust Parsers Using Island Grammars,” presented at Eighth Working Conference on Reverse Engineering, pp. 13-22, 2-5 October 2001.
[98]
Will H. Moore, “Evaluating Theory in Political Science,” http://garnet.acns.fsu.edu/~whmoore/theoryeval.pdf, last accessed 1 December 2002.
[99]
Michael Mulkay, “Cultural Growth in Science,” in Sociology of Science: Selected Readings, Barry Barnes, Ed. Middlesex, England: Penguin Books, pp. 126-142, 1972.
[100] Hausi Müller and Karl Klashinsky, “Rigi- A System for Programming-in-the-Large,” presented at Tenth International Conference on Software Engineering, Raffles City, Singapore, pp. 80-86, 11-15 April 1988. [101] Gail C. Murphy, David Notkin, William G. Griswold, and Erica S. Lan, “An Empirical Study of Static Call Graph Extractors,” ACM Transactions on Software Engineering and Methodology, vol. 7, no. 2, pp. 158-191, 1998. [102] Gail C. Murphy, Robert J. Walker, and Elisa L.A. Baniassad, “Evaluating Emerging Software Development Technologies: Lessons Learned from Assessing Aspect-Oriented Programming,” IEEE Transactions on Software Engineering, vol. 25, no. 4, pp. 438-455, 1999. [103] Gail Murphy and David Notkin, “Lightweight Lexical Source Model Extraction,” ACM Transactions on Software Engineering and Methodology, vol. 5, no. 3, pp. 262-292, 1996. [104] James M. Neighbors, “Finding Reusable Software Components in Large Systems,” presented at Third Working Conference on Reverse Engineering, Monterey, CA, pp. 210, 4-8 November 1996. [105] NIST, “TIPSTER Text Program,” http://www.itl.nist.gov/iad/894.02/related_projects/tipster/overv.htm, last accessed 10 February 2003. [106] Alessandro Orso, Saurabh Sinha, and Mary Jean Harrold, “Incremental Slicing Based on Data-Dependences Types,” presented at International Conference on Software Maintenance, Florence, Italy, pp. 158-167, 6-10 November 2001.
185
[107] David Page, “KDD Cup 2001,” http://www.cs.wisc.edu/~dpage/kddcup2001/, last accessed 22 July 2003. [108] Bodo Parady, “SPEC Benchmark Efforts,” presented at International SPEC Workshop, University of Pavia, Italy, 22-24 September 1998. [109] Thomas O. Parry, III, Eric H. S. Lee, and John B. Tran, “PBS Tool Demonstration Report on Xfig,” presented at Seventh Working Conference on Reverse Engineering, Brisbane, Queensland, Australia, pp. 200-202, 23-25 November 2000. [110] Perennial Inc., “C++ Validation Suite,” http://www.peren.com/pages/cppvs.htm, last accessed 8 January 2002. [111] Shari Lawrence Pfleeger, “Experimental Design and Analysis in Software Engineering, Parts 1-6,” ACM SIGSOFT Software Engineering Notes, vol. 19-20, 1994-1995. [112] Shari Lawrence Pfleeger, Software Engineering: Theory and Practice. Upper Saddle River, New Jersey: Prentice Hall, 1998. [113] Pilot European Image Processing Archive, “Benchmarking in Computer Vision,” http://peipa.essex.ac.uk/benchmark/index.html, last accessed 21 July 2003. [114] Plum Hall Inc., “C and C++ Validation Test Suites,” http://www.plumhall.com/suites.html, last accessed 8 January 2002. [115] Matt Powell, “Evaluating Lossless Compression Methods,” presented at New Zealand Computer Science Research Students' Conference, Canterbury, New Zealand, pp. 35-41, 19-20 April 2001. [116] Jenny Preece, Human-Computer Interaction. Harlow, England: Addison-Wesley, 1994. [117] William J. Ray, Methods Toward a Science of Behavior and Experience, Fourth Edition. Pacific Grove, CA: Brooks/Cole Publishing Company, 1992. [118] Raj Reddy, “To Dream The Possible Dream - Turing Award Lecture,” Communications of the ACM, vol. 39, no. 5, pp. 105-112, 1996. [119] Samuel T. Redwine and William E. Riddle, “Software Technology Maturation,” presented at Eighth International Conference on Software Engineering, London, UK, pp. 189-200, 28-30 August 1985. 186
[120] Volker Riediger, “Analyzing XFIG with GUPRO,” presented at Seventh Working Conference on Reverse Engineering, Brisbane, Queensland, Australia, pp. 194-196, 2325 November 2000. [121] Spencer Rugaber, “A Tool Suite for Evolving Legacy Software,” presented at International Conference on Software Maintenance, Oxford, England, pp. 33-39, 30 August-3 September 1999. [122] Steven L. Salzberg, “On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach,” Data Mining and Knowledge Discovery, vol. 1, pp. 317-327, 1997. [123] Kamran Sartipi and Kostas Kontogiannis, “Component Clustering Based on Maximal Association,” presented at Eighth Working Conference on Reverse Engineering, Stuttgart, Germany, pp. 103-114, 2-5 October 2001. [124] Kamran Sartipi and Kostas Kontogiannis, “A Graph Pattern Matching Approach to Software Architecture Recovery,” presented at International Conference on Software Maintenance, Florence, Italy, pp. 408-418, 7-9 November 2001. [125] Stephen R. Schach, Classical and Object-Oriented Software Engineering with UML and JavaTM, Fourth Edition. Boston: WCB McGraw-Hill, 1999. [126] Lambert Schomaker, “Unipen Home Page,” http://hwr.nici.kun.nl/unipen/, last accessed 21 July 2003. [127] Mary Shaw, “Prospects for an Engineering Discipline of Software,” IEEE Software, pp. 15-24, November, 1990. [128] Mary Shaw, “The Coming-of-Age of Software Architecture Research,” http://shawweil.com/marian/DisplayPaper.asp?paper_id=14, last accessed 17 February 2003. [129] Vincent Shen, “What happened to development environments?,” IEEE Software, vol. 7, pp. 20, 24, January, 1990. [130] Susan Elliott Sim, “ C++ Parser-Analysers for Reverse Engineering: Trade-offs and Benchmarks,” http://www.cs.toronto.edu/~simsuz/cascon2001/, last accessed [131] Susan Elliott Sim, Charles L.A. Clarke, and Richard C. Holt, “Archetypal Source Code Searching: A Survey of Software Developers and Maintainers,” presented at Sixth 187
International Workshop on Program Comprehension, Ischia, Italy, pp. 180-187, 24-26 June, 1998. [132] Susan Elliott Sim, Steve Easterbrook, and Richard C. Holt, “Using Benchmarking to Advance Research: A Challenge to Software Engineering,” presented at Twenty-fifth International Conference on Software Engineering, Portland, OR, pp. 74-83, 3-10 May 2003. [133] Susan Elliott Sim and Richard C. Holt, “The Ramp-Up Problem in Software Projects: A Case Study of How Software Immigrants Naturalize,” presented at Twentieth International Conference on Software Engineering, Kyoto, Japan, pp. 361-370, 19-25 April, 1998. [134] Susan Elliott Sim, Richard C. Holt, and Steve Easterbrook, “On Using a Benchmark to Evaluate C++ Extractors,” presented at Tenth International Workshop on Program Comprehension, pp. 114-123, 27-29 June 2002. [135] Susan Elliott Sim and Rainer Koschke, “WoSEF: Workshop on Standard Exchange Format,” ACM SIGSOFT Software Engineering Notes, vol. 26, pp. 44-49, January, 2001. [136] Susan Elliott Sim and Margaret-Anne D. Storey, “A Structured Demonstration of Program Comprehension Tools,” presented at Seventh Working Conference on Reverse Engineering, Brisbane, Australia, pp. 184-193, 23-25 November 2000. [137] Susan Elliott Sim, Margaret-Anne D. Storey, and Andreas Winter, “A Structured Demonstration of Five Program Comprehension Tools: Lessons Learnt,” presented at Seventh Working Conference on Reverse Engineering, Brisbane, Australia, pp. 210-212, 23-25 November 2000. [138] Janice Singer, Timothy Lethbridge, and Norman Vinson, “An Examination of Software Engineering Work Practices,” presented at CASCON '97, Toronto, Canada, pp. 209-223, November 10-13, 1997. [139] Standard Performance Evaluation Corporation, “SPEC Glossary,” http://www.spec.org/spec/glossary, last accessed December 6, 2001. [140] Standard Performance Evaluation Corporation, “Welcome To SPEC,” http://www.spec.org/, last accessed December 6, 2001. 188
[141] Standards Coordinating Committee of the IEEE Computer Society, IEEE standard computer dictionary : a compilation of IEEE standard computer glossaries. New York: Institute of Electrical and Electronics Engineers, 1990. [142] Susan Leigh Star, “Scientific Work and Uncertainty,” Social Studies of Science, vol. 15, pp. 391-427, 1985. [143] Susan Leigh Star, “Cooperation Without Consensus in Scientific Problem Solving: Dynamics of Closure in Open Systems,” in CSCW: Cooperation or Conflict?, Steve Easterbrook, Ed. London: Springer-Verlag, pp. 93-106, 1993. [144] M.-A. D. Storey, K. Wong, and H. A. Muller, “How do Program Understanding Tools Affect How Programmers Understand Programs?,” presented at WCRE '97, Amsterdam, Holland, pp. 12-21, 6-8 October 1997. [145] M.-A.D. Storey, K. Wong, P. Fong, D. Hooper, K. Hopkins, and H. A. Muller, “On Designing an Experiment to Evaluate a Reverse Engineering Tool,” presented at Third Working Conference on Reverse Engineering, Monterey, CA, pp. 31-40, 8-10 November 1996. [146] Margaret-Anne D. Storey, Susan Elliott Sim, and Ken Wong, “A Collaborative Demonstration of Reverse Engineering Tools,” ACM SIGAPP Applied Computing Reviews, vol. 10, no. 1, pp. 18-25, 2002. [147] Margaret-Anne Storey, Mark Musen, John Silva, Casey Best, Neil Ernst, Ray Ferguson, and Natasha Noy, “Jambalaya: Interactive Visualization to Enhance Ontology Authoring and Knowledge Acquisition in Protégé,” presented at Workshop on Interactive Tools for Knowledge Capture (K-CAP 2001), Victoria, Canada, 20 October 2001. [148] Bjarne Stroustrup, The C++ Programming Language (3rd Edition). Boston, MA: Addison-Wesley, 1997. [149] Tarja Systä, “Seminaari: takaisinmallintaminen,” http://www.cs.tut.fi/~tsysta/sem/reveng.html, last accessed 21 July 2003. [150] TakeFive Software Inc., SNiFF+ Reference Guide. Cupertino, CA: TakeFive Software, Inc., 1998.
189
[151] Arthur Tateishi and Andrew Walenstein, “Applying Traditional Unix Tools During Maintenance: An Experience Report,” presented at Seventh Working Conference on Reverse Engineering, Brisbane, Queensland, Australia, pp. 203-206, 23-25 November 2000. [152] The GCC Team, “GCC Home Page- GNU Project- Free Software Foundation,” http://gcc.gnu.org/, last accessed 8 January 2002. [153] Walter F. Tichy, “Should Computer Scientists Experiment More?,” IEEE Computer, pp. 32-40, May, 1998. [154] Walter F. Tichy, Paul Lukowicz, Lutz Prechelt, and Ernst A. Heinz, “Experimental Evaluation in Computer Science: A Quantitative Study,” Journal of Systems and Software, vol. 28, no. 1, pp. 9-18, 1995. [155] Michael Twidale, David Randall, and Richard Bentley, “Situated evaluation for cooperative systems,” presented at Computer Supported Cooperative Work, Chapel Hill, NC, pp. 441-452, October 22-26, 1994. [156] Vasillios Tzerpos and Ric Holt, “A Hybrid Process for Recovering Software Architecture,” presented at CASCON96, Toronto, Ontario, pp. 1-6, 12-14 November 1996. [157] Vassillios Tzerpos, “Software Botryology: Automatic Clustering of Software Systems,” presented at International Workshop on Large-Scale Software Composition, Vienna, Austria, pp. 811-818, 28 August 1998. [158] Arie van Deursen and Tobias Kuipers, “Building Documentation Generators,” presented at International Conference on Software Maintenance, Oxford, England, pp. 40-49, 30 August - 3 September 1999. [159] Walter G. Vincenti, What Engineers Know and How They Know It. Baltimore, MD: Johns Hopkins University Press, 1990. [160] Erich. von Dietze, Paradigms Explained: An Evaluation of Thomas Kuhn's Contribution to Thinking About the Implications of Science. Westport, Connectcuit: Praeger, 2001.
190
[161] Anneliese von Mayrhauser and Steve Lang, “On the Role of Static Analysis during Software Maintenance,” presented at International Conference on Program Comprehension, Pittsburgh, PA, pp. 170-177, 5-7 May 1999. [162] Ellen M. Voorhees and Donna Harman, “Overview of the Eighth Text REtrieval Conference (TREC-8),” presented at Text REtrieval Conference (TREC-8), Gaithersberg, Maryland, pp. 1-24, 17-19 November 2000. [163] Robert J. Walker, Gail C. Murphy, Jeffrey Steinbok, and Martin P. Robillard, “Efficent Mapping of Software System Traces to Architectural Views,” presented at CASCON 2000, Toronto, Canada, pp. 31-40, 13-16 November 2000. [164] Reinhold Weicker, “SPEC International Workshop: SPEC Meets Researchers and Benchmark Users,” http://www.specbench.org/events/specworkshop/germany/paderborn/general.html, last accessed 10 February 2003. [165] Nelson H. Weiderman, A. Nico Habermann, Mark W. Borger, and Mark H. Klein, “A Methodology for Evaluating Environments,” ACM SIGOIS Bulletin, vol. 22, no. 1, pp. 199-207, January, 1987. [166] Mark Weiser, “Program Slicing,” IEEE Transactions on Software Engineering, vol. 10, pp. 352-357, July, 1984. [167] Andreas Winter, Bernt Kullbach, and Volker Riediger, “An Overview of the GXL Graph Exchange Language,” in Software Visualisation - International Seminar, Dagstuhl Castle, Germany, May 20-25, 2001, Lecture Notes in Computer Science State-of-the-Art Survey, Stephan Diehl, Ed. Heidelberg, Germany, 2002. [168] Jingwei Wu and Margaret-Anne D. Storey, “A Multi-Perspective Software Visualization Environment,” presented at CASCON 2000, Toronto, Canada, pp. 41-50, 13-16 November 2000. [169] Alexander Yeh, Lynette Hirschman, and Alexander Morgan, “Background and Overview for KDD Cup 2002 Task 1: Information Extraction from Biomedical Articles,” SIGKDD Explorations, vol. 2, no. 2, pp. 87-89, 2002.
191
[170] Robert K. Yin, Case Study Research: Design and Methods, Second Edition. Thousand Oaks: Sage Publications, 1994. [171] Marvin V. Zelkowitz and Dolores Wallace, “Experimental Models for Validating Technology,” IEEE Software, vol. 31, pp. 23-31, May, 1998.
192