4.2 Software management via data warehousing . ..... space taking into account different concerns and will use the available data to support its decision. A key to ...
Quality-based software release management
by
Homayoun Dayani-Fard
A thesis submitted to the School of Computing in conformity with the requirements for the degree of Doctor of Philosophy
Queen’s University Kingston, Ontario, Canada May 2003
c Homayoun Dayani-Fard, 2003 Copyright
P@ ÐP@X . È P@ ¢Ë ©Ò£ ðXQ¯ I Jk PX ék Qà ÐXQ» à@ ð @Q¯ éKA g ú× úGAK . ¡ ¯Ag
To: my mother and loving memories of my father Q£Ag ð ÐPXAÓ éK. Õç' Y® K ÐPYK áK Q H@
Abstract An obstacle to resolving software engineering challenges is the support of the management. The issues of interest to management mostly relate to the encompassing business organization. There is not enough empirical evidence to establish a strong link between the technical and non-technical issues surrounding software product development. This dissertation provides such a link: a framework for supporting the decision making process and improving the software product management. The main pillars of this framework are: locality of focus, multiplicity of concerns, data centricity, and continuous improvement. In line with the ISO 9000-2000 standard fundamentals, the framework aims to formalize what the organization does, provide a means of monitoring its progress, and identifying problems based on factual data. A key ingredient of this framework is its infrastructure. The business intelligence technologies, in particular data warehousing, provide flexibility, scalability, and repeatability for the proposed framework. The dissertation concludes with an application of the presented ideas to an industrial case study.
i
Acknowledgment
First and foremost, I thank The Lord God ( k) for all his blessings. He indeed works in mysterious ways. I was blessed to cross paths with many who have contributed both to my life and my academic endeavors. Some were positive, others not so. I thank them all! I owe a debt of gratitude to my doctoral supervisors, Dr. Janice Glasgow and Dr. John Mylopoulos, for their wisdom, guidance, support, and most of all their friendships. The ideas presented in this dissertation and what I have learned over the years are the result of many hours of discussions with them. More specifically, I thank Dr. Glasgow for her attention to research methodology. For months, every Friday I traveled from Toronto to Kingston grappling with a thesis statement! Also, I thank her for believing me when I failed to believe in myself. I thank Dr. Mylopoulos for his visionary insights and attention to technical. The first draft of this document was less than half the size of the current version. I thank them for helping me keep my feet firmly on the ground. The next challenge I face is to strive to become as a good a mentor to my students as they have been for me. I thank my examining committee for their contributions to this dissertation. I thank Dr. Victor Basili, whose work on experimental software engineering shaped the foundation of this research. His comments and suggestions for improvements were greatly appreciated and motivated the subsequent revisions of this dissertation. I thank Dr. David Penny for his careful examination of the ideas of this dissertation and suggesting areas of improvement, in particular listing the limitations of my work. I thank Dr. Pat Martin not only for his evaluating my work, but more importantly for introducing me to data warehousing. I thank Dr. Brent Gallupe for providing a management
ii
iii
perspective on my research. I thank Dr. Kai Salomaa for careful examination of the code-base health issues. Lastly, I thank Dr. Peter Taylor, the chair of the defense, for his professionalism and sense of humor. For over six years I have worked at IBM Toronto Lab while pursuing my PhD. Without the support, guidance, and encouragement of my colleagues, both past and present, reaching this point would not have been possible. I thank them all. In particular, I thank Dr. Gabby Silberman for teaching me the importance of focus. During a performance review he corrected my sentence “ ... if I get my PhD ... ” by pointing out the importance of “when” in place of “if”. I thank Stephen Perelgut my manager at IBM Centre for Advanced Studies for teaching me the importance of timely decision making. I thank Sal Vella, my mentor at IBM, for introducing me to the many challenges a development manager must face on a regular basis. The idea of multiplicity of concerns comes from many hours of discussion with him. I thank Linda Jairam and Kelly Ryan for supporting my ideas from the early stages and encouraging me to enhance and apply them in a real environment. More specifically, I thank Kelly Ryan, for teaching me that software development resembles a moving train and as such does not stop for “good” ideas unless through small but effective improvement initiatives. I thank past and present members of CSER project. During my tenure at IBM Centre for Advanced Studies I had the pleasure of working with many students and professors. The ideas presented here were incepted during many discussions with members of CSER project. In particular I thank Dr. Hausi M¨ uller for his constant challenge of “so what?”, forcing me to think of the impact of my research. I thank my long time teacher, Dr. Ric Holt, for many spontaneous discussions that has shaped my understanding of software engineering issues. I thank Dr. Igor Jurisica for his friendship and many discussions on decision support and the importance of factual evidence. I thank members of Queen’s University School of Computing who despite my part-time status always made me feel part of the family. In particular, I thank Dr. Jim Cordy who over the years has been an inspiration. I thank him for pointing out the fact that “it is just a PhD”: there is more to life than PhD and there is a lot more to being a researcher than one dissertation. I thank Dr. Henk Meijer for his advice at the start of my doctoral program that pursuing a PhD is the life
iv
itself not just a road to it. I thank Dr. David Lamb for supervising the early part of my studies and encouraging me to seek the input of software practitioners which consequently lead to my joining IBM. I thank Irene LaFleche who at numerous occasions has helped me deal with my graduate studies blues. Each time, she reminded me that there was a spot on the shelf for my dissertation. I thank Debby Robertson who made part-time off-campus studies a possibility. Without her help and cheerful attitude, dealing with the university administration would not have been easy. I thank my late father, Reza, for teaching me the importance of learning and knowledge, his selfless love, and trusting me and my dreams at difficult times in his life. My only regret is that he is not here to share this moment with me. I thank my mother, Naheed, who despite having to live far away from me has always managed to shared her love, support, and well wishes with me through her letters, phone calls, and most recently emails. I thank my sister, Katayoun, brother-in-law Ali, and their children Yasaman and Saman for their encouraging emails and phone calls during the last stages of my work. In return, as I promised, I am graduating from university before they do. I thank my wife, Parisa, for putting up with my grunts and difficult work habits during the writing of this dissertation. I thank my uncle, Dr. Razeghi, for encouraging me me at an early age to pursue graduate studies. Lastly, at times during my graduate career I needed a word or two of encouragement. Surprisingly, it came from the acknowledgment of the dissertations that I read. I owe a debt of gratitude to their authors. I hope to pay my debt by writing a few words of encouragement for those who may read this section searching for such words. Do not lose hope. Some have it easy, others have it difficult. It may be a long road but to reach the end you must persist. Despite all difficulties, at the end, you would not want it any other way!
Thank you Blair Adamache, Fran Beaulne, Chantal Buttery, Joseph Chang, Brenda Chow, Thomas Chu, Lena Frith, Ayaz Isazadeh, Hosein Isazadeh, Janet Holland, (late) Bob Howard, Debra Howard, Jennifer Howard, Ruth Howard, John Johnston, Ivan Kalas, Debbie Kilbride, Delarae Lee, Kelly Lyons, Anumpa Mukherjee, Martha Murphy, Derek Rayside, Reza Sadat, Ilene Samuels, Doug Warren, Jeff Weir, Darrin Woodard, and many of my students at York University Database course.
Statement of originality I, Homayoun Dayani-Fard, hereby certify that this Ph.D. dissertation is original and all the ideas and inventions attributed to others have been properly referenced.
v
Contents Abstract
ii
Acknowledgement
v
Statement of originality
vi
Table of Contents
vii
1 Software management
1
1.1
Software release life cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.2
Management issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.3
A new approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
1.4
Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
2 Quality overview
11
2.1
Quality definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
2.2
Quality measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
2.3
Quality improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
2.4
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
vi
CONTENTS
vii
3 A multidimensional management framework
26
3.1
Quality definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
3.2
Quality evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
3.3
Quality monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
3.4
A comprehensive framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
3.5
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
4 A software data warehouse
45
4.1
An overview of data warehousing . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
4.2
Software management via data warehousing . . . . . . . . . . . . . . . . . . . . . . .
48
4.3
Data sources in software development . . . . . . . . . . . . . . . . . . . . . . . . . .
51
4.4
A software management strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
4.5
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
5 An object-relational schema for programs
58
5.1
Program representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
5.2
An object-relational schema for programs . . . . . . . . . . . . . . . . . . . . . . . .
62
5.3
Manipulating program data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
5.4
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
6 An implementation methodology
70
6.1
Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
70
6.2
Input layer: identifying data sources . . . . . . . . . . . . . . . . . . . . . . . . . . .
72
6.3
Middle layer: designing the data warehouse . . . . . . . . . . . . . . . . . . . . . . .
76
CONTENTS
viii
6.4
Outer layer: defining quality, progress, and health
. . . . . . . . . . . . . . . . . . .
83
6.5
Analytic tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
92
6.6
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
98
7 A software management case study
101
7.1
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.2
Development processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7.3
Management processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.4
Status quo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.5
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
8 An implementation of a software data warehouse
108
8.1
Overall architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
8.2
The central data warehouse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 8.2.1
Organizational chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
8.2.2
System test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
8.2.3
Project diary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
8.2.4
CMVC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
8.2.5
Code-base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
8.2.6
Regression test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
8.2.7
Performance test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
8.2.8
Others . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
8.3
Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
8.4
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
CONTENTS
9 A three-dimensional management model
ix
124
9.1
Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
9.2
Progress . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
9.3
Health . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
9.4
Overall management strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
9.5
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
10 Management evaluation
141
10.1 Management feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 10.2 Operational analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 10.3 Tactical analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 10.4 Strategic analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 10.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 10.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 11 Conclusion
154
11.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 11.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 11.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 Bibliography
165
A Complete C/C++ data model
166
Vita
170
List of Tables
6.1
Input layer data sources and their characteristics . . . . . . . . . . . . . . . . . . . .
76
6.2
A definition for defect management subgoal. . . . . . . . . . . . . . . . . . . . . . . .
87
6.3
Example metric definitions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
93
8.1
Organizational chart table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
8.2
Schema for system test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
8.3
Schema for Features table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
8.4
Schema for tables Components, Defects, and Files . . . . . . . . . . . . . . . . . . . 118
8.5
Schema for the tables Program-entities and relationships . . . . . . . . . . . . . . . . 119
8.6
Sampling and growth rates for input data sources . . . . . . . . . . . . . . . . . . . . 121
x
List of Figures
1.1
Software release lifecycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.2
A management support system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
1.3
A proposed architecture for software data warehouse . . . . . . . . . . . . . . . . . .
9
3.1
An example of a Goal/Question/Metric hierarchy . . . . . . . . . . . . . . . . . . . .
28
3.2
An example AND/OR tree with positive/negative influences . . . . . . . . . . . . . .
28
3.3
A hierarchical definition with associated metrics
. . . . . . . . . . . . . . . . . . . .
29
3.4
Evaluating subgoals of a hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
3.5
An example goal hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
3.6
An example quality plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
3.7
An example of quality monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
3.8
A partial health definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
3.9
A partial progress definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
3.10 An example health plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
3.11 A product development plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
4.1
48
An architecture of a data warehouse . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
LIST OF FIGURES
xii
4.2
A proposed architecture for software data warehouse . . . . . . . . . . . . . . . . . .
51
5.1
A simple C++ program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
5.2
An example AST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
5.3
An example ASG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
5.4
Part of C/C++ data model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
5.5
C/C++ data model: entity set hierarchy . . . . . . . . . . . . . . . . . . . . . . . . .
65
5.6
C/C++ data model: associations . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
6.1
The ERM for the organizational chart. . . . . . . . . . . . . . . . . . . . . . . . . . .
74
6.2
The schema for the project diary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
6.3
The schema for CMVC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
6.4
The ERM for the test logs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
6.5
The schema for the middle layer —part 1. . . . . . . . . . . . . . . . . . . . . . . . .
79
6.6
The schema for the middle layer —part 2. . . . . . . . . . . . . . . . . . . . . . . . .
80
6.7
The schema for the middle layer —part 3. . . . . . . . . . . . . . . . . . . . . . . . .
81
6.8
The schema for the middle layer —part 4. . . . . . . . . . . . . . . . . . . . . . . . .
82
6.9
Top level definition of a quality goal. . . . . . . . . . . . . . . . . . . . . . . . . . . .
85
6.10 Lower level definition of a quality goal —part 1. . . . . . . . . . . . . . . . . . . . . .
85
6.11 Lower level definition of a quality goal —part 2. . . . . . . . . . . . . . . . . . . . . .
86
6.12 Top level definition of a progress goal. . . . . . . . . . . . . . . . . . . . . . . . . . .
88
6.13 Top level definition of a progress goal. . . . . . . . . . . . . . . . . . . . . . . . . . .
89
6.14 The definition of the health goal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
91
LIST OF FIGURES
xiii
6.15 Aggregate tables for accelerating index computations. . . . . . . . . . . . . . . . . .
94
6.16 A snowflake schema for the progress dimension. . . . . . . . . . . . . . . . . . . . . .
95
6.17 A snowflake schema for the quality dimension.
. . . . . . . . . . . . . . . . . . . . .
96
6.18 A snowflake schema for the health dimension. . . . . . . . . . . . . . . . . . . . . . .
97
6.19 A snowflake schema for the three dimensional management model. . . . . . . . . . .
98
6.20 Star schema for defect analysis and reporting. . . . . . . . . . . . . . . . . . . . . . .
99
7.1
Software release lifecycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7.2
Management and development processes and data sources . . . . . . . . . . . . . . . 105
8.1
The architecture of the software data warehouse . . . . . . . . . . . . . . . . . . . . 109
8.2
ER diagram for organizational data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
8.3
An excerpt of the ER diagram of the system test data . . . . . . . . . . . . . . . . . 112
8.4
An excerpt of the ER diagram for the project diary . . . . . . . . . . . . . . . . . . . 114
8.5
An excerpt of the ER diagram for CMVC . . . . . . . . . . . . . . . . . . . . . . . . 116
8.6
An excerpt of the ER diagram for the regression test data . . . . . . . . . . . . . . . 119
8.7
An excerpt ER diagram for the performance test data . . . . . . . . . . . . . . . . . 120
9.1
An excerpt of quality definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
9.2
An excerpt of the progress definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
9.3
An excerpt of health definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
9.4
The computed, planned, and predicted values of the progress goal . . . . . . . . . . . 136
9.5
The computed, planned, and predicted values of the health goal . . . . . . . . . . . . 138
9.6
The computed, planned, and predicted values of the quality goal . . . . . . . . . . . 139
LIST OF FIGURES
xiv
10.1 Ratio of code-base changes over number of developers . . . . . . . . . . . . . . . . . 149 10.2 Multiple releases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
11.1 A quality release management system . . . . . . . . . . . . . . . . . . . . . . . . . . 156
Chapter 1
Software management Improving software management directly contributes to the improvement of the overall software development process. The literature identifies management issues as a root cause of software development problems (Board, 1994; The Standish Group International, 1995; Jones, 1996). Software engineers have identified many problems contributing to cost overruns, late schedules, and poor quality of software products. There have been many solutions proposed to remedy such problems. However, software developing organizations rarely, if ever, adopt such proposed solutions. The adoption of new solutions requires direct involvement of the management. As a result, management inaction contributes to the ever-growing problems of software development projects. To put management inaction, or late actions, in perspective one must take into account three management concerns. First, there are no silver bullets to software engineering problems (Brooks, 1995; Boehm and Basili, 2000). At best, proposed solutions are partial and/or temporary. Software systems continually evolve, which results in changes in the nature of identified problems (Lehman, 1998b). Second, problems of software engineering reflect the multiplicity of concerns that the management must face. Quality objectives, resource allocations, skills, cost and schedule constraints are among these concerns. A critical view of software management practices reveals that such concerns are not independent of one another and management must make compromises in addressing them together. Third, software engineering still remains in its infancy with regards to empirical studies. Problems cited in the literature rarely come with sufficient documentation 1
CHAPTER 1. SOFTWARE MANAGEMENT
2
to pin down their scope. Frequently, reported problems are scrutinized in the literature and counter-examples put forward, which results in the problem or its proposed solution becoming suspect (Davis, 1995; Glass, 1999). One approach for management to effectively contribute to the improvement of software development involves the application of a decision support system for tactical and strategic analyses. A (informational) decision support system which enables management to independently validate the existence of a problem, assess its impact, and evaluate the potential benefits of the proposed solutions. This dissertation focuses on the issues of improving the management of software development projects, in particular tactical and strategic analyses. The underlying approach rests on narrowing the domain of software engineering, recognizing the management’s multiplicity of concerns, managing by data, and continuous improvement. First, to narrow the domain, this dissertation focuses on commercial software products (packaged or shrink wrapped software (Carmel and Becker, 1995)). Such products have evolved through multiple successful releases and have become a contributor to the overall success of the encompassing business organization. Second, such organizations define multiple goals for achieving their overall objectives, which may conflict with one another. Any potential problem must be evaluated with respect to these goals. Third, to make effective decisions in response to an identified problem, management benefits from a decision support mechanism based on factual data. More specifically, a system for identifying potential problems and their impact, evaluating proposed solutions, and learning from past decisions. Lastly, due to partiality of the proposed solutions and evaluation mechanisms, the decision support mechanism must continually improve. This dissertation proposes a (informational) decision support mechanism by drawing parallels between software management and business intelligence (BI) (Whitehorn and Whitehorn, 1999). BI refers to leveraging large and diverse data sources, representing various aspects of a business, to provide a balanced perspective of the overall business objectives for decision support. Similarly, management of software product development may reduce to a more main-stream management scenario (e.g. similar to market analysis) and BI provides the necessary information for making appropriate decisions (Dutta et al., 1997).
CHAPTER 1. SOFTWARE MANAGEMENT
3
Management will define the objective of a software development project using a multi-dimensional space taking into account different concerns and will use the available data to support its decision. A key to realizing this approach is that for management purposes software is data. More succinctly, this dissertation claims that viewing software1 as data and leveraging BI methods and technologies, in particular data warehousing, will improve the management of commercial software product development. To demonstrate the stated claim, the dissertation provides a framework for defining and measuring management goals –both tactical and strategic. Furthermore, it provides an architecture for monitoring, prediction, and analysis of these goals based on local factual data, which in turn could support the decision making process. The flexibility built into this framework provides a facility for continual improvement of the proposed architecture and enables continual improvement of the product management in lieu of software evolution. The next section begins by showing an overview of commercial software product release life cycle. Section 1.2 describes the issues facing the management and the associated risks. Section 1.3 proposes a management improvement framework and a possible implementation for it. Section 1.4 concludes with the outline of the remainder of this dissertation.
1.1
Software release life cycle
In software product development the code-base used for creating the end product remains the key ingredient of the next release. After the development group releases a version of the software (i.e., makes it generally available), they split the code-base to create two new projects. Figure 1.1 illustrates a typical release-based development life-cycle. The bottom line in the figure represents the progress toward the future release of the product. The plan for the new product contains a set of new features or functionalities that must be added to the previous version of the product. Each feature requires a detailed design to determine the necessary changes that must be made to the code-base, followed by the implementation of these changes and unit testing to verify proper functionality of the newly implemented feature. 1
Software in this dissertation refers to the IEEE Standard glossary which defines software as, “computer programs,
procedures, rules, and possible associated documentation and data pertaining to the operation of a computer system.”
CHAPTER 1. SOFTWARE MANAGEMENT
Defects
General Availability (GA)
Features /Defects
Design Code Unit Test (DCUT)
Design Code Unit Test (DCUT)
4
Build
Build
Function System Perf. Testing
Function System Perf. Testing
Fixpack /Minor Release
Next GA
Figure 1.1: Software release lifecycle: after the product becomes generally available, the code-base splits into two; the top line represents the service stream (or fix-packs) development while the bottom line represents the new release.
The development can potentially proceed in parallel, each group working on different features. The result will be multiple changes to the code-base. While one feature is under design, another may be under implementation or unit testing. For that reason, on a regular basis a version of the new product is created: changes to the code-base are committed, compiled and linked to form the executable image (Cusumano and Selby, 1997). While the unit testing can be carried out on any intermediate version of the product, more product oriented validations require the product to reach a stable state (i.e., code freeze). At such points, system verification begins in order to determine if the product as a whole satisfies its intended specifications. Regression testing is carried out throughout the development cycle but after the code freeze it will be more focused to assure that new functionalities have not resulted in a regress from previous valid behaviors. Lastly, performance testing must be carried out to determine non-functional attributes of the product such as execution speed and resource consumption. Any identified anomaly results in a defect –a request for a corrective or adaptive change. Depending on the severity of a defect and the overall project schedule, the defect may be fixed, deferred, or a work-around may be identified. Deferred defects in conjunction with those found by the customers form the basis for the minor releases and/or fix packs. This path is shown in the top line of Figure 1.1. The objective of this project is to implement fixes for defects that were identified
CHAPTER 1. SOFTWARE MANAGEMENT
5
with the previous release(s) of the product. Depending on the nature of identified defects, a fix may also be implemented in the future release. The two lines of the release cycle are equally significant. While the bottom line focuses on new functionalities, the top line focuses on quality and customer satisfaction.
1.2
Management issues
Since a commercial product has multiple releases, there will generally be multiple projects and product versions. For instance, at any given point in time there may be several fix-packs against previous releases as well as the next release itself. In other words, management must cope with a family of products and their associated objectives and problems (Parnas, 1979). At any given point multiple defects/features are included into the code-base, multiple versions of the code-base are created, testing scenarios executed, and resources allocated for each task. While older releases of the product are gradually retired, more current releases are maintained, the new release is actively developed, and future releases are planned. When dealing with small isolated software projects it may be possible for the management to provide accurate reports on the progress, quality, schedule, and cost. However, evidence suggests that such control in large software development projects is lacking (van Genuchten, 1991; The Standish Group International, 1999). To identify the issues facing software management one must determine what is expected of the management. Management must set goals and provide directions for the product development. It must devise a plan and monitor the progress made towards this plan, e.g., operational analysis. When problems arise, management must initiate impact assessment of the problem, search for potential solutions and their implementations, e.g., tactical analysis. Furthermore, in lieu of continual evolution of the software products, management must set longer term goals beyond current release and make strategic decisions, e.g. strategic analysis. An obstacle facing management is the lack of supporting evidence. In many cases, despite promising tools and technologies, there is little data supporting managerial decisions. Consider the problem of defect prediction. There are several models for defect predictions (Card, 1998), yet all of them have come under severe scrutiny. For managers, difficulties lie in finding answers to questions such as:
CHAPTER 1. SOFTWARE MANAGEMENT
6
• Does this problem apply to our project? What evidence supports the existence of the problem? And how does this problem impact the project? • What are possible solutions? What evidence supports validity of these solutions? And what is the impact of implementing these solutions? • How can such problems be avoided? What are the long term strategies? To effectively cope with such difficulties, management must rely on evidence. This style of management is called managing by data. Since there are no a priori solutions to software development problems, management is forced to make compromises (McConnell, 2002). However, such compromises must be weighed against the overall objectives of the project and evaluated to assure continual improvement.
Improvement feedback
...
strategic planning management processes
Software data warehouse
Development databases and repositories
...
development processes
Figure 1.2: A management support system: the development processes create and modify data that is stored in development database while the management processes use data from the warehouse to control and improve development processes. The longer term strategic planning, based on data analysis, provides control and feedback to management processes.
CHAPTER 1. SOFTWARE MANAGEMENT
1.3
7
A new approach
This dissertation is founded on the premise that leveraging BI technologies can improve the management of software development projects. To demonstrate the feasibility of this claim, this section proposes a new approach to software management. This approach has four key aspects:
• Locality of focus: Each software product development is unique. It has its own history, own processes and procedures. To effectively manage software product development, management must focus on characteristics of the local project (Glass, 2003). Management must identify problems based on local evidence. Similarly, solutions must be tailored to best fit the local problems. • Data centricity: Software product development must be managed by data. There are no a priori models for software management. The only constant in the management equation is change itself. Management must continually engage in process improvement. However, such improvements must be based on local evidence. To achieve this, the focus must be shifted from developing management processes to storage and maintenance of supporting data. Software development teams must view software as data and create processes for creation, modification, and analysis of this data. Developers create, modify, and delete data, while managers use it for analysis and decision support. In recent years, the literature included many references to goal-oriented management and measurement-based management (Basili and Musa, 1991; Basili et al., 1995). In such cases, data gathering plays an integral part in supporting management. Adopting a BI approach, management can view software as data and as such it should never be discarded. Management tools must transform existing data into information for supporting the decision making process. As the management applies the information for decision making, it defines more structure on the underlying data. The added structure may be of the form of an association between entities that previously had not been related or creation of new entities. • Multiplicity of concerns: A common goal for every software development project is to deliver a “quality” software product. However, as Chapter 2 will describe, there is no universal
CHAPTER 1. SOFTWARE MANAGEMENT
8
definition for quality and any realistic definition must take into account various aspects of it. Furthermore, quality cannot be defined in isolation, there are constraints that force management to make compromises when faced with conflicting goals. • Continuous improvement: Software products evolve and change. Processes and procedures that were effective during one release may not be as effective during future releases. Furthermore, as discussed under data centricity, there are no absolute solutions. To assure continuous success, all management and development processes, as well as supporting infrastructures, must be continually improved. In line with software evolution, the data representing software must also grow to provide more depth and structure.
Taking into account the four pillars of the proposed management approach, this dissertation proposes a management support architecture (as depicted in Figure 1.2). This architecture divides the processes involved in a software development project along the data lines: development processes that create, modify, and destroy data to reflect the current state of the product, and management processes that use the data to perform analyses and make decisions to meet the objectives of the project. There are two data repositories in the proposed architecture corresponding to development and management processes. While the development repositories store data reflecting the status of various aspects of the project, the software data warehouse integrates these data sources into a non-volatile, temporal, and subject-oriented repository for extracting information about the project as a whole. Figure 1.3 depicts the architecture of the software data warehouse. The available information, in turn, may result in a feedback to the development processes. This feedback is part of the continuous improvement cycle. Based on management decisions, local development processes may be altered, optimized, or redefined. Similarly, as management identifies shortcomings in data needed for making decisions, changes can be made to the development processes to create more data, maintain existing data, or add more structure to the data. This feedback results in improvement in the infrastructure as well as management processes. In conclusion, the contributions of this dissertations can broadly be categorized as follows.
CHAPTER 1. SOFTWARE MANAGEMENT
9
Progress
Programs
Central
Test data
Warehouse
Pipes
Input data sources
...
Populating subsystem
Data CMVC
Meta data
Quality
Health
Project Mgmt
Figure 1.3: A proposed architecture for software data warehouse: on the input layer data from development databases is extracted, cleansed, and loaded into the data warehouse, while on the outer layer the focused areas are defined by extracting data from the middle layer.
• A multi-dimensional management framework – Focus on management goals and their evaluations on the local project; – Provide a means for defining measurable goals enabling monitoring and predicting their values; – Provide a means for organizing defined goals along various dimensions of interest in a hierarchical manner; and – Provide needed flexibility for changing and modifying such goals in the face of unforeseen events. • A data warehousing solution – Focus on managing by data; – Develop an architecture for storage and maintenance of integrated, temporal, and subject-oriented software data; – Implement a management support for operational, tactical, and strategic analysis; – Provide a means of storing project histories for replay; and
CHAPTER 1. SOFTWARE MANAGEMENT
10
– Provide necessary flexibility, based on commercial data management technologies, for evolution of data. • A case study – Demonstrate an application of the proposed framework to an industrial case study; – Implement a concrete architecture of a software data warehouse; and – Assess the impact of the software data warehouse on management tasks.
1.4
Outline
The remainder of this dissertation is outlined as follows. Chapter 2 provides an overview of the concept of quality. A key objective of software development projects is to deliver a “quality” product. Chapter 2 shows that quality is multi-dimensional, must be locally defined, and continually improved. Chapter 3 describes the construction of a multi-dimensional management model. It beings by focusing on quality concerns and expands to capture other concerns that constrain quality. Chapter 4 provides an overview of data warehousing solutions and a high-level architecture for the software data warehouse. Chapter 5 follows up on a key challenge in developing a software data warehousing solution: storing program data. This chapter presents an object-relational model of program data that enables storage of program data in commercial databases and use of common query mechanisms. Chapter 6 presents an approach to construction of a software data warehouse. This chapter uses a four step methodology to constructing a software data warehouse similar to the one developed in Chapter 8. Chapter 7 sets the stage by describing the status quo and identifying the input layer of the software data warehousing solution. Chapter 8 provides a high-level view of the middle layer of the software data warehouse and finally Chapter 9 shows the definition of a three-dimensional management model, which constitutes the outer layer of the software data warehouse. Chapter 10 looks at the improvements made to the management tasks of a software development by introducing the proposed approach of earlier chapters. Finally, the dissertation concludes in Chapter 11 by providing a summary and outlining possible future directions.
Chapter 2
Quality overview Chapter 1 discussed the multiplicity of concerns that software management must face. Quality is an integral concern of software management. But what is quality? This chapter begins by answering this question. It starts by arguing the importance of having a definition for quality, a means of evaluating it, and its continual improvement. The chapter also looks at the reported work on quality definition, evaluation, and improvement. It paves the way for the multi-dimensional management framework discussed in Chapter 1 and presented in Chapter 3. It will show that quality is multi-dimensional, subject to a number of constraints, must be defined locally, and must be continually improved.
2.1
Quality definitions
Before delving into the discussion of quality, it is important to realize the advantage of having a definition of software quality. A concise definition of quality can facilitate its evaluation and improvement. Furthermore, if there were a universal and unambiguous definition of quality, the concepts and ideas that are successfully applied in one domain could be transferred to other domains (e.g. software development). However, the history has shown that quality is not a concept that can be easily and unambiguously defined (Gillies, 1992). From the early years of the 20th century, manufacturing industries grappled with the issues of quality management and improvement. Several 11
CHAPTER 2. QUALITY OVERVIEW
12
quality experts proposed various definitions of quality, each with different emphases that resulted in, sometimes complementary, quality management frameworks. Although these definitions do not seem to converge on a common theme, one can identify some key attributes for the concept of quality. These attributes are at the heart of the approach presented in the rest of this dissertation. Edward Deming, the man generally credited for inventing quality management, defined quality as “A predictable degree of uniformity and dependability at low cost and suited to the market” (Deming, 1986). Deming was an advocate of statistical quality control and placed emphasis on predictability. In other words, being able to repeat the achievement with the same degree of quality. In his definition, Deming acknowledged the existence of constraints that may contribute to the resulting quality, and the compromises that may have to be made. Juran, another quality advocate of the same era, provided a definition somewhat complementary to Deming’s. Juran defined quality as “fitness for purpose” (Juran, 1979). His definition placed a great deal of emphasis on the concept of “customer” requirements. In a typical organization, there are internal as well as external customers: each stage of the development must satisfy its own customers. Juran’s definition points out the existence of different views of quality that must be taken into account. A contrast to Deming’s and Juran’s definitions was proposed by Crosby (Crosby, 1979), who emphasized the importance of conforming to the specified requirements with no deviation. Hence, quality means “zero defects”. Crosby argued that in order to achieve zero-defect (i.e. perfection) the focus must be placed on prevention as opposed to traditional inspection and testing. The quality literature offers many discussions, additions, and objections to these definitions. For example, the International Organization for Standardization (ISO) provides a definition for quality: “The totality of features and characteristics of a product or service that bear on its ability to satisfy specified or implied needs” (ISO, 1986). An important point that can be derived from the variety of definitions is that quality cannot be universally and unambiguously defined. However, based on these definitions one can outline a number of key attributes for quality (Gillies, 1992). • Quality is relative: The definitions provided here take into account concepts such as “customer expectations,” “cost,” and “marketplace.” Such concepts suggest that quality
CHAPTER 2. QUALITY OVERVIEW
13
means different things in different circumstances to different stakeholders. For instance, consider a luxury car and an economy car; they represent quality in different ways. In the first case it is the features of the car, in the second its price and cost of operation. • Quality is multidimensional: There are many contributing factors to quality. Various attributes of a product (or a process) can contribute to the overall perception of its quality. For instance, the quality of a product may depend on its usefulness, ease of use, its cosmetic appeal, and the clarity of its manuals. Each of these dimensions may also represent a different stakeholder. • Quality is subject to constraints and compromises: Aside from its multidimensionality, each contributing factor of quality may have a different degree of importance that provides room for compromise. For example, the clarity of user manuals can be overlooked when a product is easy to use. Similarly, a product’s cost or its time of availability on the market may overshadow other criteria for its quality assessment. • Quality criteria are not independent: Quality criteria usually conflict with one another. For example, in project management, it is a rule that only two of the three factors of cost, functionality, and schedule can be improved. If more functionality is required, either the cost or the schedule must be adjusted to reflect this change. These qualifications have resulted in various classifications of quality according to different views or perspectives. One of the most influential classifications was proposed by Garvin (Garvin, 1984), who studied the perception of quality in various domains and described quality according to five different views: • The transcendental view relates quality to innate excellence: something that can be recognized but not defined. • The user view relates quality to fitness for purpose. • The manufacturing view relates quality to conformance to specification. • The product view relates quality to the inherent characteristics of the product.
CHAPTER 2. QUALITY OVERVIEW
14
• The value-based view relates quality to the “price” a customer is willing to pay. The discussion provided by Garvin has been applied to software quality. Common definitions posed by software quality experts emphasize “fitness for needs” and “conformance to specifications” (Kitchenham, 1990). Others identify different roles in a software project and associate a set of concerns with each of these roles (Gillies, 1992). Taking into account characterization of quality, there are two influential software quality models: McCall’s and Boehm’s. McCall’s model (McCall, 1980; Watts, 1987) defines quality as a hierarchy of criteria, each of which is associated with a set of contributing factors and metrics for evaluation of these factors. The model is divided into three broad areas at the top and is refined into influential criteria as follows: • Product operation: Can users quickly learn to efficiently operate the product? This area is further divided into: correctness, reliability, efficiency, integrity (security), and usability. • Product revision: Can developers maintain, modify, and test the product? This area is further divided into: maintainability, flexibility, and testability. • Product transition: Can the product be ported to new environments or inter-operated with other software products? This area is further divided into portability, reusability, and interoperability. The criteria described here are further refined into their respective contributing factors. Furthermore, each factor is associated with a set of metrics. For example, McCall defines a contributing factor for maintainability as modularity, which may be associated with a complexity measure such as fan-out (i.e. the number of function calls from one module to other modules). Boehm’s model (Boehm and Brown, 1978), is also hierarchical but is based on a larger number of criteria than McCall’s model. Boehm’s model is based on the uses that are made of the system: “general utility” or “as is utility”. The “as is” division is a subset of “general utility” and roughly reflects the product operation area of the McCall’s model described previously. The other divisions of “general utility” are maintainability and portability, each of which are further divided into more criteria. Lower in the hierarchy, criteria for the model include: reliability, efficiency,
CHAPTER 2. QUALITY OVERVIEW
15
understandability, accuracy, structuredness, augmentability, and self descriptiveness. Some of these criteria interrelate with one another, which clearly points out that there must be compromises made to achieve an optimum quality. The overall quality is measured as a weighted summation of various criteria. Both models share several common characteristics: they are based on the user’s view, they take into account the developers’ concerns, and the overall quality is defined as a summation of the identified criteria. However, neither of the two models claim nor represent a universal and exclusive software quality model. In 1992, the International Organization for Standardization (ISO) proposed the ISO 9126 software quality standard. The standard defines quality characteristics as: functionality, reliability, usability, efficiency, maintainability, and portability. Each individual organization can further refine these characteristics to their influential factors and define a set of metrics for each of these factors. The standard provides a recommendation from which an organization can develop their local definition of quality. ISO has also published a sample refinement of ISO 9126 which refines each view and identifies a set of associated metrics for each view. However, there has not been any standard pertaining to this refinement. The next section will take a closer look at metrics and quality measurement.
2.2
Quality measurement
As Section 2.1 discussed, in order to initiate software quality improvement an organization must measure quality. This section provides an overview of the ideas behind measurement and its applications to software quality. Measurement is a process by which a numeric (or symbolic) value is assigned to an attribute of an entity in the empirical world (Fenton and Pfleeger, 1997). The assigned value is then called a metric (or a measure). A metric can provide a varying degree of information based on its scale. The nominal scale provides only a classification, whereas the ordinal scale, in addition to classification, provides ordering. The interval scale, further to the ordinal scale, provides magnitude and difference operations. Finally, the ratio scale provides a zero element and enables ratio operations. In addition to the measurement scale, in software quality measurement, one must identify a
CHAPTER 2. QUALITY OVERVIEW
16
particular focus of measurement: process or product. Each of the two categories has its own set of measurements. Product measurement focuses on two attributes: size and structure. The motivation behind measuring the size is to facilitate planning, assessment, and comparison of software projects. The most basic form of size measurement involves counting the “lines of code” (LOC). This measurement is simple and yields a numeric value. Using the LOC values of two programs, it is possible to compare and assess differences between the two programs. However, the comparison of the two LOC values is subjective. For that reason, the definition of “lines of code” has undergone several modifications. A line of code can simply be interpreted as a line in a file, e.g., a line of text or a blank line. More specifically, a line can be classified into a line of program, a line of comment, or a blank line. A line of program can also be further divided into a compiler directive, a declaration, a label, or an executable statement (Fenton and Pfleeger, 1997). Another factor that can prove challenging for use of LOC metrics is the “complexity” of the program. For example, a sorting program is generally considered more complex than a simple pretty-print program. However, the LOC value of the latter may well be larger than the former. For this reason, the second category of product measurements aims to take the complexity factor of a program into consideration. The idea behind this category of measurements is that the structure of a program and its complexity are related. Typically, the structure of a program is assessed by constructing a control flow graph (CFG) and/or a data flow graph of the program, then mapping the graph structure to a numeric value that is believed to be indicative of the program complexity. The most common form of this measurement is McCabe’s cyclomatic complexity (McCabe, 1976), which is computed by counting the number of edges and nodes in the CFG of a program. The cyclomatic complexity of a program V (G) is defined as: V (G) = e − n + 2p where e is the number of edges, n is the number of nodes, and p is the number of connected components (usually 1) in the control flow-graph of the program. McCabe reported (McCabe, 1976) that, based on his empirical studies, program modules that had cyclomatic numbers higher than 10 were problematic for testing and maintenance tasks. However, a counter example to McCabe’s model can easily be constructed. For example, consider a large program with only a single if
CHAPTER 2. QUALITY OVERVIEW
17
statement. The cyclomatic number of this module is very high, yet, the program is generally considered not to be complex. For this reason, variations of McCabe’s cyclomatic complexity were proposed that took into account the data-flow graph (DFG) of a program as well as its CFG. Ovideo’s complexity measure combines a weighted sum of a control-flow metric and a data-flow metric (Ovideo, 1980). The Ovideo’s metric C is computed as: C = aCf + bDf where Cf is the total number of edges in the program’s CFG, Df is the sum of data-flow complexity of each block in the program (the number of variables referenced that are not defined in a block), a and b are weighting factors that can be assumed to be 1. Alternatives to McCabe’s and Ovideo’s metrics focused on more in-the-large characteristics of programs. Typically, large programs consist of many “modules”. A module can be either based on the features provided by the programming language or a convention defined by the designers of the program. In such cases, the number of interactions between the modules or the modules interdependencies are used as a basis for program structuredness. Two common examples are the fan-in and the fan-out counts. Fan-in is the number of connections into a module, whereas fan-out is the number of connection from a module to other modules. The process quality metrics focus on assessing the effectiveness of activities of a process. These measurements are based on two categories: the rate of change and classifications. In the rate of change category, most common measurements belong to the detection rate family, which represents metrics that measure the effectiveness of testing and checking techniques (Kitchenham, 1990). Common examples of this family include: • total number of faults found during a code inspection and review normalized with respect to the number of lines of code, • the average number of faults detected per test case scenario, and • the percentage of total faults found during a particular checking or testing activity. In the classifications category, the objective is to identify a number of attributes for a fault. For example, a fault may be associated with attributes such as “phase found”, “source of fault”, and
CHAPTER 2. QUALITY OVERVIEW
18
“priority”. An example of this type of classifications and quality model is the orthogonal defect classifications (ODC) developed by Chillarege et al. (Chillarege et al., 1992). The idea behind ODC is that by classifying faults, observations and inferences can be made about the test and development processes. For example, faults can be classified based on where in the development cycle they were found (phase-found), the impact the fault had on the overall quality of the product, the stage of the process where the fault needed to be fixed (e.g. design, code), and the criticality or importance of the fault. The software quality literature offers a large number of measurements. However, the proposed measurements have some inherent flaws. These measurements intend to provide insights into specific attributes of a product or process. For example, McCabe’s complexity value is intended to measure “maintainability” of a program, the ratio of LOC and total number of comment lines in a program is intended to determine the “understandability” of a program, while the ratio of faults-found in a system test cycle to the total number of faults found is intended to measure the “effectiveness” of the system test processes. However, the intentions cannot be validated. The relationship between the metrics and the concept they seek to measure is very complex and it is difficult to establish correlations between the two. In other words, they cannot be easily validated. Furthermore, even if a correlation is shown to exist, it is difficult to show that the correlation is absolute. For example, it is very difficult to say that a certain percentage of all program modules that have a complexity number greater than 10 are not maintainable. Overall, the report on the state of measurement in software engineering practice is not encouraging (Kitchenham and Pfleeger, 1996; Fenton and Pfleeger, 1997). Quality measurements have not been widely deployed due to the nature of quality, the subjectivity of the measurements, lack of empirical evidence to the effectiveness of measurements, and the cost associated with quality measurements. However, in order to improve software quality, it must be measured. Even if a measurement is considered subjective, it should be improved not put aside. One advantage of measuring a concept such as quality is to gain a better understanding of it. As a team gains more understanding of the attributes of quality, their measurement processes must improve to reflect this change in understanding. Hence, to improve quality, it is necessary to measure quality, improve our understanding of quality, and improve the measurement processes and associated metrics. In
CHAPTER 2. QUALITY OVERVIEW
19
order to complete our overview of quality the next section will look at some quality improvement frameworks and their relationship to software quality improvement.
2.3
Quality improvement
As discussed earlier in this chapter, the need for a definition of quality is to facilitate its measurement and consequently its improvement. Each of the quality experts introduced in section 2.1 have proposed their own quality improvement frameworks based on their quality definitions. Each of these frameworks proposes a set of guidelines and actions for achieving quality goals. These frameworks have influenced various industries, including software development, and have had some degree of success. In this section, we will look at these efforts to identify their key ingredients, differences, and shortcomings. Deming’s approach (Deming, 1986), Plan-Do-Check-Act/Analyze (PDCA), is based on the work of Shewart (Shewhart, 1931) and focuses on the applications of statistical process control (SPC). The objective of Deming’s approach is to improve the production line based on a feedback loop. The four steps of Deming’s PDCA cycle are:
• Plan: define the goals, identify available data, and set up measurement criteria; • Do: execute the plan (on a small scale), search for the data identified in planning step, and perform the measurements; • Check: observe the effects of the plan, and check the measurements against the targets; and • Act (or analysis): study the results, make projections (or predictions), and take necessary actions.
Deming also proposed a fourteen-point program for quality management. Among these are: leadership of management to set out goals and plans, involvement of the entire team (not just the management), removal of dependency on testing and inspections, and continual improvement of [development] processes. Deming strongly objects to a focus on testing (or mass inspection).
CHAPTER 2. QUALITY OVERVIEW
20
He argues that quality must be built into the product by improving the development processes – reworking of non-conformances is expensive and unnecessary. Process improvements are achieved by making predictions, defining measurements and checking them against targets. Juran instituted the notion of “fitness for purpose” (Juran, 1979). His focus is on conformance to customer satisfaction as opposed to specification. Juran’s ideas are in accordance with the problems of software development: due to frequent specification changes, the resulting products generally do not meet customers’ expectations. The ideas of Juran are complementary to Deming’s. Juran’s quality improvement framework is based on building awareness of need and opportunity for improvement: setting improvement goals, carrying out projects, reporting progress, communicating results, and having regular structured quality improvement meetings as a part of the company’s regular processes. The structure of quality improvement is that the team must identify symptoms of failures (or shortcomings), determine possible candidate causes, validate each candidate, and devise a solution. Juran stated that survival and growth depend on breakthrough performances — the status quo is never enough. The two frameworks, Deming’s and Juran’s, concur on the importance of feedback loops, data collections and analysis. Crosby focuses on achieving “zero defects” (Crosby, 1979). His notion of quality assumes the existence of a formal binding requirement specification, which implies conforming to requirements as zero defects. Crosby stated that to achieve zero defects an organization must go through five maturing stages.
1. Uncertainty: At this stage quality is neither defined nor understood, problems are fought as they occur, and there is no formal improvement program. 2. Awakening: At this stage the importance of quality is recognized, teams are set up to attack problems, and there is focus on short range improvement efforts. 3. Enlightenment: This stage is identified by management support, creation of quality departments, and establishment of corrective action. 4. Wisdom: At this stage management becomes involved in quality issues, quality managers are assigned, and problems are identified early in the development phase.
CHAPTER 2. QUALITY OVERVIEW
21
5. Certainty: At the highest level quality management becomes an essential part of the company and prevention is the main concern. The main pillars of Crosby’s prevention mechanism are: identification of current problems, taking actions to correct the problems, and determining the potential problems that may rise. A weakness of Crosby’s framework is his view of quality as absolute. In contrast to his approach, Akao, working for Japanese auto-maker Toyota, developed Quality Function Deployment (QFD) to address the issues of meeting customers requirements (Akao, 1990). The philosophy behind the QFD is that zero defect reduces negative quality but does not create positive quality. Positive quality is the value to the customers that makes them choose one product over another (in conjunction with zero-defects). In other words, as Juran argued, conformance to requirements is necessary but not sufficient. To achieve positive quality, Akao devised a matrix of customer requirements, those expressed and expected, versus the technical product specification. The focus of QFD is on increasing customers’ communication, realizing the multi-dimensionality of customers needs, reinforcing usability issues, and prioritizing features and functions to be included in the product. The concept of QFD, sometimes called The House of Quality, has been successfully applied in Japanese manufacturing (Hauser and Clausing, 1988) as well as software development (Zultner, 1992). The above mentioned frameworks can be considered as the influencing factor for Total Quality Management (TQM) (Feigenbaum, 1991; Kitchenham, 1990). Originally the term was used to describe the Naval Air Systems Command’s management approach to quality improvement. Over the years, TQM has evolved and now means many things to different people. However, at its core, TQM can be characterized by its pillars of: • Customer focus: meeting the customer requirements (both internal and external); • Long term plans: where we are going and what we need to get there; and • Total participation: everyone is responsible for quality. TQM focuses also on improving processes in order to build quality into the product. Furthermore, TQM places the responsibility of a process primarily with the process owner, i.e., everyone must
CHAPTER 2. QUALITY OVERVIEW
22
take part in quality improvement efforts. This is in contrast to other approaches where a dedicated team of quality specialists are responsible for quality control. There are several variations of TQM that have been successfully applied by companies such as HP (Total Quality Control), Motorola (Six Sigma Strategy), and IBM (Market Driven Quality). The TQM focus on process can be linked to two sets of quality standards for process conformance: International Organization for Standardization, ISO 9000, and Software Engineering Institute (SEI) Capability Maturity Model (CMM). The ISO 9000 emphasizes a Quality Management System (QMS) that a supplier employs to ensure consistent and reliable conformance to customer requirements. The definition of QMS lists five components: organizational structure, responsibilities, procedures, processes, and resources. An organization applying for ISO 9000 standard must be audited by an external auditor. In short, the ISO 9000 motto can be expressed as “say what you do, do what you say, prove it, and improve it.” The SEI CMM (Ahern et al., 2001; Paulk et al., 1993), in contrast to ISO 9000, places more emphasis on maturity of the development processes. The philosophy of CMM is that as processes mature the resulting products become more stable. The ideas behind CMM are similar to Crosby’s maturity levels (Crosby, 1979): each level in CMM represents a logical continuation of the previous level. The five maturity levels of CMM are: • Initial: at this level software development processes are mostly ad hoc; • Capable: at this level some processes are established and can be repeated; • Defined: at this level software processes are documented, standardized, and integrated for the entire organization; • Managed: at this level the product and process efficiency are determined through systematic measurement; • Optimizing: at the highest level, a continuous improvement process is developed via defect prevention, technology innovation, and process change management. Both the CMM and ISO 9000 standards have come under criticism due to their focus on processes.
CHAPTER 2. QUALITY OVERVIEW
23
Critics argue that maturity of processes only guarantees a mediocre level of quality. For this reason, both CMM and ISO standards have renewed their emphasis on the improvement cycle. There are several quality improvement programs for software development. Some works focused on adapting an existing quality improvement framework to software development. For example, Pittman et al. have proposed an adaptation of Deming’s PDCA cycle (Pittman and Russell, 1998); Schulmeyer (Schulmeyer, 1990) has adapted the concept of “zero defect.” There are also specific software quality improvement programs that are developed specifically for software engineering. One of the more innovative improvement programs in software quality improvement is the result of the work of the Software Engineering Laboratory (SEL) at the University of Maryland (Basili et al., 1995; Basili and Caldiera, 1995). What distinguishes this approach, the Quality Improvement Paradigm (QIP), from others is its emphasis on goal-oriented measurement, learning through experimentations, and continual improvement. The QIP can be viewed as a three step approach: • Planning: identifying the current project characteristics, selecting a process model, and setting quantifiable goals; • Execution: carrying out the project tasks and procedures, constructing the product, collecting and validating the prescribed data, performing analysis for corrective feedback; • Analysis/Packaging: a post mortem analysis of the gathered data against the set-out goals, identifying problems, packaging the experience and other forms of structured knowledge gained for future projects. At the heart of the SEL approach there is goal/question/metric (GQM): a novel approach to setting quantifiable goals. GQM is a hierarchy that starts with a business goal which specifies the purpose of the measurement, a set of questions that provide an operational definition for the goals, and a set of metrics and data collection methods. To summarize, the QIP views software development endeavor as an experiment, it requires setting goals for evaluating processes and products, learning from the experiment, and facilitating communicating the learning experience to other projects. In this chapter, we discussed a number of issues involving quality: its definition, measurement,
CHAPTER 2. QUALITY OVERVIEW
24
and improvement. What one can deduce from this discussion is that quality is an issue close to every project, yet hard to define, quantify, and improve. The next section provides a summary of highlights of the quality discussion provided in this chapter and how it relates to the concepts of this dissertation.
2.4
Summary
The focus of this chapter was quality. It was shown that the problems surrounding quality are not unique to software development: quality is hard to define, difficult to measure, yet often easy to recognize. In the case of software, due to its soft nature, defining quality and measuring it has become even more challenging. The objective of this dissertation, as outlined in Chapter 1, is to aid managers in releasing quality software products. To do so, three problems must be addressed: • Quality definition: Quality cannot be universally defined, instead it must be defined locally. Quality, as shown in section 2.1, is multi-dimensional, and subject to constraints and compromises. The standard definitions are too general and cannot be adopted by every software development organization. Hence, an organization must define quality based on their overall organizational goals, constraints, and possible compromises. Furthermore, these definitions must take into account the multiplicity of concerns of the stakeholders: customers, developers, and managers. • Quality measurement: Measuring software quality has not been the norm in the industry (Kitchenham and Pfleeger, 1996). Common objections to existing measurements include the lack of (independently verified) experimental data, subjectivity of metrics, and cost (Gillies, 1992). To remedy these shortcomings the focus must be shifted from process-centric measurement to data-centric measurement. Since the measurements are subjective, they will be subject to repeated modifications. As the organization learns about what they are measuring, they improve their measurement processes. Measurement programs are designed by focusing on defining what to measure, then gathering necessary data and carrying out the measurement. The expensive part of the measurement, in this approach, is the data gathering
CHAPTER 2. QUALITY OVERVIEW
25
and storage. By focusing on identifying data sources and integrating them, the cost of data gathering and storage can be amortized over multiple releases. • Quality improvement: As discussed in section 2.3, quality is not absolute and must continually improve. In fact, quality must be viewed as a continuously improving as opposed to an attribute of a product or a process. This includes the products, development processes, quality definitions, and quality measurement processes. Key ingredients to continual improvement are: – Management leadership: Management must define quality, the organizational quality goals, the means of measuring quality, the quality predictions, and means of monitoring quality throughout the development cycle. – Total participation: Each member of the organization has their own set of objectives and concerns that must be defined in lieu of the overall organizational objectives. Moreover, as the development progresses, the monitoring processes must inform every member with regards to their concerns and the overall picture. – Continuous learning: Decisions are made about future concerns. The decision making process involves two steps. First, looking forward. This requires mathematical models to determine future outcomes, identify problem areas, and determine possible corrective actions. Second, looking backward. This requires identifying patterns and linking seemingly unconnected events to facilitate the forward looking step of the decision making. In other words, to move forward, an organization must look back and leverage what was learned previously (Einhorn and Hogarth, 1999). This implies that learning must be continual. Every measurement that is made adds to the organization’s memory and if encapsulated properly, it can benefit the decision making processes (Basili and Caldiera, 1995). To conclude, each software product development has its own unique characteristics, objectives, and constraints. An organization cannot accept a development practice until it has been measured and proven to work in their environment (Grady, 1992). Chapter 3 presents a framework for locally defining quality and its constraints.
Chapter 3
A multidimensional management framework Chapter 2 discussed the issues surrounding quality. It showed that there is no universal definition of quality. An organization must define quality locally taking into account the multiplicity of its stakeholders’ concerns. As the organization receives feedback from its stakeholders it must improve quality and its local definition. Hollenback et al.(Hollenback et al., 1997), in their study, found that an organization improves what it measures. As a result, the local definitions of quality must include a means of measuring it. Lastly, as Chapter 2 outlined, quality is subject to constraints and compromises. When defining quality, an organization must take into account such constraints and compromises that they need to address. This chapter builds on the concepts presented in Chapter 2. Sections 3.1-3.3 provide a novel approach to defining, measuring, and predicting quality based on proven technologies. Later, section 3.4 provides a multi-dimensional framework for managing quality in lieu of its constraints: progress and (code-base) health.
26
CHAPTER 3. A MULTIDIMENSIONAL MANAGEMENT FRAMEWORK
3.1
27
Quality definitions
The literature offers several approaches that can be used for defining quality. This dissertation uses two of them: the Goal/Question/Metric (GQM) developed by Basili et al.(Marciniak, 1994; Basili et al., 1995) and the non-functional requirement specification proposed by Mylopoulos et al.(Mylopoulos et al., 1992). The two approaches were assessed by Bonifati et al.(Bonifati et al., 2001) for designing data marts, which is similar to the problem of quality definition presented in this dissertation. Bonifati et al.(2001) identified the two approaches as comparable and interchangeable for their task and selected the Goal/Question/Metric (GQM). This dissertation uses the two approaches in conjunction. The GQM paradigm was developed by Basili et al. at the University of Maryland (Marciniak, 1994; Basili et al., 1995). GQM’s main application is to systematically adapt software measurement programs. Similar to the approach taken by this dissertation, GQM focuses on local aspects of a measurement program, which indicates that there is no universal measurement program. The GQM process begins by identifying a set of conceptual and high-level goals which closely relate to the overall business objectives (e.g. at the enterprise level). Each goal is then further refined by a set of questions that determine how to achieve it. Finally, at the lowest level, each question associates with a set of metrics that provide a quantitative answer. Figure 3.1 shows the structure of a GQM hierarchy. Mylopoulos et al.(1992) proposed an alternative means of specifying non-functional requirements (such as quality). They defined a non-functional requirement as a hierarchy of goals that are related through different types of AND/OR links, positive/negative influences, and goal inclusion. This approach, similar to GQM, starts by defining a top level goal and refining it into subgoals. The overall structure of these goals resembles an AND/OR tree. This approach also provides a mechanism for specifying external influences that positively or negatively affect a subgoal. Figure 3.2 shows an example of this approach. This chapter proposes a new approach, combining the above mentioned approaches, for defining quality. A definition begins from a high level quality objective and systematically breaks down based on organizational structure (multiplicity of concerns) and, if available, work breakdown and
CHAPTER 3. A MULTIDIMENSIONAL MANAGEMENT FRAMEWORK
28
Goal
Question
Metric
Metric
Question
Metric
Question
Metric
Metric
Figure 3.1: An example of a Goal/Question/Metric hierarchy
Goal
-
+
Goal
Goal
Goal
Goal
Goal
Goal
Goal
Goal
Goal
Figure 3.2: An example AND/OR tree with positive/negative influences. The two leftmost goals below the root are connected via an OR link. The leftmost goal also have a negative (-) external influence. The rightmost goal below the root has a positive (+) external influence.
product breakdown structures (Project Management Institute, 2000). The overall structure is a hierarchy with a goal at its root. This goal is broken down to sub-goals repeatedly until a set of metrics for the evaluation of the subgoals are identified. The hierarchical nature of this structure enables capture of the multiplicity of concerns, i.e., a goal is defined by combining its sub-goals. Figure 3.3 shows an example of this approach. To enable specification of alternatives, sub-goals may form a disjunction using an OR link. For example, the quality factors for a new software may indicate that the product must have a good user interface or a good set of user manuals. Figure 3.3 shows two sub-goals of the root, sub-goals 1 and 2, grouped in a disjunction —this is indicated by an arc crossing their links. Furthermore, we enhance our model using external influences as suggested by Mylopoulos et al.(1992). External
CHAPTER 3. A MULTIDIMENSIONAL MANAGEMENT FRAMEWORK
29
Goal
Subgoal
Subgoal
Ext. + influenece
Metric
Subgoal
Subgoal
Metric
Metric
Metric
Metric
Metric
Metric
-
Subgoal
Subgoal
Metric
Metric
Ext. influenece
Metric
Metric
Figure 3.3: A hierarchical definition with associated metrics
influences can positively or negatively affect any non-leaf node of the tree, i.e. sub-goals. This mechanism enables capturing of dynamic changes to the model: the events that can take place after the construction of the model. Using external influences the original definition remains intact in lieu of unforeseen changes. For example, from time to time, due to business objectives, the specified definition may be modified resulting in one or more sub-goals to be removed or their role strengthened or weakened. In such cases, we can use the external influence to model such changes.
3.2
Quality evaluation
The leaf nodes of the hierarchy are the actual metrics for evaluating the parent subgoal. In other words, there is a numeric value associated with each leaf node. These metrics can be combined to provide a value for their parent goal or sub-goal all the way to the root of the hierarchy. The inner nodes of the hierarchy provide both a conceptual model for the root goal as well as an accounting mechanism for nodes at the lower levels. The hierarchy represents not only the local definition of the root goal, in this case quality, but also a framework for associating a numeric value to this goal. In one extreme case, the value of the leaf nodes can be binary. In this case, the value of the immediate parent sub-goal is the logical evaluation of the conjunction and/or disjunction of the leaf
CHAPTER 3. A MULTIDIMENSIONAL MANAGEMENT FRAMEWORK
1.
30
2. G0
m01
G1
m02
m12
m11
3.
4. G3
G2
G21
G31
G22
G32
Figure 3.4: Evaluating subgoals of a hierarchy
nodes. For example, in Figure 3.4, if V (Gi ) is the value of subgoal Gi and the value of each leaf node (i.e. metric) mij ∈ {true, false}, we have: V (G0 )
= m01 ∧ m02
V (G1 )
= m11 ∨ m12
V (G2 )
= V (G21 ) ∧ V (G22 )
V (G3 )
= V (G31 ) ∨ V (G32 )
For monitoring and projection purposes the binary values are insufficient. Using binary values we cannot specify tolerance or estimate future values of quality in evolving systems. As a result, we normalize the values to a range [l, u] (e.g., 0 to 100): l ≤ mij ≤ u The values associated with the inner nodes of the hierarchy can then be interpreted as the progress made towards achieving that subgoal. A simple means of evaluation uses fuzzy logic. In other words, in the previous example the values
CHAPTER 3. A MULTIDIMENSIONAL MANAGEMENT FRAMEWORK
31
are computed as V (G0 ) =
min (m01 , m02 )
V (G1 ) =
max (m11 , m12 )
V (G2 ) =
min (V (G21 ), V (G22 )
V (G3 ) =
max (V (G31 ), V (G32 )
The fuzzy logic evaluation of conjunction is too pessimistic for the purposes of this dissertation. To remedy this, we replace the min function with the average function. In the previous example, values are computed as V (G0 ) =
(m01 + m02 )/2
V (G2 ) =
(V (G21 ) + V (G22 ))/2
The value of multiple subgoals can similarly be defined. Given a goal with several subgoals, first we evaluate the disjunctions followed by conjunctions.
Quality – 10
45
System test
Regression test
Usability
... Stress test
Scenario 1
Sceanrio 2
45 Execution 1
35
Execution 4
Execution 2
65
Execution 5
55
60
75 + 10
Scenario 3
75 Execution 6
Manual
Install
...
Execution 7
80
Execution 8
70
Execution 9
75
65 Execution 3
Figure 3.5: An example goal hierarchy
Online help
...
CHAPTER 3. A MULTIDIMENSIONAL MANAGEMENT FRAMEWORK
32
The purpose of using disjunction in our model is to facilitate deviations. In the proposed model the evaluation of disjunction takes precedence. For example, in many development projects, code review is mandatory. However, from time to time, due to simplicity of the code, the review may be waived. In such cases, we need to differentiate between developers waiving the review and forgetting the code review. To model this case, we create a subgoal “assessment” which is the disjunction of two subgoals “code review” and “justification”. To complete the assessment either code review must be carried out or a justification presented. To demonstrate the concepts presented so far, consider an example. Figure 3.5 depicts a hypothetical example of a quality definition. Looking at the subgoal stress test, its definition implies that to achieve this goal, the subgoal scenario1 must complete and, at least, one of the goals scenario2 or scenario3 must also complete. Each of the specified subgoals has metrics associated with them, which are numbers reported by testers indicating successful execution of an instance of the scenario. Assuming completion as achieving 80 percent of the upper limit and presented values for various executions, we can compute the value of subgoal stress test as follows. V (scenario1)
V (scenario2)
V (scenario3)
=
1/3(V (execution1) + V (execution2) + V (execution3))
=
1/3(45 + 55 + 65) = 55
=
1/2(V (execution4) + V (execution5))
=
1/2(35 + 65) = 50
=
V (execution6) = 75
V (stress test) = =
1/2(V (scenario1 + max (V (scenario2), V (scenario3))) 1/2(55 + 75) = 65
The disjunction of subgoals provides a means of modeling possible future changes. In some cases, one may not know in advance the types of changes that may take place. For example, in the previous example subgoal, a manager may, due to schedule pressures, decide to waive the assessment altogether. One solution is to modify the model. However, this approach results in losing the memory of the original model. Instead, to model such cases, we use external influences.
CHAPTER 3. A MULTIDIMENSIONAL MANAGEMENT FRAMEWORK
The evaluation function now can be defined as VE (G) + EG VE (G) = u
33
if VE (G) + EG < u otherwise
where EG is the external influence on the subgoal G and its value positive. The value of a subgoal cannot exceed the upper bound of u. In case, the external influence is negative the evaluation function can be defined as VE (G) =
VE (G) − EG l
if VE (G) − EG > l otherwise
Similarly, the value of the subgoal cannot reach below the lower bound of l. Using the external influences, we can maintain the history of each subgoal in the presence of continual changes. When changes are made, the external influence is increased (or decreased) by the appropriate value of EG up to the upper limit of u (or down to lower limit of l). To demonstrate this computation, consider the subgoal install in Figure 3.5. To compute the value of this goal, three metrics based on executions 7, 8, and 9 are averaged: V (install) = =
1/3(V (execution7) + V (execution8) + V (execution9)) 1/3(80 + 70 + 75) = 75
Initially, the objective for the install team was to reach at least 80 percent of the upper limit (i.e., u). However, due to changes in the business environment, the management has lowered the expectation to 70 percent. To capture this change, we add a positive external influence to the subgoal install with a value of 10 percent of the upper limit (e.g. 10). The new value of the subgoal install can now be computed as: VE (install) = V (install) + 10 = 85 In other words, the install subgoal has reached the acceptable level. The simple averaging function assumes all goals are similar in their importance to the overall goal. However, there are cases where different sub-goals may not be equally significant: one sub-goal may be more important than the other or one set of metrics be more subjective than another. For example, as discussed in Chapter 2, the quality of the user interface for a new software may be more
CHAPTER 3. A MULTIDIMENSIONAL MANAGEMENT FRAMEWORK
34
important than the quality of its documentation to the overall quality goals. TO distinguish the contributions of various subgoals, we associate a weight factor, wG , with each node G representing its importance or significance: wG ≥ 1 The weight of the root is defined as 1. The value of a node, G, can be defined as Vw (G) = V (G) × w Similarly, the value of a conjunction is defined as the weighted average of its subgoals: Vw (G2 ) = (Vw (G21 ) + Vw (G22 ))/(w21 + w22 ) where wij is the weight of the subgoal Gij . The value of the disjunction is defined as the maximum of the weighted values of the subgoals. In other words, Vw (G3 ) = max (Vw (G31 ), Vw (G32 ))
For example, consider the usability subgoal in Figure 3.5. It consists of two subgoals manual and on-line help. The customer survey indicates that twice as many customers rely on the on-line help than the printed manual. As a result, to emphasize the importance of on-line help, we assign a weight factor of 2 to the subgoal on-line help and a weight factor of 1 to the manual subgoal. Furthermore, the recent survey indicates that usability has become more critical for the success of the product and as such management has decided to increase the acceptable levels by a 10 percent. This increase is captured using a negative influence. The value of the usability goal can now be computed as: V (usability) = =
1/3(V (manuals) + 2 × V (on-line help))) − 10 1/3(75 + 2 × 60) − 10 = 65
The weighted average sum assumes that any progress towards accomplishing a goal is significant. However, there are cases where progress is not considered significant unless it reaches a certain threshold. For example, the clarity of a user manual may not be accurately assessed if only a small percentage of the document is written. As a result, we can enhance our summation function by
CHAPTER 3. A MULTIDIMENSIONAL MANAGEMENT FRAMEWORK
35
associating a threshold to each node. In other words, for a given subgoal G we define V (G) if V (G) > τ Vτ (G) = 0 otherwise The range of values of each subgoal is now divided into two sections: acceptable and unacceptable. The value τ defines the threshold for the acceptable region of each sub-goal. For example, regression testing determines whether previously implemented functionalities still behave as specified in lieu of current developments. It involves executing a large number of test cases from all previous releases. In some cases, management does not consider achievements of regression test until it has reached a certain threshold. For example, in Figure 3.5, management may not consider regression testing until it completes 60 percent of its upper limit. Given the current value of regression test as indicated in the figure, we can compute the quality value as: V (regression)
=
0 since 45 < 60
V (system test) =
1/2(V (stress test) + V (install)) = 75
V (quality)
=
1/3(V (regression) + V (system test) + V (usability))
=
1/3(0 + 75 + 65)
=
47
In summary, depending on the application at hand, one can use one or more of the evaluation functions. If there is no evidence or heuristics at hand to determine the weight of a goal, we can assign all weights to 1. As more data about the project is accumulated one can adjust the weights accordingly. Similarly, threshold values can be added to nodes if there is an agreement among the management team; otherwise, the threshold value should equal the lower bound (e.g. l). Furthermore, the evaluation functions presented here may also be redefined. The weighted average used in this dissertation was chosen based on its simplicity and practicality in the case study described later. However, as more evidence is gathered, other functions may prove more appropriate. Finally, to monitor and predict quality values we need to determine acceptable and planned levels to compare the values of each subgoal. The following section will look at a framework for monitoring and predicating quality values based on the definitions provided in this section.
CHAPTER 3. A MULTIDIMENSIONAL MANAGEMENT FRAMEWORK
3.3
36
Quality monitoring
The first step in quality management is to locally define quality based on enterprise objectives and concerns. This is accomplished using the method described in the previous section. The next step is to build a monitoring framework for determining the value of quality at different steps of the project and to predict its future directions. Based on the local quality definition, the start time, and the planned completion time of the product development schedule, we can define a line connecting the initial quality value q0 at the beginning of the project at time t0 , with the target quality value qc at the planned completion time of the project tc , where tc > t0 . This line represents the ideal line of quality. Ideally, at any point during the development of the software product the value of quality should not fall below the ideal line of quality; otherwise, the product is at risk of not meeting its quality objectives. Figure 3.6 depicts a diagrammatic picture of the ideal line of quality and the risk region. We will refer to this diagram as the quality plan.
b qc
ideal line of quality
risk region q0 a
t0
tc
Figure 3.6: An example quality plan
As pointed out in Chapter 3, our quality management system is based on a data warehousing solution. A data warehouse, unlike transactional databases, must be regularly refreshed, i.e., new
CHAPTER 3. A MULTIDIMENSIONAL MANAGEMENT FRAMEWORK
37
data is added to the data warehouse but the old data is not deleted. The quality monitoring framework involves evaluating the overall quality value based on the quality definition hierarchy at regular time interval. Every time the data warehouse is refreshed, the overall value of the quality is computed based on the local definition. Repeating this process will result in a set of ordered pairs (qi , ti ), where qi is the overall value of quality at time ti . Connecting the points q0 , q1 , q2 , · · · results in the real quality curve, depicting the progression of quality values throughout the product development cycle. In other words, the real quality curve represents the value of quality at different stages of the product development. Furthermore, the predicted line of quality at any given point in time ti can be defined as a straight line connecting the points q0 and qi . In other words, the predicted line of quality projects the value of quality at time tj > ti based on the past performance at points q0 and qi . We define the quality risk factor as the distance between the ideal line of quality and the predicted line of quality. Figure 3.7 shows an example of a quality monitoring framework’s curves. As time passes, the value of quality changes. We can capture the change of quality value q with respect to time t. More formally, if q0 , qc , and qi represent the quality values at the start of the project t0 , at the completion time tc and at a given time ti respectively, such that t0 < ti < tc , then the ideal line of quality can be expressed by its parametric equation: Li :
q − q0 t − t0 = tc − t0 qc − q0
and similarly the predicted line of quality at ti as Lp :
t − t0 q − q0 = ti − t0 qi − q0
There are two key points that must be stressed here. First, as can be seen from the predicted line of quality in Figure 3.7, based on the performance from the start of the product development the quality goals must be adjusted: either the time of completion must be pushed back or the quality expectations must be lowered. Alternatively, one can argue that more resources must be added to the project to raise the slope of the predicted line of quality to meet the target quality values at the completion time. Second, if the quality at a given time is far below the expected value, a
CHAPTER 3. A MULTIDIMENSIONAL MANAGEMENT FRAMEWORK
38
Quality
b predicted line of quality Lp
qc ideal line of quality Li
qc0
Real quality curve
q0 a
Time t0
t1
t2
t3
tc
t0c
Figure 3.7: An example of quality monitoring
tactical analysis can be initiated by drilling down into the definition of the quality and identifying the sub-goal(s) that has not made its proper contribution to the overall goal. Upon identification of the guilty quality sub-goal(s), solutions can be devised and effect closely monitored. The process of monitoring quality, predicting its future values, and resolving possible problems is a classical resource management problem that every project manager faces. A project scope, as defined in Project Management Body of Knowledge (Project Management Institute, 2000), is defined as time, functionality, and cost. At any given point two of these three attributes can be optimized. In other words, either a project is completed: on time within budget (with lesser functionality), on time with full functionality (with perhaps more budget), or within budget with functionality (perhaps at a later time). Recent studies have shown that there is a direct correlation between program health, quality of the software product, and the functionality that can be added to the product (Lehman, 1998b; Eick et al., 2000). In the next section, we will elaborate on our monitoring framework to take into account these constraints.
CHAPTER 3. A MULTIDIMENSIONAL MANAGEMENT FRAMEWORK
3.4
39
A comprehensive framework
In the previous section, a quality monitoring framework was introduced. However, as discussed in Chapter 2, the notion of quality is subject to constraints and compromises. One of the key constraints in a software product development is progress. A product development typically begins by defining a set of new features that must be added to the previously released version of the product. Each feature must go through the processes of requirement specification, impact analysis, high-level and low-level design, change planning, coding, unit testing, reviews, and finally be included into the code-base. The quality of a software product is dependent on the development of these features. The inter-dependence of quality and progress is not an unfamiliar issue. As the product release date is delayed, either some of the planned features are dropped (e.g., modify the progress targets) or the quality objectives are compromised (e.g., shortening the test cycles). The next issue, though a significant constraint, has received little attention on its own. This is the health of the existing code-base. The new development cycle, as discussed in Chapter 1, must work on the existing code-base. As the code-base evolves through releases, it begins to decay (Eick et al., 2000). It has been shown that as a code-base evolves, it becomes more and more difficult to add new features (Lehman, 1980; Lehman and Belady, 1985; Lehman, 1998b): initial design documents frequently become obsolete, coding standards compromised, and people with intimate knowledge of the codebase leave the product development team. Overall, the pressures of progress take their toll on the code-base health. This decay of the code-base, in turn, contributes to a lower quality of the resulting product and thus requires more resources to add new features to the code-base (Lehman, 1998b; Eick et al., 2000). In Chapter 2, two software quality models were presented: Boehm’s (Boehm and Brown, 1978) and McCall’s (McCall, 1980). These models included some characteristics of the code-base health in their model, e.g., modularity, understandability, and reusability. This dissertation divides various characteristics of the product development along three dimensions which define product development:
CHAPTER 3. A MULTIDIMENSIONAL MANAGEMENT FRAMEWORK
40
• Quality: meeting “customers” demand and expectations • Progress: adding new features to the product • Health: ability to (re-)use the existing code-base Product development is constrained by deadlines: new features must be added within a specified time period. Furthermore, as pointed out earlier, quality and progress are directly affected by the code-base health. To effectively manage a software product development, one must take into account various concerns along the axes of quality, progress, and health. Each of these axes must be locally defined, initial and final values estimated and/or computed, and their values regularly monitored. Both progress and health can be defined in a similar manner as quality: a hierarchy of goals reflecting progress or health accomplishments.
Health goal
Ext.
+
influenece
Metric
...
Comp.
File
File
Metric
Metric
Metric
Comp.
Metric
Metric
Metric
Metric
Figure 3.8: A partial health definition
Typically, a code-base is divided into components (or modules), which are, in turn, divided into files (or other smaller units). We can define a set of goals for files and components, and refine these goals until they can be expressed by a set of metrics. Figure 3.8 shows an example of a code-base health definition. The definition of progress depends heavily on the local processes and procedures. Figure 3.9 shows an excerpt of the progress definition used in the case study that will be presented in Chapter 7. The hierarchy represents a check list in its simplest form, i.e., binary values for each
CHAPTER 3. A MULTIDIMENSIONAL MANAGEMENT FRAMEWORK
41
goal. This model can be enhanced if the organization has in place a means of assessing partial completion of tasks. For example, if a design document specifies five changes to be made to the code-base and three of them are complete, one can specify the value of the coding goal for that specific feature as 60 percent complete. In either case, the value of code-base health and progress can be computed, similar to quality, as the weighted sum of the values of sub-goals plus/minus external influences.
Progress
-
feature
Design
design metric
Coding
Metric header
Metric classes
Unit testing
Metric
feature
Metric
Metric
feature
feature
Metric
Metric
Ext. influenece
Metric
Metric
Metric
Metric
Figure 3.9: A partial progress definition
Similar to the quality definition, the notion of the ideal line of quality can be extended to include the ideal line of health: a line connecting h0 , the value of the code-base health at the start of the project, t0 , and hc the value of the code-base health at the planned completion time of the project, tc . The real health curve is defined by a sequence of values: h0 , h1 , h2 , h3 , · · · where each hi represents the value of code-base health at a given time ti (i.e. when the data warehouse is refreshed). The predicted line of health at time ti is the line connecting the values h0 and hi . The risk region, similar to quality, is the area below the ideal health line and the risk factor is the distance between the predicted and the ideal lines of health. These definitions are similarly extended to progress. Figure 3.10 shows an example health plan. The last graph that is needed to complete the software product monitoring framework is the product
CHAPTER 3. A MULTIDIMENSIONAL MANAGEMENT FRAMEWORK
42
Health Upper bound predicted line of health at t3
hc ideal line of health
h0c
Risk region
Real health curve
h0 lower bound
t0
t1
t2
t3
tc
t0c
Time
Figure 3.10: An example health plan
development graph: a three dimensional space whose three axes represent quality, health, and progress indicators. The ideal product development is a straight line connecting the two points: A = (q0 , h0 , p0 ) and B = (qc , hc , pc ). More formally, the line is specified by: Li :
q − q0 h − h0 p − p0 = = qc − q0 hc − h0 pc − p0
Similarly, the real product development is a sequence of line-segments connecting the points (q0 , h0 , p0 ), (q1 , h1 , p1 ), · · · , (qi , hi , pi ), · · · where qi is the quality at time ti , hi is the health value at ti , and pi is the value of progress at ti and t 0 < t1 < t2 < · · · < ti < · · · Lastly, the predicted product development at time ti is the line defined by two points (q0 , h0 , p0 ) and (qi , hi , pi ). The risk factor is the distance between a point on the predicted product development line and the ideal product development line. Figure 3.11 shows an example product plan. To conclude, software product management requires local definitions of quality, health, and progress indicators, continual monitoring of these indicators, and the derivation of a predictive model. Should
CHAPTER 3. A MULTIDIMENSIONAL MANAGEMENT FRAMEWORK
Predict line
43
Ideal line
pc Real curve
Uq
p2
qc q2
p1 q1 p0 qo
Lp Lq Lh
h0
h2
h1
hc
Uh
Figure 3.11: In a product development plan, the values Lq , Lh , and Lp represent the lower bounds for quality, health, and progress respectively. Similarly, Uq , Uh , and Up represent the upper values for quality, health, and progress. The values q0 , h0 , and p0 represent the initial values (i.e., at time t0 ) for quality, progress, and health respectively; while values qc , hc , and pc represent the planned values (i.e., at time tc ).
CHAPTER 3. A MULTIDIMENSIONAL MANAGEMENT FRAMEWORK
44
there be a problem, as indicated by the risk factors, we can drill down the hierarchical definitions and determine problem areas. To remedy problems, management must make compromises, e.g., reduce the number of features, compromise the health of the code-base, or reduce quality estimates. The difference in such cases is that decisions are based on available data and a full understanding of the deviations from the original development plans.
3.5
Summary
This chapter presented a framework for monitoring software product development. The underlying concepts of this framework are locality of definitions, regular evaluation, and continuous improvement. Quality, progress, and health must be locally defined and planned based on their initial and expected final values. As the project progresses, these values are computed and compared against the plan. Based on the current values and predictions potential problems must be identified and appropriate decisions made. A product development cycle can be replayed based on different definitions for quality, progress, and health. This flexibility is due to our data warehousing solution. Chapter 4 outlines the architecture for a software data warehouse that realizes the potentials of the model described in this chapter.
Chapter 4
A software data warehouse Chapter 1 outlined a new approach to software management using business intelligence (BI). The objective of BI is to deliver the right information to the right person at the right time. A key ingredient to realization of this objective is data warehousing. Data warehousing involves the design and implementation of processes and tools to gather, manage, and deliver complete and accurate information for decision making (McGuff and Kador, 1999; Whitehorn and Whitehorn, 1999). Typically a large organization collects different data that are maintained in diverse databases scattered across geographies. As a result, data warehousing is a solution not an off-the-shelf product. An enterprise, based on its BI needs, must develop a data warehousing solution. This chapter begins with a broad overview of data warehousing and follows with an outline for developing such a solution for software management.
4.1
An overview of data warehousing
A large enterprise consists of many groups, each with a defined set of objectives. Each group collects data for supporting their day-to-day operations. The data resides in local (operational) databases providing details about the entities of interest to the group. The database operations in such databases are well-defined and can be highly tuned for optimal performance. Such databases are called operational or transactional: the data reflect the real-time state of the entities of concern 45
CHAPTER 4. A SOFTWARE DATA WAREHOUSE
46
(Inmon et al., 1998). For example, an operational database may keep data about customers, their orders, and their credit history. The data must be accurate and up-to-date in order to effectively process customers’ orders. Operations on such a database may include updates and pre-determined queries for extracting data about customers. At the other end of the spectrum, an enterprise needs to analyze its data, across various functions, to understand, at a higher level, patterns and trends in its business. To support such analytical tasks, an enterprise must access diverse data sources (from various groups) (Chaudhuri et al., 2001). Furthermore, such tasks may be ad hoc, which impact the performance of databases. For example, an enterprise may be interested in identifying spending patterns of its customers across geographical regions. In such tasks, individual data items and their accuracy is not the primary concern; instead, the focus is placed on patterns of data or their aggregations. Such databases are used for analytical processing and decision support (Chaudhuri et al., 2001). In theory, operational databases should be able to support both transactional and analytical processing. However, there are various practical issues that hinder the use of operational databases for decision support. Using an operational database for decision support tasks can degrade the performance of other query processing. Operational databases are typically fine tuned for a small number of well defined operations, while decision support queries are ad hoc. Furthermore, the data needed to perform decision support tasks may reside in diverse databases distributed over a network. Similar entities may have varying semantic definitions, attributes may have different typing, or the overall data models may be different. Overall, the data sources can vary greatly in terms of structure, format, semantics, and location. These problems are exacerbated when temporal aspects of data are taken into account. The operational databases maintain real-time state of the entities of concern. Temporal data reside in backup sources or does not exist. Hence, an objective of data warehousing is to combine various heterogeneous databases to deliver the needed information to the decision makers (Singh, 1997). Taking into account the difficulties outlined in the previous paragraph, Inmon (Inmon, 1996) defines a data warehouse as a database that supports decision making having the following four characteristics.
CHAPTER 4. A SOFTWARE DATA WAREHOUSE
47
• Subject oriented: The data in a warehouse is organized around essential entities of the business. • Integrated: It implies consistency in schemas, data types, relationships, naming conventions, and translations. • Time variant: It implies that data in a warehouse is organized by various time periods. • Non-volatile: A warehouse is not updated in real time as are the operational databases; rather, a warehouse is loaded with the existing data from operational databases on a regular interval.
The above mentioned characteristics of data warehouses require various types of data in a warehouse: old detailed data, current detailed data, lightly summarized data, highly summarized data, and meta-data. As new detailed data is loaded into the warehouse, the previous detailed data must be aged and eventually purged (Bischoff and Alexander, 1997). However, when purging data there must be a strategy in place so the summary data remain consistent. For example, when purging detailed data, should its effect on summarized data remain, or should the summarized data be recomputed without the old data. To facilitate analytic queries, as the detailed data is loaded into the warehouse different data summaries are created. The summarized data is the aggregation of detailed data. Highly summarized data requires more processing and transformation, but provides a higher-level view of data. On the other hand, lightly summarized data requires less processing and transformation but provides a lower-level view of data. In terms of analytical tasks, a warehouse must provide facilities to drill down from highly summarized data to lightly summarized and detailed data, as well as moving across to other subjects. Figure 4.1 depicts an architecture of a data warehouse. There are various heterogeneous data sources, from which data is extracted and integrated in the middle layer. The extraction process requires tools that are aware of the schema of the particular data source. The integration process requires data transformation, integrity checking, constraints checking and data cleansing. This process is of great importance to the integrity of the warehouse and involves several steps. A common problem is the semantic differences between various data sources, which may involve unifying the format of data fields and deriving values for fields with missing values. The middle
CHAPTER 4. A SOFTWARE DATA WAREHOUSE
48
Data Data Data
Central
Source ... Data Source
Populating subsystem
Data Warehouse
Data Marts (BI layer)
Meta data
Input data sources
Figure 4.1: An architecture of a data warehouse
layer, the central data warehouse, stores all the data available for analytical processing, while the outer layer contains the data marts and analytical tools that extract data from the middle layer for their tasks. A key component of a data warehousing solution is meta-data, which provides a dictionary of what data is available and how it can be used. As previously mentioned, a data warehouse is not an off-the-shelf product; rather, it must be designed and developed based on the objectives of an enterprise (Inmon et al., 1998). However, there are several key constraints placed on the overall data warehousing solution: open interfaces, scalability, data consistency, and multiple views of data (Bischoff and Alexander, 1997). Furthermore, based on value added a data warehousing solution must be continually improved. The next section draws parallel between the concepts of enterprise decision support and the issues surrounding software management.
4.2
Software management via data warehousing
As noted earlier, decision support strategies in common business organizations share a few characteristics with software management. A business organization consists of various groups with defined sets of goals. These goals are defined to support the overall objectives of an organization.
CHAPTER 4. A SOFTWARE DATA WAREHOUSE
49
Some goals are independent from others, while others directly or indirectly impact other goals. For example, a business organization may have finance, human resources, and marketing departments. While the finance department requires reduction of payments to employees, the goal of human resources goal may be competitive salaries for experienced employees, and the marketing department wants competitive products, which calls for experienced employees. Software development organizations share similar characteristics with mainstream business organizations. A software development organization consists of various groups, each with their own set of goals and concerns. For example, in larger software development organizations, a dedicated team undertakes requirement analysis. Their tasks are to study the market and customer expectations, and to evaluate the status of their product(s) to put together a set of requirements for a new release. Their goal is to create a comprehensive requirement specification, while their concern is how much of customer requirements will be implemented in the next release. Similarly, the support team must continually interact with customers identifying defects against the software product, and request fixes for the defects. The goal of the support team is to provide timely support to their customers, while their concern is defect turn-around time and customer problem resolution accuracy. The nature of divisions may vary from one organization to another. However, it remains that each organization consists of different groups, each with their own goals and concerns that contribute to the objectives of the organization. In a business organization, each group maintains its own data to support their day to day operations for achieving their goals. Their primary data concern includes up-to-date accuracy. On the other hand, the overall organization has a strategy and a set of objectives for implementing them. From this viewpoint, the concern is to evaluate key performance indicators to determine if the organization is performing appropriately. If need be, based on the evaluated indicators, the organization must take corrective action. The main data concern involves temporal aspects of data necessary for identifying trends, not the real-time state of data. Software development organizations share similar characteristics with business organizations. Each group maintains its own data for supporting their goals. Their primary data concern includes real-time accuracy. For example, a requirement document may be organized into a set of use-cases
CHAPTER 4. A SOFTWARE DATA WAREHOUSE
50
(Jacobson, 1994). Upon successful implementation of a use-case, developers must mark it as complete so the team is informed of the development progress. Any changes to these documents must be reflected immediately, otherwise it may result in inconsistencies and prolonged development cycles. Similarly, a support team may keep customer information in conjunction with their respective defects that were initiated against the product. Their main concern includes having an accurate picture of the defect resolution process. In short, software development organizations, like more mainstream business organizations, consist of different groups, each with their own goals, concerns, and supporting data. While the local data provides support for achieving a team’s goal, the combination of such local data can shed light on the overall objectives of the organization as a whole. The main obstacle in realizing the parallels drawn here is that software development organizations do not consider their data as a first class citizen. To effectively meet the objectives of a software development organization, the management must view software as data. The data may not be organized or structured in a manner similar to that of business organizations, nevertheless it can provide support for the overall objectives. For example, from a programmer’s point of view a program is data. It describes their goals — the implementation of a new functionality. Programs change regularly to implement other features, i.e., the concern of another development group, but always reflect the most recent state of the development. The program data is non-quantitative, which makes it sometimes difficult to view programs as data. However, different areas of computer science view programs as data: theory of computation (Hopcroft et al., 2001), compilers (Muchnick, 1997), and analytic tools for calculating metrics (Fenton and Pfleeger, 1997). This dissertation generalizes this view of programs; from the point of view of software organizations, programs are data. Furthermore, they represent operational data determining the current status of the development. The software management tasks include monitoring the “key performance indicators” to make sure the overall organization meets its objectives and if need be, identify problem areas and devise solutions to remedy them. To do so, management must combine data from various local sources to construct a complete picture. Figure 4.2 shows a conceptual architecture of a software (data) warehouse. In this figure, input data sources are the available data from a software system. Typically, there are several identifiable data sources in a software organization. The challenge
CHAPTER 4. A SOFTWARE DATA WAREHOUSE
51
Progress
Programs
Central Warehouse
Pipes
Input data sources
Test data
...
Populating subsystem
Data CMVC
Meta data
Quality
Health
Project Mgmt
Figure 4.2: A proposed architecture for software data warehouse
remains to structure these data sources so that they can be used with existing data warehousing tools. The next section provides a discussion of various data sources in large software development organizations.
4.3
Data sources in software development
The previous section discussed the commonalities between traditional businesses and software development organizations. Each of these have diverse sources of operational data belonging to different groups. In traditional businesses, the data is perceived as a valuable commodity, while in software development, software rarely is viewed as data. In a software development organization, one must first identify a set of tasks, define the required data, and proceed with data gathering (Kan, 1995). This dissertation, as the previous section pointed out, views software as data. To complete the architecture of a software data warehouse, this section identifies some common data sources found in a mature software development organization, e.g. levels 2 and higher in the Capability Maturity Model (CMM) (Paulk et al., 1993). To identify various data sources in a software development organization one can leverage the existing organizational charts. An organizational chart identifies reporting structures, divisions
CHAPTER 4. A SOFTWARE DATA WAREHOUSE
52
of the organization into groups, and relationships among various groups. For example, in the case study presented in Chapters 7-9, the organization includes a group that focuses primarily on product quality. The group’s main concerns involves system testing, regression testing, and services. Another group’s concern includes performance issues: evaluation and benchmarking. Overall, most common data sources include:
• programs: files containing source code; • configuration management and version control: tools for managing programs, their groupings, access, and other development processes • project diary: various project management tools, commercial or locally developed; • test databases: test teams typically maintain data about test executions and their results; and • other tools: different development teams use various tools that can provide data relevant to management. Depending on the maturity of the organization, the data sources may be more structured or other data sources may be available. For example, more established organizations maintain data pertaining to customer satisfaction. Similarly, various data are kept in structured sources such as relational databases as opposed to textual files. Regardless of the number of data sources or their structures, to effectively manage a software development project all data sources must be leveraged for decision support. As the support mechanism evolves, new data sources will be added and existing data sources become more structured. The remainder of this section elaborates on each of the data sources, their extraction processes and their data models. The programs constitute one of the most important data sources. The literature contains numerous references on program analysis (Mendelzon and Sametinger, 1995; Wong et al., 1995; Tilley and Smith, 1996; Finnigan et al., 1997). However, the approach is almost the same. Programs are typically stored in files, which are themselves stored in directories. Each file has a unique name following some naming conventions. The simplest form of analysis uses code counting tools to determine the number of lines of code, comments, or blanks, as well as comparing two versions to
CHAPTER 4. A SOFTWARE DATA WAREHOUSE
53
determine the number of lines changed, deleted, or added. More sophisticated analysis involves use of parsers to extract more detailed information from the program for the construction of data-flow, control-flow, or other dependency graphs. The build process, where programs are compiled and linked into an executable, also provides a wealth of data. Such data includes program dependencies, order of compilation and linkage, compilation units, and libraries. Also, in larger products, there is a large volume of compiler warning messages. These messages are mostly of little significance and the compiler flags are usually set for the appropriate levels. The data from programs must be extracted using a parser or an extractor that recognizes the language of the program. These tools use a variation of an abstract syntax tree (AST) (Muchnick, 1997) to model that data in programs. The tree can be traversed to find needed data. When dealing with a small set of programs, having only one or two versions of each file, and a predetermined set of queries, this model and its representations are sufficient. However, when dealing with a large number of program files with multiple versions, the trees are not appropriate. To effectively store the program data in the software data warehouse, this dissertation uses an object-relational model that enables storage of programs in relational databases. (Chapter 5 describes this model in more detail.) Similarly, warning messages and code counts can be converted into a relational model. Another data source available in a software development organization is the configuration management and version control (CM) tool. The CM tool maintains control over a program in a multiple access environment. Such tools maintain data pertaining to the history of changes made to the programs, various versions of programs, and program access. Depending on the tool, it may provide processes for changing the code (e.g. defects and features record), a means of grouping files into modules, or creation of multiple releases of a product. Some commercial tools use a relational database for storing their data. In such cases, data extraction can be achieved via commercial tools. Other tools use hierarchical repositories and provide interfaces for accessing them. In such cases, extraction tools must be developed for loading this data into the data warehouse. The case study described in later chapters uses the IBM Configuration Management and Version Control (CMVC)(International Technical Support Organization, 1994). CMVC uses a relational database to store its data and uses the file system to store program files. The CMVC data can, with little
CHAPTER 4. A SOFTWARE DATA WAREHOUSE
54
modifications, be loaded into the data warehouse. Project diaries provide detailed data about various phases of a project. Some organizations use commercial project management tools, while others develop internal tools. These tools enable teams to plan, monitor, and track their projects. However, using the BI view, these tools are simply data repositories. Such repositories maintain data about plans and progress of a project. In the case study described in Chapters 7-9, the internally developed project diary stores data about the proposed functionalities. The team member(s) responsible for a functionality creates a specification, have it reviewed and approved by other team members, completes high- and low-level designs, develops test cases for the new function, implements the function, have the code reviewed, unit tests the new function, and finally commits the changes to the code-base. The project diary stores this data in a form similar to a check list. Monitoring this data on a regular basis provides the management with a temporal view of the progress of a new function: when it was proposed, who owned it, when the design was complete, who reviewed it, and so forth. Depending on the type of data and the storage mechanism, a data model may need to be developed or the existing model modified, and an extraction process be devised. The quality team has the responsibility to test functional and/or non-functional properties of the new product. There are typically three distinct testing cycles: system test, regression test, and performance. The testing process may be manual or tool-assisted. In either case, a wealth of data is created by the team and/or the tools. Such data describes the testing process and may include: • test case: what scenario is being executed; • build level: what version of the program is being tested; • environment: what hardware on which the program is running; • start date: when the test case started; • end date: when the test case completed; • status: current result of the test case execution; • defect: which defect may have been found; and
CHAPTER 4. A SOFTWARE DATA WAREHOUSE
55
• owner: who executed the test case. To complete this section, two issues deserve further discussion: granularity of data and its temporal characteristics. Some organizations do not keep detailed data because they do not see any value in it. For example, an organization may not keep track of the owner of a test case or the person(s) who executed the test case. By identifying such data sources and integrating them into the data warehouse, over time, based on the value added, one can determine the advantages of having more detailed data. As for temporal data, care must be taken to identify if a data source is capable of storing history or the data is overwritten each time. By regularly extracting the data from operational databases and storing them in the data warehouse, the history of the data will be maintained. Chapters 7-9 will describe, in more detail, an example implementation of a software data warehouse. The following section provides an overview of a management strategy based on the concepts presented thus far.
4.4
A software management strategy
This chapter has so far presented an overview of data warehousing and a conceptual architecture for software data warehousing. The software data warehouse, in conjunction with the multi-dimensional management framework described in Chapter 3, provides a comprehensive management strategy parallel to one used in BI. This strategy has four stages.
• Planning: Using the framework outlined in Chapter 3, the management defines a set of goals for each of the dimensions of interests 1 . Furthermore, for each dimension two values are determined: the initial and the final values. Using these values the ideal plans for each dimension are constructed. • Operation: As the development cycle begins, at regular intervals the data warehouse is refreshed from the operational data sources. At this point, by extracting the necessary data from the warehouse, the value of each sub-goal of the project is computed and rolled-up to 1 As
will be discussed in Chapter 11, these dimensions can be extended based on the needs of the organizations.
Two other dimensions that were proposed after the case study presented in Chapter 7 were skills and infrastructure.
CHAPTER 4. A SOFTWARE DATA WAREHOUSE
56
determine the index values (the main goals) of each dimension. Comparing these values with the ideal plans reveals whether the project is on track (within its tolerance), ahead of or behind its plan. If behind the plan, then the severity of the problem and its possible causes must be determined. • Tactical: When a problem arises, the management focus will be on causal analysis. Using the data available in the warehouse the management performs various analyses to diagnose potential causes (or correlations) for the identified problem. Drilling down in the hierarchy of the problem goal, sub-goal(s) that failed to meet their planned targets will be identified. At this step, one can move across to other dimensions to identify possible causes (or correlations). For example, the case study presented in Chapters 7-9, identified a correlation between the number of unit testing defects and size of program files (in lines of code) in a component: beyond a threshold, the higher the number of lines of code in a component, the higher the number of identified unit testing defects, which hindered the progress goal. Other means of analysis and diagnosis involve looking for particular distribution of defects. An example of this type of analysis is orthogonal defect classification (ODC) (Chillarege et al., 1992). The objective of ODC is to develop a set of mutually independent categories of defects and build a profile of defect distribution. In the case study described later, a multi-dimensional model of defects was developed in the outer layer of the data warehouse, where each category was defined as a dimension. When analyzing defects, the management could view defects along various dimensions: by components, by managers, by function managers, by the phase where the defect was found, by date, and any other available dimensions. These are examples of ad hoc analyses that may be initiated based on the values computed in operational stage and by leveraging the available data in the warehouse. • Strategic: When the (potential) cause of a problem is identified, a solution must be devised. Depending on the severity of a problem, the solution may involve changing the initial project plan, adjusting resource allocations, or modifying existing processes. The objective of this stage is to investigate the impact of the problem as well as its potential solution in the short term as well as the long term. For example, if a solution is to be implemented, what are the benefits to be expected and what are the “costs” involved?
CHAPTER 4. A SOFTWARE DATA WAREHOUSE
57
In the example provided in the previous item, the correlation of lines of code and unit test defects, it was estimated that the number of unit test defects may reduce by 15-20 percent if the number of lines of code was reduced on average by 40 percent. Reducing the number of lines of code in a program involves splitting a file into two or more files, which is costly. During the split, the developers do not (directly) contribute to the overall goals of the project. A possible approach is to identify one or two components with a high number of unit test defects and try to reduce their number of lines of code. If the expected results are achieved, then continue with other components and eventually introduce a local rule in the development process for constraining the number of lines of code in program files.
It should be noted here that an identified problem may be the result of poor definitions for the project dimensions or the data used for its evaluation. Such shortcomings are the motivation for continual improvement to the goal definition, both in terms of capturing all concerns and their respective evaluations (i.e. metrics), as well as the data available from the warehouse, both missing data and structure of the stored data. Such tasks are also part of the strategic stage.
4.5
Summary
This chapter presented an overview of data warehousing and defined the conceptual architecture of the software data warehouse. As Section 4.1 discussed, a data warehouse is not an off-the-shelf product. A data warehouse is a solution for improving management decision support and as such must be developed based on the overall management strategy and the available data sources. Similarly, a software data warehouse must be developed based on the dimensions of the management strategy, its multiplicity of concerns, and the available data sources. As common data sources (e.g. configuration management tools) are integrated into software data warehouses, tools can be developed for facilitating data extraction and cleansing that could reduce the time and effort needed to develop a software data warehouse. The following chapter focuses on one such issue: extraction and storage of a program in a relational database.
Chapter 5
An object-relational schema for programs Chapter 4 presented the architecture of a software data warehouse. Chapter 7 will introduce an industrial case study which uses an implementation of the proposed architecture. This chapter deals with an obstacle in the way of the implementation of a software data warehouse: storing programs in commercial databases. As Chapter 7 will show, the program data constitute the largest part of a software data warehouse, which directly impacts the health definition and its evaluation. As a result, it is necessary to have a means of long-term storage and efficient access to this data. This chapter presents an object-relational schema for C/C++ programs, which facilitates storage of programs in a relational database.
5.1
Program representations
To complete the software data warehousing solution, one must make a distinction among the common tasks performed on programs. These tasks can be divided into two groups based on their impact on the final (executable) product. On one side are the tasks that directly change the program, and its executable product, to reflect new requirements. Common programming and optimizations tasks fall in this group. Analytical tasks fall into the other group: these are tasks 58
CHAPTER 5. AN OBJECT-RELATIONAL SCHEMA FOR PROGRAMS
59
that do not modify the program and hence do not directly affect the executable product. Metrics calculations, architectural discovery, and re-documentations are among these tasks. The proposed division of tasks is consistent with the dichotomy of tasks in traditional data management. In data management, there are transactional and analytical tasks, each of which has its own set of constraints. In transactional data management, data is continually changed and its real-time state is of utmost importance. Hence, the data models for this group of tasks must facilitate fast creation, modification, and query of the data. In analytical applications, data is not modified, it is loaded from the transactional systems into the data warehouse. The queries are ad hoc and the focus is mostly on trends and history rather than the real-time state of data. Similarly, in the software data warehouse, the primary objective is on analytical tasks and the real-time state of the program is of lesser concern. As a result, the data models must be suitable for these tasks. The most common data model for programs is the abstract syntax tree (AST). The AST is an intermediate representation of a program, which makes the structure of the program explicit. There is enough information in the AST to reconstruct the original program. Furthermore, from an AST, other information about the semantics of the program, such as variables’ typing and scoping information can be derived. For example, consider the simple C++ program shown in Figure 5.1. Figure 5.2 depicts a corresponding AST for this program.
1: 2: 3:
class foo { public : int id; };
4: 5: 6: 7: 8: 9:
main () { int n; foo a; n = 0; a.id = 1; } Figure 5.1: A simple C++ program
Analytical tools, mainly those used for reverse engineering tasks, use some variations of AST, e.g. Aria (Devanbu et al., 1996), Reprise (Rosenblum and Wolf, 1991), and Rigi (Muller, 1986;
CHAPTER 5. AN OBJECT-RELATIONAL SCHEMA FOR PROGRAMS
60
Muller et al., 1993). The tree nature of these models facilitates creation of other abstractions using only tree traversals, e.g., control- and data-flow graphs. ASTs are well suited for local queries, while global queries require more expensive tree traversal. To remedy this shortcoming a symbol table, or other secondary data structures, can be constructed. However, these modifications further exacerbate a key weakness of ASTs: their lack of standardization. There is no standard for storage of ASTs, nor for their query execution, nor for their longterm storage. ASTs were designed for the temporary representation of programs to be used by compilers.
function (main)
class ( foo )
object ( id )
type ( int )
parameter object ( ) ( n )
type (void)
type ( int )
operator ( = )
object ( a )
name ( n )
type ( foo )
operator ( = )
literal ( 0 )
opertor ( . )
name ( a )
literal ( 1 )
name ( id )
Figure 5.2: An example AST
To effectively store and manipulate ASTs one must remove “unnecessary” data. In other words, a decision is made in advance on the level of granularity of the entities that are needed for the analytical tasks at hand and only those entities are stored and the remaining part of the AST is discarded. In such cases, tools extract facts (or relations) from the AST and store them in a linear form. For example, Rigi (Muller et al., 1993) and Software Bookshelf (Finnigan et al., 1997) store facts extracted from a program as three-tuple relations called Rigi Standard Format (RSF). Line number 5 in the example program can be represented as:
(
type
n
int
)
In essence, this approach flattens the tree by storing nodes and arcs of the tree. The resulting data model has one entity, nodes, which has a many-to-many recursive relation to itself. This data model
CHAPTER 5. AN OBJECT-RELATIONAL SCHEMA FOR PROGRAMS
61
is simple yet elegant. However, the simplicity of the data model makes it inadequate for use in a relational database. Consequently, as the data volume grows, the data manipulation becomes less efficient. Datrix1 (Mayrand et al., 2000) is a commercial tool from Bell Canada that was developed for third party software acquisition (Mayrand and Coallier, 1996). Organizations that acquire software systems from other vendors analyze these systems to determine if they are fit for ongoing maintenance and enhancement. Datrix consists of a set of parsers to construct ASTs for programs written in C/C++ and Java. After construction of an AST, Datrix converts it into an Abstract Semantic Graph (ASG) (Bell Canada, 2000) by adding semantic data to the AST. Figure 5.3 depicts an ASG for the example program. In an ASG, the variable references are resolved, type information for variables are stored, and an order among arcs is identified using integer values.
function (main)
class ( foo )
object ( id )
parameter object ( ) ( n )
object ( a )
literal ( 0 )
type ( int )
operator ( = )
operator ( = )
opertor ( . )
literal ( 1 )
type (void)
Figure 5.3: An example ASG
The Datrix schema enhances the data model described earlier. The Datrix schema has two entities: nodes that represent the entities of a programming language and arcs that represent the relationships among entities. Each entity is further refined by a class hierarchy. Datrix stores this data internally as a graph. The next section demonstrates the use of the Datrix schema as a basis for an object relational schema for programs. 1 Datrix
is a registered trademark of Bell Canada.
CHAPTER 5. AN OBJECT-RELATIONAL SCHEMA FOR PROGRAMS
5.2
62
An object-relational schema for programs
There have been several tools that leverage relational data models for the storage of program data. The motivation behind using relational data models is to leverage commercial databases and their optimized query facilities. A key difference between these approaches and those that use ASTs is the data modeling step. To take advantage of commercial databases, a fully developed data model is necessary. The drawback, in the case of the relational model, is that graph based data is difficult to represent in relational data models. As a result, it is necessary to make a decision on the granularity of entities. Typically, based on a predetermined set of analytical tasks, the granularity of the programming language entities are defined. Common examples are top-level declarations. For example, Chen et al. (Chen et al., 1998) used types, functions, variables, and macros for their C/C++ program repository . These approaches are in contrast to the objectives of this dissertation. One of the pillars of the proposed approach of this dissertation is to shift the focus from applications to data. The justification for this shift is that due to the incompleteness of proposed solutions, applications are more prone to changes. The possibility of application changes multiplies when the number of groups and people in a software development organization are taken into account. The large amount of program data requires shortening of the development time for new analytic applications. As a result, a schema for programs must satisfy four constraints: • Enable storage of complete data from an AST. This constraint ensures that the software data warehouse stores the entire program. • Enable storage of historical data. Programs change frequently and the schema must facilitate historical and trend analyses. • Facilitate frequent data populations. The data warehouse refreshes on a regular basis. The process of data population must not deter the data warehouse refresh frequency. • Facilitate global and local queries. The Datrix schema (or other graph-based data models) can be stored in a relational database. However, such schemas are suited for a specific class of queries. A schema must facilitate a diverse group of queries.
CHAPTER 5. AN OBJECT-RELATIONAL SCHEMA FOR PROGRAMS
63
The main entity set in the schema is Program Entity. This entity set represents all program entities, e.g. an iteration, a variable, or a function. A program entity is related to other program entities, which can be represented as a many-to-many recursive relationship. Common attributes of a program entity are: • unique id: distinguishes an entity from others across the time-line; • file name: full path name of the file where the entity is stored; • date: this is the date when the entity was extracted (to distinguish between different versions of the program); • release: to distinguish among different releases of the program; • platform: in case of multi-platform environments; • location: to identify the beginning and end columns and rows where the entity is stored in the file; • block number: to identify the scope of the entity within the file. This entity set represents a programming language’s entities (in our case C/C++). One can identify these entities by examining the grammar of the language. Other program entities are then identified as the specialization of the ProgramEntity. In other words, the new entity sets inherit from the ProgramEntity. The intermediate children are: Control Statement, Identifier, Expression, and Exception. Figure 5.4 shows an excerpt of the ER diagram showing these entities and their attributes. Each entity set inherits the attributes of the parent class and has a set of its own attributes that further describe its unique characteristics. For example, the Identifier entity set has a name and a visibility. We also identify a one-to-many relationship between entity sets Expression and Exception. In programming notation, this expression is what follows a throw. Each of the entity sets of the second level can be further refined. As we identify new entity sets and place them in their respective location in the inheritance hierarchy, we also identify their
CHAPTER 5. AN OBJECT-RELATIONAL SCHEMA FOR PROGRAMS
Oid file name date release location block no
Exception
64
M Associated Program Entities
Expression
N
Associates
Identifier
Control Statement
Figure 5.4: Part of C/C++ data model
relationships with other entity sets. For example, the entity set Identifier has six children. These are: Type, Enumerator, Label, Scope, Object, Function, and Enumerator. The entity set Function has several attributes that further describe the entity. The boolean attributes static, virtual, external, and constant determine the type of the function. The entity set Function can also have an exception. In other words, there is a one-to-many relationship between Function and Exception. Furthermore, each function has many parameters that are unique to the function. This is modeled using a many-to-one relationship between the Function and the entity set FunctionParameter. Each function returns a value of a specific type. This relationship can be modeled as a one-to-many relationship between the entity sets Function and Type. Figures 5.5 and 5.6 depicts the ER diagram for the C/C++ programs. There are a few unique points in the presented schema that deviate from that of Datrix’s. The number of entity sets can not grow arbitrarily large for practical purposes. Thus, we use labels to identify subclasses that are not of great importance. For example, consider the entity set Type. This entity set is further identified by an attribute called typeName. This attribute distinguishes among members of the entity set Type. This attribute identifies if the specified type defines a function signature, a reference to a variable, or an alias. In essence, several entity sets collapse into one. Such attributes (i.e., identifying labels) internally classify entity sets. One can also control whether children of an entity set inherit these attributes. The alternative here would be to expand the entity Type by defining several children entity sets. We use such an expansion for entities that
CHAPTER 5. AN OBJECT-RELATIONAL SCHEMA FOR PROGRAMS
Figure 5.5: C/C++ data model: entity set hierarchy
65
CHAPTER 5. AN OBJECT-RELATIONAL SCHEMA FOR PROGRAMS
Figure 5.6: C/C++ data model: associations
66
CHAPTER 5. AN OBJECT-RELATIONAL SCHEMA FOR PROGRAMS
67
have a clear distinguishing attribute that further separates them from their parents. The entity Aggregate Type is one such example. This entity defines classes, structs, and other aggregate data types in a programming language. In the relational model, many-to-many relationships result in new tables. There are numerous entities in the proposed model, each of which maps to a table resulting in a large number of tables. An empirical study of a code-base (used in the case study outlined in Chapter 7) revealed that the amount of data for these many-to-many relationships is small relative to the total number of relationships. As a result, we collapsed all many-to-many relationships into one Association relationship. This relationship has an attribute typeName that identifies the type of the relationship. If need be, one can use this attribute to verify if an instance of this relationship is valid. Appendix A shows a mapping of these entities and relationships to an object-relational model implemented using DB2 V7.12 . The presented data model can store the entire AST for a given C/C++ program. Programs are parsed using Datrix parsers and loaded into the software data warehouse via a local loader program. Datrix parsers output the ASG in Tuple Attribute language (TA) (Holt, 1998). The loader program traverses the ASG, computes block numbers for each scope, creates corresponding entity sets for each node, resolves the relationships, and inserts them in the respective tables. Adding the attribute date distinguishes two identical program entities at different points in time, e.g., an identifier whose type has changed from one version to another. The following section describes common groups of queries for the analysis of programs.
5.3
Manipulating program data
The data schema described in the previous section was implemented using DB2 version 7.1. Programs are parsed using Datrix and the resulting AST is stored in text files. These text files are then read by another program that was developed to add scope numbering, resolve relationships, and to convert the AST entities to that of our object-relational schema. The process is slow in 2 DB2
is a registered trademark of IBM.
CHAPTER 5. AN OBJECT-RELATIONAL SCHEMA FOR PROGRAMS
68
comparison to the build process (i.e. compilation and linking of the programs). However, the objective is the analysis of programs and this process repeats less frequently than the build process. For example, in the case study described in Chapter 7, data was extracted from weekly builds. Furthermore, only those files that changed between two consecutive builds were parsed. In terms of the manipulation of program data, the proposed approach has a clear advantage over traditional graph-based repositories. The relational database enables creation of aggregations on the outer layer of the software data warehouse. Since the objective is to analyze the data, not to modify it, common queries that are slower are pre-computed and their results stored in aggregate data marts. For other queries that are simpler and are less intensive in terms of resource consumption, one can create elaborate indices to improve the performance. The queries can be grouped into three distinct groups based on complexity and on the use of existing query facilities. • linear queries: These include queries that search the database for specific attribute(s). Typical examples include the number of public, private, or protected members of a given class(es); the number of direct sub-classes; the fan-out of a function; the total number of decisions in a file or a function; and the total number of lines of code in a given file or function. Such queries can be expressed using basic SQL. • recursive queries: These queries are more computationally involved and require the use of more sophisticated features of SQL. Typical examples include: the depth of a class hierarchy; the total number of direct and indirect sub-classes; reachability; the call graph for a function; and file inclusions. These queries represent simple tree traversal in ASTs and require the use of recursion and other advanced features of SQL. • free-form queries: This category requires more extensive analysis of the data. Typical examples include control-flow and data-flow graphs. To answer questions relating to controlflow or data-flow graphs more sophisticated applications must be developed. Such applications extract data from the warehouse (i.e. in the data mart layer) and compute the answers.
The performance of linear queries, using proper indexing, is optimal. As a result, they can be computed on-demand. The recursive queries are more expensive and their performance depends
CHAPTER 5. AN OBJECT-RELATIONAL SCHEMA FOR PROGRAMS
69
on the program structure. In the case study presented in Chapter 7, such queries were used for determining health indicators. These queries were computed in advance, during the refresh process, and stored in temporary tables.
5.4
Summary
This chapter presented an object-relational schema for programs. The schema focuses mainly on C/C++ programming languages. However, it can be modified to encompass similar languages such as Java. The proposed schema leverages the power of commercial databases for long-term storage of program data facilitating health definition, historical studies, and other tasks such as evolutionary patterns and structural documentation. The following chapters outline a case study of the application of the ideas and framework presented thus far.
Chapter 6
An implementation methodology The previous chapters outlined the concepts behind the multi-dimensional management framework and its supporting data warehousing architecture. Subsequent chapters will present the application of these concepts to an industrial case study. However, due to issues relating to size and confidentiality surrounding the case study, this dissertation may not delve into necessary depths to enable the reader to implement a complete software data warehouse. As a result, to provide a step-by-step methodology for the design and implementation of a software data warehouse, this chapter uses a small example of a hypothetical software development organization sharing some of the characteristics of the case study. The chapter begins by outlining a four step methodology for the design of a software data warehouse, followed by a detailed description of each step using an example. This methodology and its application will pave the way for the case study presented in later chapters.
6.1
Approach
A key object of this dissertation is to provide support for tactical and strategic decision making. As software products evolve, their development processes need to be improved. As an example, during the case study presented in subsequent chapters, it was estimated that the number of customer found defects would increase. A question facing the management was whether to shift resources to 70
CHAPTER 6. AN IMPLEMENTATION METHODOLOGY
71
system testing, reduce the number of new functionalities planned to be included in the product, or change the testing process. The answer to such questions depends on (local) facts from the project. Management needs such factual data to make appropriate decisions. This chapter, using a hypothetical example, demonstrates how the ideas presented earlier in this dissertation can be implemented in a software development organization. To facilitate the implementation process we must take several key issues into account. First, an initial objective is to identify the software development’s goals and formalizing them according to the methods presented in Chapter 3. This step must be non-intrusive. Any intuitive improvement idea must be supported by local facts. These improvement can proceed after completing appropriate analyses. Second, each goal must associate with a set of metrics regardless of their level of subjectivity. The subjectivity of metrics can be the subject of future improvements. At this stage, we only formalize the goals and determine how the organization determines whether they have achieved them. Third, all available data sources should be identified and their data characterized in terms of their structure, size, and growth. Data from these sources can provide the necessary evidence for assessments of the software development’s goals. Fourth, due to continual evolution of software products, one should not aim for perfection: flexibility is of greater importance as the software data warehouse must undergo continual improvement. Taking into account the presented issues, this chapter presents a four-step approach in line with the software data warehouse architecture shown in Figure 4.1.
1. Input layer: Identify and characterize the available data sources. 2. Middle layer: Design schemas to integrate the data from the input layer (as discussed in Chapter 4). Furthermore, develop tools for data extraction, cleansing, and transformation from the input layer. 3. Outer layer: Define the three-dimensional management model (as described in Chapter 3) based on organizational goals and available documentations such as ISO 9000 quality management system (ISO, 1990).
CHAPTER 6. AN IMPLEMENTATION METHODOLOGY
72
4. Analytical tools: Develop tools for fast and ad-hoc query processing based on the threedimensional management model. Though we identify these steps sequentially, it will be necessary to execute them iteratively. Not all the input layer data sources can be identified at the first pass. Some data sources may not be easily accessible. Alternatively, in some cases there may be too many sources, which can delay the development of the software data warehouse. The resulting process, hence will be evolutionary providing two important benefits. First, quick impact in terms of providing management decision support. The provided support will eventually results in changes to the existing processes and data sources. Second, we enforce the importance of continual improvement. The following four sections elaborate on each of these steps.
6.2
Input layer: identifying data sources
This step aims to identify data sources that will feed the middle layer —the central data warehouse. This step involves identifying the available data sources and characterizing them in terms of their size, structure, and growth. The availability of this data, in turn, will contribute to decision support and improvement efforts. A key benefit from identification of the input data sources includes a partial business model of the software development organization. The available data sources and their characteristics indicate the maturity level of the software processes as described by the Capability Maturity Model (CMM) (Ahern et al., 2001). More mature organizations have repeatable processes that are measured and continuously optimized. In order to successfully create a software data warehouse an organization must have reached second or higher levels of maturity. The availability of data not only provides immediate leverage for decisions making, but also supports historical analyses and repeat execution of successful processes. Identifying data sources in a software developing organization can provide several difficulties. First, what this dissertation considers data to be used for analysis may not be perceived as such by the development organization. For example, the name of a person executing a test scenario may be
CHAPTER 6. AN IMPLEMENTATION METHODOLOGY
73
buried in the name of the log file and completely forgotten. From our perspective, the names of log files is a data source. Such data can shed light on testing processes, resource allocations, and other facets of the software development project. For the development team, the naming of log files may simply be a good practice. As a result, the identification of the data sources must proceed in iterations. As data sources are identified, they are integrated into the central data warehouse. As various analyses are performed, new data sources will be identified, others may be created, and some become more structured. An initial step should involve searching for “important” data sources, which a mature organization must have. These include the code-base, configuration management tools, defect tracking and reporting tools, organizational charts, project diaries, development tools, and test execution logs. The next step involves the characterization of the data sources.
• How structured is the data source: plain text, structured text, or relational schema? • What is the current size of the data source? • What is the growth rate of the data source? • Is the data source permanent or temporary (e.g. a pilot project). It may also prove beneficial to determine if there is any historical data available to use for initial hypothesis verifications. For example, by looking at the data fields of a defect tracking and reporting, one can identify the usefulness of data items. In our case study we identified a field of the problem tracking database as a means of verifying customers’ usage of the product. However, historical data revealed that due to constant pressures, the data value was almost always null. To demonstrate the identification and characterization of input data sources, we use a hypothetical software development organization. A mature organization with a medium number of employees (e.g., 100) that have released several versions of their product. We begin with the common data sources described earlier.
• Code-base: It remains one of the most important sources of data. It is permanent and continually changing —daily or weekly depending on the build process. It is semi-structured
CHAPTER 6. AN IMPLEMENTATION METHODOLOGY
74
and maintains an extensive history. Our organization has organized its source code into components, each of which groups a number of related files. The programs are written in C/C++. We adopt the schema presented in Chapter 5 for the code-base. • Organizational chart: Organizations maintain a hierarchical representation of their employees. Such data can provide valuable insight into the development processes. Also, for reporting purposes, it is necessary to divide the project along the management levels into local concerns (as discussed in chapters 3 and 4). Our organization provides access to its organizational chart through a structured web repository. Figure 6.1 depicts the ER model which represents the organizational chart. Manage 1 M Work in
1
People Managed M
Department 1
Attributes: Name Email IsMgr Id
Direct
1
Attributes: Name Id
Figure 6.1: The ERM for the organizational chart.
• Project diary: Depending on the organization’s level of maturity, various tools may be used for planning, design, and tracking of the project. Some organizations develop their in-house tools, while others acquire commercial tools. In either case, if there are no available schemas, an abstract model of the tool must be created based on data that can be extracted from the tool. In our example, the project diary’s maintains its data in a relational database. Figure 6.2 shows the schema for the project diary. • Configuration management: A necessary prerequisite for reaching CMM level 2 includes the use of configuration management tools. Such tools not only control access to the code-base through a pre-determined process, they also provide a wealth of data about the development processes. Such data is typically structured and includes some history. Our organization
CHAPTER 6. AN IMPLEMENTATION METHODOLOGY
CREATE TABLE LineItems ( Id INTEGER NOT NULL, ReleaseName VARCHAR(25), Owner VARCHAR(50), Status VARCHAR(25), SpecDoc INTEGER, DesignDoc INTEGER, UnitTestDoc INTEGER, ScenarioDoc INTEGER, CodeReviewDoc INTEGER, PRIMARY KEY (Id), FOREIGN KEY (SpecDoc)REFERENCES Specifications ON DELETE RESTRICT, FOREIGN KEY (DesignDoc)REFERENCES Designs ON DELETE RESTRICT, FOREIGN KEY (UnitTestDoc)REFERENCES UnitTests ON DELETE RESTRICT, FOREIGN KEY (ScenarioDoc)REFERENCES Scenarios ON DELETE RESTRICT, FOREIGN KEY (CodeReviewDoc)REFERENCES CodeReviews ON DELETE RESTRICT ) CREATE TABLE Specifications ( Id INTEGER NOT NULL, Owner VARCHAR(50), Reviewer VARCHAR(50), Approver VARCHAR(50), Abstract VARCHAR(256), Description BLOB, Status VARCHAR(25), PRIMARY KEY (Id) ) CREATE TABLE UnitTests ( Id INTEGER NOT NULL, Owner VARCHAR(50), Reviewer VARCHAR(50), Approver VARCHAR(50), Abstract VARCHAR(256), Description BLOB, Status VARCHAR(25), StartDate DATE, EndDate DATE, Success INTEGER DEFAULT VALUE 0, PRIMARY KEY (Id) )
CREATE TABLE Scenarios ( LineItemId INTEGER NOT NULL, SystemTestId INTEGER NOT NULL, PRIMARY KEY (LineItemId,SystemTestId), FOREIGN KEY (LineItemId) REFERENCES LineItems ON DELETE CASCADE, FOREIGN KEY (SystemTestId) REFERENCES SystemTests ON DELETE CASCADE ) CREATE TABLE CodeReview ( Id INTEGER NOT NULL, Owner VARCHAR(50), Reviewer VARCHAR(50), Approver VARCHAR(50), EstimatedChgLOC INTEGER, LinesAdded INTEGER, LinesDeleted INTEGER, LinesChanged INTEGER, Completed CHAR(1), FilesCheckedIn CHAR(1), SanityTest CHAR(1), ComponentsAffected VARCHAR(256), FilesChanged VARCHAR(256), PRIMARY KEY (Id) ) CREATE TABLE Designs ( Id INTEGER NOT NULL, Owner VARCHAR(50), Reviewer VARCHAR(50), Approver VARCHAR(50), Abstract VARCHAR(256), Description BLOB, Status VARCHAR(25), PRIMARY KEY (Id) ) CREATE TABLE SystemTests ( Id INTEGER NOT NULL, Owner VARCHAR(50), Reviewer VARCHAR(50), Approver VARCHAR(50), Abstract VARCHAR(256), Description BLOB, Status VARCHAR(25), StartDate DATE, EndDate DATE, Success INTEGER DEFAULT VALUE 0, PRIMARY KEY (Id) )
Figure 6.2: The schema for the project diary.
75
CHAPTER 6. AN IMPLEMENTATION METHODOLOGY
76
uses CMVC (International Technical Support Organization, 1994) for their configuration management and defect tracking and reporting. CMVC uses a relational database for storing its data and through a front-end tool, authorized users can access this data. Our organization uses some of the CMVC processes. Figure 6.3 depicts the schema for the data relating to these processes. • Test logs: The growth of automated test tools has resulted in more data: test plans, test executions, and test results. In the absence of such tools, audit trails and other log files exist which provide data about the execution of test scenarios. Our organization uses home grown tools that execute test scenarios and write logs of their executions in text files. The log files are semi-structured. Figure 6.4 depicts an ER model of this data source. Finally, Table 6.1 summarizes the characteristics of our data sources. Data source Code-base Project diary Configuration Management Organizational chart Test logs
Structure semi-structured relational relational structured semi-structured
Size 100 KLOC 100 MB 500 MB 25 MB 50 MB
Growth 30 % annually 80 % annually 60 % annually 5 % annually 100 % annually
Table 6.1: Input layer data sources and their characteristics
The middle layer in the software data warehouse architecture integrates the data from the input layer. The integrated data provides a comprehensive view of the overall software development. The next section will discuss the issues surrounding the integration of the input layer data as well as steps involved in extraction, transformation, and loading of this data into the middle layer.
6.3
Middle layer: designing the data warehouse
This section discusses the issues surrounding the design of the middle layer, which involves the integration of the data sources in the input layer. The middle layer provides uniform access to consistent data to the outer layer. The outer layer contains the implementation of the three dimensional management model (as discussed in Chapter 3) and other analytical tools.
CHAPTER 6. AN IMPLEMENTATION METHODOLOGY
CREATE TABLE Components ( Id INTEGER NOT NULL, Name VARCHAR(100) NOT NULL, UserId INTEGER , AddDate TIMESTAMP, DropDate TIMESTAMP, PRIMARY KEY (Id) ) CREATE TABLE Releases ( Id INTEGER NOT NULL, Name VARCHAR(100) NOT NULL, UserId INTEGER, AddDate TIMESTAMP, DropDate TIMESTAMP, LastUpdate TIMESTAMP, process VARCHAR(40), Track VARCHAR(10), Approve VARCHAR(10), PRIMARY KEY (Id) ) CREATE TABLE Defects ( Id INTEGER NOT NULL, Type VARCHAR(10), Prefix VARCHAR(40), CompId INTEGER, State VARCHAR(20), Severity VARCHAR(40), Abstract VARCHAR(150), Reference VARCHAR(40), LastUpdate TIMESTAMP, AddDate TIMESTAMP, AssignDate TIMESTAMP, ResponseDate TIMESTAMP, endDate TIMESTAMP, OriginId INTEGER, OwnerId INTEGER, ReleaseId INTEGER, PhaseFound VARCHAR(50), PRIMARY KEY (ID), FOREIGN KEY (compId) REFERENCES Components, FOREIGN KEY (ReleaseId) REFERENCES Releases, FOREIGN KEY (originId) REFERENCES Users, FOREIGN KEY (ownerId) REFERENCES Users )
CREATE TABLE Files ( Id INTEGER NOT NULL, CompId INTEGER NOT NULL, SourceId INTEGER, PathId INTEGER, AddDate TIMESTAMP, DropDate TIMESTAMP, LastUpdate TIMESTAMP, BaseName VARCHAR(100) NOT NULL, Type CHAR(10), OwnerId INTEGER, DistType CHAR(1), Scode CHAR(1), PRIMARY KEY (Id), FOREIGN KEY (CompId) REFERENCES Components, FOREIGN KEY (PathId) REFERENCES Paths, FOREIGN KEY (OwnerId) REFERENCES Users ) CREATE TABLE History ( DefectId INTEGER NOT NULL, UserId INTEGER NOT NULL, Action VARCHAR(20), AddDate TIMESTAMP, PRIMARY KEY (AddDate,DefectId,UserId) FOREIGN KEY (DefectId) REFERENCES Defects, FOREIGN KEY (UserId) REFERENCES Users ) CREATE TABLE Paths ( Id INTEGER NOT NULL, Name VARCHAR(100), PRIMARY KEY (Id) )
CREATE TABLE Users ( Id INTEGER NOT NULL, Login VARCHAR(25) NOT NULL, Name VARCHAR(50), Department VARCHAR(50), DeptId INTEGER, PRIMARY KEY (Id) )
CREATE TABLE Tracks ( Id INTEGER NOT NULL, ReleaseId INTEGER NOT NULL, DefectId INTEGER NOT NULL, UserId INTEGER NOT NULL, State VARCHAR(20), AddDate TIMESTAMP, LastUpdate TIMESTAMP, PRIMARY KEY (Id), FOREIGN KEY (DefectId) REFERENCES DEFECTS, FROEIGN KEY (UserId) REFERENCES Users, FOREIGN KEY (ReleaseId) REFERENCES Releases )
Figure 6.3: The schema for CMVC.
77
CHAPTER 6. AN IMPLEMENTATION METHODOLOGY
Line items
78
Attributes: Id Name
M Test N N Environment M require
Attributes: OS CPU start End Software
Scenarios
1
M
Attempts
1
Attributes: Id Name Release Type S/W available H/W available
M
Defects
find
execute
Attributes: Sequence Machine Start End Status Score Tester
Attributes: Id Name Originator Severity
Figure 6.4: The ERM for the test logs.
The first step in the design of the middle layer involves adding a temporal domain model. A means of achieving this uses event modeling. The event modeling requires adding a time-stamp to each entity. As the entity changes, the time-stamp enables us to capture its history. Figure 6.5 shows the temporal definition of the tables Scenarios and Attempts. The next step in the design of the middle layer involves integrating various data sources (or subjects) in the input layer. This step can potentially identify shortcomings in the input data sources and indicate venues for future improvements. Typically, in software developing organizations, three types of integration cases. • Two data sources refer to the same concept. However, in one this may be an entity, while in another it may be an attribute. For example, in the project diary from the input layer, there are attributes owner, reviewer, and approvers. These attributes represent people in the organization. However, in the organizational charts, we have an entity “people”, which contains detailed data about member of the organization. The middle layer must resolve this linkage to integrate the two subjects. • Two data sources refer to the same concept at different levels of abstraction. In some cases,
CHAPTER 6. AN IMPLEMENTATION METHODOLOGY
CREATE TABLE TestScenarios ( OnDate TIMESTAMP NOT NULL, Id INTEGER NOT NULL, Name VARCHAR(50), ReleaseId INTEGER NOT NULL, PRIMARY KEY (OnDate, Id), FOREIGN KEY (OnDate, ReleaseID) REFERENCES TestReleaseCMVC ) CREATE TABLE TestAttempts ( OnDate TIMESTAMP NOT NULL, Sequence INTEGER NOT NULL, ScenarioId INTEGER NOT NULL, MachineId INTEGER NOT NULL, StartDate TIMESTAMP, EndDate TIMESTAMP, Status VARCHAR(10), Score FLOAT, TesterId INTEGER, PRIMARY KEY (OnDate, Sequence, ScenarioId, MachineId), FOREIGN KEY (OnDate, ScenarioId) REFERENCES TestScenarios, FOREIGN KEY (OnDate, TesterId) REFERENCES People ) CREATE TABLE TestLineItems ( OnDate TIMESTAMP NOT NULL, ScenarioId INTEGER NOT NULL, LineItemId INTEGER NOT NULL, FOREIGN KEY (OnDate, ScenarioId) REFERENCES TestSceanrios, FOREIGN KEY (OnDate, LineItemId) REFERENCES ProjDiaryLineItems ), PRIMARY KEY (OnDate, ScenarioId, LineItemId) )
CREATE TABLE TestDefects ( OnDate TIMESTAMP NOT NULL, Sequence INTEGER NOT NULL, ScenarioId INTEGER NOT NULL, MachineId INTEGER NOT NULL, DefectId INTEGER NOT NULL, PRIMARY KEY (OnDate, Sequence, ScenarioId, MachineId, DefectId), FOREIGN KEY (OnDate, Sequence, ScenarioId, MachineId ) REFERENCES TestAttempts, FOREIGN KEY (OnDate, DefectId) REFERENCES CMVCDefects )
CREATE TABLE TestEnvironment ( OnDate TIMESTAMP NOT NULL, OS VARCHAR(25) NOT NULL, CPU VARCHAR(25) NOT NULL, StartDate TIMESTMAP NOT NULL, EndDate TIMESTAMP NOT NULL, OtherSoftware VARCHAR(256), PRIMARY KEY (OnDate, OS, CPU, StartDate, EndDate) ) CREATE TABLE TestRequire ( OnDate TIMESTAMP NOT NULL, ScenarioId INTEGER NOT NULL, OS VARCHAR(25) NOT NULL, CPU VARCHAR(25) NOT NULL, StartDate TIMESTMAP NOT NULL, EndDate TIMESTAMP NOT NULL, PRIMARY KEY (OnDate, ScenarioId, OS, CPU, StartDate, EndDate ), FOREIGN KEY (OnDate, ScenarioId) REFERNCES TestScenarios, FOREIGN KEY (OnDate, OS, CPU, StartDate, EndDate) REFERENCES TestEnvironment )
Figure 6.5: The schema for the middle layer —part 1.
79
CHAPTER 6. AN IMPLEMENTATION METHODOLOGY
80
establishing the linkage between subjects involves more than simply identifying the missing links. For example, in CMVC schema, a table represents the concept of release. It provides details about various aspects of a release. This concept is similar to that presented in the test logs and project diary. However, the level of granularity at each of these tables varies from another. In CMVC, the release captures details about various aspects of the build process. In project diary, release simply refers to the period of time and the version of the final product. While the project diary refer to version 1, CMVC release version 1 refers to product-GUI-version-1, product-Core-version-1, and product-Doc-version-1. To resolve such differences of granularity, we use a mapping table as shown in Figure 6.6. CREATE TABLE Test-Release-CMVC ( OnDate TIMESTAMP NOT NULL, TestReleaseId INTEGER NOT NULL, TestReleaseName VARCHAR(50), CMVCReleaseId INTEGER NOT NULL, PRIMARY KEY (OnDate, TestReleaseId,CMVCReleaseId), FOREIGN KEY (OnDate, CMVCReleaseId) )
Figure 6.6: Mapping releases in one database to another. The schema for the middle layer —part2.
• Two data sources may seem to refer to the same concept but no physical link exists. In other words, two subjects clearly relate to one another, yet the data sources lack the physical connection. In our organization, every line item results in changes to the code-base. These changes follow a CMVC process: for each change a developer opens a defect, creates a track, checks out necessary files, and after completing the modifications checks back in those files. Although, there exists a one to many relationship between line items and defects, we cannot establish it from the existing data sources. Such data could provide insight to various aspects of development such as line item complexity or its impact. We leverage these cases to direct future improvements: by having stored these links we can determine more rigorously the complexity of line items in terms of their impact to the code-base.
Now, the middle layer is subject-oriented, time variant, and integrated. Figures 6.7 and 6.8 depict the complete schema for the middle layer. To ensure non-volatility of the data warehouse, we need to delve deep into the loading processes.
CHAPTER 6. AN IMPLEMENTATION METHODOLOGY
CREATE TABLE LILineItems ( OnDate TIMESTAMP NOT NULL, Id INTEGER NOT NULL ReleaseId INTEGER NOT NULL, OwnerId INTEGER, Status VARCHAR(25), SpecDoc INTEGER, DesignDoc INTEGER, UnitTestDoc INTEGER, ScenarioDoc INTEGER, CodeReviewDoc INTEGER ) CREATE TABLE LIUnitTests ( OnDate TIMESTAMP NOT NULL, Id INTEGER NOT NULL, OwnerId INTEGER, ReviewerId INTEGER, ApproverId INTEGER, Abstract VARCHAR(256), Status VARCHAR(25), StartDate TIMESTAMP, EndDate TIMESTAMP, Success INTEGER )
CREATE TABLE LISpecifications ( OnDate TIMESTAMP NOT NULL, Id INTEGER NOT NULL, OwnerId INTEGER, ReviewerId INTEGER, ApproverId INTEGER, Abstract VARCHAR(256), Status VARCHAR(25) ) CREATE TABLE LIScenarios ( OnDate TIMESTAMP NOT NULL, LineItemId INTEGER NOT NULL, SystemTestId INTEGER NOT NULL ) CREATE TABLE LIDesigns ( OnDate TIMESTAMP NOT NULL, Id INTEGER NOT NULL, OwnerId INTEGER, ReviewerId INTEGER, ApproverId INTEGER, Abstract VARCHAR(256), Status VARCHAR(25) )
CREATE TABLE LISystemTests ( CREATE TABLE LICodeReviews ( OnDate TIMESTAMP NOT NULL, OnDate TIMESTAMP NOT NULL, Id INTEGER NOT NULL, Id INTEGER NOT NULL, OwnerId INTEGER, OwnerId INTEGER, ReviewerId INTEGER, ReviewerId INTEGER, ApproverId INTEGER, ApproverId INTEGER, Abstract VARCHAR(256), EstimatedChgLoc INTEGER, Status VARCHAR(25), LinesAdded INTEGER, StartDate TIMESTAMP, LinesDeleted INTEGER, EndDate TIMESTAMP, LinesChanged INTEGER, Success INTEGER ) Completed CHAR(1), FilesCheckedIn CHAR(1), SanityTest CHAR(1), ComponentsAffected VARCHAR(256), FilesChanged VARCHAR(256) ) CREATE TABLE ORGCHARTPeople ( OnDate TIMESTAMP NOT NULL, Id INTEGER NOT NULL, Name VARCHAR(50), Email VARCHAR(50), IsMgr CHAR(1), WorksIn INTEGER, MgrId INTEGER ) CREATE TABLE CodeBasePrgEnts OF D CodeBaseProgEntity ( REF IS OID USER GENERATED
CREATE TABLE ORGCHARTDepartments OnDate TIMESTAMP NOT NULL, Id INTEGER NOT NULL, Name VARCHAR(50), Director INTEGER )
CREATE TYPE CodeBaseProgEntity AS ( OnDate TIMESTAMP NOT NULL, FileId INTEGER, BeginRow INTEGER, BeginCol INTEGER, EndRow INTEGER, EndCol INTEGER, Scope INTEGER ) REF USING VARCHAR(50) MODE DB2SQL
Figure 6.7: The schema for the middle layer —part 3.
81
CHAPTER 6. AN IMPLEMENTATION METHODOLOGY
CREATE TABLE CMVCComponents ( OnDate TIMESTAMP NOT NULL, Id INTEGER NOT NULL, Name VARCHAR(100), UserId INTEGER, AddDate TIMESTAMP, DropDate TIMESTAMP ) CREATE TABLE CMVCReleases ( OnDate TIMESTAMP NOT NULL, Id INTEGER NOT NULL, Name VARCHAR(100), UserId INTEGER, AddDate TIMESTAMP, DropDate TIMESTAMP, LastUpdate TIMESTAMP, Process VARCHAR(40), Track VARCHAR(10), Approve VARCHAR(10) ) CREATE TABLE CMVCHistory ( OnDate TIMESTAMP NOT NULL, DefectId INTEGER, UserId INTEGER, Action VARCHAR(20), AddDate TIMESTAMP )
CREATE TABLE CMVCTracks ( OnDate TIMESTAMP NOT NULL, Id INTEGER NOT NULL, ReleaseId INTEGER, DefectId INTEGER, UserId INTEGER, State VARCHAR(20), AddDate TIMESTAMP, LastUpdate TIMESTAMP )
CREATE TABLE CMVCDefects ( OnDate TIMESTAMP NOT NULL, Id INTEGER NOT NULL, Type VARCHAR(10), Prefix VARCHAR(40), CompId INTEGER, State VARCHAR(20), Severity VARCHAR(40), Abstract VARCHAR(150), Reference VARCHAR(40), LastUpdate TIMESTAMP, AddDate TIMESTAMP, AssignDate TIMESTAMP, ResponseDate TIMESTAMP, endDate TIMESTAMP, OriginId INTEGER, OwnerId INTEGER, ReleaseId INTEGER, PhaseFound VARCHAR(50) ) CREATE TABLE CMVCFiles ( OnDate TIMESTAMP NOT NULL, Id INTEGER NOT NULL, CompId INTEGER, SourceId INTEGER, PathId INTEGER, AddDate TIMESTAMP, DropDate TIMESTAMP, LastUpdate TIMESTAMP, BaseName VARCHAR(100), Type CHAR(10), OwnerId INTEGER, DistType CHAR(1), Scode CHAR(1) ) CREATE TABLE CMVCPaths ( OnDate TIMESTAMP NOT NULL, Id INTEGER NOT NULL, Name VARCHAR(100) )
Figure 6.8: The schema for the middle layer —part 4.
82
CHAPTER 6. AN IMPLEMENTATION METHODOLOGY
83
As we saw in the input layer, the structure of data sources can vary significantly. The structure of a data source determines how to extract, cleanse, transform, and load the data into the middle layer. The first issue in loading data into the middle layer involves cleansing and transformation of data. To assure data consistency, first, we must assure all data items are converted into a uniform format. In our example, the concept of time has different formats depending on the input layer data sources. In CMVC, the date is represented as a time-stamp, while in test logs, the date is represented as day and month, further, in the project diary, the date is represented as day, month, and year. Also, in some data sources, this value is mandatory, while in others it can be null. Upon successful loading of the data from the input layer, the middle layer provides a uniform and consistent access to integrated, cleansed, and historical data, which can be viewed from different perspectives. As new input data sources become known, we add them by repeating the above mentioned steps. This need for flexibility is one motivation behind using traditional relational schemas instead of multi-dimensional models. However, to better facilitate analytical processes, in the outer layer we use multi-dimensional models. The next section focuses on definitions of three dimensional management model and their implementation as a multi-dimensional data model.
6.4
Outer layer: defining quality, progress, and health
The outer layer provides the mechanism for creating hypotheses about the software development process. First, by defining a three dimensional management model, we can extract the necessary data from the middle layer facilitating the monitoring of the project. Second, by comparing the computed values for each dimension against its planned value, we can determine potential problems. Lastly, upon identification of a problem, we analyze the values along each dimension to formulate a hypothesis. For example, during our case study, a shortcoming in the quality value correlated with the size and growth of a particular component. When a defect remained unresolved against this component, the corresponding test scenario could not proceed with its execution —resulting in lower quality index. This lead us to hypothesize that size and growth of components correlate with the defect turn-around time, which can impact quality.
CHAPTER 6. AN IMPLEMENTATION METHODOLOGY
84
To develop a three dimensional management model, as described in Chapter 3, we must consult the organization’s existing documentations on quality management and development processes. The following list outlines the potential documents to look for to construct a definition for each of the three dimensions. Once again, it must be noted that the goal of this step is to formalize what the informal goals of an organization along the three dimensions. Improvement efforts must be deferred until supporting data becomes available.
• Quality: A key documentation that provides insight into an organization’s quality objectives is the quality management system (QMS) (ISO, 1990). As required by ISO 9000-2000 standards, this document explains details about an organization’s development and quality assurance processes. Using these processes, we can construct a local definition for quality. In cases where a QMS does not exist, we must reply on other sources such as the organization’s “shipping criteria”. Such a document provides a set of criteria by which an organization determines whether or not to make its product generally available. In our example, the shipping criteria states that the system test team must complete their scenarios at 99 percent completion, the regression testing team must complete 95 percent of their test buckets, and usability team must give the product a passing mark. Figure 6.9 depicts the top level of our quality goal. The regression testing team continually executes various test cases against the most recent version of the product. Test cases are grouped into two buckets, one focusing on critical and another on non-critical functionalities. The regression testing ensures that the new functionalities have not caused previous functionalities to regress. The value of each subgoal is the average score of its test scenarios —B1 to B3 for critical functions and B4 and B5 for non-critical functionalities. The non-critical functionalities sub-goal represents aspects of the product that, in cases of sever problems, can be worked around. In other words, if the team is facing a resource shortcomings, management may reduce the expectations on this subgoal to allocate resources elsewhere. To capture the management influence on this subgoal, instead of frequent modification to the quality definition, we use a positive external influence as shown in Figure 6.9.
CHAPTER 6. AN IMPLEMENTATION METHODOLOGY
85
Quality Regression testing
Score
+
Non-crit. functions
Critical functions
Usability
System testing
Workaround
Executions
Prepare
Planning B1
Install
B2
B4
B3
B5
Install
Stress
Line item coverage
Stress
Interface
Install Interface
Interface Stress
Scenario Scenario L1 S1 Scenario Scenario Scenario I1 I2 S2
Env. analysis
Line item
Env.
coverage
Line item analysis coverage
Line item coverage
Line item coverage
Env. analysis
Env. analysis
Env. analysis
Figure 6.9: Top level definition of a quality goal.
Executions
Stress
Interface
Install
Workaround
+
Scenario S2
Scenario S1
Defect
Defect
mgmt
mgmt
Attempt Attempt 1
2
Scenario I1
Defect mgmt
Attempt Attempt 4 3
Sceanrio I2
Defect mgmt
Sceanrio L1
Defect mgmt
Attempt 5
Attempt Attempt 9 8
Attempt Attempt 7 6
Figure 6.10: Lower level definition of a quality goal —part 1.
CHAPTER 6. AN IMPLEMENTATION METHODOLOGY
86
Another subgoal immediately below quality is the usability. The main concern of the usability team involves determination of how easily and effectively users can utilize the product. However, as the definition of the central data warehouse indicates, we have no access to the data from this team. As such, we cannot refine this sub-goal. When faced with such scenarios we must make a choice. One approach ignores the usability team altogether and removes their subgoal from the definition of quality. In some cases, a concern such as usability is beyond the scope of the management’s responsibility at this level. Another approach, as shown in Figure 6.9, includes the usability subgoal without further refinement. The value of this subgoal is binary: whether or not the product is “usable”. The definition of usable remains undefined in the current management scope. Should the management —perhaps at higher levels of responsibility— require a more granular definition for this aspect of quality, we will need to integrate the usability team data and refine the definition accordingly. This approach provides a means of scaling the quality definition based on the perspective of the responsible manager.
Prepare
Stress
Install
Interface Scenario S1 Scenario
Scenario
S2
L1 Scenario
H/w &
I1
Scenario I2
S/W Req H/W & Resource
H/W &
allocation
H/W & S/W Req
S/W Req
S/W Req
H/W & S/W Req
Resource
Resource
allocation
allocation
Resource
Resource
allocation
allocation
Figure 6.11: Lower level definition of a quality goal —part 2.
The system testing constitutes the last subgoal of quality. This subgoal further refines along
CHAPTER 6. AN IMPLEMENTATION METHODOLOGY
87
three issues: planning, preparation, and execution. These issues aim to enforce the idea of full life cycle testing. At early stages of specifications the quality assurance team beings by planning test scenarios, assessing requirements for testing them, and, as time passes, allocating appropriate resources to carry out their execution. Each subgoal further refines along the components of the testing activity: stress, interface, and install. The stress testing exercises the product in various customer scenarios under various operating environment for periods of time. The interface testing ensures that the product integrates with other products. Lastly, the install test ensures that the product can be installed under various conditions. As Figures 6.10 and 6.11 show, each components further refines along their respective scenarios. The value for each subgoal is computed by extracting appropriate data from the central data warehouse. Table 6.2 shows an example definition for defect management for the execution of test scenarios.
100 80 50 0
if 2 or less low severity defects remain open if 3 to 5 low severity defects remain open if 1 or 2 high severity defects remain open otherwise Table 6.2: A definition for defect management subgoal.
By counting the number of defects for each scenario in the TestDefectsCMVC table in the central data warehouse we can determine the value of the defect management for that scenario. Table TestAttempts in the central data warehouse stores the score for each scenario’s execution (i.e., attempt). The values for line-item coverage, environment analysis, H/W and S/W requirement, and resource allocation subgoals are binary: 100 if a record exists in the table LineItemScenario, 0 otherwise. Currently, we cannot make finer distinction on these subgoals based on the available data. To address such shortcomings, we raise the weight of other less subjective subgoals. In our example, we raise the weight for the execution subgoal, to a higher value, e.g. 2. • Progress: In its simplest form, the definition of the progress goal resembles a checklist of
CHAPTER 6. AN IMPLEMENTATION METHODOLOGY
88
the activities to perform. Depending on the sophistication of an organization, there may exist more elaborate means of measuring the progress. In either case, a key documentation that provides the necessary insights into the progress goal is the development process manual. This document typically outlines how to divide the requirements into groups, how to carry out a design, what steps to take to complete the implementation, and other details about the development process. For example, a requirements document may be structured as a set of use-cases (Jacobson, 1994), each of which will be designed and implemented by a group of developers. In our case study, which will be presented in the subsequent chapters, the requirements are divided into line-items. Each line item groups a set of related functionalities. It is self contained and can be assigned to a small group of developers (e.g. 1 to 5 people). Figures 6.12 and 6.13 depict the definition of progress goal for our example.
Progress
Line item 1
Line item 2
Line item 3
Line item 4
Figure 6.12: Top level definition of a progress goal.
The progress is refined along the line items (Figure 6.12). In other words, we define our progress goal as the achievement of line items 1 and 2 in conjunction with completion of at least one of the line items 3 and 4. Completion of each line item involves four steps. Figure 6.13 depicts these steps for line item 1. First, the specification of a line item must be “formalized”. This involves the creation of a detailed specification document, reviewing it, assessing its impact, and finally approving it. In our example, we aim to emphasize the importance of the approval of specifications. As such, we assign a higher weight to this subgoal, e.g. 2, to capture this importance. An historical analysis may reveal that the review of functional specification is frequently bypassed or not enforced. To model the lack of enforcement of the review process, we add justification disjunct. A developer can either review a document or provide a justification. Furthermore, if in some cases functional specifications are altogether bypassed, we model
CHAPTER 6. AN IMPLEMENTATION METHODOLOGY
89
Line item 1
Specification
Design
Integration
Coding
Approved Impact analysis
Function Spec.
Documentation
High level
Low level
Comp-
Docum-
Docum-
affected
entation
entation
Review Justify
function Review depend
modification
Review
Review
Unit test
Approval
Checkin
build
Sanity test
Score
Review
Justify
Justify
Figure 6.13: Top level definition of a progress goal.
such scenarios using the external influences. For example, due to schedule pressures, for less complex line items, management may waive the functional specification. In other words, we model management’s intervention as an external influence. The effect of the external influence is identical to removing the subgoal. However, it provides an historical picture of the changes made to the initial progress goal. The other subgoals of line item 1 include design, code, and integrate subgoals. As the schema for the project diary indicates (see Figure 6.2) the table LineItemDesign stores the design documents as binary objects. This implies that without sophisticated tools we cannot infer anything about the design document except whether or not it exists. To facilitate a finer distinction on a design, we must extract more data from these objects. Such an evolutionary step falls under the continual improvement and as such must be driven by the potential benefits to the management process. The code modification subgoal relies on subjective metrics provided by the developers. From time to time, developers estimate the total lines of code that they need to write to implement the line item. We use this value against total lines added or changed to estimate completion.
CHAPTER 6. AN IMPLEMENTATION METHODOLOGY
90
The values of lines of code added or changed are updated prior to regular project status meeting. The review sub-goal is a binary flag. If the coding is complete, the reviewer information is in place, and the list of files changed is not empty, then we assume the review is complete. The same rule applies to the approve and review sub-goals of unit testing. If the respective name of the person involved is not null in the database, then the approval (or review) is complete. The score of the unit test is a subjective metric provided by the developer, e.g., the number of functions tested over total number of functions added or changed. The integration subgoal could be tracked through CMVC tables. When files are checked in, they are grouped under a track. When a track integrates into the nightly build, it changes from working to integrated state. However, as previously discussed, we have no means of connecting a line item to CMVC in our data warehouse. Instead, developers must update the project diary with the status of their changes. • Health: The issues relating to the code-base health rarely receive attention on their own. Typical organizations maintain a list of best-practices as coding guidelines that may or may not be enforced. Such documents typically combine the issues relating to progress or quality with the code-base health. For example, some organizations may use some variations of line counting to estimate their progress. Others may use McCabe’s complexity (McCabe, 1976) to assess the quality of the final product. This dissertation separates the concerns of code-base health from quality and progress. Based on local data, we study possible correlations among the three axes of the management model. The first step in defining the health involves identifying characteristics of interests. These may include an average size of program source files, number of functions per file, complexity measures of a file, and various notions of structuredness such as fan-in and fan-out. In our example, we define the code-base health as a conjunction of McCabe’s complexity, average code size, average growth rate, and structuredness. Figure 6.14 depicts the health definition. Each issue in Figure 6.14 is further refined along the components of the code-base. The figure shows two components: FileMgr and GUI. Each component further refines to its constituent
CHAPTER 6. AN IMPLEMENTATION METHODOLOGY
91
Health
Complexity
FileMgr
...
Growth
Size
FileMgr
GUI
...
... index.C
table.C
buf.C
GUI
FileMgr
Structure
...
FileMgr
...
GUI
...
... index.C table.C
GUI
buf.C
index.C
table.C
buf.C
S3
G1
G2
G3
...
index.C
table.C
buf.C
getNxtLnk complxity S1
S2
M1
getPrvLnk complxity
... getValue complxity
Figure 6.14: The definition of the health goal.
M2
M3
M4
CHAPTER 6. AN IMPLEMENTATION METHODOLOGY
92
files, and possibly to finer granularity such as functions. At the file level, the complexity is computed as the average of the complexities of a file’s functions. Similarly, at the component level, complexity is measured as the average of the complexities of the files. The overall complexity of the code-base is then defined as the average complexity of its component. Similarly we define the other subgoals. A sensitive issue in construction of the health goal involves the selection of the metrics. There are no standard definitions for metrics pertaining to a particular characteristics and their respective values (Kitchenham, 1990). Since our approach has a local focus, we base our metrics definitions and their respective values on earlier release of the product —ideally, a version that is considered “good”. In other words, we benchmark the health of the code-base against its previous releases. As data becomes available, we improve the definitions of the metrics and their associated values. Table 6.4 shows a set of example definitions.
Using the definitions of quality, progress, and health goals provided in this section, at regular intervals, we extract appropriate data from the central data warehouse to compute indices for each of these goals and compare them against their planned values facilitating tactical and strategic analyses (as discussed in Chapter 3). The following section shows an implementation of our 3 dimensional management model using multi-dimensional data models and aggregate data sources (Kimball, 1996; Pedersen and Jensen, 2001).
6.5
Analytic tools
The definition of the three-dimensional management model provides a means of monitoring of the software development project. As the project gets under way, we compute the indices and compare them against their respective planned values. When one or more of the indices fail to meet their planned values, detailed analyses may reveal the existence of a problem. In such cases, management initiates tactical and strategic analyses by drilling down the indices, formulating hypotheses, and gathering data to support or reject them. To facilitate this process, we develop several analytical tools. Here we address three such tools.
CHAPTER 6. AN IMPLEMENTATION METHODOLOGY
Size 100 90 80 0
Growth 100 90 80 0
Complexity 100 90 80 0
Structuredness 100 50 0
if L < 700 if 700 < L < 900 if 900 < L < 1100 otherwise where L is the total lines of code
if D < 20 if 20 < D < 30 if 30 < D < 40 otherwise where D is the growth percentage
if M < 16 if 16 < M < 20 if 20 < M < 24 otherwise where M is the McCabe’s complexity.
if fi < a and fo < a if fi < a and fo > a or if fi > a and fo < a otherwise where fi is the fan-in and fo is the fan-out. Table 6.3: Example metric definitions.
93
CHAPTER 6. AN IMPLEMENTATION METHODOLOGY
94
• Aggregate data sources: To accelerate the analysis of the indices in the three dimensional management model, we compute some of them in advance and store them in aggregate tables. For example, computing the size of a file involves counting the number of statements in that file. Similarly, computing complexity involves counting the binary decisions (Kitchenham, 1990) in a function. These steps can be performed during the refresh periods, when loading the central data warehouse, and storing the results in aggregate tables. Figure 6.15 shows the aggregate tables supporting the health indices. We can implement other aggregations to facilitate computing project indices. CREATE TABLE AggFunctionComp ( OnDate TIMESTAMP NOT NULL, FunctionId INTEGER NOT NULL, Complexity INTEGER NOT NULL, PRIMARY KEY (OnDate, FunctionId) ) CREATE TABLE AggFileGrowth ( OnDate TIMESTAMP NOT NULL, CompId INTEGER NOT NULL, FileId INTEGER NOT NULL, Time1 TIMESTAMP, Time2 TIMESTAMP, Grwoth FLOAT, PRIMARY KEY (OnDate,CompId,FileId), FOREIGN KEY (OnDate, CompId) REFERENCES CMVCComponents, FOREIGN KEY (OnDate,FileId) REFERENCES CMVCFiles )
CREATE TABLE AggFileSize ( OnDate TIMESTAMP NOT NULL, CompId INTEGER NOT NULL, FileId INTEGER NOT NULL, LOC INTEGER, PRIMARY KEY (OnDate,CompId,FileId), FOREIGN KEY (OnDate, CompId) REFERENCES CMVCComponents, FOREIGN KEY (OnDate,FileId) REFERENCES CMVCFiles ) CREATE TABLE AggFileStructure ( OnDate TIMESTAMP NOT NULL, CompId INTEGER NOT NULL, FileId INTEGER NOT NULL, FanIn INTEGER, FanOut INTEGER, PRIMARY KEY (OnDate,CompId,FileId), FOREIGN KEY (OnDate, CompId) REFERENCES CMVCComponents, FOREIGN KEY (OnDate,FileId) REFERENCES CMVCFiles )
Figure 6.15: Aggregate tables for accelerating index computations.
• Project monitoring: Leveraging multi-dimensional data models and their associated schemas, we can efficiently implement our three dimensional management model. Figure 6.19 depicts the top level of a snowflake schema for our project monitoring. The schema is fact-less (Kimball, 1996) with keys for each of its four dimensions. – Time: This dimension captures when the data was sampled. – Progress: This dimension represents the implementation of the progress goal as shown in Fig 6.16. – Quality: This dimension represents the implementation of the quality goal as shown in Figure 6.17.
CHAPTER 6. AN IMPLEMENTATION METHODOLOGY
Progress dimension Progress key
Time dimension
Time key
Line item dimension
Line item key
Line item Id
Progress value
Spec. key
Specification dimension
Func. spec. key
Specification key
Doc.
Func. spec. key
Review
Impact analysis key
Justified
Approved Value
Design key Coding key Integration dimension Integration key Checked-in
Integration key
Design key
High level key
Value Low level key Value
Build
Sanity test
95
Value Impact analysis dimension
High level dimension Low level key Doc. Review Justified
Coding key Unit test key
Value modification Review Value Coding dimension
Value
Unit test key Review Approval Test score Value
Figure 6.16: A snowflake schema for the progress dimension.
CHAPTER 6. AN IMPLEMENTATION METHODOLOGY
96
Critial func. key
Regression testing dimension
B3
Critial func. key
Value
Value
Quality key Time key
B2
Regression key
Non-crit func key
Quality dimension
B1
Time dimension
Non-crit func key B4 B5
Scenario key
Ext. influence
Line item coverage
Value Stress planning key
Regression key System test key
System test key
usability score
Planning key
Quality value
Preparation key Execution key Value
System testing dimension
Planning key
Value
Scenario key
Stress planning key Interface key
Environment analysis
Value
Install key
Interface key
Value
Scenario key
Preparation dimension
Value
Execution dimension
Install dimension
Scenario key Line item coverage Environment analysis Value
Figure 6.17: A snowflake schema for the quality dimension.
CHAPTER 6. AN IMPLEMENTATION METHODOLOGY
97
– Health: This dimension represents the implementation of the health goal as shown in Figure 6.18. File size
File growth Size dimension
Growth dimension
Comp. key File Id
Growth key
File growth
Comp. key Comp. growth
Health fact Growth key
Comp. key
Size key
File Id
Comp. key
File size
Comp. size
Size key Time key
Complexity dimension
Structuredness dimension
Complexity key
Complx. key
File structuredness
Struct. key
Structuredness key
Comp. key
Comp. key
Comp. key
Comp. key
Health key
Comp. complexity
File key
File key
File complexity
Func. Id
Time dimension
Comp. structuredness
File Id
Health index
File complexity
File Strucutredness
Function complexity
Function complexity
Figure 6.18: A snowflake schema for the health dimension.
By joining these tables we can determine the values of project indices at different times and phases of the project. The concept “phase” in the time dimension provides a means of comparing two projects based on their relative time. For example, we can compare release 1 to release 2 with respect to the average progress indices during their initiation phase. Each of the quality, progress, and health goals are defined using a snowflakes schema. Figure 6.18 depicts the schema for health. The health has four dimensions each of which represents a subgoal of quality. These dimensions further refine along their component dimension. The leaf values are those computed by the aggregate tables or directly referencing the central data warehouse. • Defect analysis: It constitutes one of the main tools of software management. To demonstrate how the data from the central data warehouse can be leveraged for creation of tools to facilitate management reporting, we develop a star-schema (Kimball, 1996) for
CHAPTER 6. AN IMPLEMENTATION METHODOLOGY
Quality dimension
Progress dimension
Quality key
Progress key
...
98
Project fact
...
Quality key Progress key Health key Time dimension
Time key
Time key Day Month Year
Health dimension Health key ...
Phase
Figure 6.19: A snowflake schema for the three dimensional management model.
defects. Figure 6.20 depicts this star schema. Defects have five dimensions, owning manager, components, time of arrival, development phase where the defect was detected, and files that were involved in the fix of the defect. Using this schema, we can generate various defect reports along different dimensions.
Overall, the uniform access to the data in the middle layer facilitates the development of various analytical tools providing insights into the software development project. However, the three dimensional management model must drive the development of such tools to avoid redundant growth of tools that do not support specific analytical goals.
6.6
Summary
This chapter presented a methodology for the design and implementation of a software data warehouse. The four-step methodology involves the identification and characterization of the available data sources, design of an integrated schema for these data sources, defining of the three dimensional management model, and finally development of analytical tools using multi-dimensional
CHAPTER 6. AN IMPLEMENTATION METHODOLOGY
99
Component dimension Component Id
Manager dimension Id
Defects fact
Name
Manager key
Manager Id
Component key Arrival time key
Name
Number of files
Phase found dimension
Phase key
Phase Id
Year
Files key
Name
Month
State
Day
Severity
Arrival time dimension
Trigger
Files dimension Record Id File Id Name
Figure 6.20: Star schema for defect analysis and reporting.
CHAPTER 6. AN IMPLEMENTATION METHODOLOGY
100
data models. The key issues in the design of the software data warehouse include: non-intrusion, flexibility, and fact-based improvement. Overall, a software data warehouse provides uniform access to temporal, integrated and consistent data that using a three dimensional management model can support the decision making process. The next two chapters will review an industrial case study which used a similar methodology for the design and implementation of a software data warehouse. Finally, Chapter 10 will provide an evaluation of software data warehousing as described thus far.
Chapter 7
A software management case study This chapter and the subsequent three chapters will present an application of the ideas introduced so far to an industrial case study. The chapter begins by providing an overview of the software product and its associated processes to familiarize the reader with the overall development project. Subsequent chapters will focus on the implementation of the middle and the outer layer of the software data warehouse as well as the management evaluations.
7.1
Overview
The software product in this case study is over ten years old and has gone through over five major releases. Each release has added numerous features to the previous release and has fixed many defects found in previous releases. At the start of this case study the number of people involved in the development of this product was over five hundred. The total number of C/C++ files in the code-base currently exceeds twenty thousand with over four million lines of code. An initial study revealed that the product has grown, on average, by 40 percent for each release. To put this growth in perspective, each release, on average, added nearly 40 percent new features than the previous release. The size of the code-base increased nearly by 40 percent in terms of lines of program code and program files. The number of people working on the product has grown
101
CHAPTER 7. A SOFTWARE MANAGEMENT CASE STUDY
102
by nearly 40 percent for each new release. This number includes all people involved in the design, implementation, testing, and management of the product1 .
7.2
Development processes
The product development follows a pattern similar to that described in Chapter 1 (see Figure 7.1). A team of marketing, business, and technical people propose a set of new features to be added to the previous release. They evaluate these features for their technical and business feasibility. Some features are dropped, while others are included in the final plan. The final features are approved and frozen as the official requirement of the new product release. Each feature is assigned to a team member. The more complex features are assigned to more experienced developers while less complex features are assigned to less experienced ones. Depending on the size of the assigned feature the owning developer may have several developers in her team. The owning developer is responsible for completing a specification document, which select members of the team can review and comment on.
Defects
General Availability (GA)
Features /Defects
Design Code Unit Test (DCUT)
Design Code Unit Test (DCUT)
Build
Build
Function System Perf. Testing
Function System Perf. Testing
Fixpack /Minor Release
Next GA
Figure 7.1: Software release lifecycle
Upon completion of the review of the specification document, the owning developer and her team proceed with the high- and low-level design, followed by implementation. The team implements 1
Due to confidentiality of the studied product, numbers and dates are rounded to the nearest decimal.
CHAPTER 7. A SOFTWARE MANAGEMENT CASE STUDY
103
their feature on a local copy of the product. Each development team creates a copy of the product on their personal directory and apply their changes only to this copy. The developers are responsible for unit testing and functional testing of their feature and the new version of the product — one including the new feature. Depending on the complexity of a feature design document, program code, and test cases may be reviewed and approved by select members of the teams. Upon completion of the unit testing and functional testing, the new feature is integrated into the product. During the regular building of the product the new feature is integrated and some rudimentary tests are performed. If the build process successfully integrates the new feature with the code-base, other testing processes can begin; otherwise, the new feature is returned to the development team. In the case of successful integration, the test-cases used for functional testing are transferred to regression testing. The objective of regression testing is to ensure that the new feature has not compromised other previously available functionalities of the product. In other words, the new release must support all of the previously developed features as well as the new ones when appropriate. In parallel to design and coding, a separate team plans the testing of the product as a whole, i.e., system testing. This team studies new features and develops a set of test cases for exercising the product with its new features. Each test case aims to emulate a customer usage scenario testing one or more of the new features in conjunction with the previously released features of the product. Since each test case covers a number of features and possibly requires some of the previously added features, the system test cannot begin until a number of features have been successfully integrated. As a result, the system test cannot begin until late in the development cycle. Furthermore, the system test generally requires the code-base to be frozen, i.e., no new features can be added. Otherwise, after the code-base is frozen test cases must be run to ensure that the product has not regressed during the changes. Performance testing ensures that the new release meets its non-functional requirements. Typically this involves testing the product against a set of industry standard benchmarks. In cases where there are no set requirements, the performance of the current release is measured against the previous one — the new release should not regress more than a small amount.
CHAPTER 7. A SOFTWARE MANAGEMENT CASE STUDY
104
Depending on the type of the product, it may be necessary to expose the product to more testing. For example, a product may be tested for its integration with other products. Also, the product may be assessed for its usability, ease of installation, help facilities, and other non-functional requirements. These verification steps must take place after the product has reached a stable state, i.e., the product does not change except for correcting defects. It is important to note that as the development is progressing, defects may be found in previous releases. These may have to be forward fitted to the current release.
7.3
Management processes
Similar to development processes, management follows a mature process. The product development is divided into sub-projects, each of which is assigned to a manager. Depending on the complexity of the sub-project, it may be further broken down into smaller sub-projects and assigned to other managers. All managers, regardless of the size of their sub-project, are responsible for the planning, execution, and monitoring of the sub-project — their concerns. The successful delivery of the product is the responsibility of the release team. This team consists of a release manager and a number of technical members from across the project. They apply project management principles across the product development to ensure that sufficient progress is made according to the initial plans and that the product will meet its quality criteria. Furthermore, the release team must identify resource shortcomings, schedule slippage, and any other problems and report these to the management team. All managers meet regularly to report on the status of their sub-project. Typically, each sub-project generates and stores its own data. The manager of the sub-project, using the available data and other rules of thumb, determines the status of the sub-project. During the status meetings each manager reports on the progress made, identifies possible problems and their impacts, and proposes possible solutions. Based on the nature of the problems, the development manager may inform the entire team of a critical situation and request extra effort to resolve the situation. A case in point is the reduction of customer defect backlog.
CHAPTER 7. A SOFTWARE MANAGEMENT CASE STUDY
7.4
105
Status quo
The product under study has mature development and management processes. However, one challenge remains: to continually improve and optimize these processes to cope with the rapid growth of the product. The continual growth of the product has resulted in isolation of processes, i.e. little or no data sharing. Figure 7.2 depicts various processes that were identified with their associated data sources. Currently, there is no formal process for defining goals, monitoring, or predicting them. Goals are set informally and the monitoring processes are ad hoc. As problems arise the management team requests reports on various aspects of the project. Each team, based on their local data, generates reports on what they have accomplished and their future plans — their local concerns.
System test Project diary
Regression test reports
Customer reported problems
CMVC
Performance reports Codebase
Management processes
File-based textual data
Development processes
Structured data
Figure 7.2: Management and development processes and their associated data source
The ad hoc nature of reporting contributes to several problems. One problem is that of isolated
CHAPTER 7. A SOFTWARE MANAGEMENT CASE STUDY
106
data sources. Each team has its own concerns and maintains its own data accordingly. Though teams have their own concerns, these are not independent of one another. The performance of one team can have a significant impact on others. Similarly, a team’s data could provide insights into other teams’ concerns and the overall project. Other problems relating to data isolation are semantics and format. Due to the diversity of concerns among teams, frequently there are various meanings for common terms. For example, different teams have different meanings for the term “closed defect”. Developers consider a defect closed when its fix has been submitted for integration. The system test team considers a defect closed when its fix has been integrated into the product. Yet, the customer service representatives view a defect as closed when the fix is tested and verified. This problem is further exacerbated by data types and format. For example, in this case study various teams had used different data types to refer to the current release of the product, e.g., 2.1, v2.1, product-v2.1, product-v2. The informality of the project goals contributes significantly to troubles in the monitoring process. A lot of time during status meetings is spent verifying other teams’ data and their interpretations. Frequently, due to data inaccuracies, reports must change, which results in mistrust in reporting and further isolation of teams. This problem becomes of great significance when a project reaches a magnitude as in the described case study. The stated problem is in accord with Brooks’ observations (Brooks, 1995) that as the number of people grow, communication becomes a major obstacle to success. Trying to correct a mis-communication can prove to be of little value. Another problem resulting from the ad hoc monitoring is prediction. Teams need to make predictions about their future performance, compare predicted values versus actuals, and improve their performance. However, lack of integrated and temporal data from the monitoring processes makes predictions difficult and sometimes impossible. A sound prediction system requires access to past performance data in order to estimate future outcomes. Chapter 3 discussed the importance of continual improvement of goal definitions, their monitoring system and prediction models. However, the lack of access to necessary data makes improvement undertakings very costly, if not impossible. As a result, management teams rely on anecdotal evidence for improvement or completely distrust new methods. During the case study we identified
CHAPTER 7. A SOFTWARE MANAGEMENT CASE STUDY
107
numerous “rules of thumbs” and “proven practices” that managers employed to cope with rising problems. These rules and practices are rarely, if ever, validated. In some cases, such rules and practices are of great value but the lack of empirical evidence makes it difficult to expand the circle of its application. In other cases, they had become stale but lack of empirical evidence made it impossible to either discard or improve them. The above-mentioned problems also contribute to tactical problems such as causality analysis. When a problem is identified, a team is assigned to identify the cause of the problem and propose a solution. In most cases, the lack of access to integrated data results in senior technical members of the team leaving their jobs to undertake problem determination tasks. This process can be very time-consuming and not having access to support tools can hinder its performance. Moreover, even after a possible cause is identified, a solution cannot be easily chosen. Once again, solutions are anecdotal or based on the intuition of senior members of the team. There is no framework for incremental deployment of such solutions, monitoring and assessing their performance, and improving and optimizing them.
7.5
Summary
The management for the discussed case study, similar to other successful commercial software developments, is reactive: management is always reacting to problems. In lieu of the growth of software products, the management must be pro-active: management must plan, predict, and prepare for potential problems. Furthermore, as the processes are executed they generate data. Some data is discarded, others kept in unstructured sources, very little of it kept in formal and structured repositories. As the product ages, such data can shed a light on the overall planning, prediction, and management of the product. A key supporting tool for improving the product management is a software data warehouse. Chapter 7 presents an overall picture of a software data warehouse based on the ideas discussed in Chapter 4.
Chapter 8
An implementation of a software data warehouse The previous chapter discussed the overall picture of a software product development case study. Section 6.3 pointed out that as the product grows the existing management processes often fail to produce the necessary results. A key contributing factor is factual data about various aspects of the development process. Data is needed for better planning and decision making. As the amount of data increases, a data-centric solution becomes crucial. This is due to the high frequency of changes to the planning, monitoring, and prediction processes. This chapter describes the implementation of the middle layer of our software data warehouse (see Figure 4.2). This layer represents a temporal and integrated view of the software project data. The chapter beings by showing the overall implemented architecture of the prototype used for this case study. It provides an overview of the data stored in the middle layer and some of the operational issues. An objective of this chapter is to demonstrate the growth of the complexity of management tasks by outlining the size of the data, its growth, and the diversity of its sources. This discussion facilitates the discussions of the subsequent chapters.
108
CHAPTER 8. AN IMPLEMENTATION
8.1
109
Overall architecture
The first implementation of the software data warehouse used IBM DB2 V7.2 running on a Linux server. Figure 8.1 depicts the overall architecture. The architecture shows three layers: the input layer, the middle layer or central data warehouse (CDW), and the outer layer or the data mart layer. The input layer extracts data from various sources, cleanses, and loads it into the CDW. The CDW is a temporal and integrated repository of all data from the product development. The output layer represents subject oriented data used by the analytical processes.
Lotus Notes LEI/ DB2 Web server Test databases
DB2 V7.1 Data Warehouse
DB2
Windows Client
Windows Client
CMVC Server Windows Client Spreadsheet
Linux Server
Datrix/ Loader
Input
Middle
Outer
Figure 8.1: The architecture of the software data warehouse developed for the case study. The input layer extracts, cleanses, and loads the data from various sources into the middle layer. The data in the middle layer is made available through the web server and external clients.
For the input layer, various scripts were developed that extract the data, cleanse, and time-stamp it. For example, the program files are parsed using Datrix parsers resulting in a number of abstract semantic graphs (ASG). A transformer program reads these ASG files and converts them to the object-relational schema described in Chapter 5 and loads this into the CDW. The data from the project diary is extracted using a set of scripts written in LotusScript and stored in DB2 tables. The majority of test data is now stored in DB2. This data is similarly extracted, time-stamped, and loaded into the CDW.
CHAPTER 8. AN IMPLEMENTATION
110
The refresh of the CDW takes place at regular intervals. At certain intervals, the data from specified sources is loaded into the CDW. The CDW serves as the repository of all project data. The outer layer includes data marts that organize data along specific dimensions representing the management concerns. For example, in this case study the three concerns are quality, health, and progress. Another data mart includes a defect monitoring view. A defect can be viewed along its key attributes: components, owners, owners’ managers, and date. Based on the needs of the management team other data marts are developed. The CDW can be accessed from any workstation subject to management approval. Team members maintain their transactional data sources for daily operation while accessing the CDW for analyzing their concerns with respect to other project concerns. For example, members of the system test team frequently analyze the turn-around time for their defects, the distribution of defects along components, or the amount of code changes for each defect. The following section describes in more detail the data stored in the middle layer or CDW.
8.2
The central data warehouse
A commercial software product development has numerous diverse data sources. This dissertation focuses on viewing software as data for management purposes. Previously, such data was either discarded, hidden in log files, or at best poorly structured and maintained. During the case study, as we leveraged the available data to provide reports on development processes and to identify possible trends, we observed that more data sources were identified for integration into the data warehouse. Furthermore, some of the existing data sources became more detailed and structured. In other words, the direction of the maturity of management processes was from having little or no data towards more data with more structure. The volume of data increased drastically over the period of the study (almost doubled), which in turn exacerbated the need for automation of data analysis. As more analysis was performed, data structures became more complex. The following subsections list several data sources that were identified and were integrated into the middle layer. This data was used to derive management tasks that are described in Chapter 9.
CHAPTER 8. AN IMPLEMENTATION
8.2.1
111
Organizational chart
The organizational chart contains data about members of the team including their email addresses, their department or group, their job responsibilities, and their managers. This data is needed to improve reporting and resource analysis. It provides links between the entities in various data sources, e.g., defect owners, code reviewers, and people who modified files. Figure 8.2 shows an excerpt of the entity relationship (ER) diagram for this data source. This ER diagram maps to two relational tables as shown in Table 8.1.
m
People
1
Own
m
CMVC ids
1 Report to
Figure 8.2: An excerpt of the ER diagram for the organizational data. A person owns multiple unique IDs and reports to one person.
PEOPLE on-date country emp-no name email mgr-country mgr-emp-no Primary key Foreign key
when the data was sampled (or last changed) employee’s location employee number employee’s name employee’s email address where the employee’s manager works the manager’s employee number on-date, country, emp-no mgr-country, mgr-emp-no to PEOPLE
CMVC-IDS on-date id status country emp-no Primary key
when the data was sampled (or last changed) CMVC id deleted or active foreign key into People table foreign key into People table on-date, id Table 8.1: Organizational chart table
CHAPTER 8. AN IMPLEMENTATION
8.2.2
112
System test
The system test team keeps detailed data about their test plans, new features they test, their progress, and problems identified (i.e., defects opened). This data is kept in a relational database and is updated frequently by the testers — people responsible for executing test scenarios. The main concern of the system test team is to verify the quality of the product as a whole, identify problems, open defects and assign them to developers. The data from this database, in conjunction with other sources, provides insights into the quality aspects of the product. Figure 8.3 shows an excerpt of the ER diagram for this data source.
1
Scenarios
n
execute
m
m
cover
m
Features (Project)
Attempts
1
belong
identify
1
Releases (CMVC)
m
Defects (CMVC)
Figure 8.3: An excerpt of the ER diagram of the system test data
The entity set scenarios represents test cases that simulate customer usage patterns. Each scenario is created by taking into account a number of new features that were implemented. The data linking a scenario to new features that it covers was originally unavailable. This data was hidden in the textual system test documentations. As reporting mechanisms became more granular, it became apparent that such data should be more structured. Currently, this relationship is being established by the system test team. Every time a scenario is executed an attempt record is created. Upon discovery of a fault in the execution of a scenario, a defect is opened and the scenario’s attempt aborted. Each scenario is
CHAPTER 8. AN IMPLEMENTATION
113
time-stamped to capture the sampling time (or the time of last change). The two main relational tables are Scenarios and Attempts as shown in Table 8.2. SCENARIOS on-date name status release category Primary key Foreign key
when the data was sampled (or last changed) name of scenario deleted or active the release of the product that scenario tests the testing category on-date, name release to RELEASES
ATTEMPTS on-date name sequence status score Primary key Foreign key
when the data was sampled (or last changed) name of scenario the sequence number for the attempt running, complete, abort percentage of execution completed on-date, name, sequence on-date, name to SCENARIOS Table 8.2: Schema for system test.
8.2.3
Project diary
The project diary is where all the data about plans, project execution, and minutes of meetings are kept. This data was stored in Lotus Notes databases. A Notes database consists of a set of forms each of which contains several fields. The corresponding ER diagram uses these forms as the basis for its entities. In addition, all entities include a time stamp attribute to reflect the time of sampling (or last change). Figure 8.4 depicts an excerpt of the ER diagram for the project diary. As seen in Figure 8.4, each new feature is planned for inclusion in a particular release. The new feature is owned by a member of the team who is responsible for creating a final specification (or justifying the lack of it). Each feature is in turn designed by a group of people and its impacts identified with respect to the affected (CMVC) components. Almost all records created in the project must be reviewed by a pre-determined group of people. In this diagram, the entity people is the same as that described in the organizational chart ER diagram. Similarly, the entities components and releases are from the CMVC model described later in this section.
CHAPTER 8. AN IMPLEMENTATION
114
Include
Releases
m
1 1
Allocate
1
Sizing records
Own
m
Features
m
1 1 1 Describe
People Verify
Elaborate
n
m
m
Design records
Functional tests
0,1 Final specs 1
m
m
1
m
m
Review
Review Document
Review Contribute
Impact n
m Deviation records
Components
m Regression tests
Figure 8.4: An excerpt of the ER diagram for the project diary
n
n
CHAPTER 8. AN IMPLEMENTATION
115
The entities described here map into a set of relational tables. Table 8.3 shows the definition of Features table. FEATURES on-date id owner-country owner-emp-no feature-id Primary key Foreign key
when the data was sampled (or last changed) Lotus Notes assigned unique id country of the owning member of the team (F.K.) employee number of the owner (F.K.) id of the feature it describes (F.K.) on-date, id owner-country,owner-emp-no to PEOPLE Table 8.3: Schema for Features table
The extracted data facilitates determining the progress of each feature and the overall project, as well as capturing a record of the activities performed during the development of a release.
8.2.4
CMVC
The configuration management tool, CMVC, not only manages the development process, but also provides a wealth of data about defects, their status, and changes made to the code-base (International Technical Support Organization, 1994). However, the interface to CMVC was designed for transactional tasks: checking in/out of files, opening and tracking of defects, and tracking changes to files. Performing ad hoc analytical tasks requires a different schema. Hence, we created a schema similar to the original CMVC schema but optimized for analytical queries. Figure 8.5 shows an excerpt of the CMVC ER diagram. CMVC groups files into logical components, which can have sub-components. Each component is owned by a developer who is responsible for its integrity: providing access to other members of the team, keeping track of activities against the component, and so forth. To change the code-base, a developer opens a feature or defect. The developer identifies the files that must be changed to resolve the defect/feature. To include changes into the code-base, for each defect, the developer must create one or more tracks. Each track groups a number of files that were modified for implementing a defect/feature against a particular release. Upon successful integration of the tracks into the code-base the defect is closed. Having access to this data in conjunction with other data
CHAPTER 8. AN IMPLEMENTATION
116
contain m 1
belong
1
Components
m
Files n
m
1
m n
group
modify
include
own m
n
1
Releases
People
assigned 1
1
originate
own
1 1
attach m m Defects
m
m
m 1
have
m
m Tracks
Figure 8.5: An excerpt of the ER diagram for CMVC
CHAPTER 8. AN IMPLEMENTATION
117
sources provides valuable information about the product and its associated processes. The entities shown in Figure 8.5 map to a set a relational tables. Table 8.4 shows the schema for components, defects, and files tables. It must be noted that a locally developed program computes the lines of code and comments in a program file and inserts them into the tables in the data warehouse.
8.2.5
Code-base
The program itself is the largest and most revealing source of data. Most of the programs are written in C/C++. Each file goes through changes during the development cycles. As a result, during each time period multiple versions of each file exists. Datrix C/C++ parsers construct an abstract semantic graphs (ASG) for each program file (as describe in Chapter 5). A local program transforms the resulting ASGs into object-relational schemas and inserts them into DB2. Appendix A shows a complete listing of all tables used for storing program data. Table 8.5 shows the two top-level tables.
8.2.6
Regression test
The regression test team keeps detailed data about their test cases covering multiple releases, their execution environment, and the results of these executions. The main concern of the regression testing is to ensure proper operation of the product’s previous functions, i.e., to ensure none of the previous functionalities have regressed. The ER diagram for the regression test is similar to that of system test. At early stages of the case study, all data was kept in log files. However, as the implementation of the data warehouse progressed, the team responsible for regression testing began using a relational database for their data. Figure 8.6 shows an excerpt of the ER diagram for the regression test data.
CHAPTER 8. AN IMPLEMENTATION
COMPONENTS on-date id name status owner parent Primary key Foreign key
DEFECTS on-date id status release originator owner phaseFound component severity priority opened endDate abstract Primary key Foreign key
FILES on-date id path type status total comment version Primary key
when the data was sampled (or last changed) component id name of component deleted or active owner’s CMVC id id of the parent component on-date, id parent to COMPONENTS owner to CMVC-IDS
when the data was sampled (or last changed) defect id open, working, verify, closed, canceled, returned id of the planned release for fixing defect originator defect owner phase where the defect was identified component id severity 1,2,3, or 4 priority 1,2,3, or 4 date when defect was opened date when defect was closed/canceled/returned brief description on-date, id component to COMPONENTS release to RELEASES owner to CMVC-IDS originator to CMVC-IDS when the data was sampled (or last changed) id of the file full path of the file text, binary, program deleted or active total lines of code total lines of comments version of the file on-date, id Table 8.4: Schema for tables Components, Defects, and Files
118
CHAPTER 8. AN IMPLEMENTATION
PROGRAM-ENTITIES oid on-date compName filename begin-row end-row begin-col end-col scope Primary key Foreign key
RELATIONSHIPS source-oid target-oid type order Primary key Foreign key
119
unique id assigned to each entity date of the committed build used name of CMVC component owning the file path name of the file row where the entity definition begins row where the entity definition ends column where the entity definition begins column where the entity definition ends the scope id of the entity within the file oid filename to FILES compName to COMPONENTS
oid of the source oid of the target type of relationship order of the relationship source-oid, target-oid, type, order source-oid to PROGRAM-ENTITIES target-oid to PROGRAM-ENTITIES
Table 8.5: Schema for the tables Program-entities and relationships
Environment n runs in m Buckets 1 result m Defects (CMVC)
1
Consist of
m
Test cases m create 1 Features
Figure 8.6: An excerpt of the ER diagram for the regression test data
CHAPTER 8. AN IMPLEMENTATION
8.2.7
120
Performance test
The performance test generates and maintains detailed data about their evaluation tasks. The main concern of the performance team is to benchmark the performance of the product against industry standards. Performance is a key non-functional requirement of commercial software products that can determine their success or failure. Typically, each benchmark has several components that must be measured against the previous release of the product. This data provides an overall picture of the performance characteristics of the product from the start to the final release. In conjunction with the code-base data, this data can provide insight into coding styles and standards. Figure 8.7 shows an excerpt of the ER diagram for performance test data.
Hardware m runs on n 1
Benchmarks m associate 1 Releases
contain
m
Testcase
1 identify m Defects
Figure 8.7: An excerpt ER diagram for the performance test data
8.2.8
Others
Beside the above-mentioned data sources, towards the end of the case study we identified additional data sources that could provide a more comprehensive picture of the product development. The first data source is the customer service data base. This data source provides details about customer problems, defects opened against a release, respective solutions, and the turn around time. The other data source(s) deals with the integration test team. These are the test teams that work closely
CHAPTER 8. AN IMPLEMENTATION
Source CMVC System test Regression Performance Organizational Project diary Code-base
Sampling daily daily daily weekly weekly weekly (bi-)weekly
121
growth 50 MB 5 MB 5 MB 10 MB 200 KB 50 MB 300 MB
Table 8.6: Sampling and growth rates for input data sources
with partners to ensure smooth integration and operation of the products. Overall, there are many data sources in a software product development, each representing a set of concerns. These data sources are not semantically independent from one another — they all describe an aspect of the product. However, these dependencies are not formal. As data sources grow, it becomes more difficult to establish these dependencies. As projects progress, more data is generated and new data sources are identified. Based on the feedback, as suggested in Chapter 1, the data warehouse must be improved to capture such changes. The cost of the software data warehouse for a large project quickly becomes significant. However, based on the value added, this cost can be justified and amortized over multiple releases. The next subsection will look at some of operational issues of the software data warehouse.
8.3
Operation
The data in the middle layer grows very rapidly. Various data sources are sampled at regular intervals and added to the central data warehouse. During peak days of the project, the CDW grows on average by 500 MB per week, while during the off-peak days this amount reduces to 300 MB. The following table provides an overview of the sampling interval and the growth rate of the data. The reported growth represents only the data for the current release of the product. As previously mentioned, at any given point, there are two or three releases of the product under development or maintenance, which can contribute more data to our CDW. Table 8.6 summarizes the characteristics of the input data sources.
CHAPTER 8. AN IMPLEMENTATION
122
In terms of sampling of data, the loading window is 8 hours long. This is the time needed to extract, cleanse and load the data into the CDW. For the purposes of this dissertation, this time was acceptable. However, as the product grows we anticipate the size of our data (and the number of data sources) to grow. Furthermore, there needs to be some flexibility allowed for possible network or hardware problems. In terms of the performance of queries we have some flexibility for optimizations. Since the data is never modified it is possible to create indexes on tables involved in expensive queries. For example, for the defects and files tables we have created over 20 different indexes. However, as the data sets grow the creation of indexes requires time that could compromise our loading window. This problem is most noticeable with the code-base data. This data is larger than other sources and there is some penalty for the table hierarchies. For this reason, we developed more optimized schemas in the the outer layer. The outer layer extracts data from the central data warehouse to be used by analytical processes. For example, the multi-dimensional management model discussed in chapter 3 is implemented in this layer. The layer organizes data into subjects: quality, progress, and health. Other ad hoc analyses can also be performed by aggregating data from the central data warehouse and storing them temporarily in local tables. In our prototype, upon completion of a refresh, various reports are generated and posted on the internal web site. One common report creates trends for defect arrivals, feature implementations, and defect closures. In this case study, the release team owned the data warehouse. The release team is responsible for the execution and monitoring of the project — the product development. Based on the feedback from team members and the management team, the release team decides what reports must be automatically generated and then published on the server. In other words, what is performed on the data warehouse server represents the official management perspective of the product development. However, each team can define its own concerns by accessing the CDW from a connected workstation, can automate reports, and share the results with other members of the team.
CHAPTER 8. AN IMPLEMENTATION
8.4
123
Summary
This chapter focused on the input and the middle tier of the data warehousing solution. It identified various data sources integrated in the software data warehouse, depicted a partial ER diagram of the data warehouse, and discussed some operational issues relating to the size and performance of the warehouse. The outer layer of the warehouse, where the data marts reside, organizes the data into subjects. This layer implements the multi-dimensional model described in Chapter 3. The data from the middle layer is extracted into this layer to facilitate monitoring and tactical analyses. The next chapter gives an example of a multi-dimensional management model using three subjects (or concerns): quality, progress, and health. It will show how the indices for each of these subjects are computed and monitored against their planned values. Furthermore, when a deviation from the plan occurs, it enables drilling down and across the multi-dimensional model to identify possible correlations and root causes.
Chapter 9
A three-dimensional management model Chapter 8 provided an overview of the middle layer of the software data warehouse. It outlined the input data sources and sketched an excerpt of the ER diagram of the middle layer. This chapter focuses on the outer layer, the analytical layer, that implements the multi-dimensional management model and other analytical tools. Chapter 3 presented an approach for defining, monitoring, and predicting quality, progress, and health. The following sections will provide an overview of the local definitions used in the case study. It must be noted that these definitions are based on the existing informal goals of the project. An objective of this case study is to formalize the existing definitions to be able to study the impact of the new management framework by comparing the status quo definitions before and after.
9.1
Quality
The local definition of quality applies the criteria used for shipment of the product. The team responsible for quality assurance determines, in advance, a set of quality criteria as a pre-requisite for making the product generally available. Based on these criteria, the local product quality is
124
CHAPTER 9. A THREE-DIMENSIONAL MANAGEMENT MODEL
125
defined. The following is a simplified version of this definition. It defines quality as a conjunction of five sub-goals. • System test: Achieving this goal assures the stability and reliability of the product in lieu of the newly added functionalities. • Regression: Achieving this goal assures that newly added functions have not caused the product to regress from its previously implemented functionalities. • Performance: Achieving this goal assures that the product can perform within accepted benchmarks — both industry standards and internal objectives. • Usability: Achieving this goal results in determining how prospective customers will respond to the product user interfaces and the associated help facilities and manuals. • Defect control: Achieving this goal results in establishing that the team is able to cope with possible future defects. Figure 9.1 shows an excerpt of the depiction of this quality definition. Each of the sub-goals below the root represents a particular team’s concerns. These goals, depending on the size of the team, can be further divided. This section focuses on two subgoals. The first subgoal is defect control. During various stages of the product development “defects” are found against the product. These may be serious flaws that deteriorate the behavior of the product, e.g., loss of data, or ones of little consequence, e.g. a typo in a help message. In any case, the team must be ready to respond with plans of actions: defer to a future release or fix within a reasonable amount of time. Hence, upon delivery of the product to market, there should be no unresolved defects, and the rate of defect arrival to those resolved should be less than one in a given period of time (e.g. fast defect turn around time). Another sub-goal described in this section is the system test, which is further divided into three sub-goals.
• Stress test: The product is able to perform for a long period of time under various loads.
CHAPTER 9. A THREE-DIMENSIONAL MANAGEMENT MODEL
126
• Co-existence test: The new release of the product performs in the presence of older versions of the product (e.g. in a client-sever scenario). • Install test: The new release of the product installs under various conditions and configurations.
Each of these sub-goals is achieved by successfully completing a set of test cases or scenarios. A scenario is a combination of several tasks that test one or more aspect of the product with respect to a specific sub-goal. When the execution of a scenario is attempted, it must successfully complete its tasks. Completion of each task adds to its total score indicating the achievement of that particular sub-goal. For example, a score of 90 indicates 90 percent of all tasks completed successfully. Should an attempt detect a fault in the product, it aborts, the tester opens a defect, and the execution of the scenario halts. Upon fixing the defect, another attempt at execution of the halted scenario begins — from the start. There may be several attempts at the execution of a scenario before it achieves its pre-determined score. In ideal case, the least acceptable score is 100 (out of 100).
Quality 30 Usability
30
40
Performance
System test
Regression test
Co-exist test
40 Scenario 1
Scenario 2
attempt attempt attempt 3 1 2
Defect control
36
51 Stress test
10
Install
backlog
resolution ratio
48
...
Scenario k
count
closure/ arrival
Defect count
Figure 9.1: An excerpt of quality definition used in the case study. The numbers shown outside each sub-goal represent its score at a particular point in time. In this example, the score of scenario 2 and 3 are 40 and 48 respectively. Using these values and that of scenario 1, the score for the stress test, system test, and quality goals are computed.
CHAPTER 9. A THREE-DIMENSIONAL MANAGEMENT MODEL
127
A number of views were defined on the middle layer to facilitate computation of the indices. Figure 9.1 depicts a partially computed quality definition. To fully compute the quality index, the values for Scenario 1, Stress test, System test goals must be computed. To do this, first a view is defined to extract the score of each attempt for Scenario 1. The following view presents a simple grouping of the necessary data:
CREATE
VIEW Scenario1 (date,attempt,score) AS
SELECT
t.ondate,t.sequence,t.score
FROM
attempts t, scenarios s
WHERE
t.name=s.name and s.category=’Stress’ and s.release=’v5’ and t.ondate = MAX (SELECT a.ondate FROM attempts a)
In this example, the values of 68, 72, and 84 are extracted for attempts 1, 2, and 3 shown in Figure 9.1 respectively. Similarly, counting the number of defects against the scenario by querying the defect table reveals that there are three severity two defects outstanding. The value of defectcount goal is binary: 100 for no defect and 0 for any number of defects. This value can also be defined based on the severity of the defects. For example, 0 for critical or high severity defects, 50 for low severity defects, and 100 for no defects. Given these values, Scenario 1 goal in Figure 9.1 can be evaluated as: V (Scenario1 ) =
(max(attempt1 , attempt2 , attempt3 ) + defects)/2
=
(max(68, 72, 84) + 0)/2
=
42
Assuming the values given in the Figure 9.1 for Scenario 2 and Scenario 3 one can propagate the values upward and compute the value of Stress test goal as: V (Stress-test)
= (V (Scenario1 ) + V (Scenario2 ) + V (Scenario3 ))/3 = (42 + 40 + 48)/3 = 43
CHAPTER 9. A THREE-DIMENSIONAL MANAGEMENT MODEL
128
Similarly given the appropriate values in Figure 9.1, the value of the system test goal is computed as: V (System-test)
= (V (Stress-test) + V (Co-exist-test) + V (Install))/3 = (43 + 51 + 36)/3 ≈ 43
Finally, the quality indicator at this time is computed as: V (Quality)
= (V (Usability) + V (Performance) + V (System-test) V (Regression) + V (Defect-count)/5 = (30 + 30 + 43 + 40 + 10) ≈ 31
In other words, the current value for the quality index of the product is 31. This value can now be compared against the planned value to determine if the quality lies in the risk region (i.e., below the expected levels). Figure 9.6 depicts the values of quality against the plan. For simplicity reasons, and to avoid sensitive issues, we assign equal weights to each sub-goal in the quality definition. This is contrary to our experience that indicates some sub-goals are more important than others. Frequently less important sub-goals are sacrificed for higher importance sub-goals. However, identifying and assigning reasonable weights to these sub-goals would require management support and more investigation on our part. (Such issues can be sensitive as it may be interpreted as significance of one team’s contributions over another team.) An interesting issue was raised during the determination of tolerance. Chapter 3 discussed that a quality definition should take into account some degree of tolerance. The intention is to capture two values: the ideal and the acceptable. However, the management replied that there is “zero tolerance” on quality. Though the management’s attitude towards quality is commendable, it must be noted that realistic goals are more important. If there is a shortcoming in the quality, then management should attempt to identify possible causes and improve their processes for future releases. (A study showed that during previous releases, due to deadline pressures, there was a tolerance of less than five percent.) Not having such values formally identified at the start of the project — perhaps based on previous release data — can dampen the improvement efforts.
CHAPTER 9. A THREE-DIMENSIONAL MANAGEMENT MODEL
9.2
129
Progress
Similar to quality, in order to define progress goals, we studied the existing processes, data, and rules. The progress goal is defined as a conjunction of a set of must-have features and a disjunction of a set of nice-to-have features. The must-have features represent functionalities that must be delivered in the new release of the product, whereas the nice-to-have features represent one or more functionalities that should be included in the new release. Any feature that does not make it to the current release is pushed back to future releases. The decision as to which features are must-have or nice-to-have is influenced by a combination of business and technical constraints. For example, technically a feature may be necessary for other features of the product or, from a business perspective, a feature may provide strategic advantages for the product. Figure 9.2 shows an excerpt of a depiction of the progress definition. Progress is defined as a conjunction of a set of features (or functionalities) that must be implemented for the new release of the product. Each feature in turn breaks down into five sub-goals as described here.
• Specification: Achieving this sub-goal indicates that the team has agreed upon the details of the new feature. This sub-goal is broken down into a disjunction of final approved specification or justification. In cases where the new feature is considered “simple”, there may not be a need for a formal specification. In such cases, a justification is provided indicating this decision. On the other hand, to achieve the final approved specification, an owner must be identified, possible dependencies with other features indicated, and the overall specification reviewed for completeness. • Design: Achieving this sub-goal indicates that a high-level plan for implementation of the feature is in place. To achieve this goal, five sub-goals must be achieved: sizing, high-level design, low-level design, impact analysis, and review. • Coding: Achieving this sub-goal indicates that the appropriate changes are made to the local copy of the code-base. This sub-goal requires opening of a feature record in CMVC, creation of tracks for grouping of the changed files, and code review.
CHAPTER 9. A THREE-DIMENSIONAL MANAGEMENT MODEL
130
• Unit testing: Achieving this sub-goal indicates that the new feature has passed local testing and can be integrated into the code-base. This sub-goal requires planning of necessary test cases, reviewing and finally carrying them out. When unit testing is completed, the test-cases are transferred to regression testing. • Integration: Achieving this sub-goal indicates that the changes made to the code-base are integrated into the common code-base. This sub-goal requires dependency checks (to ensure all files are checked in), compilation/linkage, and rudimentary tests to ensure that the product can perform its basic functionalities.
Feature
Spec.
Final spec.
Justify
Owner
Init. spec
Review
Feature
...
Design
Defects from previous release
-
Progress
Feature
Unit Test
Code
...
Feature
Feature
Integrate
Open feature
Plan
High level
Complete track
Review
Compile link
Low level
Review
Test
Sanity tests
Sizing
Impact analaysis
Depend check
Sign off
Review
Figure 9.2: An excerpt of the progress definition used in the case study
There is no imposed order on the sub-goals under each feature. The development process is a choice made by the developers. Based on the nature of the feature, developers may adopt a water-fall model for well known situations, spiral for less known and risky situations, or prototyping for exploratory
CHAPTER 9. A THREE-DIMENSIONAL MANAGEMENT MODEL
131
development. In any case, each sub-goal must be achieved before the feature is considered complete. Based on the approach used by the developer, the metric used can be either binary or a subjective value indicating the progress made towards each sub-goal. For example, some developers report that their actual testing is not complete, i.e. binary, while others report that 60 percent of testing is complete. In either case, we rely on the value provided by the developers. However, as time passes, we will be able to provide developers with feedback on the accuracy of their estimates and/or metrics. The disjunction of nice-to-have features represents some degree of tolerance afforded by the management. However, there may yet be more problems due to aggressive scheduling/planning, an increase in the number of customer-found defects against previous releases, or other unforeseen events. As a result, at the root of the progress we add a negative influence to capture such dynamic changes. Similar to quality definition we assigned equal weights to each sub-goal. However, we observed that frequently steps were bypassed when there was schedule pressure. The intention of this case study was to observe and to identify such anomalies for future plans.
9.3
Health
The notion of code-base health introduced in this dissertation is perhaps the most novel dimension in the multi-dimensional management framework. To the best of our knowledge, the health of a code-base has never been looked at on its own merit, let alone planned. Typically, when faced with a problem, such as an increased number of defects against a part of the code-base, a number of metrics are gathered for complexity, fan-in/fan-out, or other structural properties of the code-base to explain the problem at hand. Upon resolving the problem, the team ceases to gather metrics or perform further investigations as possible correlations between the structural properties of the code and the overall quality. In other words, the process of evaluating code-base health on its own is not the norm in most software developments. A possible reason can be traced back to the reactive mentality: teams respond to problems. In contrast, the multi-dimensional framework aims to be
CHAPTER 9. A THREE-DIMENSIONAL MANAGEMENT MODEL
132
pro-active: plan for achieving or sustaining the code-base health. The objective of this case-study was to gather some common ideas about the characteristics of the programs — the source code. No coherent health goals existed to facilitate the formalization of a definition. There were various hints, suggestions, and common practices provided in the coding guidelines for the team. Hence, using these guidelines we performed a benchmark study to set a number of subjective metrics. A set of commonly accepted characteristics were identified and the value of these during previous releases were evaluated (or sometimes estimated). The layout of the code-base was used for the first level below the root. In other words, the health of the product was defined as a conjunction of the health of its components. The definition presented here presents a first step. As analyses are performed and based on the feedback of the team (and external sources) the definition can be improved upon to provide a better picture of the effect of health on the product development. For the purpose of this case study the definition provided served as a starting point. Health is defined as a conjunction of the health of each component of the code-base. The health of each component is defined as a conjunction of the health of the files in that component and the overall structure of the component itself. Figure 9.3 depicts an excerpt of the health definition. The overall health of a file is defined as a conjunction of four sub-goals: • Size: the number of lines of code in the files; • Growth: the overall growth of the component in terms of lines of code; • Change: the overall number of lines of codes changed; • Structure: the number of calls to and from a file; and • Complexity: the average cyclomatic complexity (McCabe, 1976) of functions in a file. These sub-goals represent properties of the files that should be maintained. To determine proper values for these properties, we computed these values for the previous release. The value system devised can be described as follows1 : 1
Later in the case study it was noted that these values should be localized to each component. For example, some
components that are considered healthy have a very large growth factor which is normal for that component with little impact on its overall health.
CHAPTER 9. A THREE-DIMENSIONAL MANAGEMENT MODEL
133
• Size s: large number of lines of code (not including the comments) can be indicative of poor understandability as well as indication of contention when developing in serial mode — team members must wait longer for others to complete their modifications. As a result, based on previous releases, the following ranges were identified for the size definition: – 100 if L < 700; – 90 if 700 < L < 900; – 80 if 900 < L < 1100; and – 0 otherwise, where L is the total lines of code. • Growth factor g: fast growing modules can be indicative of potential problems (Lehman, 2000; Lehman, 1998a). Based on previous growth of the code-base we defined the following ranges for the growth definition of files: – 100 if D < 20; – 90 if 20 < D < 30; – 80 if 30 < D < 40; and – 0 otherwise , where D is the percentage of growth in lines of code from one committed build to the next. • Change factor d: while growth factor can identify potential problems, it may also be misleading when a file changes rapidly but does not grow in size. As a result, we defined a growth factor to identify files that have undergone significant changes as follows: – 100 if δ < 20; – 90 if 20 < δ < 30; – 80 if 30 < δ < 40; – 0 otherwise , where δ is the percentage of lines of code changed from one committed build to the next. • Structuredness s: We used the number of calls to functions in a file from other files in the same component (fan-in) and the number of calls from a file to other files as an indication of
CHAPTER 9. A THREE-DIMENSIONAL MANAGEMENT MODEL
134
the structural properties of that file. If the number of calls into a file is very large (greater than a threshold), that could indicate a highly used file. Similarly a large number for the calls from a file could indicate error-prone files (Kitchenham, 1990). As a result, a file with both a high fan-in and a high fan-out is considered to have a poor structure (score of 0). A file with a low fan-in and a low fan-out is considered to have a good structure (score of 100). If only one of the fan-in or the fan-out values were high, it indicated a reasonable structure (score of 50). A unique threshold value was defined for each component based on the values of a previous release. • Average complexity c: We computed the average complexity of each function as the total number of predicates plus 1 since there were no jumps out of loops in the programs. The complexity of a file was then computed as the average complexity of its functions. The component complexity was computed as the average complexity of the files in the component. Based on the previous releases we defined the following ranges: – 100 if M < 16; – 90 if 16 < M < 20; – 80 if 20 < M < 24; and – 0 otherwise , where M is the McCabe’s complexity (McCabe, 1976).
The structure of each component was separately evaluated based on the number of files that were called by other components. In other words, the number of files that were called were indicative of proper interface definition. In an ideal scenario, one would consult the designers to determine the interface files in a component and evaluate the number of interface violations. However, due to the large number of components (over 100), we aimed for locality of calls — the number of calls should be localized to a few files. Let n denote the number of files in a component and c the number of files that are called from other components, then
• if c < n × 0.1, the component is well structured (i.e. 100); • if n × 0.1 < c < n × 0.3, the component structure is acceptable (i.e. 50); and
CHAPTER 9. A THREE-DIMENSIONAL MANAGEMENT MODEL
135
• otherwise, the component has a poor structure (i.e. 0). Figure 9.3 depicts an excerpt of the health definition.
Health
Comp 1
File 1
Comp 2
...
...
File n
Comp n
Overall Structure
Size Complexity Growth Change
Structure
Figure 9.3: An excerpt of health definition used in the case study
Defining the code-base health has caused the most discussion on this case study. While everyone holds an opinion as to what should be used as a defining property for the health of the code-base, none objects to having a definition for code-base health. The objective behind the locality of focus pillar of this dissertation is to encourage teams to define their goals and objectives locally and based on available data argue for or against them. For example, a link can be established between commonly contested local coding guidelines, styles and quality or progress goals of the product development. This idea is at the heart of the proposed data centric solution: putting in place planning, monitoring processes and improving these processes continually based on available data. Furthermore, having viable data supporting the importance of the health of the code-base can motivate managers to place more emphasis on programming tasks that have no immediate impact to the success of the project, e.g., refactoring, restructuring, and beautifications. The next section outlines an overall strategy for the management of the case study based on the three definitions presented thus far.
CHAPTER 9. A THREE-DIMENSIONAL MANAGEMENT MODEL
9.4
136
Overall management strategy
The overall management strategy of the case study based on the ideas presented in this dissertation can be described as follows.
Figure 9.4: This figure shows planned, computed, and predicted values of the progress goal. As the figure shows, at the early stages, the progress goal significantly deviated from the plan but later on it recovered. However, the prediction as of the last computed point indicate a large deviation from the final release plan. Such a figure can indicate aggressive planning or poor resource allocation.
• Planning: At this stage the management constructs a hierarchical definition for each of quality, progress, and health concerns for the new release. The initial and expected final values for each of these concerns are computed. If need be, tolerance levels for each of the concerns are determined. For each of the three dimensions an ideal line is drawn connecting the initial and expected final values. These lines represent the initial plan of the project. Should there be a need for plan modification, external influences will be used for appropriate adjustments. This
CHAPTER 9. A THREE-DIMENSIONAL MANAGEMENT MODEL
137
way the original plan remains intact for possible future improvement of the planning process. • Operational: At regular intervals quality, health, and progress are evaluated according to their definitions. The information gathered is then disseminated via a local web site. Using the computed value a linear prediction is made as described in Chapter 3: a straight line connecting the initial value and the current computed value. The computed value at a specified interval is compared against the expected value —a point on the plan line. If the computed value falls below the expected value, it could indicate a possible problem. To identify possible causes to a potential problem, tactical analysis will be initiated. • Tactical: Upon identification of a problem, management begins searching for correlations and possible causes for the problem. Suppose the problem is caused by a drop in the quality, by drilling into the numbers, one can identify goal(s) that failed to contribute to the value of the overall hierarchy. Based on the identification of the goal(s) that have not been accomplished, other dimensions can be investigated to determine if additional aspects of the project such as health and progress had any impact on quality. If the problem seems to be the result of aggressive planning, the overall plan should be adjusted according to the prediction at the current point. If a problem seem to endure for the remainder of the project and possible future releases of the product, a strategic analysis should be performed to find supporting evidence for future directions. • Strategic: When management identifies potential for a future problem or re-occurrence of a similar problem, a longer term solution must be defined. The solution can be the result of the input from the strategic layer (e.g., from other sources) or internal to the team. In either case, a hypothesis is formulated and its validity is studied based on the data from the data warehouse. Should the hypothesis prove to be of value, an impact analysis is performed, a strategy for the implementation of the solution including its evaluation is constructed. This case study, as described in Chapter 7, used some forms of defect analysis for their strategic analysis. As a result, a set of queries were formulated that, using the data from the warehouse, performed various analysis on defect arrivals, defect turnaround time, code growth, and customer reported problems.
CHAPTER 9. A THREE-DIMENSIONAL MANAGEMENT MODEL
138
Figure 9.5: This figure shows planned, computed, and predicted values of the health goal. As the figure shows, the health of the system initially was close to the plan. However, as time passes, the health deviates more and more from the plan. This deviation can be explained in terms of pressures on the progress goal or a change of concern in terms of health goals. The subjectiveness of health definition requires a replay of the health goal based on other definitions of health.
CHAPTER 9. A THREE-DIMENSIONAL MANAGEMENT MODEL
139
Figures 9.4, 9.5, and 9.6 show the computed values for progress, health, and quality goals respectively. These values were computed at an interval during the case study. Due to confidentiality issues, the times are not shown on the charts. These charts present the deviations from the plan that was used to initiate tactical analysis presented in the next chapter.
Figure 9.6: This figure shows planned, computed, and predicted values of the quality goal. As the figure shows, the quality goals at the start of the project were behind the plan. As the time passed, the quality goals moved closer to their plan. The prediction at the last sample point indicates small future deviation. At this point, the team can either adjust their quality plans or shift resources to improve quality.
9.5
Summary
This chapter outlined a case study that is representative of commercial software product development. The product studied has mature development and management processes and has gone through several successful releases. As the product growth reaches an almost exponential factor,
CHAPTER 9. A THREE-DIMENSIONAL MANAGEMENT MODEL
140
there is no guarantee that the existing processes can perform as before. These processes must continually be improved. However, changing processes for a product development with such complexity involves a risk. To effectively manage these changes, they must be driven by data. The management must formalize the concerns, plan and monitor them, and based on gathered data identify possible causes and determine possible means of improving the processes. If a process performs well during one release, there is no guarantee it will perform as well during the subsequent releases. Software products continually evolve: they grow and change. The same should hold for their development and management processes. The software data warehouse is a tool for accomplishing such tasks. The next chapter looks at the benefits gained using the new management framework.
Chapter 10
Management evaluation The previous chapters discussed the overall characteristics of the product under study, the existing data sources, and an example three-dimensional management model. Initially, the case study aimed to shadow the management tasks, as the development progressed, and to compare the status quo with the proposed framework of this dissertation. However, at an early stage, based on management enthusiasm, the case study became an integral part of the existing management framework. The management argued that effective improvement would require small steps while under the real pressures of the development cycle. As a result, it is more difficult to compare the status quo with the new management framework. This chapter begins with anecdotal evidence based on management feedback (Ryan, 2003).1 The subsequent subsections list improvements made to operational, tactical, and strategic aspects of the product management. The arguments presented in these subsections relies on possibilities or levels of difficulties facing management in the absence of the proposed framework. 1A
manager from the group used in the case study presented her assessment at CASCON’02 workshop: “From
garage to factory”.
141
CHAPTER 10. MANAGEMENT EVALUATION
10.1
142
Management feedback
The case study presented in previous three chapters aimed not to interfere with the progression of the product development and its management. However, early on, the management showed an interest in the gradual applications of the ideas presented in this dissertation. As a result, the proposed management framework became integrated into the overall management strategy. A key argument presented by the management suggested that to effectively assess the potential improvements, the proposed framework has to be gradually integrated and its effects should remain continual and small. Upon discussing the ideas of this dissertation, the management formulated their requests as follows: • identification of major problems; • determination of the significance of the problems; • determination of possible causes; • evaluation of potential solutions; and • assessment of future impact.
The prototype software data warehouse provided first line managers with regular and timely reports on features and defects, their status, and overall impact. Each manager additionally provided more detailed requirements for presentation of their reports. Upon detection of a potential problem, higher-level managers were informed, and evidence data, possible causes, potential solutions, and future impacts were presented to them. The following subsections present three aspects of the management with associated examples. It must be noted that some problems faced by the management are not noticeable at a smaller scale. For example, the total number of changes made to the code-base (features or defects) during this case study exceeds 50,000.
CHAPTER 10. MANAGEMENT EVALUATION
10.2
143
Operational analysis
Chapter 3 discussed the importance of planning, monitoring, and prediction to the management of the software product development. It further provided a new approach to planning, monitoring, and prediction of software projects. The plans created for the case study, as described in Chapter 9, were the result of a formalization of existing plans — based on textual documents. Prior to the case study, plans were not rigorously monitored nor their progression predicted. A key reason behind this inaction relates to the lack of access to reliable data to verify the effectiveness of the plans and the inability to find universally acceptable metrics. Furthermore, as the number of items in the plan documents and the number of team members grew, the existing infrastructure became inadequate. To effectively support the management processes, and their improvement, the underlying infrastructure must continually evolve. The software data warehouse provided such an infrastructure, which in turn simplified adoption of the multi-dimensional framework presented in Chapter 3. In terms of planning, the ability to verify, monitor, and improve plans resulted in team members spending more time on planning. Use of the goal-oriented approach presented in Chapter 3 formalized existing planning processes, resulting in the identification of inconsistencies that could have gone unnoticed for some time. For example, due to the large volume of changes made to the code-base, frequently developers were unaware of tasks that were assigned to them. Such problems generally were not identified until they became critical. However, the automation of monitoring of progress and quality goals resulted in the identification of these cases within a refresh interval (e.g. a week). The new three-dimensional management model created an overall picture available to every member of the team. Each manager could view their aspect of the project based on their needs and their customizations. For example, a manager responsible for a key component of the product regularly provided the number of developers working on various new features within his area. He was motivated to monitor not only the progress, but also to create a picture of resource allocation within his area over a period of time. Having regular reports as such enables him to demonstrate to higher-level managers the need for highly skilled individuals to work in his group.
CHAPTER 10. MANAGEMENT EVALUATION
144
The availability of the temporal data enabled us to demonstrate that even simple linear projections could provide a clearer picture of the status of the project at release time. For example, at a later date in the project, we estimated that the progress goal would significantly be behind schedule, which would contribute to missing a major milestone. As it turned out, the number of defects opened against the product significantly grew during this period. As Chapter 9 described, the defect arrival is a negative external influence that reduces the overall progress. Using a linear projection, we determined that if the arrival of defects continued at the same pace, the project would not meet its target. Such evidence enabled the management to perform informed risk management and identify a possible course of actions. A possible solution to the above mentioned problem required the team members to evaluate incoming defects and defer “non-critical” ones to future releases. In small projects, the task of deferring a defect to future releases is trivial. However, as a project becomes larger (as in the case study presented in this dissertation) almost nothing is trivial. When a manager decides that a defect must be deferred, his decision can result in uneasiness among the developers who have, perhaps, worked for days on preparing a fix. Typically, long periods of time are allocated to discussions as to why a defect should be deferred to the next release. Prior to the development of the software data warehouse, emails were sent to the defect owners requesting their defects to be deferred. The process involves developers formally changing the necessary data in the defect record in the configuration management tool (CMVC). After a week, at most 5 percent of these defects were actually deferred. Instead, long periods of discussion ensued and developers presented their cases as to why defects should not be deferred. After reporting the results of monitoring broadly available, we observed that 60 percent of defects marked for future releases were deferred within the first day of sending the emails. A possible explanation for this huge increase is due to multiplicity of concerns. A developer working on a defect is focused on her work, i.e., her concern. From the developer’s perspective, the defect at hand, rightly or otherwise, is critical. However, when the developer is able to see the large number of defects and the data indicating the limitations of the resources, she is able to see justifications behind the deferred decision. In other words, the availability of agreed upon monitoring and reporting improves the communication channel among the members of the team.
CHAPTER 10. MANAGEMENT EVALUATION
145
The monitoring process provided the necessary data indicating the existence of a potential problem and a means for narrowing it down. The next step involves determining possible causes of the problem: tactical analysis. The next subsection describes the tactical aspects of the case study.
10.3
Tactical analysis
Early feedback from the management indicated the importance of data evidence for identified problems as well as an indication of their significance (e.g. the future impact). The software data warehouse provided the infrastructure for determining correlations and possible root causes. Upon identification of a problem in operational analysis, it was localized along its dimension to a sub-goal. Next, other dimensions were explored to identify possible correlations. Finally, a hypothesis was formulated and its likelihood evaluated. The uniform access to the project’s subject-oriented, integrated, and temporal data enabled quick hypothesis evaluation. As a result, managers frequently requested formal causal analysis for their problems. The monitoring of the project’s progress goal revealed that early projections did not match the performance of the development team. Drilling down the progress goal identified the culprit: a feature had fallen behind its schedule. Based on experts’ input, we drilled across the health goal. The new feature would modify a component that was considered “complex”. The hypothesis formulated was that higher levels of “complexity” of the component contributed to the slow implementation of the new feature. As it turned out, the specified component’s (structural) fan-in exceeds the average of other components by 35 percent, indicating possible high usage of the component. Furthermore, the most recently changed files in the component were much larger than other files in the same component: 1000 lines of code versus the average of 600. Lastly, the average time to complete a defect against this component exceeds the average of other components. A possible course of action suggested more resources should be allocated for implementation of new features against this component. Until such data became available, the manager responsible for the implementation of these features could not present a strong argument to other managers. Another example anomaly involved the quality goal. At a point during the project, the quality goal score fell below its acceptable levels. Drilling down the quality goal definition revealed that a
CHAPTER 10. MANAGEMENT EVALUATION
146
subgoal of the system test had failed to contribute to the overall quality score. At this time, the progress score should have reached a steady level – no more new code should have been introduced. However, drilling across to the progress goal revealed coding of new features was still taking place. The hypothesis in this case stated that defects identified during the system test subgoal were not fixed in a timely manner. The relevant data indicated that the average turn around time for system test defects exceeded the acceptable levels. A possible course of action proposed required the management to evaluate the features whose implementation had fallen behind the schedule and to defer them to future releases. Consequently, the development team can focus on timely fixing of system test defects. The last example anomaly involved a common slippage scenario. When a project falls behind schedule, a common course of action is to shrink later phases — typically system test and integration testing. A manager argued that the longer time alloted for the system test can reduce the number of defects identified by the customers (i.e. post release). Although the data from previous releases was not fully available, we identified that during the previous release the system test scenarios were merged at the end of the cycle to reduce the testing cycle. The reduction, based on team members recollections, was significant in comparison to previous releases. On the other hand, the number of customer found defects had risen slightly more than expected levels. Considering the steady growth of the project from release to release, we hypothesized that a reduction in system test can not only impact the quality of the product (i.e. more defects found by customers), but also affect the progress of the future release due to allocation of resources to backlog defects (i.e. external negative influence). Verification of this hypothesis requires post release data. However, the argument provided based on projection of current data contributed to strategies discussed in the next subsection. Once again, it must be emphasized that management decisions are not always based solely on software engineering principles. In many cases, the business goals override the software engineering concerns. The objective is to communicate to the management the possible outcomes so that these effects do not accumulate to unrealistic levels. For example, the management should be made aware of the number of new features that the team can implement for a new release, the number of defects that must be fixed, and the amount of testing required to achieve a certain level of quality (as
CHAPTER 10. MANAGEMENT EVALUATION
147
well as consequences of not reaching such levels). The management in turn may choose to forgo some of the new features in favor of fixing more defects and improving the product overall quality. Alternatively, the management may choose to implement more new features in lieu of a potential hit to the health of the code-base. Tactical analysis provides the necessary data for making informed decisions that are of immediate consequence. The next subsection describes the strategic and long term aspects of the management of the case study.
10.4
Strategic analysis
Perhaps the most critical aspect of management is strategic analysis. It involves creating solutions for problems identified during tactical analysis. Such solutions are typically longer term and involve a significant degree of risk. As a result, in most software development organizations, the strategic aspect are more likely to be chopped during economic pressures. A key contribution of the software data warehouse involves facilitating simulation and analysis of strategic decisions. The strategic analysis constitutes the improvement cycle of this dissertation’s proposed management framework. The objective is to provide clear historical evidence as what the problem is, identify its impact in the long run, and evaluate the impact of implementing a potential solution in terms of costs and benefits. The first example of a strategic analysis involves quality shortfalls. As discussed in the tactical analysis, the number of customer defects were fixed at a constant ratio. However, considering the growth rate of the product from one release to another reveals potential problems. Adding to the equation the cost of fixing a customer’s defect soon raises the criticality levels of the problem. Adopting a variation of the Butterfly model (Bassin et al., 1998; Bassin and Santhanam, 1998) a profile of various types of defects for each component was constructed. For each component, the profile revealed the number of function test defects, system test defects, and customer found defects. The average ratio of components with the highest number of customer defects indicated that for every four development changes made to the components, four defects were found during function testing, two defects were found during system test, and one by the customer. In other words, if the past is indicative of the future performance, a constant increase in the number of development
CHAPTER 10. MANAGEMENT EVALUATION
148
changes in lieu of current pressure on system testing would result in a massive number of customer problems. A potential solution involves a variation of the usual testing strategy. An analysis of the top components revealed that these have undergone the most changes in terms of addition of new lines, deletion of new lines, and changes of lines of code. Hence, a strategy is to test these components to find defects that are hidden. To do so, a team consisting of a developer (knowledgeable in the functionality of the component), and regression tester (knowledgeable in function test cases), and a system tester (knowledgeable in black-box testing) undertake the task of finding defects against the specified component. Taking into account the number of customer found defects can point out the success or failure of the proposed testing approach. Should the data support the new approach, there needs to be a dynamic testing strategy that takes into account the health of each component, the ratio of defects found in various stages, and the history of previous release customer found defects. The objective of this testing is to identify defects as opposed to customary scenario execution. Once again, in large projects such an undertaking can prove massive, involving a great degrees of risk (e.g. allocation of scarce resources to unproven tasks). The availability of evidence data and ability to simulate and predict future impact can help the management make appropriate decisions. Another strategy that was formulated along the same lines of testing focuses on the variation in the development cycle. Mining the available data from previous projects and the current release revealed a pattern of shortening of testing cycles. On the outset, the release cycle is a modified version of a water-fall model: the system test and integration test cycles cannot begin until all development, unit testing, and function testing have been completed. When the project falls behind schedule, the pressure transfers to the last phases. The data suggests a correlation between the shortage of system test cycle (merging of system test scenarios via positive external influences) and the rise in the customer found defects. Schedule pressures are almost inevitable in today’s software development projects. Customers demand more functionalities, more inter-operability among software products, and more sophisticated user interfaces. As a result, projects are growing larger beyond the capabilities of the existing processes. Yet, such processes are not easily modifiable due to the risk involved. For example,
CHAPTER 10. MANAGEMENT EVALUATION
149
in our case study we determined that from one release to the next the number of developers had increased by 30 percent on average. The number of changes made to the code-base, i.e., to fix a defect or implement a new feature, had grown on average by 40 percent. During the current release, the number of developers increased by less than 30 percent and the number of changes made to the code-base exceeded the usual 30 percent. As a result, the project will fall behind schedule. A study revealed that although the ratio of code-base changes to developers is constant over a release, the ratio changes significantly during the development cycle. For example, consider the ratio curve depicted in Figure 10.1. (For demonstration purposes the curve is approximated to a straight line.) A proposed solution involves multiple small releases in place of one big release. In other words, assuming that at any given point during the development each developer can work on a number of code-base changes, we can have multiple release cycles that can chop off the peak of the curve shown in Figure 10.1 as depicted in Figure 10.2. Ratio
Ratio curve during the development
a
Average ratio
Time Start
Release
Figure 10.1: Ratio of code-base changes over number of developers
The proposed strategy supports the extreme programming philosophy for managing of larger projects (Beck, 1999). This philosophy rests on frequent development, testing, and releasing. The drawbacks in applying such an approach include:
CHAPTER 10. MANAGEMENT EVALUATION
150
• multiple releases require more planning, • overhead between two consecutive releases, and • synchronization of various testing activities. The advantages of multiple releases include:
• the amount of testing performed during each release will remain almost constant, • during the planning stage the management makes a conscious decision as to what new features may not be delivered in the new release, i.e., the last mini release shown in Figure 10.2, and • at any point in time, there is a mini release that can become the candidate release of the product and be delivered to the market.
This model is not unfamiliar to release-based software product developments. The multiple minireleases model resembles that of the fix-pack model. Typically a software development organization provides a major release every 16 months but provides fix-packs once every 3 months. The idea behind the fix-packs, as the name suggests, is to include defect fixes to previous major release. We have determined an optimal ratio of code-base changes to developers based on our current release. To experiment the new model, we proposed that during the next release, a subproject be identified and given autonomy to have multiple releases. This implies that the changes made to their code-base be integrated into the main code-base only at the end of their mini-release. To achieve this separation, we need to identify a subproject that its changes to the code-base can be localized. The success criteria for this experiment includes:
• constant defect/feature turn around time across mini-releases, • only the last mini-release can slip, and • reduced number of customer found defects since each release has been subject to rigorous testing.
CHAPTER 10. MANAGEMENT EVALUATION
151
Ratio Ratio curve during the development
Average ratio
a
Release 1
Release 2
Release 3
Release 4 Time
Start
General availablity
Figure 10.2: Overlapping mini-releases can reduce the pressures placed on the development team. A key factor determining the success of this model is estimation of the amount of work the development team can undertake at each interval. For example, in the case study presented in this chapter, this model was used in the maintenance stream (or fix-packs) based on the number of defects that were fixed in previous releases.
A software development organization needs a strategy for its continual growth and survival. However, changes involve risks that must be analyzed from different viewpoints. Having access to temporal data can facilitate strategic analysis and support gradual changes to the development processes.
10.5
Limitations
Chapter 1 stated that solutions to software engineering problems are at best partial and local. The software data warehouse presented in this dissertation also has its limitations. This section lists some of the major limitations, while the subsequent chapter formulates future directions based on these.
• Costs and benefits: Chapter 4 indicated that the data warehouse is a solution not a product. Its logical schemas require extensive business process modeling and its implementation costs
CHAPTER 10. MANAGEMENT EVALUATION
152
may exceed the budget of a small organization (Chaudhuri et al., 2001). This dissertation does not address the monetary evaluation of the proposed solution. However, the underlying assumption is that a mature organization with long term plan can benefit from the potential gains provided by the software data warehouse. • Maturity levels: Software development organizations vary in terms of their processes and their applications. To effectively implement the ideas presented in this dissertation, an organization must have reached a higher levels of maturity in terms of their development processes. Typical organizations have gone through several successful releases with multiple future releases in the planning stage. Such a requirement characterizes organizations that are at levels 3 or higher in the SEI Capability Maturity Model (CMM) (Paulk et al., 1993). • Continuous operations: As the results of case study suggests, as a software organization evolves, its data sources and their structures change. In turn, the software data warehouse must evolve to reflect these changes. As a result, the organization needs to plan for the extra overhead for operation and maintenance of the software data warehouse. Otherwise, as time passes it can become obsolete.
10.6
Summary
To summarize, the creation of the software data warehouse for the case study has resulted in more questions being asked and novel answers being offered than previous releases. The availability of data to managers has resulted in more focused planning, better tactical analysis and improved strategy formulation. The key advantages gained can be summarized as follows: • Operational analysis: – Improved planning, monitoring, and projection processes. – Improved communication among various stakeholders. – Timely identification of problems and their locality. • Tactical analysis:
CHAPTER 10. MANAGEMENT EVALUATION
153
– Improved causal analysis of identified problems. – Access to subject-oriented, integrated, and temporal data for the project. – Enabled benchmarking of various tasks for future releases. • Strategic analysis: – Facilitated long term planning and simulation. – Enabled assessment of cost-benefit analysis of potential solutions. – Provided a means of improving planning processes. One aspect of the software data warehouse not formally addressed here, but which raised great levels of interest, is learning. In recent years, there has been an interest in mining the software development databases (Khoshgoftar et al., 2001). Mining data from the software data warehouse could provide nuggets of information that would help development teams formulate best practices. The next chapter concludes this dissertation by providing a summary of the contributions of this dissertation as well as discussing potential avenues for future research.
Chapter 11
Conclusion This dissertation claimed that viewing software as data and leveraging business intelligence methods and technologies, in particular data warehousing, will improve the management of software product development. The underlying pillars of this claim were: locality of focus, multiplicity of concern, data centricity, and continual improvement. Chapter 2 presented an argument for locality of concern, while Chapter 3 proposed a means for defining and capturing multiplicity of concerns. Chapter 4 outlined a data centric architecture for management support. The case study presented in later chapters demonstrated an application of the concepts presented in earlier chapters while emphasizing the importance of continuous improvement in lieu of software evolution. This chapter concludes the dissertation by summarizing its key contributions and outlining possible future directions.
11.1
Contributions
This dissertation begun with an argument for improving software product management. To improve software development, its management must be improved. Software developing organizations, like other business undertakings, are bound by numerous constraints that affect management’s decisions. This dissertation proposed a framework for continual improvement of software product management based on the ideas of business intelligence. The following list summarizes the contributions of this 154
CHAPTER 11. CONCLUSION
155
dissertation.
• A multi-dimensional management framework: – Focused management goals and their evaluation on the local projects; – Provided a means for defining measurable goals facilitating monitoring and predicting their values; – Provided a means of organizing defined goals along various dimensions of interest; and – Provided flexibility for changing and modifying locally defined goals in lieu of unforeseen changes. • A data warehousing support architecture: – Focused on managing by data; – Developed an architecture for storage and maintenance of integrated, temporal, and subject-oriented software data; – Implemented a management support for operational, tactical, and strategic analyses; – Provided a means of storing project histories for replay; and – Provided flexibility for the evolution of data based on existing commercial data management tools and technologies. • A case study: – Demonstrated an application of the proposed ideas to an industrial case study. This dissertation showed that the ideas are applicable and scale to handle large projects. – Implemented a concrete architecture of a software data warehouse; and – Assessed the impact of the software data warehouse on management tasks.
Overall, this dissertation presented a management framework based on: locality of focus, multiplicity of concern, data centricity, and continual improvement. To implement this framework, a novel architecture was developed for management support as depicted in Figure 11.1. The architecture
CHAPTER 11. CONCLUSION
156
Improvement feedback
...
strategic planning management processes
Software data warehouse
Development databases and repositories
...
development processes
Figure 11.1: A quality release management system based on data division: development processes that create and modify (software) data and management processes that analyze it.
separates management and development processes. Development processes create, modify, and delete data (i.e., software), while management processes use the data to monitor and control the overall project. This dichotomy is demonstrated by separating development and management data. While the development data is transient (i.e., changes frequently), the management data is non-volatile. Furthermore, the development data is isolated representing local concerns, while the management data is integrated capturing multiple concerns. Finally, a set of strategic processes continually assess the project to improve the management processes. These processes also provide the link between various product development projects.
11.2
Future Directions
The ideas put forward in this dissertation have evolved from their conception throughout their evaluation and reporting. However, several avenues opened for exploration that went beyond the main focus of this dissertation. The following list summarizes those that deserve further investigation.
CHAPTER 11. CONCLUSION
157
• Implementation of software data warehouse: Though data warehousing is a solution not a tool, the domain of software management is sufficiently narrow to justify models and tools for the creation and the operation of software data warehouses. Chapter 5 of this dissertation focused on one such issue: extraction and storage of program data. It is necessary to have a means of extracting and storing data from data sources that are most common to software development. Furthermore, the development of business intelligence solutions need not start big. Rather they should start small and based on value-added expanded. • Effectiveness of software data warehouse: The design and implementation of a software data warehouse adds significant overhead to the software development budget. Furthermore, its ongoing maintenance and operation will tax the organization’s resources. Hence, to effectively leverage the software data warehouse there needs to be model for determining effectiveness of the solution in terms of its monetary benefits (i.e. Return On Investment). • Empirical studies of multi-dimensional framework: The availability of data for a case study enables further study of the management model. For example, by changing the definitions of the three dimensions of the case study described in Chapters 7-9, it is possible to assess if a different definition could have provided more granular information about the project. As Chapter 9 showed, all goals were far from their respective plans. This suggests that either goal definitions should change or the planning of the project was too ambitious. Engaging in studies as such can contribute to goal definitions. Furthermore, during the case study, the management team raised an issue regarding the enhancement of the management model. In essence, the management team agreed that there are two separate dimensions that impact the overall software product development. These are skills and infrastructure. Using the results of an empirical study one may be able to justify such concerns. • Code analysis: The availability of data for a product from the initial step to final delivery can provide insight on the overall software product development. Hence, it is necessary to study the code, its characteristics, and evolution through the development cycle. There are many rules of thumb in projects that can be verified. One such example is the introduction of clones in later stages of development. As the project reaches its final phase, developers
CHAPTER 11. CONCLUSION
158
hesitate to apply any changes to the code that has been tested. Instead, from time to time, developers simply add new pieces of code that duplicate the existing code –cloning of the code. • Knowledge management: As software development projects are carried out, there are data patterns that can capture nuggets of knowledge that may prove useful to other projects. It is necessary to store such “lessons learned” or common “analysis patterns” as suggested by Basili (Basili and Caldiera, 1995). There needs to be an encapsulation of what is learned for purposes of sharing with future members of the team, and in the long run, with other projects. The ideal is to have a documented set of rules that provide the cause, the effect, and the degree of confidence based on the type of the development project where the rule was deduced. Also, as the data grows, the development organization will require mechanisms for aging, aggregating, or deleting the previous release data.
11.3
Summary
This dissertation reinforces the claim that there is no silver bullet for solving software engineering problems (Brooks, 1995). There are many solutions, each of which has its own merit and domain of applications. To solve the problems of software engineering, one must focus locally and, based on supporting data, adapt existing solutions. Furthermore, as software systems evolve, in response to their environment (Lehman, 2000), so do the problems associated with them. As a result, existing solutions must be improved continually to fit the new problems.
Bibliography Ahern, D. M., Clouse, A., and Turner, R. (2001). CMM Distilled. Addison-Wesley. Akao, Y. (1990). Quality function deployment : integrating customer requirements into product design. Productivity Press. Basili, V., Zelkowitz, M., McGarry, F., Page, J., Waligora, S., and Pajerski, R. (1995). SEL’s software process-improvement program. IEEE Software. Basili, V. R. and Caldiera, G. (1995). Improve software quality by reusing knowledge and experience. Sloan Management Review. Basili, V. R. and Musa, J. D. (1991). The future engineering of software: A management perspective. Computer. Bassin, K. A., Kratschmer, T., and Santhanam, P. (1998). Evaluating software development objectively. IEEE Software. Bassin, K. A. and Santhanam, P. (1998). Managing software development: the Butterfly model and ODC. Center for software engineering IBM T. J. Watson Research Center. Beck, K. (1999). Extreme Programming Explained: Embrace Change. Addison Wesley. Bell Canada (2000).
Datrix:
Abstract semantic graph reference manual.
http://www.iro.umontreal.ca/labs/gelo/datrix/. Bischoff, J. and Alexander, T., editors (1997). Data Warehouse. Prentice Hall. Board, D. S. (1994). Report of the defense science board task force on acquiring defense software commercially. The undersecretary of defense, acquistion and technology. 159
BIBLIOGRAPHY
160
Boehm, B. and Basili, V. R. (2000). Gaining intellectual control of software development. Computer. Boehm, B. and Brown, J. R. (1978). Characteristics of Software Quality. Elsevier. Bonifati, A., Cattaneo, F., Ceri, S., Fuggetta, A., and Paraboschi, S. (2001). Designing data marts for data warehouses. ACM Transactions on Software Engineering and Methodology, 10(4). Brooks, F. P. (1995). The mythical man-month. Addison-Wesley, 20th anniversary edition. Card, D. N. (1998). Learning from our mistakes with defect causal analysis. IEEE Software. Carmel, E. and Becker, S. (1995). A process model for packaged software development. IEEE Transactions on Engineering Management, 42(1). Chaudhuri, S., Dayal, U., and Ganti, V. (2001). Database technology for decision support systems. IEEE Computer. Chen, Y., Gansner, E. R., and Koutsofios, E. (1998). A C++ data model supporting reachability analysis and dead code detection. IEEE Transactions on Software Engineering, 24(9). Chillarege, R., Bhandari, I. S., Chaar, J. K., Halliday, M. J., Moebus, D. S., Ray, B. K., and Wong, M. (1992). Orthogonal defect classification – a concept for in-process measurement. IEEE Transactions on Software Engineering, 18(11). Crosby, P. B. (1979). Quality is free. McGraw-Hill. Cusumano, M. A. and Selby, R. W. (1997). How Microsoft builds software. Communications of the ACM, 40(6). Davis, A. M. (1995). 201 principles of software development. McGraw-Hill. Deming, W. E. (1986). Out of crisis. Massachusetts Institute of Technology. Devanbu, P., Rosenblum, D., and Wolf, A. (1996). Generating testing and analysis tools with Aria. ACM Transactions on Software Engineering and Methodology, 5(1). Dutta, S., Wierenga, B., and Dalebout, A. (1997). Designing management support systems using an integrative perspective. CACM, 40(6).
BIBLIOGRAPHY
161
Eick, S. G., Graves, T. L., Karr, A. F., Marron, J. S., and Mockus, A. (2000). Does code decay? assessing the evidence from change management data. IEEE Transactions on Software Engineering, 27(1). Einhorn, H. J. and Hogarth, R. M. (1999). Hardvard Business Review on Managing Uncertainty, chapter Decision making: going forward in reverse. Harvard Business School Press. Feigenbaum, A. V. (1991). Total Quality Control. McGraw-Hill. Fenton, N. E. and Pfleeger, S. L. (1997). Software Metrics: A Rigorous and Practical Approach. PWS Publishing Company. Finnigan, P. J., Holt, R. C., Kalas, I., Kerr, S., Kontogiannis, K., Muller, H. A., Mylopoulos, J., Perelgut, S. G., Stanley, M., and Wong, K. (1997). The software bookshelf. IBM Systems Journal, 36(4). Garvin, D. (1984). What does “product quality” really mean? Sloan Management Review. Gillies, A. C. (1992). Software Quality: Theory and Management. Chapman and Hall. Glass, R. L. (1999). The realities of software technology payoffs. Communications of the ACM. Glass, R. L. (2003). Facts and fallacies of software engineering. Addison-Wesley. Grady, R. B. (1992). Practical software metrics for project management and process improvement. Hewlett-Packard Professional Books. Prentice Hall. Hauser, J. R. and Clausing, D. (1988). The house of quality. Harvard Business Review. Hollenback, C., Young, R., Pflugard, A., and Smith, D. (1997). Combining quality and software improvement. CACM, 40(6). Holt, R. C. (1998). Structural manipulations of software architecture using tarski relational algebra. In Working Conference on Reverse Engineering. Hopcroft, J. E., Motwani, R., and Ullman, J. D. (2001). Introduction to Automata Theory, Languages, and Computation. Addison Wesley.
BIBLIOGRAPHY
162
Inmon, W. H. (1996). The data warehouse and data mining. Communications of the ACM, 39(11). Inmon, W. H., Imhoff, C., and Sousa, R. (1998). Corporate information factory. John Wiley and Sons Inc. International Technical Support Organization (1994). Did you say CMVC? IBM Red Books. ISO (1986). ISO8042 Quality Vocabulary. ISO. ISO (1990). Quality management of quality assurance standards. Jacobson, I. (1994). Object-Oriented Software Engineering: A Use Case Driven Approach. Addison Wesley. Jones, C. (1996). Patterns of software systems failure and success. International Thomson Computer Press. Juran, J. M. (1979). Quality Control Handbook. McGraw-Hill. Kan, S. H. (1995). Metric and Models in Software Quality Engineering. Addison Wesley. Khoshgoftar, T. M., Allen, E. B., Jones, W. D., and Hudepohl, J. P. (2001). Data mining of software development databases. Software Quality Journal. Kimball, R. (1996). The data warehouse toolkit: practical techniques for building dimnesional data warehouses. John Wiley and Sons, Inc. Kitchenham, B. (1990). Software development metrics and models, chapter 10 Software Reliability Handbook. Elsevier. Kitchenham, B. and Pfleeger, S. L. (1996). Software quality: the elusive target. IEEE Software, 13(1). Lehman, M. M. (1980). On understanding laws, evolution and conservation in the large program life cycle. Journal of System and Software, 1(3). Lehman, M. M. (1998a). Feedback, evolution and software technology: The human dimension. In Proceedings of the International Workshop on the Principles of Software Evolution, ICSE’98, Tokyo, Japan.
BIBLIOGRAPHY
163
Lehman, M. M. (1998b). Software’s future: Managing evolution. IEEE Software. Lehman, M. M. (2000). Rules and tools for software evolution planning and management. Technical Report 2000/14, Imperial College. Lehman, M. M. and Belady, L. A. (1985). Program evolution: processes of software change. Academic Press. Marciniak, J. J., editor (1994). Encyclopedia of software engineering, volume 1, chapter Goal/question/metric paradigm. Wiley-Interscience. Mayrand, J. and Coallier, F. (1996). System acquisition based on product assessment. In Proceedings of International Conference on Software Engineering. Mayrand, J., Patenaude, J., Merlo, E., Dagenais, M., and Lagu¨e, B. (2000). Software assessment using metrics: A comparison across large C++ and Java systems. Annals of Software Engineering, 9. McCabe, T. J. (1976). A complexity measure. IEEE Transactions on Software Engineering, 2(4). McCall, J. A. (1980). An assessment of current software metric research. In EASCON80. IEEE. McConnell, S. (2002). The business of software improvement. IEEE Software. McGuff, F. and Kador, J. (1999). Developing Analytical Database Applications. Prentice Hall PTR. Mendelzon, A. and Sametinger, J. (1995). Reverse engineering by visualizing and querying. Software – Concepts and Tools, 16:170–182. Muchnick, S. S. (1997). Advanced Compiler Design and Implementation. Morgan Kaufmann. Muller, H. A. (1986). Rigi – A model for software system construction, integration, and evolution based on module interface specifications. PhD thesis, Rice University. Muller, H. A., Orgun, M. A., Tilley, S., and Uhl, J. S. (1993). A reverse engineering approach to subsystem structure identification. Journal of Software Maintenance, 5(4). Mylopoulos, J., Chung, L., and Nixon, B. (1992). Representing and using nonfunctional requirements: a process-oriented approach. IEEE Transactions on Software Engineering, 18(6).
BIBLIOGRAPHY
164
Ovideo, E. I. (1980). Control flow, data flow, and program complexity. In The 4th computer software and application conference (COMPSAC 80). IEEE. Parnas, D. L. (1979). Designing software for ease of extension and contraction. IEEE Transaction on Software Engineering, 5(2). Paulk, M. C., Curtis, B., Chrissis, M. B., and Weber, C. V. (1993). Capability maturity model for software, version 1.1. Technical Report CMU/SEI-93-TR-24, Software Engineering Institute. Pedersen, T. B. and Jensen, C. S. (2001). Multidimensional database technology. IEEE Computer. Pittman, W. D. and Russell, G. R. (1998). The Deming cycle extended to software development. Production and inventory management journal. Project Management Institute (2000). Project management body of knowledge. Project Management Institute. Rosenblum, D. and Wolf, A. (1991). Representing semantically analyzed C++ code with Reprise. In Proceedings of USENIX C++ Conference. Ryan, K. (2003). CASCON workshop report: from garage to factory. Technical Report TR-74.188-9, IBM Toronto Lab. Schulmeyer, G. G. (1990). Zero defect software. McGraw-Hill. Shewhart, W. A. (1931). Econimic Control of Quality of Manufactured Product. Van Nostrand Company. Singh, H. S. (1997). Data Warehousing. Prentice Hall PTR. The Standish Group International, I. (1995). Chaos. Internal report. The Standish Group International, I. (1999). CHAOS: A receipe for success. Internal report. Tilley, S. and Smith, D. (1996). Coming attractions in program understanding. Technical Report CMU/SEI-96-TR-019, Software Engineering Institute. van Genuchten, M. (1991). Why is software late? An empirical study of the reasons for delay in software development. IEEE Transactions on Software Engineering, 17(6).
BIBLIOGRAPHY
165
Watts, R. (1987). Measuring software quality. NCC Publications. Whitehorn, M. and Whitehorn, M. (1999). Business Intelligence: The IBM Solution. Springer Verlag. Wong, K., Tilley, S. R., Muller, H. A., and Storey, M. D. (1995). Structural redocumentation: A case study. IEEE Software. Zultner, R. E. (1992). Quality Function Deployment (QFD) for software. Zultner and Company.
Appendix A
Complete C/C++ data model This appendix provides the complete definition of tables described in Chapter 5. CREATE TYPE ProgramEntity AS ( filename VARCHAR(600), buildDate DATE, beginrow INTEGER, begincol INTEGER, endrow INTEGER, endcol INTEGER, scope INTEGER ) REF USING VARCHAR(50) MODE DB2SQL CREATE TYPE Expression UNDER ProgramEntity AS ( qualifier VARCHAR(30) ) MODE DB2SQL CREATE TYPE Identifier UNDER ProgramEntity AS ( name VARCHAR(60), visibility VARCHAR(10) ) MODE DB2SQL CREATE TYPE ThrowSpec UNDER ProgramEntity AS ( expression REF (Expression)) MODE DB2SQL CREATE TYPE Type UNDER Identifier AS (qualifier VARCHAR(30)) MODE DB2SQL CREATE TYPE Scope UNDER Identifier AS (qualifier VARCHAR(30)) MODE DB2SQL CREATE TYPE Jump UNDER ProgramEntity AS ( qualifier VARCHAR(30), destination REF (ProgramEntity)) MODE DB2SQL CREATE TYPE SwitchSelect UNDER ProgramEntity AS ( condition REF(Expression)) MODE DB2SQL CREATE TYPE IfSelection UNDER ProgramEntity AS ( condition REF(Expression), trueBlock REF(Scope), falseBlock REF(Scope) ) MODE DB2SQL CREATE TYPE LiteralExp UNDER Expression AS ( 166
APPENDIX A. COMPLETE C/C++ DATA MODEL
type VARCHAR(40), initVal VARCHAR(400)) MODE DB2SQL CREATE TYPE NameReferenceExp UNDER Expression AS ( name VARCHAR(30)) MODE DB2SQL CREATE TYPE OperatorExp UNDER Expression AS ( operator VARCHAR(30)) MODE DB2SQL CREATE TYPE UnaryOp UNDER OperatorExp AS ( operand REF(ProgramEntity)) MODE DB2SQL CREATE TYPE BinaryOp UNDER OperatorExp AS ( oprnd1 REF(ProgramEntity), oprnd2 REF(ProgramEntity)) MODE DB2SQL CREATE TYPE TernaryOp UNDER OperatorExp AS ( oprnd1 REF(ProgramEntity), oprnd2 REF(ProgramEntity), oprnd3 REF(ProgramEntity)) MODE DB2SQL CREATE TYPE Enumerator UNDER Identifier AS ( initValue REF(Expression)) MODE DB2SQL CREATE TYPE Function UNDER Identifier AS ( constant CHAR(5), external CHAR(5), inline CHAR(5), static CHAR(5), virtual CHAR(5), resultType REF(Type), initValue REF(Expression), exception REF(ThrowSpec)) MODE DB2SQL CREATE TYPE Label UNDER Identifier AS ( qualifier VARCHAR(30), switch REF(SwitchSelect), value REF(Expression)) MODE DB2SQL CREATE TYPE ObjectEntity UNDER Identifier AS ( qualifier VARCHAR(30), type REF(Type), initValue REF(Expression), constant CHAR(5), external CHAR(5), static CHAR(5)) MODE DB2SQL CREATE TYPE FunctionParam UNDER ObjectEntity AS ( function REF(Function)) MODE DB2SQL CREATE TYPE AggregateType UNDER Type AS ( type VARCHAR(10)) MODE DB2SQL CREATE TYPE ForLoop UNDER ProgramEntity AS ( body REF(Scope), condition REF(Expression), increment REF(Expression)) MODE DB2SQL CREATE TYPE WhileLoop UNDER ProgramEntity AS ( qualifier VARCHAR(30), body REF(Scope), condition REF(Expression)) MODE DB2SQL CREATE TABLE programEntities OF ProgramEntity ( REF IS OID USER GENERATED )
167
APPENDIX A. COMPLETE C/C++ DATA MODEL
CREATE TABLE expressions OF Expression UNDER programEntities INHERIT SELECT PRIVILEGES CREATE TABLE identifiers OF Identifier UNDER programEntities INHERIT SELECT PRIVILEGES CREATE TABLE throwSpecs OF ThrowSpec UNDER programEntities INHERIT SELECT PRIVILEGES CREATE TABLE types OF Type UNDER identifiers INHERIT SELECT PRIVILEGES CREATE TABLE scopes OF Scope UNDER identifiers INHERIT SELECT PRIVILEGES CREATE TABLE jumps OF jump UNDER programEntities INHERIT SELECT PRIVILEGES CREATE TABLE switchselects OF SwitchSelect UNDER programEntities INHERIT SELECT PRIVILEGES CREATE TABLE ifselections OF IfSelection UNDER programEntities INHERIT SELECT PRIVILEGES CREATE TABLE literalexps OF LiteralExp UNDER expressions INHERIT SELECT PRIVILEGES CREATE TABLE nameReferenceExps OF NameReferenceExp UNDER expressions INHERIT SELECT PRIVILEGES CREATE TABLE operatorExps OF operatorExp UNDER expressions INHERIT SELECT PRIVILEGES CREATE TABLE unaryOps OF UnaryOp UNDER operatorExps INHERIT SELECT PRIVILEGES CREATE TABLE binaryOps OF BinaryOp UNDER operatorExps INHERIT SELECT PRIVILEGES CREATE TABLE ternaryOps OF TernaryOp UNDER operatorExps INHERIT SELECT PRIVILEGES CREATE TABLE enumerators OF Enumerator UNDER identifiers INHERIT SELECT PRIVILEGES CREATE TABLE functions OF Function UNDER identifiers INHERIT SELECT PRIVILEGES CREATE TABLE labels OF Label UNDER identifiers INHERIT SELECT PRIVILEGES CREATE TABLE objectEntities OF ObjectEntity UNDER identifiers INHERIT SELECT PRIVILEGES
168
APPENDIX A. COMPLETE C/C++ DATA MODEL
169
CREATE TABLE functionParams OF FunctionParam UNDER objectEntities INHERIT SELECT PRIVILEGES CREATE TABLE aggregateTypes OF AggregateType UNDER types INHERIT SELECT PRIVILEGES CREATE TABLE forLoops OF ForLoop UNDER programEntities INHERIT SELECT PRIVILEGES CREATE TABLE whileLoops OF WhileLoop UNDER programEntities INHERIT SELECT PRIVILEGES CREATE TABLE relationships ( type VARCHAR(30) not null, source REF(ProgramEntity) not null, target REF(ProgramEntity) not null, order INTEGER not null, PRIMARY KEY (TYPE, SOURCE, TARGET, ORDER) )
Vita Name
Homayoun (Homy) Dayani-Fard
Education Queen’s University, Ph.D. 1995 - 2003 Queen’s University, M.Sc. 1993 - 1995 University of Toronto, Honors B.Sc., 1989 - 1993 Experience Member of DB2 quality assurance team, IBM Toronto Lab, 2002- present Member of DB2 release team, IBM Toronto Lab, 2001-2002 Research associate, IBM Centre for Advanced Studies, 1997-2001 Adjunct instructor, York University, 1998 - present Adjunct instructor, Queen’s University, 1995 - 1997 Research assistant, Queen’s University, 1994 - 1997 Teaching assistant, Queen’s University, 1993 - 1994 Teaching assistant, University of Toronto, 1993 Service bench technician, Olivetti Canada, 1988-1993 Patents Dynamic semi-structured repository for mining software and software-related information, US patent 6,339,776. Publications Improving Software Management Through Data Warehousing, 2000, IBM Toronto External Technical Report TR-74.169.
170
APPENDIX A. COMPLETE C/C++ DATA MODEL
CASCON ’98 Workshop Report: Legacy Software Systems –Challenges, Issues, and Progress, 1999, IBM Toronto External Technical Report TR74.165-k. Reverse Engineering by Mining Dynamic Repositories, 1998, Working Conference on Reverse Engineering. CASCON ’97 Workshop Report: Software Architectures, 1998, IBM Toronto External Technical Report TR-74.161. A Study of Semi-Automated Program Construction, 1998, Queen’s University External Technical Report 1998-416. Bridging The Gap Between The Design and Implementation of Hard RealTime Systems, 1996, Queen’s University External Technical Report 1996-397. Awards IBM Invention Achievement Award, 1999 IBM Appreciation Award, 1999 IBM Special Achievement Award, 1998 Ian MacLeod Award, Queen’s University, 1997 (co-recipient) Queen’s Graduate Award, 1993, 1995, 1996, 1997 Dean’s Award, Queen’s University, 1993 University of Toronto Dean’s list, 1993 HP Award in Computer Science, University of Toronto, 1990 Academic activities
Exhibits co-chair, International Conference on Software Engineering, Toronto, Canada, 2001 Technical demos and research posters chair, CASCON, Toronto, Canada, 1999,2000 Workshops chair, CASCON, Toronto, Canada, 1997,1998 Member of organizing committee, ITRC/TRIO Researcher retreat, Kingston, Canada, 1995-1997 Guest editor, Crossroads, ACM student magazine, special issue on software engineering, 1995
171