Empirical Software Engineering, 4, 199–215 (1999)
c 1999 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands. °
Metrics to Assess the Likelihood of Project Success Based on Architecture Reviews ALBERTO AVRITZER AT&T Network Computing Services, Middletown, NJ 07748 ELAINE J. WEYUKER AT&T Labs—Research, Florham Park, NJ 07932
[email protected]
[email protected]
Abstract. Architecture audits are performed very early in the software development lifecycle, typically before low level design or code implementation has begun. An empirical study was performed to assess metrics developed to predict the likelihood of risk of failure of a project. The study used data collected during 50 architecture audits performed over a period of two years for large industrial telecommunications systems. The purpose of such a predictor was to identify at a very early stage, projects that were likely to be at high risk of failure. This would enable the project to take corrective action before significant resources had been expended using a problematic architecture. Detailed information about seven of the 50 projects is presented, and a discussion of how the proposed metric rated each of these projects is presented,. A comparison is made of the metric’s evaluation and the assessment of the project made by reviewers during the review process. Keywords: Architecture review, discovery review, architecture audit, risk metric
1.
Introduction
It has been estimated that AT&T currently has on the order of 350 million lines of active code that are used to run its long distance network, including such activities as network maintenance, network provisioning, and network capacity planning and design, as well as non-network-related activities including a wide variety of other software-centric projects, plus billing, accounting, and other back-office software systems. Because the number and complexity of AT&T’s software systems is so enormous, a formal standardized process of software discovery review and/or software architecture review is generally required whenever we are beginning the specification and design of a new software system that is ultimately to be deployed within the network, or when existing systems are to undergo major modifications. Although this process has been in place at AT&T for a number of years, we became involved in making these reviews more systematic and uniform approximately three years ago. The ultimate goal of our involvement is to develop a process that would allow the project to determine at a very early stage of system development, whether or not the project is at high risk of failure. In this way, corrective actions could be taken, if necessary, to modify the software architecture to address identified weaknesses. With this end in mind, we describe in this paper, metrics that we have developed to determine the risk of failure for a project, based on findings made during the review process done very early in the system’s lifecycle, before any code has been written or low level design performed. One of the assets we had at our disposal to accomplish this was findings collected during a total of 50 discovery reviews and architecture reviews performed during the period that
200
AVRITZER AND WEYUKER
we have been involved in the process. All of these reviews were performed on very large industrial telecommunications systems that were intended to be built by our organization and used in production. Once our metrics had been defined, we were therefore able to attempt to validate them by selecting a subset of the 50 available reviews. We did this be choosing seven projects on which we had served as a member of the review team. This meant that we had intimate knowledge of both the project and the review process. We selected four projects that the review team had deemed to be at low risk of failure, and three projects that the review team had considered to be at high risk of failure. In this paper we describe in detail, our findings for these seven software projects. 2.
Software Architecture Basics
Before describing the process that we followed while attempting to define these metrics, and presenting both metrics that we defined and discarded, and the one ultimately selected, we first present some relevant terminology. The most important term that needs definition is, of course, the term software architecture itself. There is by no means unanimous agreement about what this term means. In fact, we have found that many papers in the literature essentially fail to define the term at all, rather than dealing with the obvious difficulty in defining it. Instead the authors leave it as an undefined term implying that everyone agrees on a meaning, and therefore it is unnecessary to formally define the term. For example, although reference (Abowd et al., 1997) presents a comprehensive description of techniques that have been used with success in industry for evaluating software architecture, the term nonetheless remains entirely undefined in the document. Other papers by well-known experts state explicitly that there is no universally agreedupon definition (Garlan et al., 1995; Garlan and Perry, 1995). For example, in their excellent paper on architectural mismatch, Garlan, Allen, and Ockerbloom state: There is currently no single, universally accepted definition of software architecture, but typically a systems architectural design is concerned with describing its decomposition into computational elements and their interactions (Garlan et al., 1995).1 In an introduction to a special issue dedicated to papers on software architecture, Garlan and Perry present the following definition of architecture. They credit this definition to a group that met at the SEI during 1994 to investigate software architecture issues, and note that there are a number of conflicting definitions of software architecture that appear in the relevant research literature. The structure of the components of a program/system, their interrelationships, and principles and guidelines governing their design and evolution over time (Garlan and Perry, 1995).2 Since our mandate is to perform architecture audits, assess them in meaningful ways, and define appropriate metrics to predict the ultimate risk of failure for these audited projects,
METRICS TO ASSESS THE LIKELIHOOD OF PROJECT SUCCESS
201
it is certainly necessary to make concrete, what we mean by the term “architecture”. For these reasons, we have selected the definition that Shaw and Garlan use in their recent book: Abstractly, software architecture involves the description of elements from which systems are built, interactions among those elements, patterns that guide their composition, and constraints on these patterns. In general, a particular system is defined in terms of a collection of components and interactions among those components (Shaw and Garlan, 1996).3 At AT&T, we have in place formal procedures for two different types of architectural assessment. The first type is intended to evaluate architectural alternatives, and to balance the benefits and risks of decisions. For this reason, it is done very early in the lifecycle, generally even before the requirements are completed. This is known as a discovery review and it is used to help a development organization make decisions. In contrast to this, an architecture review is done after the requirements are complete, and high level design has been done, but before low level design or coding. Since it is done at a somewhat later stage than a discovery review, the primary architectural decisions have already been made and so the goal of an architecture review is to guarantee the architecture’s completeness, and to aid in the identification of potential problems. We will use the phrase architecture audit to refer to the both types of architectural assessments: discovery reviews and architecture reviews. We call the team that performs either type of audit a review team, and speak of its members as reviewers. We have found that if we want to maximize the value derived from an architecture audit, it must be done by our most experienced and talented technical people. We also require that members of a review team must not belong to the development team of the project they are reviewing to prevent any lapses in objectivity or perceived conflicts of interest. An attempt is made to balance the skills of the review team members including experts in different technologies and software development phases and processes needed for the project under review to be successful. The review team chairperson typically invests about two hours with the project’s lead architect prior to scheduling an audit. This is done to enable the chairperson to identify the necessary technological expertise that must be included on the review team. Once the review team personnel have been selected, the project must deliver requirements and architecture documents to the review team members with ample time for them to adequately prepare for the review. This will enable the reviewers to determine which areas of the architecture will require detailed probing during the review. Project team members make presentations during the audit on key features of the architecture. Issues that the review team determine require special attention by the development team are noted on cards and after the presentations are complete, the review team meets to assign severities and categories to each identified issue. Once that has been done, these findings are reported to the project and also entered into a database that is used to track the resolution of each identified issue. In addition, after the audit is complete, a report detailing the findings is assembled and distributed both to the project’s upper management and to the project’s lead architect. The report and database entries are used by the project to address problems identified by the review team. Even though it is frequently obvious to an experienced reviewer whether or not the project is in good shape, when a review team tells a project that it is at high
202
AVRITZER AND WEYUKER
risk of failure, the project personnel are frequently distressed and sometimes question the review team’s objectivity and request concrete evidence of this assessment. For this reason, we decided to develop a metric that would depend on the information collected during the architecture audit, and would be tied to explicit findings. In that way it would be clear to the project members that the risk assessment was based on empirical evidence rather than whim or malice. 3.
Empirical Results
Table 1 presents data collected during 50 architecture and discovery reviews that were performed over a period of two years for many different projects developed by our organization. As a problem or issue was identified, it was assigned a category. The main categories were: project management, requirements, performance, operations administration and maintenance (OA&M), design, technology, network, security, testing and other. Project management issues would typically include such things as scheduling, staffing, planning, and funding issues. A typical requirements problem for a feature would be that its requirement was missing. Included among performance issues would be the lack of such things as a performance budget, performance estimates, or data gathering efforts. Note that since there is no code written yet at the time that the architecture audit is performed, a finding classified as being performance-related would typically not indicate that the software was not able to handle the required throughput or that the response time was too long, or something of that nature. Examples of problems that would be classified as being OA&M issues include such things as user administration, disaster recovery, back-up and recovery, and fail-over. If the selected technology is unable to solve the required problem, then we would categorize it as a technology issue. If the system will not be able to access the required internal networks that are necessary to provide the connectivity needed by the user, then we would classify the problem as a network issue. If the architecture fails to comply with corporate security guidelines, then we would indicate that there is a security issue. Typically, a testing issue indicates that there is a question about whether or not the proposed testing approaches are feasible. Again, it does not indicate that there is a problem with the code identified during testing since at the stage of an architecture audit, there has typically not been any code written. The category “Other” is a repository for all other problem types that are identified during the audit. Besides classifying issues using the categories enumerated above, each identified issue is also assigned a severity that will be indicative of how important it is to address the problem, and the timeframe necessary to make any changes. The most severe problems are designated project-affecting (PA) issues because if they are not addressed within a very short timeframe, they are likely to put the success of the project in jeopardy. Essentially, the project cannot move forward until a project-affecting issue is resolved. The next most severe category of problems are designated as critical (CR) issues. These are problems that must be solved in order to validate the architecture. Like project-affecting issues, there is a time-pressure associated with critical issues. In this case if a project is having an architecture review, the project cannot move out of the architecture phase and on to the low level design phase until the problem is resolved. If an issue that is deemed likely to impact customer satisfaction is
203
METRICS TO ASSESS THE LIKELIHOOD OF PROJECT SUCCESS
Table 1. Problems identified during 50 architecture audits. Category
PA
Critical
Major
Minor
Rec.
Obs.
Total
Project Management
98
95
71
48
54
96
462
Requirements
43
102
103
72
33
72
425
Performance
14
56
71
32
34
32
239
OA&M
9
12
95
50
32
29
227
Design
14
29
76
42
59
41
261
Technology
8
25
41
16
25
30
145
Network
0
1
6
1
2
0
10
Security
7
20
29
11
14
9
90
Testing
0
1
4
3
2
3
13
Other
14
17
18
1
10
16
76
Total
207
358
514
276
265
328
1948
identified, then we classify it as a major issues, while if a problem is identified that might impact future releases, then it is classified as a minor issue. There are also categories called recommendations and observations. These are of lower consequence than the others, but should still be examined and carefully considered. However, there is not the same notion of immediacy as there is with the categories project-affecting or critical. In order to get insights into the data collected from the 50 reviews, we created a table categorizing all findings by type and severity. This is shown in Table 1. There are a number of interesting observations that become obvious from this tabulated data. The first is that there is a very uneven distribution of issues by type. In particular, 47% of all projectaffecting issues were project management issues and about 27% of all critical issues were also classified that way. Overall, more than a third of all project-affecting and critical issues identified during the 50 reviews included in this study were categorized as being project management issues. Since these problems were so pervasive in this study, and considering the types of problems such findings represented, we hypothesized that project management problems were likely to play a key role in differentiating between high risk and low risk projects. This led us to make an initial attempt at a risk prediction metric that simply considered the percentage of project-affecting problems that could be attributed to project management issues for a given project. Another related metric that we looked at was to consider the total percentage of both project-affecting and critical problems attributable to project management issues. Our hypothesis was that projects with low percentages of project-affecting problems that could be attributed to project management were likely to be at low risk of failure, while projects at high risk of failure would have high percentages of the most severe problems that are attributable to project management.
204
AVRITZER AND WEYUKER
In order to determine the reasonableness of this conjecture, we chose the reviews of seven projects from among the 50 architecture audits in our database. For each of these seven reviews, as well as for a number of other audits, we had personally served as a member of the review team. We had also had responsibility for the assignment of the review team members for each of the 50 reviews, and we therefore were quite sure that all of the audits in our database were staffed by the most talented individuals available and that all of the reviews were carefully conducted. Nevertheless, the audits that we attended were, of course, the only ones for which we had first-hand knowledge of the way they had been conducted and therefore were the only ones about which we could attest to the meaningfulness of their results. In order to make sure that our study included a wide range of situations, we selected four projects (A, B, C, and D) that were considered by the review team to be projects that were at low risk of failure, as well as three projects (E, F, and G) that were considered to be at high risk of failure. At this stage, the risk determination was made based on team member’s experience and intuition. Our goal was to come up with a more objective and uniform way to assess this risk, and that was our motivation in trying to define a metric that could be uniformly applied by the review team at the time a discovery or architecture review was performed. For the reasons outlined above, we began by considering just the percentage of projectaffecting issues that were attributed to project management problems as an indicator of likely project success or failure. There were four projects considered to be at low risk of failure among the seven projects we selected for our study. Of these, projects D and B had 75% and 100%, respectively, of the findings categorized as being project-affecting that were attributable to project management causes. Note that this was significantly higher than the average percentage (47%) cited above for all of the projects. In contrast to this, projects A and C had, respectively, 0% and 20% of their identified project-affecting findings attributable to project management causes, which were significantly lower than the average. But looking at the detailed data for these four projects reveals that considering a percentage can be problematic or misleading. For example, Project A had no project-affecting issues in any category, and hence the percentage was considered to be 0. Project B, for which 100% of the project-affecting issues were categorized as begin attributable to project management causes, had only a single project-affecting finding which was classified as a project management problem. Had the project been in less good shape, and had more issues, that percentage might well have gone down. Consideration of Project C revealed an interesting situation. There were fifteen project-affecting problems identified, of which three were project management issues. The remaining twelve were assigned the category “Other”. The project was at a very early stage, and this was a discovery review to help the team resolve benefits and risks of making certain decisions. The team had thought of using new middleware product with which the organization had no production experience. During the discovery review, it was decided not to use this middleware. Since all twelve of these “Other” findings were related to issues that would arise only if this middleware was used, all of these problems evaporated, thereby effectively reducing the total number of projectaffecting issues for this project to three, all classified as being project management issues.
METRICS TO ASSESS THE LIKELIHOOD OF PROJECT SUCCESS
205
There were four project-affecting problems identified for Project D, of which three were classified as being project management issues, making the percentage of interest 75%. We therefore concluded that although all of the low risk projects had small numbers of project-affecting issues of any type identified during the architecture audits (assuming that we discounted the middleware-related problems for project C), the percentages of projectaffecting problems that were classified as being caused by project management issues varied significantly. This indicated that this percentage was not a good predictor of likely low risk of project failure. To see whether this percentage was more successful at correctly identifying high risk projects, we looked at the percentage for Projects E, F, and G. Here again there were large differences in the percentages of project-affecting problems that could be attributed to project management. For Project E, this percentage was 50% which was close to the average (47%), but for Project F the percentage was 100% while for Project G it was only 12%. Even looking at the low risk and high risk projects as groups was not helpful. While the overall percentage of project-affecting issues attributable to project management averaged 47% for the 50 projects for which we had audit data, for the projects we had identified as being of low risk, that percentage averaged 35% and for the high risk projects, that percentage was only slightly higher (and still less that the overall average) at 38%. Clearly, the percentage of project-affecting issues attributable to project management alone was not a useful metric. We also considered a similar metric using the percentage of project-affecting issues categorized as requirements issues. We had found that for the 50 projects in our database, the average percentage was 21%. Projects A, B, and C, which were all rated as low risk projects had no project-affecting issues that were attributed to requirements problems, but that was also true for Project F, which was a high risk project. High risk projects E and G both had percentages that were significantly above the average: 50% and 41% respectively. For Project D, the percentage was 25%, somewhat above average. Thus, considering the percentage of project-affecting issues detected that could be attributed to requirements problems was also not an appropriate choice of metric. Since considering the percentage of project-affecting issues attributable to a single fault category did not produce a useful metric, we next considered the overall percentage of the problems uncovered during an audit that were deemed to be project-affecting. Thus, for each project, we looked at the fraction of all issues identified for that project that were classified as being project-affecting. For the 50 projects, the average percentage was 11%. Among the low risk projects, the percentage varied significantly from 0% for Project A, 5% for Projects B and D, and 27% for Project C. When we discounted the twelve findings related to the selection of the middleware product, the percentage for Project C dropped to 7%, making the percentage for all four low risk projects less than the 11% average percentage. For the high risk projects, each of the three projects had a higher percentage of project-affecting problems than the overall value, but sometimes it was only a little above the average. For Project E the percentage was 21%, for Project F it was 13%, and for Project G the percentage was 38%. Even though all of the low risk projects had percentages of project-affecting issues that were below average, and all of the high risk projects had percentages of project-affecting
206
AVRITZER AND WEYUKER
Table 2. Number of projectaffecting issues. A
B
C
D
E
F
G
0
1
3
4
6
6
17
issues that were above average, we concluded that the differences were not clear enough to use this as a predictor of likely project success. To use it as a predictor, one would have to be able to choose a value X such that any project that had an overall percentage of the total number of issues uncovered during a review that were categorized as being project-affecting that exceeded X% would be considered to be at high risk. Considering the seven projects in this phase of our study, what value would we recommend that X be? Although a value between 8% and 10% would work for these seven projects, we were reluctant to recommend this as a predictor. One would expect that projects with percentages that were only slightly above X% would be at lower risk than those with significantly higher percentages, but that did not seem true for these projects. For this reason, we decided to continue looking for other possible predictors of likely project success. Since there was some potential usefulness of considering the percentage of issues classified as project-affecting as a metric, we next decided to consider the absolute number of project-affecting issues for a project to see if that might be a useful predictor of likely project success. Again, we did see some possibility that it would be useful. Once the issues related to middleware selection were removed from consideration for Project C, the number of project affecting issues for the seven projects were as shown in Table 2. As in the case of percentages of project-affecting issues, when we considered the absolute numbers of such issues, we saw that all of the low risk projects had fewer project-affecting issues than any of the high risk projects. But again the differences were not always great. In addition, there was some significant differences in project size, and so it seemed unlikely that a simple count like this would provide a reliable metric in general. At this point we concluded that we should consider more complex metrics for predicting likely project success, but we were still convinced that the use of data collected during discovery reviews and architecture reviews could be helpful in providing guidance regarding the likelihood of a project being at risk if it was continued without significant intervention. When we thought about why the metrics we had tried were not really satisfactory, we concluded that we needed a metric that not only penalized a project for identified issues, but also rewarded the project for common problems that they did not have. We therefore developed a detailed questionnaire that contained a tabulation of the most commonly-occurring architecture problems, with a severity associated with each such problems. Part of the reason for doing this was to try to make the reviews more uniform and less dependent on the particular reviewers on the team. Another part of the reason was to see if we could use it as the basis for a metric. For each project-affecting issue, the review team was asked to assign a value between 0 and 10, depending on the degree to which that project had that problem. So, for example, if a project had unrealistic schedules, they might
METRICS TO ASSESS THE LIKELIHOOD OF PROJECT SUCCESS
207
be slightly optimistic, in which case the team might assign a value of 2, or they could be totally unrealistic and be assigned a value of 10. For each critical issue, the team can assign a value of from 0 to 5. We include the complete questionnaire in Section 4. In Section 5, we describe how this questionnaire was used as the basis for a metric that we have used to assess the likelihood of failure for the seven projects in our study. Each of these projects was assessed as being at high risk of failure, at moderate-high risk, at moderate risk, at moderate-low risk, or at low risk of failure. We also compare the rating of the projects by experienced reviewers with the assessment made by the metric. 4.
Common Architectural Problems
In order to develop the questionnaire to be used for all architecture audits, we compiled a list of the most common issues encountered during earlier reviews, that were likely to have a significant impact on project success. This was done by tapping into the experience of some of our most senior architects. Our hope was that this questionnaire could be used as the basis for defining a metric that would be appropriate to use to predict the likelihood that the project was at high risk of failure. We included problems in three categories: project management which accounted for almost half of the project-affecting issues in the 50 reviews that were in our database, and requirements and performance since significant numbers of project-affecting and critical issues were also attributed to these categories. Together, these three categories comprised 75% of the project-affecting problems and 71% of the critical issues encountered during the 50 architecture audits. We included only problems judged to be either projectaffecting or critical since these are the ones that tend to have the most significant impact on the project’s success or failure. By placing these limitations on the identified problems, we expected to be able to limit the questionnaire to a manageable size. Experience with earlier questionnaire design (Ostrand and Weyuker, 1984) has taught us that when they become too lengthy, people tend to become careless when filling it out. We learned that it was important to have a relatively simple categorization scheme, since the data collected using such a questionnaire is much more likely to be accurate and therefore more useful than if the scheme is extremely complex and requires the person completing it to spend a large amount of time making relatively subtle distinctions between categories. The most commonly-occurring project-affecting issues by category are: Project Management •
The team has not identified the stakeholders.
•
The team has not identified domain experts.
•
The team has not formalized project funding.
208
AVRITZER AND WEYUKER
•
The team has not identified a project manager/leader.
•
A project plan/schedule has not been put in place.
•
There is an unrealistic deployment date.
•
The team has not identified measures of success.
•
Failure to select software architects.
•
Each layer has an architect assigned, however, a chief architect with responsibility for the overall architecture has not been selected.
•
The team has not written an overall architecture plan.
•
An independent requirement writing team has not been selected.
•
The team has not put a hardware and installation schedule in place.
•
An independent team responsible for performance issues has not been selected.
•
A quality assurance organization has not been selected.
•
The team has not developed a system test plan.
•
A tracking system for modification requests (MR) has not been established.
•
No contingency plan has been provided.
Many of these problems could have been averted by having a clearly defined project management team in place. Often when the types of problems listed above occur, it is because the project is very large, having many different managers whose primary function involve personnel management, rather than project management. The difference between a project management team and a personnel management team is that the former has primary responsibility for assuring that the project is a success by monitoring its progress, while a personnel manager is charged with guiding a person’s success by monitoring his or her output and progress. If no one has responsibility for a project’s success, then problems such as those listed above often occur. Requirements •
The team has not provided a clear statement of the problem.
•
The team has not identified decision criteria that will be used to select the software architecture.
•
The team has not provided a requirements document.
•
An assessment of the size of the expected user community has not been done.
METRICS TO ASSESS THE LIKELIHOOD OF PROJECT SUCCESS
209
•
The team has not provided a statement of the project’s data storage requirements.
•
The team has not provided a statement of the amount of data loss that the project can tolerate.
•
The team has not identified expected outputs.
•
The team has not defined operations administration and maintenance (OA&M).
•
There are insufficient resources allocated for a new requirement.
Performance •
Performance requirements have not been established by the end user.
•
The team has not gathered performance data.
•
The team has not established performance budgets.
•
The team has not established expected traffic rates.
•
The team has not established a mechanism to measure either the number of transactions or the length of a transaction.
•
The team has not established a mechanism to assure the handling of the required throughput.
•
The team has not assured that processing requirements can be met by the hardware.
•
A performance model is lacking. The most commonly-occurring critical issues by category are:
Project Management •
The team has not allocated sufficient time for testing.
•
The team has not established a plan to move the project to OO technology.
•
The team has not allocated sufficient resources for build tasks.
Requirements •
Requirements are missing for a feature.
•
The scope of the project is too broad.
•
The team has not clarified some requirements.
210
AVRITZER AND WEYUKER
•
The team has not prioritized the requirements.
•
The team has not characterized anticipated system usage.
•
The team has not established a service agreement with a vendor.
Performance •
The team has not determined the user component of load on the server.
•
The team has not determined for the server, the expected number of clients.
•
The team has not defined input processing requirements.
•
The team has not established response time requirements.
•
The team has not justified the need for a cache.
•
A network performance model is lacking.
•
There is an incomplete overall performance model.
•
The team has not determined expected data volumes between clients and servers.
•
The team has not performed an analysis of the implications of coexisting applications.
•
The team has not determined network (WAN/LAN) capacity.
•
No process flow diagram has been provided.
5.
The Risk Metric
Since the earlier simple metrics that we considered were not entirely satisfactory as a predictor of likely project risk, we decided to rely on information extracted from the data we collected using the questionnaire. As mentioned above, we assigned a value between 0 and 10 for each identified project-affecting issue, and a value between 0 and 5 for each critical issue. If every possible issue was as severe as it could possibly be, there would be a total score of 440, with 170 points coming from project management project-affecting problems, 90 points coming from requirements-related project-affecting problems, and 80 points coming from performance-related project-affecting problems. In addition, 15 points were possible from project management critical issues, 30 points might come from requirements-related critical issues, and another 55 points could come from performancerelated critical issues. We next identified five ranges of risk: •
Low Risk: score < 75,
•
Moderate-Low Risk: 75 ≤ score < 150,
METRICS TO ASSESS THE LIKELIHOOD OF PROJECT SUCCESS
211
Table 3. Risk computed for 7 projects. Project
Score
Risk
A
10
Low
B
25
Low
C
50
Low
D
197
Moderate
E
230
Moderate-High
F
280
Moderate-High
G
330
High
•
Moderate Risk: 150 ≤ score < 225,
•
Moderate-High Risk: 225 ≤ score < 300.
•
High Risk: 300 ≤ score.
We applied the metric and computed scores for the seven selected projects. Table 3 presents these scores and the associated risk assessment. Recall that the reviewers had categorized Projects A, B, and C as low risk projects based on their experience, and had categorized Projects E, F, and G as high risk projects. It was good to see that the metric categorized them similarly. It was surprising and disappointing to us, however, that Project D was rated by the experts as a low risk project, and yet was rated by the metric as being at moderate risk of failure. It was important for us to determine whether the problem was that the metric was inappropriate under certain circumstances, the reviewers’ assessment was incorrect, or if this particular project had some special characteristics that accounted for this discrepancy. When we examined the architecture audit findings, however, the issue was clarified. Project D, we concluded, was indeed a project at low risk of failure. Investigation of the review findings shows that more than half of the points contributing to the score were categorized as being performance-related. However, this project was doing a discovery review a year before its scheduled release date. At the time that the review had been done, the project had simply not yet addressed its performance issues. This is not unusual for a discovery review since its purpose is to help the project weigh alternatives. It is therefore entirely acceptable for the project not to have finalized some of its plans. By doing a discovery review a year before the scheduled project release, there would be ample time to weigh the alternatives, fill in missing factors, and thereby remove what had looked like risks to the project’s success. This confirmed our opinion that the project really was at low risk of failure, and this experience was valuable since it led us to conclude that the metric was primarily useful for projects that were undergoing an architecture review, rather than one going through a discovery review.
212 6.
AVRITZER AND WEYUKER
Follow-Up Findings
A significant amount of time has now elapsed since the projects went through the architecture audit process and the metrics were computed. We felt it was now important to follow-up to see what had happened to the projects. Of course all projects are supposed to use the results of their reviews to identify and correct problems. Therefore, even a project that was considered at high risk, should now be deployed, but with far fewer problems than would have occurred if the architecture had not been assessed and (hopefully) corrected at a very early stage. Our results were at first surprising. Of the four low risk projects, three had been cancelled. We learned that the reasons for cancellation of each of the projects was entirely independent of the project’s quality or progress. They were all cancelled due to a change in business needs and technology. Only B from among A, B, C, and D was ultimately deployed. It is currently running successfully in the field. At least some portion of all three of the projects assessed as being at moderate-risk to high risk of failure have been deployed in the field. The projects evidently addressed the identified problems and are now much more stable than they would have otherwise been. In each case they provided an essential business function, and so needed to be deployed close to their scheduled release date. This means that the identification of problems early in the lifecycle was particularly valuable for these projects. The most important lesson that we learned from our study is that the biggest factor in determining whether or not a project is or is not deployed, is the business needs of the organization. It appears to be a much bigger factor than project quality. Given that this is true, it is clear that any identification of problems very early in the lifecycle is absolutely invaluable since it allows an organization to get a faltering project back on track before the impact of the problems is catastrophic. We have begun an investigation of simpler metrics that would not require the use of the detailed checklist that we used above. Although we believe that this metric is an excellent indicator of the likely risk of project failure, past experience involving such a checklist for software fault categorization (Ostrand and Weyuker, 1984) has taught us that it is sometimes difficult to get developers to understand the value and necessity of routinely filling out such a checklist carefully and consistently. Therefore, if we could define a metric that came directly from the findings included in the report generated after every architecture review, that would be ideal. For that reason, we are currently collecting separate assessments of each project having an architecture review to see if there is a correlation between some of the data standardly collected and the project’s quality. 7.
Conclusions
It is standard practice for all of our projects to go through a discovery review and/or an architecture review prior to low level system design. In this paper we proposed using data collected during such a review to predict the risk of failure that a project faces. We developed a questionnaire-based metric that projects can use for this purpose immediately following the completion of the review. In this way, a project that is at high risk of failure can be identified
METRICS TO ASSESS THE LIKELIHOOD OF PROJECT SUCCESS
213
early, and steps can be taken to rectify problems that have been found, before any low level design or implementation has begun. This should make the process significantly more costeffective than if architecture problems are allowed to persist through development, making the ultimate system either poorly designed or requiring extensive rearchitecturing, design, and implementation. Without this early intervention, it is likely that either deployment would be delayed, or the project cancelled. Of course the findings made during an architecture audit should always be used to improve the project’s status, but sometimes a project does not realize just how serious their problems are. For this reason we developed a metric and associated risk rating categorization scheme. By having a metric that provides both a numerical rating and an assessment of likelihood of risk, we expect to be able to make the situation entirely clear to the project, and the audit findings should tell them precisely what areas need to be addressed. Since the questionnaire we developed provides a numerical rating indicating the degree to which the project suffers from each specific problem, it also provides guidelines that should help the project prioritize their effort by indicating to the project which areas need to be addressed first. We used the metric on seven large industrial telecommunications projects. Although all of the projects were done in a telecommunications environment, they included many different types of projects, with different goals, platforms, and requirements. What they all had in common was that they were large, having very high reliability requirements, and very high availability requirements. These particular projects were selected because they were included in the database of architecture audit findings that we maintained, and specifically because we had first-hand experience with these projects and reviews, having served as a reviewer or chair of the review team for each. In this way we could be sure that the findings were accurate and complete, and further, that we were able to make an initial assessment of the likelihood of project failure against which we could compare the value determined by the metric during the empirical study described above. Since we had personally served as a member of fifteen different projects’ architecture review teams, we were able to select both projects that we considered to be at low risk of failure as well as those that we deemed to be at high risk of failure. For all but one of the seven projects, the assessment of likely risk made by the metric was in agreement with reviewers’ independent predictions. However, one project appeared to the reviewers to be at low risk of failure and yet was assessed by the metric as being at moderate risk of failure. When we looked at the findings that were responsible for this project’s classification as being at moderate risk of failure, this puzzle was resolved. It led us to conclude that the metric was primarily useful for projects that were undergoing an architecture review, rather than one going through a discovery review since this type of review is done to help a project make decisions. We hope to see the metric and questionnaire instituted as a standard part of the architecture audit process for our organization. We believe that the metric that we introduced in this paper will significantly contribute to the early identification of projects at high risk of failure, as well as the determination of which particular phases of development have the greatest need of attention. We have found our preliminary results to be very encouraging. We are also continuing our investigation of other, simpler, metrics that would be indicative of project risk, but would not require the use of a detailed checklist.
214
AVRITZER AND WEYUKER
Acknowledgments There are many people in our organization who served as members of review teams, and we are very grateful to their dedication. We are particularly indebted to Emile Hong, Joe Tashjian, Stan Olochwoszcz and Dawn Estelle for their support. The help of all of these people contributed to the success of this project, and we could not have done this work without their help.
Notes 1. p. 20 2. p. 269 3. p. 1
References Abowd, G., Bass, L., Clements, P., Kazman, R., Northrup, L., and Zaremski, A. 1997. Recommended best industrial practice for software architecture evaluation. Technical Report CMU/SEI 96-TR-025, available at http://www.sei.cmu.edu/products/publications. Garlan, D., Allen, R., and Ockerbloom, J. 1995. Architectural mismatch: why reuse is so hard. IEEE Software: 17–26. Garlan, D., and Perry, D. 1995. Introduction to the special issue on software architecture. IEEE Trans. on Software Engineering: 269–274. Ostrand, T. J., and Weyuker, E. J. 1984. Collecting and categorizing software error data in an industrial environment. J. Systems and Software 4: 289–300. Shaw, M., and Garlan, D. 1996. Software Architecture—Perspectives on an Emerging Discipline. Prentice-Hall, Inc.
Alberto Avritzer received a Ph.D. in Computer Science from the University of California, Los Angeles, an M.Sc. in Computer Science for the Federal University of Minas Gerais, Brazil, and the B.Sc. in Computer Engineering from the Technion, Israel Institute of Technology. He is currently a District Manager at AT&T Network Services, Middletown, NJ. He spent the summer of 1987 at IBM Research at Yorktown Heights. His research interests are in software engineering, particularly software testing and reliability, real-time systems, and performance modeling, and has published several papers in those areas. He is a member of ACM SIGSOFT, and IEEE.
METRICS TO ASSESS THE LIKELIHOOD OF PROJECT SUCCESS
215
Elaine Weyuker received a Ph.D. in Computer Science from Rutgers University, and an M.S.E. from the Moore School of Electrical Engineering, University of Pennsylvania. She is currently a Technology Leader at AT&T Labs—Research in Florham Park, NJ. Before moving for AT&T Labs in 1993, she was a professor of Computer Science at the Courant Institute of Mathematical Sciences of New York University, NY, where she had been on the faculty since 1977. Her research interests are in software engineering, particularly software testing and reliability, and software metrics. She is also intersted in the theory of computation, and is the author of a book (with Martin Davis and Ron Sigal), “Computability, Complexity, and Languages, 2nd Ed”, published by Academic Press. Dr. Weyuker is a Fellow of the ACM, and a senior member of the IEEE. She is a member of the editorial boards of ACM Transactions on Software Engineering and Methodology (TOSEM), the Empirical Software Engineering Journal, and an Advisory Editor of the Journal of Systems and Software. She has been the Secretary/Treasurer of ACM SIGSOFT, on the executive board of the IEEE Computer Society Technical Committee on Software Engineering, and was an ACM National Lecturer.