Applying Benchmarking to Learn from Best Practices - CiteSeerX

12 downloads 28404 Views 72KB Size Report
applied an analysis technique called OSR (Optimised Set Reduction). This ... software industry best practices and provides a framework for assessing software.
Applying Benchmarking to Learn from Best Practices Andrew Beitz and Isabella Wieczorek Fraunhofer IESE, Sauerwiesen 6 D-67661, Kaiserslautern, Germany {beitz, wieczo}@iese.fhg.de

Abstract. Assessments are a proven and widely used method to measure one’s software process strengths and weaknesses. This helps determine where to start software process improvement programs. However, an assessment uses information internal to an organisation, but does not compare its processes to a competitor’s. Benchmarking is a way to compare one’s practices with other organisations. These types of comparisons reflect what are currently best practices within industry. In combination with assessment results, benchmarking can be used as a useful indicator on which processes to improve based upon industry assessment data. In this paper we present initial benchmarking results using data from the SPICE (Software Process Improvement and Capability dEtermination) Trials. To obtain the results, we applied an analysis technique called OSR (Optimised Set Reduction). This technique is well suited to find patterns in a database and derive interpretable models. We describe the type of benchmarks that are going to be produced for the SPICE Trails participants and how they can used for process improvement. Lastly, we describe how to integrate benchmarking into an assessment method.

1

Introduction

To improve the software processes one needs to know which processes need improvement. One way to achieve this is to compare the organisation’s current set of practices with a set of best practices derived from industry. This way, organisations learn what works best from other organisations and may then choose to adopt these practices themselves. An assessment is one approach to comparing organisational processes with industry best practices. They provide a disciplined examination of the processes within an organisation to detect the areas that could be improved. The emerging international standard ISO/IEC 15504 (also known as SPICE) [1] integrates software industry best practices and provides a framework for assessing software processes. IT businesses today are using assessments to better identify which processes need improvement. However, what an assessment does not reflect is how well one compares with industry. Therefore, after an assessment it can be difficult to determine or justify which processes to improve that will make a sufficient difference on the business. Each organisation is different in which processes are critical to achieve its business goals. Benchmarking is one technique that shows where to focus the improvement effort based upon the needs of the organisation. It allows an

organisation to compare its performance with other projects or organisations to identify which best practices lead to better performance. Fraunhofer IESE has developed a method, called FAME (Fraunhofer Assessment MEthod) [http://www.iese.fhg.de/fame], for performing assessments in a reliable and cost-efficient way [2]. FAME is a method that effectively applies existing well-known assessment methods like SPICE and BOOTSTRAPTM, and uses the assessment standard, ISO/IEC 15504, for software process assessment. It helps to determine the strengths and weaknesses of a company’s current software processes and supports it in making well informed decisions on process improvement. One of the main components of FAME is the benchmarking service that is offered at the end of the assessment. This component contains state-of-the-art techniques, such as OSR (Optimised Set Reduction) [4], [5] and uses the assessment results to determine which best practices within an organisation can lead to better performance. Fraunhofer IESE has developed an OSR tool to offer such benchmarking services. The SPICE project [6] was established in June 1993 to validate the use of ISO/IEC 15504. To help validate the upcoming standard, the SPICE project has conducted trials in three phases. The success of OSR has lead it to be used as the benchmarking tool for phase 3 of the SPICE Trials [7]. The SPICE project is now seeking participants for phase 3 of the trials. Participants who submit their assessment results will receive automatic international benchmark profiles from the official SPICE web site [http://www.sqi.gu.edu.au/spice/]. The SPICE project expects a large number of participants for phase 3 of the trails. The paper starts off by provide background information about the SPICE Trials and the concepts of benchmarking. We explain how OSR can be applied to provide benchmarking services for the SPICE Trials. We also provide some examples produced by the benchmarking tool and their interpretation. The examples are based upon using the data set from phase 2 of the SPICE Trials. Further research into benchmarking will help determine which benchmarks will best help the SPICE trial participants of phase 3 to better focus their improvement efforts within their organisation. Lastly, we describe how benchmarking can be successfully used and integrated within an assessment method, like FAME, for participants of the SPICE trials.

2

2.1

Background

The SPICE Trials

Process assessment examines the processes used by an organisation to determine whether they are effective in achieving their goals. ISO/IEC TR 15504 has widely been recognised and used around the world as the emerging international standard for process assessment. It provides a common framework for the assessment of software processes. This framework can be used by organisations involved in planning, managing, monitoring, controlling, and improving the acquisition, supply, development, operation, evolution and support of software. It has also been validated

internationally in the SPICE trials [7] where it has proven useful for performing assessments. The SPICE trials are the most extensive joint effort of Industry, the Public Sector, and Academia to collect and validate process assessment knowledge. Research into the SPICE Trials is leading to a better understanding of how to improve assessments and provide better guidance towards process improvement. The results of this analysis, along with research at Fraunhofer IESE, are constantly being incorporated into the development of the FAME method. In an assessment, the processes used in an organisation will be mapped to the best practices defined in the ISO/IEC TR 15504 framework. This results in an assessment profile that shows which processes are being performed and how well they are being performed. Fig. 1 shows an example of an assessment profile for an organisation using the ISO/IEC TR 15504 framework. The profile depicts a two dimensional view of the best practices performed. This includes the process dimension and the capability dimension of which are described in detail in ISO/IEC TR 15504-2 (the reference model) [1]. The process dimension describes 29 processes for software development. This includes processes in the area of Customer-Supplier (CUS), Engineering (ENG), Support (SUP), Management (MAN), and Organisation (ORG). In Fig. 1, some of the processes are displayed on the horizontal line. They include software Development (ENG_1), Maintenance (ENG_2), Acquisition (CUS_1), and Documentation (SUP_1). The capability dimension describes how well these practices are performed (i.e. the management of each practice). In Fig. 1, the capability dimension is displayed vertically. There are six capability levels: Incomplete (Level 0), Performed (Level 1), Managed (Level 2), Established (Level 3), Predictable (Level 4), and Optimizing (Level 5). For example, ENG_1 has a capability score of 3. This means that the process of developing the software (ENG_1) has been established (Level 3) within the organisation. 5 4 3 2 1 0 ENG_1

MAN_1

CUS_1

SUP 1

Fig. 1. An example of an assessment profile using ISO/IEC 15504.

2.2

The Process of Benchmarking

In the past, the word “benchmark” has been used in various ways. The ability to benchmark the performance of their projects provides software organisations a number of advantages. These include the ability to determine whether they are competitive in a given business sector, and whether a significant productivity

improvement is required for sustaining a particular business. For the construction of benchmarks, three issues need to be clarified. First, we define the general concepts of benchmarking. Second, we explain how values are compared, and third, how similarity is defined. 2.2.1 General Concepts Benchmarking is a data intensive process. This means that it is necessary to have a benchmarking database containing performance measures as well as other variables for a set of completed projects. Benchmarking is usually performed after the project is completed. Thus, one can use variables that are available at different stages in a project (e.g., at the beginning or at the end of a project). The basic benchmarking process is to identify a subset of projects in the database that are similar to the project to be benchmarked. Then, the project is compared to this subset to determine whether its performance is better or worse and by how much. In practice, benchmarking presents a number of difficulties. For example, what is meant by “compare productivity” and how does one determine if another project is “similar”? Of course there are the simplistic answers to these questions. However, these generally do not provide satisfactory results when put into practice. 2.2.2 Comparing Performance Let us assume that we have found the “similar” projects, and now we want to compare our productivity with the productivity of those similar projects. Using productivity to compare project performances means that our variable is measured on a continuous scale. Thus, an obvious approach is to take the average productivity of the similar projects and see whether we are above or below it. This has a number of disadvantages. First, as is common with software project productivity data, there are likely to be some extreme productivity values for some of these similar projects. Such extreme values can have a large impact when using the average, hence giving a distorted benchmark result. This problem can be easily solved by using a robust statistic, such as the median, instead of the average. The second difficulty is that if we know, let’s say, that our project is above the median productivity of similar projects, this still does not tell us by how much. Therefore, we need some measure of “distance” from the median project and our project. One approach is to convert a raw score into a percentile score. This has the advantage that no assumptions about the distribution of scores in the population need be made, and also that the derived score is intuitive. In the context of productivity benchmarking, this concept can be applied. One may choose to have 4 ranges that are equivalent to the quartiles of the productivity distribution of similar projects. The 4 ranges represent four productivity benchmark levels. Using the capability process rating values of SPICE to compare project performance, we deal with discrete variables. Thus, we have to take a slightly different approach than the one described above. For an identified set of similar projects, we use the SPICE capability ratings (see Section 2.1). Thus, we can derive a distribution of capability levels for a set of similar projects. Let’s say our current project falls within capability level 2 and the distribution of projects similar to it is: level 1: 0%, level 2: 20%, and level 3: 80%. This means that 80% of the similar

projects are one level above the current project, which indicates the current project’s assessment was relatively low compared to other similar projects. 2.2.3 Defining Similarity The next question is how do we find “similar” projects? Similarity has to be defined with respect to some set of project characteristics such as application domain, region, primary business area (characteristics referred to as attributes). Ideally, similar projects should have similar values on these attributes as well as having performance values that do not vary too much. For example, let’s say that our project has a team size of 7. If the “similar” projects are those that have team sizes between 7 and 10, then this class of projects in our benchmarking database should not have performance ratings that vary, say tenfold. Otherwise they are not similar enough since they represent projects that vary too greatly in their performance. It then becomes prudent to try to use other variables that partitions the set of projects with team sizes between 7 and 10 into smaller subsets to reduce the variation in performance. The above discussion indicates that the attributes have to represent important variables that distinguish companies in terms of their performances. But also the attributes have to be of importance for business decisions. For example, if my business domain is aerospace, I would not be interested in benchmarking my projects against projects in the computer games domain. Therefore, application domain would be an important variable to consider when identifying similar projects. If one is using an already existing database (as in our case), then the potential attributes are predefined. The attributes that we use are presented later in this article. There are many analytical techniques that can be used for identifying similar projects. An obvious one is cluster analysis (also known as analogy in software engineering, e.g. [8]). However, this generally leads to clusters that are not optimal in terms of the variation in performance values. The reason is simple: cluster analysis only considers the attributes and does not take into account the actual performance values. A class of data analysis techniques that build clusters taking into account the attributes and the productivity values are regression trees [3]. Another technique used in this context is the OSR technique [4], [5]. 2.3

The Benchmarking Tool

The tool used for benchmarking the SPICE database is based on the OSR algorithm [4], [5]. OSR is a pattern recognition based data analysis technique, which determines subsets in a historical data set. It combines machine learning principles with robust statistics to derive subsets that provide the best characterisations of the object (in our case: a project) to be assessed. The generated model consists of a collection of logical expressions (patterns) that represent trends in a data set. OSR has several main advantages. It is easy to use and interpret. It can deal with many variables of different types (discrete and continuous). It can be easily used for benchmarking of a given new project, since the past projects in the corresponding subset can be used as a baseline of comparison. OSR dynamically builds a different model for each project to be benchmarked. The underlying data set for each project is decomposed into subsets. The projects that

have the same value (or belong to the same class of values) for a significant project attribute as the target project is extracted and builds the subset. This is done recursively on each generated subset. The set reduction stops when a termination criterion is met. For example, if the subset consists of less than a certain number of projects, or no project attribute appears significant. A characteristic of the optimal subsets is that they have optimal probability distributions on the range of the dependent variable (in our case: process capability levels). This means that they concentrate a large number of projects in a small number of dependent variable categories (if categorical) or on a small part of the range (if continuous). A prediction is made based on a terminal subset that optimally characterises the project to be assessed. The most frequented capability level within this subset is used as a predictor. An optimal subset is characterised by a set of conditions that are true for all projects in that subset. For example, “Var1=low AND Var2 ∈ Class1”, where “Var1” and “Var2” can be any project attribute, “low” and “Class1” are values in given ranges of values for “Var1”, and “Var2”, respectively. This logical expression is true for a given target project and one can compare its process capability level with the predicted one that is based on the probability distribution of this optimal subset. In contrast to other machine learning algorithms, OSR does not generate a static model based on a given data set. Rather, it dynamically generates patterns dependent on the project to be benchmarked. The Fraunhofer IESE developed a tool that automates this algorithm. Given the SPICE data set, OSR was used to build individual models (logical expressions) for each project in the data set. A v-fold cross-validation approach [9] was followed building a model for a project based on the remaining projects in the data set. This is performed for each project in turn. Thus, each project is characterised through an OSR-model that determines a subset of projects that are similar to the current project. A set of generated patterns (identical models) was then analysed to find general trends in the data. These results are presented in Section 3.2.

3

Industry Benchmarks

Fraunhofer IESE plays a major role in the SPICE Trials in developing benchmark results to participants of the Trials. In the SPICE trials, benchmarking is performed against each process assessed, so the result is a benchmark profile. The benchmark profile will allow participants of the trials to determine where they are positioned in their industry with processes. The information presented in this report and to participants is aggregated to ensure confidentiality of all data in the international SPICE Trials database. The goal of the benchmark in the SPICE trials is to predict the assessment rating of a process based on influential factors. Influential factors are used to group similar organisations in which to benchmark from. For example, a small telecommunications company in Europe might want to compare its company against all other small telecommunications companies in Europe. One could also identify other factors that would be interesting to use in benchmarking. For example, the leading company in Europe for small telecommunications businesses may want to compare itself to large

telecommunications businesses or even to company’s world-wide. The benchmark that best fits a company will be based on where the organisation is currently positioned within industry and what business goals the company wants to achieve. 3.1

The Database

The SPICE Trials has collected a large amount of assessment data for phase 2, including a variety of project attributes, such as the region, the business area, targeted business, the number of staff, or ISO9000 certification. For our analysis, we used a subset of the database consisting of 168 projects (i.e. organisational units) assessed up to capability level 3. The regions are divided into South Asia Pacific, Canada, Europe, North Pacific, and USA. The biggest contributing regions to Phase 2 of the trials were Europe (41%) and South Asia Pacific (37%). The types of businesses assessed in the SPICE trials were: Finance (including Banking and Insurance), Business Services, Petroleum, Automotive, Public Utilities, Aerospace, Telecommunications and Media, Public Administration, Consumer Goods, Retail, Distribution/Logistics, Defence, IT Products and Services, Health and Pharmaceutical, Leisure and Tourism, Manufacturing, Construction, Travel, Software Development, and others. A majority of the assessments came from Telecommunications and Media (24%), IT Products and Services (18%), Software Development (17%), and Finance (16%). The SPICE trials also collected a good variation of assessment data in phase 2 for small and large organisations. 3.2

Benchmark Results

Using the SPICE trials database from phase 2, the OSR tool was able to generate a number of models to benchmark against. The models were generated using the influential factors, region (RE), business (BU), and the number of staff (ST). Each model shows a significant trend in the data set. One interesting pattern generated was from the IT Products and Services in the South Asia Pacific area. The results of this benchmark are shown below for the following five processes up to capability level 3: MAN_1 (Manage the Project), ENG_2 (Develop Software Requirements), ENG_3 (Develop Software Design), ENG_5 (Integrate and Test Software), and SUP_2 (Perform Configuration Management). One OSR model is generated for this process and this can be shown as a simple heuristic rule: Model (MAN_1): IF BU = IT Products and Services AND RE = South Asia Pacific THEN Predicted Capability Level = 2 This rule shows which of the factors (BU for Business, RE for Region, and ST for Number of Staff) have a significant influence on the predicted capability level of the process MAN_1.

The benchmark distribution for the process Manage the Project (MAN_1) is shown in Fig. 2. It shows the capability levels horizontally and the proportions (in percentage) of assessment instances are shown vertically. For this model the majority of the assessment instances are level 2. This means that IT Products and Services in South Asia Pacific area have a predicted capability of level 2 for MAN_1. If an organisation has this capability level for MAN_1 then it is within the majority of instances. However, 22% of these assessment instances are at level 3 for this process and would therefore have a competitive edge over this company. The benchmark allows an organisation to position itself within industry and determine where it should aim to be for the process MAN_1. Distribution (MAN_1) 100% 55% 50%

22%

22%

0% 0% Level 0 Level 1 Level 2 Level 3

Fig. 2. Distribution corresponding to the Model for MAN_1

The benchmark distributions for the process Develop Software Design (ENG_2) are shown in Fig. 3. Three OSR models have been generated for this process: Model 1 (ENG_2): IF RE = South Asia Pacific AND ST > 87.5 THEN Predicted Capability Level = 2 Model 2 (ENG_2): IF RE = South Asia Pacific AND ST < 87.5 THEN Predicted Capability Level = 3 Model 3 (ENG_2): IF BU = IT Products and Services THEN Predicted Capability Level = 3 For model 1, the majority of assessment instances are level 2. For models 2 and 3, the majority of the assessment instances are level 3. This means that IT Products and Services have a predicted capability of level 3 for ENG_2. However, for a large company (greater than 87.5 staff) in the South Asia Pacific area the predicted capability of ENG_2 is level 2. For a small company (less than 87.5 staff) in this area the predicted capability of ENG_2 is level 3.

Distribution 1 (ENG_2)

Distribution 2 (ENG_2)

100%

100% 50%

50%

33%

42%

50%

17% 0%

0%

50%

8%

0%

0% Level 0 Level 1 Level 2 Level 3

Level 0 Level 1 Level 2 Level 3

Distribution 3 ( ENG_2)

100% 50% 50%

19%

31%

0% 0% Level 0

Level 1

Level 2

Level 3

Fig. 3. Distributions corresponding to Model 1, 2, and 3 for ENG_2

The benchmark distributions for the process Develop Software Design (ENG_3) are shown in Fig. 4. Two OSR models have been generated for this process: Model 1 (ENG_3): IF RE=South Asia Pacific AND BU=IT Products and Services AND ST>87.5 THEN Predicted Capability Level = 2 Model 2 (ENG_3): IF RE=South Asia Pacific AND BU=IT Products and Services AND ST 62.5 THEN Predicted Capability Level = 2 For model 1, the majority of assessment instances are level 2. This means that large companies (greater than 62.5 staff) in the IT Products and Services area have a predicted capability of level 2 for ENG_5. Smaller companies in this area have the default predicated capability since no model was generated. Distribution (ENG_5) 100% 57% 50% 0%

43%

0%

0% Level 0 Level 1 Level 2 Level 3

Fig. 5. Distribution corresponding to the Model for ENG_5

The benchmark distributions for the process Develop Software Design (SUP_2) are shown in Fig. 2. Two OSR models have been generated for this process: Model 1 (SUP_2): IF RE = South Asia Pacific AND ST > 200 THEN Predicted Capability Level = 1

Model 2 (SUP_2): IF RE = South Asia Pacific AND ST ≤ 200 THEN Predicted Capability Level = 3 For model 1, the majority of assessment instances are level 1, and for model 2, it is level 3. This means that large companies (greater than 200 staff) in the South Asia Pacific area have a predicted capability of level 1 for SUP_2. Smaller companies (less than or equal to 200 staff) in the South Asia Pacific area have a predicted capability of Level 3 for this process. Distibution 1 (SUP_2) 100%

Distibution 2 (SUP_2)

40%

50% 0%

80%

100%

60%

50% 0%

10%

0%

10%

0%

0% Level 0 Level 1 Level 2 Level 3

Level 0 Level 1 Level 2 Level 3

Fig. 6. Distributions corresponding to Model 1 and 2 for SUP_2

3.3

Learning From Best Practices

Benchmarking is a positive, proactive process to change operations in a structured fashion to achieve superior performance [10]. The benefits of using benchmarking are that functions are forced to investigate external industry best practices and incorporate those practices into their operations. This leads to profitable, high-asset utilisation businesses that meet customer needs and have a competitive advantage. In assessments we compare the capability of processes to improve the performance of an organisation. Organisations with processes that are mature will have a better performance than those with low maturity [7]. Therefore, one would want to know which processes have a low maturity and target them for improvement. Benchmarking allows organisations to determine what is an acceptable level of maturity within the company by comparing itself to industry. If the company is not reaching it's goal then an improvement program should be implemented. This type of approach can be integrated into FAME to better focus the improvement program. FAME and other assessment methods help to determine the strengths and weaknesses of an organisation's current software processes. FAME contains a unique feature that allows it to focus on the processes that are most relevant for the business, this is called a Focused Assessment [2]. This type of assessment saves time and money in performing an assessment, and it helps to focus the improvement program. After an assessment, benchmarking is used to identify which business processes that have the most impact on the company, need improvement by comparing the results to industry.

For example, let us take the case that an IT Products and Services company had a business goal of getting products quickly onto the market in South Asia Pacific area (i.e. time-to-market), but it found out that it competitors were faster and better at achieving this. A Focused Assessment is then performed on processes that have an impact on the business goal of time-to-market. One such process that maybe strongly influenced by this goal is Manage the Project (MAN_1). An assessment result on this company may look like the one described in Fig. 1. The capability of the assessed process MAN_1 is in this example is Level 1. If we then compare this to the industry benchmark model described in Fig. 1 then one could see that this company is lagging behind the rest of industry in this process. The company could then identify an improvement program to reach Level 2 for the process MAN_1, as seen in Fig. 7. below. 5 4 3 2 1 0 ENG_1

MAN_1

CUS_1

SUP_1

Fig. 7. An example of where to focus the improvement effort

3.4

Other Types of Benchmarking

Other sets of analysis will be performed in the SPICE Trials using benchmarking techniques. The aim is better learn which techniques provide industry with the most informative information on best practices. The result will mean industry are better informed on which processes should be assessed to position themselves within their market. Fraunhofer IESE is also performing internal benchmarking within companies who only wish to learn from best practices within the own organisation. Internal benchmarking is used to find out how a project compares to other projects in the company. It is also useful for evaluating the risks in taking up new projects by comparing to previous performance. Benchmarking in general can be performed externally or internally, with the greatest benefits in performing both types. External benchmarking, like the SPICE Trials, is used to find out how an organisation compares to other similar organisations in the industry. It is also used by large acquires of software systems to gauge the relative performance of their suppliers.

4

Conclusions and Future Directions

This paper presents a way to learn from best practices by comparing one’s practices to an industry benchmark. Benchmarking can be included into an assessment method to better focus the improvement program. Fraunhofer IESE has developed a number of industry benchmarks by using the OSR tool on the SPICE Trials data. The OSR tool generates a number of models by finding general trends in the data. These models provide a classification for companies to compare with other companies of similar type. A number of influential factors were used to determine what makes one company similar to another. This paper presented only a subset of the models generated by these influential factors. However, many other attributes (i.e. influential factors) associated with an assessment has been collected in the SPICE Trials and this will be the work of future research for better benchmarks. Fraunhofer IESE is investigating ways of how to better customise benchmarks for companies who require specific benchmarks to learn from. Research into this area will lead to better benchmarks being derived, which ultimately can lead to better improvement results within the company.

Acknowledgements We would like to thank the participants of the SPICE trials for submitting their assessment data. The authors would also like to acknowledge Erik Dick and his team for their continuing work in the development of the OSR tool.

References 1. ISO/IEC. ISO/IEC TR 15504-2: Information Technology – Software Process Assessment – Part 2: A Reference Model for Processes and Process Capability. Technical Report type 2, International Organisation for Standardisation (Ed.), Case Postale 56, CH-1211 Geneva, Switzerland (1989) 2. Beitz, A., El-Emam, K., Järvinen, J. A Business Focus to Assessments. SPI 99 Conference, Barcelona, 30 November - 3 December (1999) 3. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J. Classification and Regression Trees, Wadsworth & Books/Cole Advanced Books & Software (1984) 4. Briand, L.,C., Basili, V., Thomas, W., M. A Pattern Recognition Approach for Software Engineering Data Analysis. IEEE Transaction on Software Engineering, vol. 18, no. 11, November (1992) 5. Briand, L.C., Basili, V., Hetmanski, C., Developing Interpretable Models with Optimized Set Reduction for Identifying High-Risk Software Components. IEEE Transactions on Software Engineering, vol. 19, no. 11, November (1993) 6. Emam, K. El, J. Drouin, J., Melo, W. SPICE: The Theory and Practice of Software Process Improvement and Capability Determination. IEEE Computer Society (1998) 7. SPICE Project Trials Team. Phase 2 Trials Interim Report. June 1998. URL: http://www.iese.fhg.de/SPICE/Trials/p2rp100pub.pdf

8. Shepperd, M., Schofield, C. Estimating Software Project Effort Using Analogies, IEEE Transactions on Software Engineering, Vol. 23, Number 12, November (1997) 736-43 9. S.M. Weiss, S.M., Kulikowski, C.A. Computer Systems that Learn. Morgan Kufmann, San Francisco, CA, USA (1991) 10.Zairi, M. Benchmarking for Best Practice – Continuous learning through sustainable innovation. Reed Educational and Professional Publishing Ltd (1996)

Suggest Documents