Enhancing Continuous Integration by Metrics and ...

Enhancing Continuous Integration by Metrics and Performance Criteria in a SCRUM Based Process Metrics and SCRUM in an Industrial Environment: A Contradiction? Christian Facchi

Peter Trapp

Jochen Wessel

University of Applied Sciences Ingolstadt Esplanade 10 85049 Ingolstadt, Germany

University of Applied Sciences Ingolstadt Esplanade 10 85049 Ingolstadt, Germany

Nokia Siemens Networks GmbH und Co. KG Liese-Meitner-Str. 7 89081 Ulm, Germany

[email protected]

[email protected]

[email protected]

ABSTRACT

Keywords

SCRUM, as an Agile Method for software development, has been widely spread. Due to the basic idea behind agile methods, metrics are only used in a very restricted way. We had the question whether metrics can be used within SCRUM or is there a clear contradiction between metrics and SCRUM. To answer the question, we introduced metrics in several areas of SCRUM. A key success factor of SCRUM is a permanent integration, which will be achieved by the method of Continuous Integration. At Nokia Siemens Networks, we are in doubt whether our Continuous Integration is as steady as it should be. To solve that problem we introduced qualitative methods in our development process. Metrics have been used for a well founded evaluation of the steadiness of the Continuous Integration. Another area where metrics can be used is to ensure test quality. Additionally, to validate some non functional criteria we introduced performance measurements of our target in our test automation and Continuous Integration environment. In this paper, we present these metrics, their use and validation in a real life industrial case study. Based on the delivered data, we significantly improved the software production process. We have seen, that a moderate use of metrics within SCRUM can help significantly.

Agile Methods, SCRUM, Metrics, Continuous Integration, Test Coverage, Performance Measurement

Categories and Subject Descriptors D.2.8 [Software Engineering]: Metrics—process metrics, performance measures; D.2.9 [Software Engineering]: Management

General Terms Metrics, SCRUM, Continuous Integration, test coverage, statistical analysis, performance measurement, process improvement

EPIC 2010 Bolzano-Bozen, Italy

1.

INTRODUCTION

During our SCRUM based development process of a bigger software project, we had to answer some simple questions: 1. How continuous is our Continuous Integration? 2. Does the code have sufficient quality? 3. What is an appropriate definition of done? We only had the feeling that used integration is not as continuous as it should be. The SCRUM teams have sometimes mentioned that as an impediment. However, even the teams only mentioned that, it might be not as continuous as it should be. The same is the case with the other question regarding the code quality. Such the answer to that question depends strongly on the developer’s mindset regarding that point. The same phenomenon can be seen on the differences, which have appeared regarding a ”definition of done”. It is not clear when a developer has finished his work in sufficient quality. For an engineer, it is not a good solution to build further corrections on imprecise feelings, especially, if those are not homogeneous even within one team. So in the following, we tried to put the impression we had on a more solid basis. One possibility to do that is to define several measurements or so called metrics. Some metrics are still in use in SCRUM projects like a burndown chart, metrics for productivity and velocity. One intention of their usage is to increase visibility. It has been known that the use of metrics in agile projects is contrarily discussed, e.g. in [Glo08, Bea07, MZ07]. One suggestion among others is, that if it has been decided to use metrics, they have to be handled with great care. In the case study it has been decided to use metrics to put the suspicion on a more solid basis. So all we wanted to get an answer to our initial questions. In addition we tried to help the team members to identify the impediments faster and give them the appropriate priorities. Furthermore, we first step for an ongoing process improvement in an industrial environment has been given. Additional qualitative methods can now be introduced to the

development process. However, that has to be handled with care to avoid a contradiction to the responsiveness, especially, the speed of Agile Methods. Also it might be a possibility to decrease the gap between a plan driven approach and agile methods, because plan driven methods tend to depend more on metrics. Especially, if higher CMMI levels are reached. Hence, a fruitful existence of both might be possible [GDA+ 08]. Hence, not only the before mentioned questions have been answered within our project. An additional question appeared:

Backlog List in a prioritized form. This list is owned by the Product Owner. These work items will be completed as far as possible within a sprint. This sprint has a duration of four weeks and a deliverable product as a target. During a sprint the sprint backlog, which is a subset of the product backlog list, will be completed by the SCRUM team. All sprint backlog entries should have a maximal size of one working day for one team member. Every day a 15 minute meeting, the ”Daily Scrum” will be organized. During this Daily Scrum, every SCRUM team member has to answer the following questions:

1. What have you done since the last meeting? Is there a contradiction in using metrics within SCRUM?

2. What will you do before the next meeting? 3. Which impediments did you have?

This question appears because introducing metrics will add more bureaucracy and also more means for control by management, which is in contrast to SCRUM principles. This paper is an extension of [FW10a, FW10b], where only metrics using data delivered by the Continuous Integration environment, especially the build scripts, have been proposed. In this paper we go some steps further by including performance measurements of the target or simulation in the Continuous Integration Environment. Furthermore we discuss some metrics for measuring the quality of the test process.

2.

BASICS

In this section, we present a short introduction in the used concepts and environment, to allow a better understanding of the paper.

2.1

SCRUM

Agile Methods, for which e.g. SCRUM and eXtreme Programming (XP) 1 are representatives, are based on the following principles [Bea09]: • Individuals and interactions over processes and tools • Working software over comprehensive documentation • Customer collaboration over contract negotiation • Respond to change over following a plan Agile Methods put more value on the above mentioned items on the left side. This is in contrast to a traditional plan driven development, where the items on the right side have highest priority. SCRUM is an incremental method for software development. As an Agile Method, it supports the above mentioned goals. It is based on work items, which are collected in the Product 1 Due to the broader area of usage SCRUM is described as an agile technique for managing complex projects. Whereas, XP is a specialized only on software development. However, in the scope of software development both are mentioned as representative of agile methods.

At the end of a sprint the retrospective meeting has to take place, where further improvements will be planned. Also a planing meeting for the next sprint will be done. SCRUM has been described in more detail, e.g. in [SB02, Sch04, Glo08]. One of SCRUM’s design goals has been to improve productivity of software development by introducing self organizing teams and, as consequence, empowering the team members. Also increasing the velocity of feedback is a key factor. Such, feedback can be given very fast. E.g. a deliverable product has to be available after each sprint. Also impediments have to be presented in the Daily Scrums every day. There is an ongoing discussion whether SCRUM leads to the desired productivity gain, which in SCRUM is named hyperproductivity, or whether plan driven development methods have a higher productivity. E.g. in [Boe02, BT05, GDA+ 08] a careful approach to agile methods has been suggested to either combine the advantages or to select the appropriate method. Especially, it is often doubted, whether SCRUM can be used in large and distributed environments, which is the case with Nokia Siemens Networks. This seems to be the case in a lot of organizations, because there are a lot of hints given to handle that, e.g. in [FG07, SVBP07, LV09, Glo08, LS05, SSRR08]. Nokia Siemens Networks decided to go for SCRUM. We believe that within SCRUM a highly creative working environment can be given. This is especially achieved by empowering the software developer. Although, we believe that SCRUM shows already existing impediments early and directly. So the problems already existed, most even before the introduction of SCRUM, and they are only directly and fast addressed by the SCRUM teams. This is something every engineer likes to have, because then the problem can be fixed. However, for the responsible department heads it has been a significant change, because they delegate some of their responsibility to the SCRUM teams. That can be compared with the relationship of parents and children. One aspect of education is to increase the freedom step by step. However, that is not easy for most of the parents, because they are still responsible for their children, but they will significantly loose direct control and have only indirect control.

This can be compared to a software line manager, who is responsible for the team’s output, but only has limited direct control over the SCRUM team.

2.2

Definition of Done

One important question in a SCRUM based process is the definition of an appropriate quality level of the delivered product. This topic is more of interest in a distributed environment, if different SCRUM teams are working together, but the teams are not located on one development site. Such, a consensus regarding the desired quality level over all sites has to be established. SCRUM teams normally handle that by some rules, which are called ”Definition of Done” DoD [SB02, Sch04, LV09, Glo08]. These rules describe the aligned minimal principles, which a product or a backlog item has to fulfill. The DoD will be defined by the SCRUM team, because of the self organizing structure of the team. These rules e.g. may include some successful semantic checks by tools, or also a review by a peer on every code change. In a distributed SCRUM the DoD is aligned within every separated team and also between all teams.

2.3

Continuous Integration

One of the key elements of SCRUM is a continuous feedback loop to identify impediments as soon as possible and also to include early changes in the development process. E.g. in [FG07] it has been mentioned that automation of the build and test is a key factor of any agile method. Within a Continuous Integration the different work items will be integrated in the software product very often, nearly permanently. So interface problems will be shown early. In the best case, an automated software integration after each checkin of modified code in the version control system is started. This increases the speed of feedback and possible corrective actions. For more information on Continuous Integration see e.g. [DMG07, Fow06].

2.4

Quality of Tests

Another software development method which is in use by Agile Methods, is a test driven development [Bec02]. This method also has been described as test-first programming, because before starting to implement a functionality the corresponding test case has to be implemented first. The use of test driven development in SCRUM is not surprising, because a deliverable product should be the target on every step in SCRUM. So this will be emphasized by putting a test case as a starting point of every development action. As consequence, tests are required in the used environment. For a real product a thoroughly test phase is required. To achieve that goal the required level of tests has to be determined. But, it is impossible to determine a sufficient test level exactly. For approximation a lot of different techniques or measurements exist. As a minimal criterion branch coverage, which also has been named as C1 criteria [Mye04, Mar95, Lig02], has been stated. If that metrics has been reached to 100%, so at least every branch of the software has been executed once during the test suites’ execution. This is called a minimum criteria, because an execution of every branch does not expose every error. However, a branch coverage below 100 % states, that some parts of the software have never been executed in a test case. So that might be a potential error.

2.5

Performance Measurement

The definition of done often includes that, the code should satisfy all its functional and non-functional requirements. Hence, it is also important to constantly verify the nonfunctional requirements, such as software performance, as described in [Gun98, HL84, Hoc07]. Non-functional requirements in SCRUM based processes seem to be very hard to achieve and should be permanently be in focus [CR08, Boe02]. So it might be good to consider performance criteria during the whole project. In [Gla98], software performance is considered the main technical reasons why software project fails. The execution of software in these projects is often too slow to be used in productive environments. However, the importance of performance optimizations is not reflected in academic curricula and is often underestimated. Especially, in complex software systems, such as telecommunication systems, the performance of software is particularly problematic. Here, low performing systems will not be accepted by the users and therefore cannot be sold to the customers. Software performance is often targeted in the system design phase, e.g. different architectures will be evaluated and performance target specifications will be written. The methods for performance engineering of software are described and contribute to following areas in the software development cycle [Mar02, SS01]: • Capacity Planning: A process to forecast of workload, resource requirements and planning the acquisition to support the demand of future resource requirements. • Software Engineering: A discipline to develop large software systems. • Performance Management: Measurements on existing software to verify performance criteria or to identify specific performance problems. • Performance Tuning: A discipline to improve slow code in order to reach the performance targets. • Performance Modeling: A process to analytically solve models of complex environments. • Software Quality Assurance: Includes requirements for audits and for software product quality evaluations. Performance tuning, as described above, follows a strict line: analysis - test - improvement - verification (summarized from [Mar02]). This procedure depends strongly on the experience of the person realizing a performance optimization and can only roughly be used for estimations. Because of Nokia Siemens Networks uses a Continuous Integration environment, the idea to utilize this environment, to constantly evaluate the runtime performance behavior of the most frequently used functionality, came up. This enables the developers to spot performance problems early in the development process. Moreover, tendency trends of the performance of the software can be estimated and, hence, evaluated. Additionally, the developers can be notified if their software functionality reaches some thresholds.

2.6

Research Cooperation

In 2005, a cooperation between Nokia Siemens Networks and Ingolstadt University of Applied Sciences started first in the area of SW performance optimization. In telecommunication systems performance is a crucial factor, because some so called key performance indicators have to be guaranteed by the system. This key performance indicators are show stoppers, because if they are not reached by the system, it can not be used by the customer. As a consequence, the customer will not buy the system in total. So a high effort has to be put in reaching these performance goals during the system’s development phase. Therefore, a research cooperation started with the goal of supporting the system’s runtime performance consideration during the development phase of the telecommunication system. This cooperation has been started very early before any performance problems in the system have been identified. In the area of performance optimization, a method for goal oriented performance optimization has been found. With this method a better identification of bottlenecks and a method for effort optimized performance improvement has been defined. For further information, the reader is referred to [TF07, TF08, TFB09, TMF10]. Due to its results the project has been prolonged by Nokia Siemens Networks for several times and is still running. As an extension of the ongoing cooperation the first of the authors has spent a sabbatical at Nokia Siemens Networks in 2009. Because the development department, which initiated the cooperation, started to convert the software development process to SCRUM, several research topics in that area have been identified. As first goal the improvement of SCRUM by metrics for Continuous Integration [FW10a, FW10b] has been determined. As a second topic the question what is a sufficient test quality appeared. That has resulted in some basic considerations regarding metrics for tests, which we present later in the section of the paper. As a third topic the combination of software performance measurements in the automated test environment has been identified, which we also present in this paper. Due to the long ongoing cooperation, where both sides are very flexible, several interesting areas have been identified and selected as work topics valuable for all involved parties.

2.7

SCRUM at Nokia Siemens Networks

First we have to note that, Agile Development using SCRUM at Nokia Siemens Networks is still under continuous changes in the exact mode of operation. But, that is not surprising in an self organizing process, with small sized feedback loops. Especially, in the Radio Access area of software for mobile phones based networks, where the examined project is located, SCRUM is still rarely used. Instead, the major development is based on traditional plan driven processes. Also, the overall program, in which the project in scope, is developed follows a traditional process using overlapping streams all of them developed in a waterfall fashion. As Nokia Siemens Networks is a world wide operating company, it is quite common that the software development of projects is distributed over several sites. Our project is distributed over two sites with roughly 90 people working for this project in total. In our project, we decided to introduce SCRUM, although the frame process does not support

this Agile Method. To do this, we defined the software as a product that is developed by the whole project. Also, the SCRUM teams are not following every step of the process (i.e. from early specification to complete System Test), but take the specification of a certain level done by separated teams and deliver the software to System Integration teams. To fulfill the outside requirements given by the frame process, we installed a so called Feature Expert Team (FET), which separates the outside process from the SCRUM teams among others. Their responsibility was beside others to introduce the features to the SCRUM teams, support the Product Owner to build up the product backlog and to answer ad hoc requests coming from other program members. After approximately one year of working in this mode, we changed the set up in a way that reorganized that team again. Part of the people are now working as so called SMEs (Subject Matter Experts), others are working for teams outside our projects (e.g., higher level specification). The team is completed by one Product Owner and two Proxy Product owners (one per site). Note that these are not Product Owner in the sense of SCRUM, as they do not have any ”Return on Investment” responsibility, but they do a big part of the work of a Product Owner defined in SCRUM. As the project started in a traditional mode it consisted of five so called System Components, the former System Component responsibles were also part of the FET and are now SMEs to have a combined knowledge of the complete project also outside of the SCRUM teams. The scope of the SCRUM teams is Software Development. This includes supporting the higher level specification, coding, Unit/Module testing, System Component Testing and, in addition, a first integration testing in which the five System Components are integrated and tested as a product. Before we decided to switch to SCRUM, several activities already had been started to base the software on Continuous Integration. A fully automated test environment had already been developed, so it was a small step to put this under the control of a Continuous Integration automation, which is called CI machine. As the Configuration Management System in use ClearCase2 does not provide atomic operations, we introduced a script call SCSync to have a better control over the contributions towards the CI. SCSync is based on the label mechanism of ClearCase and provides functionalities to hand over all changes a developer has done in one step without interference by other developers. One additional functionality of SCSync is a roll-back, so in case of some errors or problems during hand over to CI, the previous status can be recovered.

2.8

Project Specific Definition of Done

Despite the fact that, in a SCRUM based process running code is the target of every action, the Definition of Done DoD [LV09, Glo08] is important for achieving sufficient software quality. This is especially the case in bigger projects with several distributed SCRUM teams working on the same product. In this subsection, we shortly mention some points of the ”Definition of Done” currently in use within the software development of the project we have chosen. This project is located in the Radio Access component for mobile phones 2 http://www.ibm.com/developerworks/rational/products/clearcase

based networks. The Definition of Done is checked by the SCRUM Master before the Sprint Review Meeting. All artifacts, e.g. links to updated documentation or reports of tools are stored on a Wiki page. Only the points are presented, which are related to this paper. • Code is compiled without errors using the official project build system. • Code compiles without warnings. Warnings, which can not be removed, should be well commented in the code. • New code must be free of Klocwork3 findings of severity 1 to 3. • Code tested with valgrind4 and results does not show any memory leaks. • Unit tests stored under version control and successfully passed. • The branch coverage, or C1 coverage, of the unit test measured with BCov5 is on a satisfactory level. That means the main use cases branch’s are covered, main error scenarios are covered, unlikely to happen errors situations might be not covered, inlined code do not have to be covered. • The branch coverage of unit tests does not decrease in percentage between sprints.

2.9

Related Work

The area of metrics has been studied for a long time. So a lot of different approaches to metrics for software are available. For object-oriented systems a lot of different metrics exist, e.g. see [CK94, XSZC00]. These metrics are based on properties of the source code. Some of these metrics can be used in our environment to estimate the quality of the code. But, these metrics are not specialized on an agile development process. In [SGK07] a case study for the use of objectoriented metrics in agile projects, which are based on eXtreme Programming (XP), has been done. However, these metrics will not directly address our problem to determine the quality of Continuous Integration, because only a statical look on the code has been realized. A tool based approach to metrics for the software development has been given in [JKP+ 05], where for a huge range of development environments plugins to collect data are available. However, currently there are no specialized metrics for agile methods available. In addition, the specialized development environment of Nokia Siemens Networks, which consists of a specialized make environment and version control system, is not supported by these plugins. In [BA07], a specialization of metrics for Agile Methods has been given. The main focus of that approach is the effort 3 Klockwork is a semantic program checker http://www.klocwork.com/products/insight-pro . 4 Valgrind is a simulation tool for dynamic analysis http://www.valgrind.org . 5 BCov is a tool for checking the branch coverage during a test execution. http://sourceforge.net/projects/bcov/ .

estimation of a project. In [MZ07] an abstract approach targeting on the costs has been described. In [LS05] some general metrics are given, which can be collected manually or in an automated manner. These metrics concern the number of defects, code coverage, number of test cases and much more. This level of abstraction is the other extreme in relation to source code metrics. To measure the quality of the Continuous Integration the abstraction level of these metrics is too high. In [SBB+ 06] metrics for the usage in agile projects have been introduced, where the abstraction level of the metrics is nearly in our desired area. So, e.g. in one metric the modification ratio of code will be measured. However, the aspect of Continuous Integration has not been covered so far. In [DTHK05] there have been different metrics in our desired abstraction level covered. So explicitly the pulse of Continuous Integration has been mentioned. However, it is only measured how many checkins6 are done during a day. The quality of a checkin, e.g. in terms of visibility by a successful build or test phase, has not been considered. We believe metrics, which include these informations, add significant information for quality considerations of the software development process. This is a clear consequence that within SCRUM tested code is the result of every checkin. So if the code is not working anymore, further work is not possible. There is an ongoing discussion, which level of branch coverage is minimally required, because the target of full coverage can not be reached, if the software contains code for robustness, which can not be tested [ZHM97, KLV05, MLBK02a]. In [FW10a, FW10b] metrics using data delivered by the Continuous Integration environment, mainly the build scripts, have been proposed. Based on the delivered measurements, we had some improvements, which we furthermore elaborate in this paper. Performance tuning in agile development environments has been described in [BGMJ09]. Here, a methodology to include performance tuning into the development cycle of a software product has been explained and supported by a case study. The paper mainly focuses on an overall integration of performance measurements and highlights the business value of different optimization possibilities evaluated together with the costumer. We, however, explain the integration of performance measurements into the Continuous Integration environment, while the software development is done in an agile process. Our focus is to explain a possible realization of such measurements. Metrics for tests are widespread in software development. They have been described e.g. by [Mye04, PY07, Mar95, Lig02]. There is a large tool support available. How test coverage can be used to get a reliability estimation of software has been examined e.g. in [CLW97, MLBK02b]. Measurement and improvement of test quality has been shown 6 As checkin the action of a developer to handover his code changes to the version control system is defined. So this changes are frozen in the current version. Further modifications have to be done in newer versions.

in e.g. [ZHM97, RUCH01].

3.

METRICS FOR SCRUM

In this section, the approach for using metrics within an agile environment is presented. For that metrics for Continuous Integration have been developed and, additionally, some already existing metrics, e.g. for test coverage, have been used. Finally, some measurements of the target software’s runtime performance have been included. There exist process oriented methods in deriving metrics, e.g. GQM Goal Question Metric [vSB99], where metrics will be developed in a clearly described way. GQM primarily is goal oriented and consists of several phases:

With CI Frequency the pulse of Continuous Integration can be measured. This can be compared with the heartbeat of the integration. As a target, that value should be larger or equal to the number of team members within a SCRUM team. This is based on the fact, that backlog items have to be completed within one day and as a target of every backlog item deliverable code should be achieved. Thus, at least every team member has to conclude as a statistical mean a checkin at least once a day. To test whether an checkin works an integration has to be done. The reason why only successful integrations are counted is because only they can be considered as a correct delivery of a product part.

3.1.2 Planing Phase: In this phase, an application, e.g. software system, for measurement has been selected and a planning process has been started. The result of this phase is a first project plan. Definition Phase: Here, the measurement will be developed. First a goal will be clarified, which will lead to several questions. Then several metrics will be developed to answer these questions. Also some hypotheses for the validity of these metrics should be formulated. Data Collection Phase: In this phase real data with respect to the before defined metrics will be collected. Interpretation Phase: Here, the collected measurement data will be interpreted with respect to the defined metrics. These measurements data will be abstracted to answers, which are related to the before defined questions. As final abstraction, the goal attainment can be evaluated. GQM is a very strong method, which describes a process in deriving metrics. However, we have seen that, whenever we wanted to introduce some classical process oriented techniques in our SCRUM based process, that this has led to significant acceptance problems within our SCRUM teams. This might have some psychological reasons like the teams fear to go back to a plan driven approach as well as the fear of loosing speed. We, therefore, decided to start with metrics in an agile way.

3.1

Definition of Metrics for Continuous Integration

Now, some basic metrics to be used to support the Continuous Integration will be defined. On all definitions the practical use is highly prioritized. The main target of this subsection is to answer the first question mentioned in the introduction: ”How continuous is our Continuous Integration?”

Quality of Continuous Integration

In the following, we define two metrics to count the number of successful or erroneous Contininuous Integration attempts. Definition 2. #CIOK is defined as total number of successful Continuous Integrations within development stage < phase >.

Definition 3. #CIERR is defined as total number of erroneous Continuous Integrations within development stage < phase >.

A distinction of the different development phases or steps is necessary, because of different failure possibilities. A script can fail during the pure build phase, or in the unittest or in the product test phase. Hence, e.g., the number of errors build . during the build phase will be described by #CIERR To determine the probability of a build or test failure we introduce straightforward, based on the above definitions: Definition 4. ErrorRatio :=

#CIERR #CIERR + #CIOK

Therefore with ErrorRatio , it can be determined how many builds/test failures happen. E.g.: if ErrorRatiounittest = 0.25 holds, then only every forth run of an unittest is successful.

3.1.3

Duration of Continuous Integration

One basic requirement of SCRUM is an early and steady integration. So, a metric to determine the frequency of the integration is defined as follows.

To be useful Continuous Integration has to be automated and fast. Thus, the result has to be available to the developer as soon as possible. Only then usable feedback can be given to the developer. If that step takes too long either the developer will not do a complete check, or parallel integration by different developers starts and in consequence the probability for integration side effects increases.

Definition 1. The CI Frequency (#CI) of Continuous Integration is defined as the number of successful Integrations per day.

Definition 5. M axDuration is defined as the maximal duration of a development step of < phase > for < component >.

3.1.1

Pulse of Continuous Integration

Hence, with M axDuration the maximal value for receiving a result can be measured. This measurement can be simply extended by introducing average values and standard deviations to collect statistical information if desired.

3.1.4

Availability of Continuous Integration

The probability that a system, which is defined by successfully passing the appropriate phase, is up and running, is also interesting. A system is available at the current time, if it has successfully passed the build-, unit or systemtest. Then, further development steps can be started. Therefore the following definitions have been introduced: Definition 6. uptime is defined as set of time intervals [ti , ti+1 ], where the system is up and running. Each time interval is described by a starting point ti and an endpoint ti+1 . During these timepoints the system is available for stage < phase >. All time intervals are maximal. Definition 7. downtime is defined as set of time intervals [ti , ti+1 ]. Each time interval is described by a starting point ti and an endpoint ti+1 . During this timepoint the system is not available for stage . All time intervals are defined as maximal regarding ti < ti+1 . Thus uptime and downtime make a disjoint decomposition of the complete time interval, because the product’s design areas under study can either be OK or not OK. There are no further possibilities available. Furthermore, the time intervals are not equidistant, because they are determined by the result of the Continuous Integration. Based on these intervals, the different availability time can be summed up. Before that can be done, we have to define an order on the different build phases. E.g. a build failure is also regarded as a failure of the unittest. Using these definitions the system’s availability can be defined straight forwardly: Definition 8. We define the availability as a probability that the system is available in stage < phase > for a developer as: P uptime P availability = P uptime + downtime P Where denotes the added duration of the relevant time intervals.

Definition 9. We define the total order ≺ on the development phases: build ≺ unittestunittest ≺ systemtest All not mentioned pairs are defined as not to hold, except on transitivity based ones.

Based on this order, a simple implication can be stated: If a test case fails in a phase α that it fails also for all (”later” or ”higher”) phases β, where α ≺ β holds. E.g., a downtime in the build phase will also lead to a downtime in unittest or systemtest. Hence, the total availability can be simply taken as a probability that the system is available for a developer in the highest development stage regarding the order ≺.

3.2

Metrics for Tests

As next step, we wanted to answer the second question given in the introduction: ”Does the code have sufficient quality?” And as a consequence also the third question ”What is an appropriate definition of done?”. Therefore we concentrated on the tests. Here, we selected the area of describing testcases by branch coverage, first. Second, we introduce some simple measurements of testcases.

3.2.1

Metrics for Test Coverage

Within our development department at Nokia Siemens Networks, we have immediately seen that a full branch coverage, or shorter c1 = 100% as demanded in theory cannot be reached. Despite the fact, that the arguments for a full branch coverage are quite easy to understand: How can the quality of one piece of code be estimated, if not at least one test case covers it7 ? In practice, there is a demand to write robust code. This code might have some parts in it, to check some situations, which are hardly to occur in reality. These sections are only included to increase the quality, but they may be very hard to reach in any test case. In that case, the branch coverage will be less then 100%. Based on that experience, a branch coverage lower then 100% is reasonable. The question appears: What should be a minimal criterion for branch coverage? If 80% branch coverage is requested and reached, it might be the case that some important branches have not been covered by any test case. We, therefore, have requested that between several sprints the branch coverage of one component should not decrease. That has been shown as a feasable approximation. Another possibilty is to comment such branches, which cannot be reached as suggested e.g. by [PY07]. These branches should ideally not be counted in any coverage calculation. However, that requires a tool extension, which we have not realized due to effort estimations.

3.2.2

Metrics for Testcases

Normally, the number of testcases should correspond to the number of usecases. A direct correlation should exist in a test driven development. Therefore, some simple metrics have been defined to measure the number of testcases. Please notice that in first approach the number of usecases has not be considered. 7 We are aware that this holds only in a test based quality approach. If formal methods are used for program development quality can be determined with a minimal number of tests. However, formal methods are currently not used within SCRUM. That might be an interesting research question, too.

Definition 10. The number of testcases #T C for a component is defined as the number of test cases available for the component < component >. With #T C a simple metrics exists which should be a first indicator for the quality of the tests. Especially, in combination with the earlier mentioned branch coverage. The time trend of these tests might also be an indicator. There can also be a correlation to bugs, which are reported, because every bug should yield to a new test case in a test driven development.

3.3

tendency trends of the upcoming software performance are visible. The following statistical evaluation information will be provided to the developer: minimum, average, maximum, variation and the squared coefficient of variation (see also [SL05]). Because of the performance target specification document, thresholds can be specified on a “per message” resp. “software functionality” basis and the developer of the component will be notified if the threshold is reached. Additionally, the developer is able to view any older available performance measurement results at any time. This evaluation will be down fully automatically.

Metrics for Performance Measurements

Additional answers to the second and, in consequence, to the third question of the introduction: Does the code have sufficient quality? and What is an appropriate definition of done?, can partly be given by performance issues. First of all, a metric to evaluate the system’s performance behavior has to be chosen. As of the key performance indicators, which are defined in the performance target specification document, the latency of messages is used in this case. In order to study the latency of the messages, a breakdown of the runtime, spent in the included software modules and functions, has to be done. Moreover, the traffic model document will be used to identify the most frequently used messages and software functionalities. This results in a top down list of the used messages. The most frequently used messages will be used to study the performance behavior. Thus, the performance of these functionalities will be monitored with the Continuous Integration environment. Only the most frequently used messages will be evaluated to reduce the overhead created by the evaluation.

Performance Measurements in Continuous Integration. The performance measurements will be included into the Continuous Integration environment. Hence, the test environment has to be capable to execute the performance tests as specified by the performance optimization team. Therefore, the message based test scenarios as well as test environment is used to create a dedicated performance test environment. This enables to do load and stress tests to the system under test. Moreover, the performance test environment as well as the performance test evaluation will be integrated into the regression test environment and periodically executed. Another requirement to the performance test environment is to deliver almost the same results deterministically in the same test setup.

Evaluation of the Results. The new measured results are integrated into the history of the available results based on a “per message” resp. “software functionality” analysis. This enables to identify changes in the performance behavior of the software, e.g., slow-downs after adding a new software functionalities (features) or speed-ups after an optimization. The results of the evaluations are put into graphs as well as a statistical calculation will be done. The graphs display the time spent in the software processing (on the y-axis) and the x-axis lists the date and time of the evaluation. Hence, some

4.

CASE STUDY

Based on the afore described implementation of metrics, we included that metrics tool within our software development department, which develops telecommunication software for Long Term Evolution (LTE), a successor technology of UMTS. We measured two independent development branches, one is mainly a stabilization branch and on the other branch some very new features are developed. This has been done to validate the defined metrics on a broad range of development tasks. Please notice that in this section not all previously defined metrics have been presented. Only some metrics have been shown for the clarification of the results.

4.1

Setup

We concentrate on software for evolved NodeB (eNB), a network element used within LTE. Two different feature teams with eight developers in each team are working at one development site. We restricted the test phase to the simulation phase. No target tests have been covered by our case study. As a build and test environment, regular Linux servers are used. All builds and tests are fully automated and integrated in Cruise Control8 and later Hudson9 , an environment for Continuous Integration. As configuration management tool ClearCase is used.

4.2

Implementation of Metrics for Continuous Integration

As a prerequisite, for collecting the previously defined metrics, a completely automated build and test environment is necessary. At Nokia Siemens Networks the build and test environment is realized by different UNIX scripts. During the build and testphase many information are written to a log file. Hence, we simply extended the build scripts by tracing additional information to the logfile. This information has the format: M ET RICS :< invocationScriptN ame >< state >< time > It will be written on the logfile at the start of the script and at the end of the script. 8 CruiseControl is a tool for visualization of Continuus Integration http://cruisecontrol.sourceforge.net 9 Hudson is a tool for visualization of Continuus Integration http://hudson-ci.org .

The implementation of these scripts has been realized in a small Perl program. For more Information regarding the implementation the reader is referred to [FW10a, FW10b]. For a statistical interpretation of the derived data for Continuous Integration a spreadsheet program has been used. We simply have read the data, which have been provided in a tabular format. Based on that, the statistical values e.g. minimum, maximum, mean value and standard deviation can be easily determined. In addition, this data can be visualized. Thus an easier interpretation of the data can be achieved. This supports on gaining an overview, if a large number of data have been collected. Based on simple statistical methods, peak values can be easily identified and furthermore analyzed. Sometimes these peak values are caused by extraordinary circumstances and it has been decided to neglect them. In other cases they have shown us optimization candidates with significant improvement potential. However, the results have to be carefully analyzed, which requires in depth knowledge of the production process.

4.3

Integration of Performance Measurements in the Continuous Integration Environment

As described in Section 3.3, the performance evaluation environment will be integrated into the Continuous Integration environment. The used procedure to measure the systems’s performance is as follows: 1. Identify most frequently used software functionalities using the traffic model document. 2. The threshold for different “messages” and “software functionalities” has to be evaluated using the performance target specification document. 3. Execute the performance test environment within the regression test environment and integrate the results into the Continuous Integration environment. The performance test environment is already available but still needs to be integrated to be executed fully automatically. 4. The evaluation of the performance results are done using an available python script called “perf-eval.py”. This script parses the log-files and determines the actual performance behavior of the “message” or “software functionality”. 5. A partially developed Perl script, called “perf-ci.pl” parses the older performance results as well as the new measured results and creates the performance measurement summary as described in Section 3.3. The script already supports the drawing functionality of the evaluation graphs as well as the measurement history of the older results. The statistical summary as well as the support for the threshold, the notification of the developer and a support of a signal light still needs to be implemented. Using this environment, the performance evaluation team and the developers will be able to get a good overview of

the performance behavior of their software modules or functions. Moreover, they are able to see some tendency trends of the upcoming software performance. Hence, they are able to identify performance problems early in the development process. Because of the threshold, a violation of the runtime behavior of the software functionality can be easily identified and handled accordingly. Additionally, a signal light can be used to easily monitor and visualize the status of the performance for any software functionality. In this case, different performance states, e.g., the runtime of the software functionality is close to the threshold, has to be defined. This hopefully improves the software performance optimization process.

4.4

Results of the Case Study

The validation of this metrics have been realized by a case study. Thus, two SCRUM teams, which are located at one development site and are working on several components out of four, have been selected. As timeframe, we included some complete sprints. One of the teams is working under SCRUM rules for a long time. For the second team, it has been the first sprint, so the complete process might not be running stable. All teams are working on parallel branches, which we call increments due to historical reasons. One increment is in a stable bugfix phase, the other one in normal ongoing development phase. We monitored the metrics on these branches clearly separated.

4.4.1

Metrics for Continuous Integration

We saw that, we had a low pulse of integration, which significantly differs on the components/teams. We had expected to have at least as many checkins per day as team members exists. However, we had even on well running components only half of the expected ratio. The explanation was already known to us: Within both teams, we have dedicated developers and testers in nearly the same ratio. This is caused by history, where a plan driven development process has been in use. This specialization is contradictory to SCRUM. However, the metrics have confirmed that fact, so that the priority of countermeasures has been increased. We also noticed that, there is not a rush hour of checkins at the end of a sprint. All checkins are nearly uniformly distributed over time. This is a very good indication that the teams, even the newer one, are constantly working on the backlog items. We have seen that, on some components the build and test time, which has been measured by M axDuration with < phase > set either to build or to test, is too high. E.g., we had 1,5 hour build and 2 hours test time as maximum values. This sums up to 3,5 hours as a worst case. That is in contrast to our desired upper bound of build and test time of 10 minutes. Immediately, we have seen one main improvement, which has lowered the buildtime to 50%. That improvement has been realized during the first inspection of the reasons why a build takes so long time. Furthermore, a detailed analysis of the build and test time has been started. Please note that this has been seen before without metrics, what is reasonable, because of the factor 20 between both values. However, the delivered numbers have increased the priority again, because it is not based on feelings or estimations. These numbers are hard facts, which can not be

Based on the derived data for Continuous Integration, we have analyzed them by a standard spreadsheet program. It can easily determined by a high standard deviation, whether the build times have some isolated peak values or not. Additionally with gliding average values a trend of the build times can be determined.

Buildtimes 14000 12000

Buildtime

Buildtime 14000

12000 10000 8000 6000

4000 2000 0 Comp_1 Comp_1 Comp_1 Comp_2 Comp_2 Comp_3 Comp_3 Comp_3 Comp_3 Comp_3 Comp_3 Comp_3 Comp_4 Comp_4 Comp_5 Comp_5 Comp_5 Comp_6 Comp_7 Comp_7 Comp_8 Comp_8 Comp_8

put aside easily. Also we realized that, within our stabilization branch the build times are in the range of 0.5 - 1 hour. That is surprising, because we are using the winkin mechanism of ClearCase, where once compiled objects are reused. So either the developer has not compiled the software before checkin or the winkin mechanism is not working properly. This needs further investigation, which has been started. In addition, we found some significant improvement of our build time by improving the build mechanism’s structure. This improvement has been identified and realized very fast, by simply reusing intermediate build results. That has reduced the build time to the half. Some additional possibilities for improvement have been identified, and will be scheduled for the future, because no simple solution does exist. Please note that, these optimizations could also have been identified without metrics. However, the result of the metrics initiated these investigations.

Figure 2: Buildtimes distributed over components additionally been regarded as a hint for a not appropriate architecture. The statistical interpretation of data has to be done very careful. Otherwise some false conclusions can be drawn. A careful statistical interpretation is very time consuming. Especially, for achieving short term results the required effort is to high. Additionally, for a normal SCRUM based process the resources for doing that are not available. However, we will in future put the visualization on our web interface for Continuous Integration. Thus, the teams can decide by themselves to use the information, as it has been done with other visualizations, e.g. burndown charts.

10000 8000 Buildtime

6000 4000 2000 0 09.06.2009 16.06.2009 23.06.2009 06.07.2009 27.07.2009 11.08.2009 Date

14000

Figure 1: Buildtimes

12000

In Figure 1 the build times distributed over time have been 10000 presented. The y-axis describes the buildtime in seconds. On the x-axis the timepoint of execution in DD-MM-YYYY 8000 format has been shown. It can be seen that some verySpalte highE 6000 peak values have happened, which have been immediately analyzed. As a consequence, the build system has been sig4000 nificantly improved. Additionally, on that diagram it can be 2000 seen that activity of chekins does not have any rush period 0 by simple comparing the number (density) of checkins. Comp_1 Comp_4 Comp_2 Comp_3 Comp_3 Comp_3 Comp_3 Comp_1

In Figure 2, a representation with the components name has been included. The y-axis describes the buildtime in seconds. On the x-axis the obfuscated the components’ name distributed over time has been presented. Our industrial partner does not want to publish them, because they may provide hidden information of the architecture. The exact point in time has not been shown in Figure 2 to prevent an overload of the diagram. In Figure 2 it has been seen that some components have really long build times, e.g. comp3 and comp8 . This components have been analyzed first to find any optimization of the build procedure. Furthermore, we have seen that on some components like comp3 has been more checkin activity compared to other ones. That has

Even if a statistical interpretation of the data for a day by day use in SCRUM is to time consuming, for a longer period of time, it might be necessary due to the huge number of data to be analyzed. These longer periods are often of interest in any process optimizations.

4.4.2

Test Metrics

In a first attempt it has been tried to get an impression on the distribution of branch coverage, also called C1 -criterion, over time. Therefore 5 components have been chosen.

MT-Coverage per Build 105,00% 95,00% 85,00%

SC1 SC2

75,00%

SC3 SC4 SC5

65,00% 55,00%

45,00% 1

3

5

7

9

11 13 15 17 19 21 23 25 27 29 31 33 35 Build Number

Figure 3: Branch coverage of several components In Figure 3 the branch coverage distribution over time is shown. On the y-axis the branch coverage has been shown.

It ranges from 95% to 50%. On the x-axis the time points have been mentioned. The determination of the branch coverage has been done only during builds in the ckeckin phase. So every time a devloper does a checkin the coverage of a component has been determined. As a result of this analysis it can be seen that for component SC1 there has been a drastical reduction of the branch coverage from 75.1% to 51.4% in the software build 8. The cause has been new added functionallity. However, that is in clear contrast to the used DoD, which has been shown in Section 2.8. So it is not astonishing that the branch coverage has later increased to 75.4%. On that component there has also been a slighter decrease to the branch coverage at build 27, which has been corrected very soon. On all other components there has not been much change. That might be the case, because the colected data’s timeframe has not been very big. In future the trend of the branch coverage will be recorded and that will lead to a broader picture. In Figure 3 is can be clearly seen that the value of branch coverage significantly differs between components. A required branch coverage, which has to be reached by all components, as sometimes suggested, is not very useful. Either the threshold value has to be chosen to low, in our example it would be at the level of 75%, which will to low for all other components especially for component SC4 . Or it will be set to high and therefore cannot be reached by all components. So the DoD in use, where every component has an individual thresshold value is very usefull. Another metrics regarding tests has simply been the number of test cases: Comp1 Comp2 Comp3 Comp4 Comp5

85 539 89 259 118 F B3

98 551 98 320 152 F B4

98 581 98 336 161 F B5

100 637 98 376 221 F B6

119 641 98 392 227 F B7

Table 1: Number of testcases In Table 1 the number of test cases available for special components with respect to time has been shown. Our software production cycle takes together two successive sprints as a so called featurebuild (FB). In this table, these featurebuilds are named F B3 , ... F B7 . We again obfuscated the components name. Based on this data, we have seen that the complete number of test cases has been in the desired area with one exception. The component comp5 has approximately 25% of the complete code. So, it has been identified in featurebuild F B3 that the number of available testcases has been to low, so that it has to be increased. However, this process has not been finished by now. Especially the exact correlation to usescases seems to be not determined in an easy way. The reason for that is that the process in place uses a mixutre of stories and mainly technical specifications. So also on that topic further work has to be done.

4.4.3

Performance Metrics

As discussed earlier, the integration of the performance test environment into the regression test environment as well as into the Continuous Integration environment still needs to be done. Actually, several scripts to support a fully automated process have been developed. The most frequently used software functionality and their respective thresholds have been defined, too. The scripts are partially available and some test runs, with synthetic test data, have shown that our approach is applicable. Moreover, the results are promising and we are looking forward to get real test data. Unfortunately, we can not provide any results of a case study, yet.

4.5

Discussion of the Metrics

We have seen that the metrics have significantly helped to put the thoughts of improvement on a more solid basis. Every metrics has been found as to be reasonable for answering the questions and putting the right priority on the improvements.

4.5.1

Metrics for Continuous Integration

With CI Frequency we have an indicator for the activity of our SCRUM teams. Due to the nature that different types of backlog items exist and such also a different completion time is existent, we have seen that it only can be taken as a statistical average value. Such, the pure number need some very detailed interpretation. We even have monitored the number of failures during build and test phase. Therefore, with the use of ErrorRatio we have a simple indicator for the quality status of a component. However, any ratio below 1 has to be taken with care. Normally, if a build breaks, the next checkin should solve that, because before every checkin a developer should prerun the tests. But, we have seen that with bigger changes like interface changes more than one person has been involved. Hence, they checkin their solutions, which only has been expected to work, and see that in combination with other items conflicts appear. But that is the normal integration business as it happens on bigger changes. In that case, a higher error ratio is to be expected. However, the same ratio can be seen, if a developer checkin very hastily and stumbles from error to error. Instead of using the maximum value of build time, we thought of using a more statistical based information like average values and standard deviation. To do that, only all successful builds have to be taken, because build failures normally can be detected in shorter time. Then a collection of build times has to be interpreted statistically, e.g., with a spreadsheet program. However, we decided against doing that, because for our purpose the maximum, taken as a worst case, can be used as pure indicator. For every next step, further examination has to be done at a very detailed level beyond a statistical level.

4.5.2

Test Metrics

In Figure 3 it can be clearly seen that the branch coverage trend is very important for our DoD. So in future more data will be analyzed. Also the differentiation between components has been shown as adequate. So the proposed metrics will be used further on. Due to [Mye04] testing has a special nature:

Testing is the process of executing a program with the intent of finding errors.

Definition 13. For a checkin Ci and a set of components Compi we define: LowCohesionIndicator(Ci, Compi )

So, a test of a program should have as major target to identify some unknown error. A more positive approach to improve the quality or even the confidence in a program is according to [Mye04] not sufficient. So, if the following statement has been kept as major test principle, the test process might be to weak: Testing is the process of establishing confidence that a program does what it is supposed to do. However, that seems to be one of the targets every project manager has regrading the product to be delivered. He wants to have a high confidence level. But for a tester that is the wrong mindset [Mye04] and in consequence leads to an ineffective test process. In our environment, we move the before mentioned statement to a meta level, where we describe the confidence level of our tests. A first approach to that can be the combination of metrics for branch coverage and the number of test cases. So, these metrics might give a first answer to the quality of test cases and in consequence give a first indication of the software’s quality.

4.6

Further Ideas for Metrics

Based on our experiences, we will take a closer look on the following metrics in the future. We have seen that the checkins are significantly different in size and complexity. Therefore, we will introduce a measurement for the different behaviors by the following definition, which is based on [SBB+ 06]:

Definition 11. changeSize(Ci) is defined as the sum of the numbers of added, changed and deleted lines for the checkin Ci.

Based on that, we can use changeSize(Ci) as an indicator for the complexity of a change. We know that a pure numbering of lines is not directly correlated to the complexity of a software product as has been mentioned, e.g. by [SBB+ 06]. However, it can be used as a first indicator for the impact of a change. Additionally, the pure number of changed configuration items can be usefull:

Definition 12. For a checkin Ci the number of changed configuration items is defined as changedItems(Ci).

Further of interest is also an indicator for a proper modularization. Therefore, all possible configuration items have to be disjointly distributed to components. We can then count how many components are involved during a single checkin. Based on that distribution, we can define a simple measure, which is an indicator of the modularity of a component.

( :=

0 #(Ci∩Compi ) #Ci

; for Ci ∩ Compi = ∅ or Ci ∩ Compi = Ci ; else

Thus, LowCohesionIndicator(Ci, Comp1 ) is an indicator whether a checkin is restricted to one component. In such case

X

LowCohesionIndicator(Ci, Compi ) = 0

Compi

holds. Hence that is reasonable, because the change has only affected one component and is only a local one. In that case, no penalty has been added to any component. In the other case, if more than one component has been involved X LowCohesionIndicator(Ci, Compi ) = 1 Compi

holds. So a penalty will be distributed over all components. However, more effort in that area is necessary. E.g., the indicator is not symmetric, so one component with a high number of configuration items might have a higher LowCohesionIndicator than a smaller component. One possible solution could be to normalize this number by the size of a component. Additionally, the separation of configuration items in interface parts and implementation adds useful information, which should be included in further considerations. The information that a build or test fails can add significant hints for a software architecture as well. This has to be included in our considerations. Also in the area of metrics for test there is also need for improvement.As it has been demonstrated in Table 1 the pure number of test cases is not sufficient, because it takes not into account the size of components. So to really use that metrics it has to be biased with the code size: Definition 14. The number of test cases per lines of code is defined as follows #T CperLOC =

#T C #N LOC

where NLOC is defined as number of non commented lines of code for < component >. Such #T CperLOC should be an indicator which component has not sufficient test cases. Because this metrics has not been now introduced a threshold consideration cannot be given in this paper.

5.

CONCLUSION AND FUTURE WORK

So some of our initial questions have been solved: Our Continuous Integration is a regular integration, but the frequency has not been high enough. We have also seen that

some results of the metrics show us facts we already have known. E.g., that we have explicit testers in our teams, which is not desired by SCRUM. However, based on the metrics, we have now seen that this reduces the productivity of our team significantly. Also some first results of the metrics have to be carefully analyzed to prevent wrong conclusions. This takes time and it has to be done with the according experts, because an in-depth knowledge is necessary. Hence, the result of the metrics have to be reflected very carefully. We have also seen that a very broad knowledge is necessary to draw the right conclusions. Sometimes indepth knowledge regarding the build mechanism up to architectural or process topics is required. On bigger development projects this knowledge can not be concentrated onto one person. Several experts have to be involved and those experts are normally restricted resources. If these experts are not included in the considerations, very easily wrong conclusions can be drawn. However, we have found some real improvements, which significantly reduce the build time and thus increasing the productivity of the developers. This improvement could be easily checked by comparing the build times before the change and afterwards. Such an improvement of the development environment has been caused. This has reduced the cycle time for developers and such improved the development process. On more abstract topics, only a discussion of concepts could be achieved for estimating the benefit. But this discussion can be put on a more solid basis with the available metrics. Some questions have not been answered precisely. For example whether the Code has sufficient quality. But we have seen that, there does not exist a formal answer to that question. We tried to use a simple definition of branch coverage to deal with that situation. It has been a pragmatical solution. We required not an exact minimal level, but defined that the branch coverage between sprints is not allowed to decrease. Our proposed metrics for tests clearly give first hints in the answer regarding the quality of tests. With the combination of branch coverage and also the weighted number of test cases a first step in that direction has been achieved. That holds especially in a test driven development, where a direct correlation to the code should be. So our original second question regarding the quality of the software could not directly be answered. However, if the quality of test should have an impact to the quality of software. In future, we would like to collect metrics for some more sprints to include some tendency trends. Therefore, some graphical interpretation or real statistical analysis is required. Thus, a deeper reliability analysis based on statistical methods, e.g., using [Lyu96] will be realized. We also would like to include the proposed metrics proposed in Section 4.6, because some hints on the software architecture might be given, too. However, we have seen that without permanent external help or support, the SCRUM teams will only use the metrics very seldom. In an agile argumentation then the conclusion

is quite simple: The effort of using metrics has been to high to deliver products. As a consequence, there is no business value in using that metrics. However, we believe that a better visualization of the metrics, e.g., a better integration to Hudson will significantly help and increase the acceptance. As an additional possibility, the team should suggest metrics, which are then implemented and installed. Another argument in using metrics is the existence of a distributed development environment. Therefore, the determination of impediments’ priorities in the SCRUM of SCRUM could be supported by the used metrics. Moreover, we think that integrating performance measurements in the Continuous Integration environment can help to identify a creeping slow-down of the software performance as well as upcoming software performance problems early. These problems can be easily passed over to the developers, so that, they can react in an appropriate way. We do not think that performance metrics integrated in a SCRUM based process does influence the agile principles as it only helps the developers to get attention to upcoming performance problems, which can be addressed in a backlog. Additionally, performance measurements integrated in the Continuous Integration environment can be used to verify some non-functional requirements, as defined in the “definition of done”. This is also very interesting because systems performance as a non-functional requirement should in SCRUM based processes be permanently taken in consideration [CR08]. So it might be good to consider performance criteria during the whole project. That has been realized by integrating in the Continuous Integration environment. So to summarize the results of our initial questions: 1. How continuous is our Continuous Integration? We have seen that our Continuous Integrations is as continuous as it should be. However, the frequency seems to be to low. 2. Does the code have sufficient quality? This question has not been clarified in detail. But with the introduced metrics we have developed some first impressions of the quality of our tests. But in future these metrics have to be further modified. And then it has to be further looked on whether a correlation between test and code quality exists. 3. What is an appropriate definition of done? It has been shown that the used Definition of Done in the area of branch coverage has been appropriate. So a threshold value different for components has been shown as useful. Additionally, by using these metrics some modifications, which have not been part of our questions, have been identified. Thus, some major impediments in our build process have been shown and removed. After the correction the cycle time of a development phase has been significantly improved. To summarize: Metrics can significantly support a SCRUM based process. But they have to be handled with care to keep

the agile principles like self organizing teams. Especially, the psychological effect of using metrics in SCRUM teams has not to be neglected, because metrics improve the ability of management to control SCRUM teams. If a SCRUM master can not prevent that this control has been misused by management, one major aspect of SCRUM, the self organization, has been destroyed. But, if these metrics are used within the team, they significantly help to qualify impediments especially in distributed teams. These metrics might even help SCRUM teams to spread within a bigger company. The first of the authors had a long industrial experience within CMMI in the area of SW development for mobile phones. In that period a strict use of metrics has been shown sometimes as not helpful, because these metrics helped not to increase productivity. In most cases these metrics have been introduced only due to management needs. In an agile environment that effect has been visible with enforcement. So a moderate use of metrics defined by the team themself is suggested.

6.

ACKNOWLEDGMENT

This research is granted by Nokia Siemens Networks. The authors would like to thank the LTE group, especially Helmut Voggenauer, J¨ org Monschau, Karl Mattern and Joachim Bauernberger as representatives for the excellent support and contributions to this research project.

[CR08]

[DMG07]

[DTHK05]

[FG07]

[Fow06]

[FW10a]

[FW10b]

We also thank Hans-Michael Windisch and the anonymous referees for careful reading and providing valuable comments.

7.

REFERENCES

[BA07]

[Bea07]

[Bea09]

[Bec02] [BGMJ09]

[Boe02] [BT05]

[CK94]

[CLW97]

Luigi Buglione and Alain Abran. Improving estimations in agile projects: Issues and avenues. In Proceedings Software Measurement European Forum (SMEF), 2007. Paul A. Beavers. Managing a large ”agile” software organization. In IEEE Agile 2007, 296-303, 2007. Kent Beck and et al. Manifesto for agile software development. http://agilemanifesto.org, Jul 2009. Kent Beck. Test Driven Development. Addison-Wesley Longman, 2002. S. Bhate, R. Gupta, M. Macwan, and S. Jaju. An Agile Approach to Performance Testing. Performance engineering and enhancement, 7(1):53–64, 2009. Barry Boehm. Get ready for agile methods, with care. IEEE Comupter, pages 64–69, 2002. Barry Boehm and Richard Turner. Management challenges to implement agile processes in traditional development organizations. IEEE Software, pages 30–39, 2005. Shyam Chidamber and Chris Kemerer. A metrics suite for object oriented design. IEEE Transactions on Software Engineering, 20:476–492, 1994. Mei-Hwa Chen, Michael R. Lyu, and W. Eric Wong. Incorporating code coverage in the reliability estimation for fault-tolerant

[GDA+ 08]

[Gla98] [Glo08] [Gun98] [HL84]

[Hoc07] [JKP+ 05]

[KLV05]

[Lig02] [LS05]

software. In Proceedings 16th IEEE Symposium on Reliable Distributed Systems, Durham, North Carolina, pages 45–52, 1997. Lan Cao and Balasubramaniam Ramesh. Agile requirements engineering practices: An empirical study. IEEE Software January,February 2008., 25:60 – 67, 2008. Paul M. Duvall, Steve Matyas, and Andrew Glover. Continuous Integration: Improving Software Quality and Reducing Risk. Addison-Wesley, 2007. Yael Dubinsky, David Talby, Orit Hazzan, and Arie Keren. Agile metrics at the israeli air force. In Agile Conference 2005, IEEE, 2005. Chris Fry and Steve Green. Large scale agile transformation in an on-demand world. In AGILE, 2007. Martin Fowler. Continuous integration:. http://martinfowler.com/articles/continuousIntegration.html, May 2006. Christian Facchi and Jochen Wessel. The definition of metrics for continuous integration in SCRUM. In SMEF (Software Measurement Europe Forum), Rome, 2010. Christian Facchi and Jochen Wessel. The definition of metrics for continuous integration in SCRUM – how continuous is our continuous integration. In SCRUM Days 2010, Munich, www.scrum-day.de, 2010. Hillel Glazer, Jeff Dalton, David Anderson, Mike Konrad, and Sandy Shrum. Cmmi or agile: Why not embrace both! Technical report, Software Engineering Institute, CMU/SEI-TN-003, 2008. R. Glas. Software Runaways: Monumental Software Disasters. Prentice Hall, 1998. Boris Gloger. SCRUM. Hanser, 2008. N. H. Gunther. The Practical Performance Analyst. McGraw-Hill Education, 1998. P. Heidelberger and S. S. Lavenberg. Computer Performance Evaluation Methodology. IEEE Transactions on Computers, 33(12):1195–1220, 1984. D. Hoch. Linux Performance Monitoring. OSCON, 2007. Phillip M. Johnson, Hongbing Kou, Michael Paulding, Qin Zhang, Aaron Kagawa, and Takuya Yamashita. Improving software development management through software project telemetry. IEEE Software, 2005. Mehdi Kessis, Yves Ledru, and G´erard Vandome. Klv2005. In Fifth Int. Workshop on Software Engineering and Middleware (SEM 2005), 2005. Peter Liggesmeyer. Software-Qualit¨ at. Spektrum Akademischer Verlag, 2002. D. Leffingwell and H. Smits. A CIO’s playbook for adopting the scrum method of achieving software agility. Technical report, Rally Software Development Corporation and

Ken Schwaber-Scrum Alliance, 2005. Craig Larman and Bas Vodde. Scaling Lean & Agile Development – Thinking and Organizational Tools for Large-Scale Scrum:. Pearson Education, 2009. [Lyu96] Michael R. Lyu, editor. Handbook of software reliability engineering. McGraw-Hill, 1996. [Mar95] Brian Marick. The Craft of Software Testing. Prentice-Hall, 1995. [Mar02] J. J. Marciniak. Encyclopedia of Software Engineering. John Wiley & Sons Inc, second edition, 2002. [MLBK02a] Yashwant K. Malaiya, Michael N. Li, James M. Bieman, and Rick Karcich. Software reliability growth with test coverage. IEEE transaction on reliability, 51:420–426, 2002. [MLBK02b] Yashwant K. Malaiya, Michael Naixin Li, James M. Bieman, and Rick Karcich. Software reliability growth with test coverage. IEEE TRANSACTIONS ON RELIABILITY, 51:420–426, 2002. [Mye04] Glenford J. Myers. The Art of Software Testing. John Wiley & Sons, Inc., 2nd edition, 2004. [MZ07] Viljan Mahnic and Natasa Zabkar. Introducing cmmi measurement an analysis practices into srum-based software development procss. International Journal of Mathematics and Computer in Simulation, 1:65–72, 2007. [PY07] Mauro Pezze and Michal Young. Software Testing and Analysis: Process, Principles and Techniques. John Wiley & Sons, 2007. [RUCH01] Gregg Rothermel, Roland H. Untch, Chengyun Chu, and Mary Jean Harrold. Prioritizing test cases for regression testing. IEEE Transactions on Software Engineering, 27:929–948, 2001. [SB02] Ken Schwaber and Mike Beedle. Agile Software Development with SCRUM. Prentice Hall, 2002. [SBB+ 06] Danillo Sato, Dairton Bassi, Mariana Bravo, Alfredo Goldman, and Fabio Kon. Experiences tracking agile projects: an empirical study. Journal of the Brazilian Computer Society, 2006. [Sch04] Ken Schwaber. Agile Project Managment with SCRUM. Microsoft Press, 2004. [SGK07] Danilo Sato, Alfred Goldman, and Fabio Kon. Tracking the evolution of object-oriented quality metrics on agile projects. In Agile Processes in Software Engineering and Extreme Programming. Springer, 2007. [SL05] R. Srinivasan and O. Lubeck. MonteSim: A Monte Carlo Performance Model for In-order Microarchitectures. ACM SIGARCH Computer Architectur News, 33(5):75–80, December 2005. [SS01] A. Schmietendorf and A. Scholz. Aspects of Performance Engineering - An Overview, pages IX – XII. Springer Verlag Berlin,

[LV09]

[SSRR08]

[SVBP07]

[TF07]

[TF08]

[TFB09]

[TMF10]

[vSB99]

[XSZC00]

[ZHM97]

Heidelberg, 2001. J. Sutherland, G. Schoonheim, E. Rustenburg, and M. Rijk. Fully distributed scrum: The secret sauce for hyperproductive offshored development teams. In IEEE AGILE 08, Toronto, pages 339 – 344, 2008. Jeff Sutherland, Anton Vikorov, Jack Blount, and Nikolai Puntikov. Distributed scrum: Agile project management with outsourced development teams. In HICSS’40, Hawaii International Conference on Software Systems, 2007. P. Trapp and C. Facchi. Performance Improvement Using Dynamic Performance Stubs. Technical Report 14, Fachhochschule Ingolstadt, August 2007. P. Trapp and C. Facchi. How to Handle CPU Bound Systems: A Specialization of Dynamic Performance Stubs to CPU Stubs. In CMG ’08: International Conference Proceedings, pages 343 – 353, 2008. P. Trapp, C. Facchi, and S. Bittl. The Concept of Memory Stubs as a Specialization of Dynamic Performance Stubs to Simulate Memory Access Behavior. In CMG ’09: International Conference Proceedings. Computer Measurement Group, 2009. Peter Trapp, Markus Meyer, and Christian Facchi. Using cpu stubs to optimize parallel processing tasks: An application of dynamic performance stubs. In ICSEA (International Conference on Software Engineering Advances) 2010, Nice (accepted paper), 2010. Rini van Solingen and Egon Berghout. The Goal/Question/Metric Method. McGraw-Hill, 1999. M. Xenos, D. Stavrinoudis, K. Zikouli, and D. Christodoulakis. Object-oriented metrics a survey. In Proceedings of the FESMA, Federation of European Software Measurement Associations, Madrid, Spain, 2000. Hong Zhu, Patrick A. V. Hall, and Johnson H. R. May. Software unit test coverage and adequacy. ACM Computing Surveys, 29:366–427, 1997.

Enhancing Continuous Integration by Metrics and ...

Enhancing Continuous Integration by Metrics and ...

Suggest Documents

Continuous Integration

Continuous Integration and Continuous Delivery at FRIB

Approximation Metrics for Discrete and Continuous Systems

Introducing Continuous Integration

RELIABLE INTEGRATION OF CONTINUOUS

a metamodel integration between metrics and

Accountancy Capstone: Enhancing Integration and ...

Enhancing the stability of a continuous-wave terahertz system by ...

Accountancy Capstone: Enhancing Integration and ...

Continuous Integration for Concurrent Computational Framework and ...

Continuous Integration, Delivery and Deployment: A ...

continuous integration and version control: a ...

Software Process Improvement: Continuous Integration and Testing ...

Continuous Integration for Concurrent Computational Framework and ...

Continuous Integration and Delivery for HPC

Using Continuous Integration and Automated Test ...

continuous integration pdf.pdf - Google Drive

Leveraging Continuous Integration process to

Continuous Improvements using Metrics for ITSM

Supporting Continuous Integration by Mashing-Up Software Quality ...

ePub Learning Continuous Integration with Jenkins By ... - Google Sites

pdf-1843\jenkins-continuous-integration-cookbook-by-alan-berg ...

PDF Agile Metrics in Action: Measuring and Enhancing ... - Google Sites

Sustainable Systems Integration Model-Metrics in ... - ScienceDirect