Timesheet Assistant: Mining and Reporting Developer ...

5 downloads 77 Views 410KB Size Report
curately, we present a tool called Timesheet Assistant (TA) that non-intrusively mines developer activities and uses sta- tistical analysis on historical data to ...
Timesheet Assistant: Mining and Reporting Developer Effort ∗

Renuka Sindhgatta, Nanjangud C. Narendra, Bikram Sengupta, Karthik Visweswariah

Arthur G. Ryman IBM Rational Toronto, Canada

[email protected]

IBM Research India; Bangalore, India

{renuka.sr,narendra,bsengupt,vkarthik}@in.ibm.com ABSTRACT

1.

Timesheets are an important instrument used to track time spent by team members in a software project on the tasks assigned to them. In a typical project, developers fill timesheets manually on a periodic basis. This is often tedious, time consuming and error prone. Over or under reporting of time spent on tasks causes errors in billing development costs to customers and wrong estimation baselines for future work, which can have serious business consequences. In order to assist developers in filling their timesheets accurately, we present a tool called Timesheet Assistant (TA) that non-intrusively mines developer activities and uses statistical analysis on historical data to estimate the actual effort the developer may have spent on individual assigned tasks. TA further helps the developer or project manager by presenting the details of the activities along with effort data so that the effort may be seen in the context of the actual work performed. We report on an empirical study of TA in a software maintenance project at IBM that provides preliminary validation of its feasibility and usefulness. Some of the limitations of the TA approach and possible ways to address those are also discussed.

One of the key features of a software development or maintenance contract is that usually customers provide payment based on the time spent by software developers in executing that contract. This necessitates the use of timesheets, which are records of effort expended. These timesheets are manually filled in by developers, and then verified by the software vendor and the customer. Appropriate payment is made to the software vendor after the verification. In addition, timesheets are also used within the software organization to track effort and developer productivity. While several products that facilitate manual timesheet reporting exist [8, 1, 2, 3], how to accurately estimate the actual effort expended on software development tasks continues to remain a thorny issue. Manually recording time in timesheets is a tedious and error-prone process. A number of reasons may lead to incorrect timesheet data being recorded; for example:

Categories and Subject Descriptors D.2 [Software Engineering]: Design - methodologies; Design Tools and Techniques - computer-aided software engineering

General Terms Design, Experimentation, Human Factors, Verification

Keywords Timesheet, Development Activity, Mining, Estimation Thanks to Saurabh Sinha and Subhajit Datta for their feedback. ∗

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ASE’10, September 20–24, 2010, Antwerp, Belgium Copyright 2010 ACM 978-1-4503-0116-9/10/09 ...$10.00.

INTRODUCTION

• Developers often work on multiple development tasks in parallel, or they interleave these tasks with other activities (e.g., code reviews, learning activities etc). Later, it becomes difficult to separate out time that was spent on these tasks and activities individually. • The developer may not have expended the full effort for which he/she is expected to be billed, and to cover that up, the developer may over-report the effort spent on certain tasks. • If the developer truthfully reports an effort that is significantly different from the one originally estimated (often by the team lead) then he/she may have to explain why a certain task took more/less time. It has been our experience that this is a situation many developers tend to avoid. • Also, it is difficult for project managers to thoroughly analyze and validate timesheet data, given that a manager would usually not know the full development context, and manually retrieving and checking the necessary information from the project repositories for each timesheet entry is not practical. Errors in timesheets that arise due to such challenges can be damaging in many ways. Under-reporting of effort results in lower revenue for the software vendor and also unrealistic expectations from the customer for future work, which could seriously impact the quality of software delivery. On the

other hand, over-reporting of effort – if detected by the customer – results in damaging the vendor’s reputation, leading to potential loss of business. Timesheet data is also used to refine estimation baselines in organizations following a process improvement model such as Capability Maturity Model Integration (CMMI)1 . A wrong baseline resulting from poor quality timesheet data can lead to severe estimation problems and delivery issues for the software organization in the long run. To address these challenges, we present Timesheet Assistant (TA), a tool that mines developer activities to (i) estimate the time (i.e., effort) a developer may have actually spent on assigned tasks, and (ii) generate an activity report that reviewers may drill down to different levels of detail to understand the characteristics of work carried out. Unlike some existing approaches for timesheet automation (e.g., [5]) that involve intrusive (and therefore unpopular) methods such as recording keyboard/mouse clicks and monitoring the applications that the developer has accessed, TA is non-intrusive and does not require any change to the environment a developer is working in. One of the key requirements of TA is the availability of information on the files modified for a given task. In most modern development enviR ronments such as IBM Rational Team ConcertT M (RTC)2 or Visual Studio Team System (VSTS)3 , this information is available as a part of tasks or work items assigned to the developer. In Section 2, we explain the role of such environments in our overall solution approach. TA analyzes information in the following manner. First, it extracts all the tasks a developer had worked on in the given period of time. Second, for each task, TA mines the files that were changed and computes a set of metrics that help explain the overall development effort and context, for example, the size of the change, the expertise of the developer on the files that had to be changed etc. Third, it uses some of the computed metrics and statistical techniques based on historical data (in this paper we demonstrate usage of linear regression) to estimate the time that may have been taken to complete the task. Finally, it reports all the relevant information along with the tasks in the timesheet to help reviewers drill-down and validate the data, and periodically re-calibrates the estimation model based on new effort values submitted. Section 3 presents the overall architecture of TA and explains the functioning of its different components. Our evaluation of TA in a maintenance project at IBM has provided preliminary validation of its usefulness. For example, we have found that it is possible to develop reasonably accurate estimation models at the individual work item level and use that to suggest timesheet entries based on analysis of development work undertaken. We call this “estimation in the small”, as opposed to the traditional project-level “estimation in the large”, exemplified by estimation methods such as COCOMO [9] or SLIM [7]. Also, user feedback from the case study suggests that summarizing and linking development activities with timesheet reports significantly ease the task of reviewing the reports or justifying the entries to project managers or customers. These results are encouraging and have helped spark a discussion with one of our business partners on possible productization of TA con1

http://www.sei.cmu.edu/cmmi/ http://www-01.ibm.com/software/awdtools/rtc/ 3 http://msdn.microsoft.com/enus/teamsystem/default.aspx 2

cepts, even as we carry out larger-scale validation. At the same time, we have also identified some of the threats to the validity of our current approach (and possible ways to address those) and realized that there are certain effort determinants that are inherently difficult to capture fully, particularly using a non-intrusive approach such as ours. The results and lessons learnt from our empirical study of TA are detailed in Sections 4 and 5, while the rest of the paper discusses related work and presents our conclusions and directions for future work. We will conclude this section with a summary of the main contributions of our work: 1. We consider the challenges associated with timesheet reporting and validation that are of great practical significance to the application services industry, but which have been little studied by the software engineering research community. 2. We present a novel framework called Timesheet Assistant (TA) that provides automated and non-intrusive support for addressing these challenges through a combination of software repository mining, statistical analysis and in-context reporting of development activities in timesheet reports. 3. Our solution approach departs in two significant ways from traditional estimation literature: we perform estimation “in the small” at the level of individual tasks and we estimate post-facto, based on actual development work carried out. 4. We demonstrate the feasibility and usefulness of TA on a real-life case study from a maintenance project at IBM and discuss the lessons learnt.

2.

TOWARDS TIMESHEET AUTOMATION

While timesheets have been used in the software industry for decades, we believe that a number of recent trends now make it possible to mine project repositories and improve and automate the way timesheets are reported. To begin with, in the last few years, Integrated Development Environments (IDEs) have evolved from being developer focused tools providing features to compile, debug and deploy software programs to collaborative environments supporting project planning, work assignment, discussions, source code management, build and test management, project tracking and reporting. In such development environments, each task – whether planning, development, testing, or defect fix – is modeled as a work item that is expected to deliver a development plan, design, feature enhancement, or a code fix, as the case may be. A work item carries a set of basic attributes that are useful for tracking it e.g., name, unique identifier, description, creator, owner, creation date, closure date, estimated effort, effort spent and so on. Custom attributes can also be defined as needed. Moreover, links may be established between a work item and associated software development artifacts (code, test cases, designs, plans, etc.) stored in a configuration management system through the definition of one or more change sets. A change set is a collection of files grouped together by the developer in a manner that is meaningful for the project. For example, all GUI-related file changes can be grouped together into a single change set and checked in against one or more work

items. This rich contextual information linked to work items makes it possible to track development activities undertaken to implement the items, and is core to the Timesheet Assistant approach. With IDEs becoming increasingly popular and data-rich, and tool vendors focusing on end-to-end integration across the software development lifecycle (SDLC) tool stack, we are also seeing the emergence of data warehouses to help archive large volumes of SDLC data efficiently and support fast querying and retrieval of information. Data from SDLC tools can be extracted, transformed and loaded into these warehouses, and then business intelligence techniques can be applied on the data to get deeper insight into the health of the project and obtain various kinds of reports for informed decision-making. Along similar lines, in the case of Timesheet Assistant, our goal is to run statistical analysis techniques on data and metrics extracted from the development environment to estimate the actual effort that may have been spent on assigned tasks, and then generate insightful reports on the development activities undertaken to help reviewers validate timesheet entries4 . The increasing adoption of data warehouses and BI techniques in mainstream software development [16], along with modern IDEs described above, thus provide the necessary ingredients to develop an automated solution for timesheets. Specifically, Timesheet Assistant (TA) assumes Rational Team Concert as the development environment, Rational Insight5 as the data warehouse and reporting engine and GNU Octave [4] for statistical analysis and predictive modeling of effort.

3. TA: TIMESHEET ASSISTANT We begin this section by describing our approach towards estimation in the small, i.e., how we model factors that may determine or help explain effort spent on individual development tasks in timesheets. Following this, we will present an overview of the Timesheet Assistant architecture and explain the functioning of its different components.

3.1 Estimation in the Small A number of parametric software estimation models [9, 7] have evolved over the last several decades to accurately predict the overall cost, schedule and quality of a software product to be developed. These models typically embody estimation in the large; they apply across the software development lifecycle, are governed by a set of gross effort indicators whose values themselves need to be estimated (often subjectively) at the start of a project, and are then used for budgeting, project planning and control, tradeoff and risk analyses, etc. Given that these models have been built, refined and calibrated through a large number of completed software projects over several years, they provide an excellent starting point for us to study well-accepted factors that impact effort (in the large), and then consider which of those factors may also be relevant for estimation in the small, and how they may need to be re-interpreted. For our study, we used the well-known COCOMO II Post Architecture model [9] that estimates effort as a function of the size of the software project (in terms of thousands of source lines of code, or function points), adjusted by 22 variables (scale factors or effort multipliers), values for which are 4 http://jazz.net/library/content/articles/insight/performancemanagement.pdf 5 http://www-01.ibm.com/software/rational/products/insight/

given by selecting a rating from a well-defined scale (ranging from Very Low to Extra High). The choice of COCOMO II was motivated by its widespread acceptance in the software community. While size of source code produced (or modified) will naturally influence the effort needed even for individual development tasks tracked through timesheets, we performed an analysis of the 22 identified variables to determine their applicability for task-level estimation. We kept two things in mind while doing this. First, we wanted to fix the initial context of TA at the level of individual projects; in other words, we were interested in factors that can cause effort variations across tasks within the same project environment, rather than factors that are project-wide and likely to impact all tasks more or less uniformly. Second, since our goal is to automate timesheet reporting and validation to the extent possible, we wished to focus on factors whose values can potentially be mined from the development environment with relative ease/accuracy, without having to depend on subjective assessment by team members who fill up timesheets, or burdening them with additional reporting overhead. With these objectives in mind, we first identified the set of COCOMO effort factors that would normally characterize the project as a whole, and whose effect on individual development tasks we will assume to be uniform. These include for example, Development Flexibility (degree of conformance to requirements and interface standards), Analyst Capability (capability of personnel working on requirements), Team Cohesion (willingness of parties to work together), Personnel Continuity (annual personnel turnover), Process Maturity (e.g., CMM level), Use of Software Tools (degree of sophistication in tools used for the project), Platform Volatility (e.g., frequency of changes in compilers, assemblers), etc. One may argue, of course, that even some of these factors may affect one work item more than another within a project; for example, willingness to work together may vary from one group of team members to another, or some requirements analysts may be much more capable than others in the project. But we have decided to treat these factors as having uniform influence for now, given that it would anyway be difficult to objectively measure their influence on individual development tasks carried out by team members. Next we studied COCOMO factors that may be relevant for individual work items once they have been re-interpreted for estimation in the small. We discuss these below.

3.1.1

Reliability

The COCOMO factor Required Software Reliability (more effort would need to be expended to avoid software failures that carry higher risks) may vary from one component (and its work items) to another within a project. Another factor, Data Base Size (effort required to generate test data) is considered as a measure of the testing effort required and is also related to reliability. In our estimation model for timesheet tasks, we have included a factor called Required Reliability that will be measured in terms of the number of test cases that will have to be written (for new features) or executed (for defect fixes) before closing the work item. We have assumed that components that are deemed risky will be tested more comprehensively, hence we have used number of test cases as an indicator of a component’s required reliability.

3.1.2

Complexity

The Product Complexity factor in COCOMO is factored into our model as task complexity in multiple ways. First, we recommend building separate models for different types of work items such as new feature implementations, enhancements or defects, since the dimensions of complexity vary across these types. For example, for the same change in terms of number of lines of code, a defect could take longer time than an original feature development [13]. On the other hand, the effort spent on developing a new feature may be influenced by the number of libraries that are being used (as the functionality of each of them needs to be understood), while in a defect fixing activity it may not be as important since the developer may already be aware of the library functionalities. This approach may be further extended to separate out server side work from client side work if needed, since the former would typically involve more complex logic. Second, for any work item, the complexity of its associated development activity is measured in terms of the number of distinct files changed, for each such file the number of methods changed and for each such method, (i) the wellknown cyclomatic complexity (that measures the number of independent paths through a program unit) and (ii) the fan-out, i.e., number of external methods/functions being called. The number of files/methods changed helps us track whether the change was localized or relatively distributed (in the latter case it may be considered more complex). Note that not all of these measures will necessarily impact the effort spent on each work item in a statistically significant way; however, we still track these since they also help in summarizing the overall development context to timesheet reviewers. Third, a work item may manipulate files of different types. The core functionality of the system may be implemented in some major programming language, but there will also be accompanying miscellaneous minor files such as configuration and build scripts, properties files, xml/html files, etc. Making a change to a minor file such as properties file will be in general, far less complex than an equal-sized change in a Java file. Classifying changes by identifying the types of files changed is therefore an important factor in sizing work. For example, we classify files as “major” and “minor” for key development files and miscellaneous files respectively. However, for “minor” files, due to their simpler structure, we currently do not calculate any complexity measure.

3.1.3

Expertise

Team capability related factors in COCOMO such as Programmer Capability (capability of programmers as a team), Applications/Platform Experience (level of applications or platform experience of the project team), Language and Tool Experience (of team), etc., are re-considered within our timesheet model in terms of the expertise of the developer performing the task. This expertise has two dimensions. The first one, which corresponds roughly to experience, may be used to differentiate how two developers with different levels of application/language/platform familiarity may perform the same activity. To handle this, we can cluster developers by experience level, which may be determined by indicators such as the time spent on the project, the overall development experience in years, etc.; the intuition being that higher the experience, less will be effort required for a given work item. The second dimension of expertise, which

deals with task-level familiarity, may be mined from historical development data and will be used to adjust the effort a developer spends on a work item based on his/her familiarity with the files that need to be updated. This low-level notion of expertise is particularly important for estimation in the small, since familiarity with files can significantly bring down the effort that would be needed to update them to incorporate a new feature/fix a defect. We measure the work item level familiarity of a developer for a work item as the weighted average of the familiarity on each file changed by the developer, where the weight assigned to a file is the proportion of the total development size for the work item that corresponds to the file. For example, if a developer D has changed files F1 and F2 for a work item W, with 20 lines changed in F1 and 30 lines changed in F2, then D’s familiarity index for W would be a weighted average of her familiarity on each file (ff), i.e., 0.4 ∗ f f f or F 1 + 0.6 ∗ f f f or F 2. Developer expertise computed in this way lies between 0 and 1, with a higher number indicating higher expertise. The file familiarity ff of developer D with a file F is computed by first considering the ratio of the number of times F has been updated by developer D, compared to the maximum number of times it has been updated by any developer throughout the evolution history of the file. This provides a relative measure of expertise compared to an expert on that file. The ratio is used to further classify developer file familiarity into 3 buckets (High, Medium, Low) based on threshold values that may be set by the user. Finally, each bucket is given normalized scores (we used 1.0, 0.6 and 0.3), for High, Medium and Low familiarity, respectively. For a user updating a file for the first time, familiarity is set close to 0. We realize that modeling expertise is in itself an interesting topic of research [15], and hence we also experimented with a slightly different measure of file familiarity, in terms of the proportion of total code in the file that has been updated by the developer, compared to the updates made to the file by other developers throughout the evolution history of the file. However, when we computed both the measures and cross-checked with a team of developers who were part of the TA case study, we realized that equating file expertise with proportional code updated, may result in grossly inaccurate familiarity levels since it ignores familiarity that a developer gains of code submitted by others in course of making his/her own updates. On the other hand, according to the first measure, familiarity increases each time a developer updates a file, since he/she gets an opportunity to review its source code, and this model resulted in familiarity values that to a large extent, mirrored the level of familiarity the developers themselves reported. Hence we have used the first approach in our model. Details of our interviews with developers on the topic of file familiarity could not be included here due to lack of space. Table 1 summarizes the effort factors and associated work item attributes and metrics that we are tracking in our task level model for timesheets. This information will be presented to reviewers of timesheets to help them quickly get an understanding of the development work carried out. They will also be used in designing statistical effort models for development activities.

used by Metrics Analyzer to compute the metrics listed in Table 1. These metrics are stored in the Data Store. Metrics Analyzer also uses historical work item data available in the Data Store to compute the expertise of the developer on changed files and further for the work item as a whole To allow projects the flexibility of defining their own metrics that can be configured and extracted by Activity Tracker, we define an extension point for adding additional metrics – by extending an AbstractMetricProvider of Metrics Analyzer component. A metric can be defined for different levels of granularity – work item, file and changes. Activity Tracker extracts metrics computed by all extension points and generates and stores them in the Data Store for each work item as name/value pairs of metrics.

3.2.2

Figure 1: Timesheet Assistant Architecture

3.2 Timesheet Assistant Architecture We will now outline the architecture of TA, comprising four main components, as shown in Figure 1: • Activity Tracker: Extracts all attributes and metrics of work items that provide context of the development activity and are indicative of the effort spent on it. Steps (1), (2) and (3) in Figure 1 indicate the extraction and storage of this data. • Effort Calculator: Uses statistical analysis techniques to build an effort model based on effort predictors extracted by Activity Tracker. Subsequently, it uses the effort model to compute the effort possibly expended by the developer on a work item (steps (4) and (5) of Figure 1). • Re-Calibrator: As more work item data is captured, the regression model may need to be refined. ReCalibrator re-computes the regression coefficients which are further used by Effort Calculator. • Timesheet Visualizer: The data mined by the Activity Tracker and effort computed by Effort Calculator can be viewed using Timesheet Visualizer represented by step (6) in Figure 1. The following sections describe the components in detail.

3.2.1

Activity Tracker

The key components of Activity Tracker are Work Item Data Extractor, Code Parser, and Metrics Analyzer. Work Item Data Extractor uses the Rational Team Concert Client API6 to extract work item attributes such as type of work item, creator of the work item, owner, status, estimated effort and change sets associated with the work item. For each file in the change set, before and after versions of the file are extracted, the changes made are identified using the NetBeans (http://netbeans.org/) diff utility for Windows. Code Parser parses the file and changes for method declarations, method invocations, decision statements which are 6

https://jazz.net/wiki/bin/view/Main/RtcSdk20

Effort Calculator

The work item and metrics information mined and computed by the Activity Tracker from historical tasks and effort reported for these tasks are used by the Effort Calculator, which applies statistical analysis techniques to predict effort for subsequent tasks. In our implementation so far, we have used linear regression for effort prediction. A user can create a model template for a project using TA. A model template allows the user to define inputs for the linear regression model, such as the time period, software size, expertise, complexity metrics, etc., to be considered and any other custom predictors defined in Activity Tracker. The model template can be instantiated resulting in extraction of relevant predictors and using Ordinary Least Squares (OLS) regression to compute the model. If the predictor distributions are not normal, they are linearized by taking logarithms. The regression coefficients and the resulting effort curve are stored in the Data Store. The user can select, discard or refine the model template inputs based on a measure of goodness of fit (described later in Section 4). Once a model template is selected suitable for prediction, for any new work item, the effort is computed based on the model contained in the template, i.e., using its regression coefficients and predictors. As stated earlier in this paper, we have used GNU Octave for linear regression modeling.

3.2.3

Re-Calibrator

As a project advances, the influence of factors on the effort for a work item, changes. Familiarity of technology, stability of features through a development cycle, etc., may cause lesser effort to be expended for the same change as compared to the effort during the initial stages of the cycle. On the other hand, in long-running projects, code decay can lead to increase in change effort over time [13]. Either way, it becomes necessary to adapt the model to changes in the project environment. Re-calibrator essentially is the creation of new Effort Calculator model templates. Hence, during the project life cycle, a new model template can be instantiated containing the relevant work item information, metrics and predictors.

3.2.4

Timesheet Visualizer

The information mined by Activity Tracker and effort predicted by Effort Calculator is used to visualize the timesheet via Timesheet Visualizer. As shown in Figure 2, for a developer, the work items and the original estimated effort for each work item are listed. Effort predicted by the Effort Calculator is also presented. The Details view provides a

Table 1: Effort Factors and Related Attributes/Metrics Effort Factor Size Required Reliability Complexity

Expertise

Metric/Attribute Lines of Code (LOC) Number of Test Cases Task Type Number of Files/Methods changed File Types Cyclomatic Complexity Fan-out Experience Level Task Familiarity

Description Number of non-commented lines of code updated (added, changed) Number of test cases that have to be written/executed for a work item Type of task e.g., feature development, enhancement, defect fix etc Number of distinct files/methods that were updated Change distribution across different file types (e.g., core logic, properties, xml etc) Number of linearly-independent paths through a program unit Number of other functions being called from a given program unit Developer’s overall experience (e.g., in years), experience in project etc Developer’s familiarity with the files modified by the task

summary of the files and the changes made which can be further drilled down for details of changes made to each file. Size, Complexity, Expertise, etc., metrics are provided for each file that can help a developer or a project manager identify why a specific development activity took more/less time. The Visualizer can highlight metric values that cross a user-defined threshold or range. As Timesheet Visualizer runs on Rational Insight, its reporting component can be used to create customized reports of the information extracted by Activity Tracker.

4. EVALUATING TA In this section we present the results of evaluating TA. The case study was a software development project at IBM during its maintenance cycle, hence we obtain only defect work items for our evaluation. Evaluation of TA on other work item types – original feature development, enhancements, etc., is important and will be taken up in our next phase of experiments, as when those types of work items become available to us. Nevertheless, we consider this case study significant, since maintenance activities involving defect fixes and minor enhancements constitute a significant share of the application services domain where timesheets are routinely used.

4.1 Goals and Method The primary goals of our case study were to (i) determine how well the data extracted using TA could be used for estimating the actual effort at the work item level, and (ii) obtain feedback from the developers and their project managers on our tool and approach in general, and how it could be improved in the future. The project team that we chose was responsible for maintaining a web-based product, and the team used Rational Team Concert as their development environment. Each task was modeled as a work item, estimated time was defined and the task was assigned to a team member. The team member would work on the assigned task, and check-in the relevant development artifacts. Work item data collected for 1.5 months was used as a training set to predict effort for an additional 1 month validation period. We chose linear regression modeling as our statistical technique, since it is one of the most widely used methods for effort estimation, and is simple to use and experiment with. For each work item, we computed the metrics set shown in Table 1 with two exceptions: Experience Level of the developer and Number of Test Cases. The former was not considered since all the 10 developers who were part of the maintenance activity had been with the team for more than

a year, and had helped develop the product they were then maintaining. There did not seem to be any significant differences between their platform/application experience levels although we did not collect information on their overall development experience. We could not collect data on the number of test cases that were executed for each defect item, since the testing activity was not formally recorded against the defect items in the repository. Correlation among the factors was then determined using Pearson coefficient, and only relatively uncorrelated variables were selected [12] as independent variables for running the regression analysis. These variables were: number of lines of code updated (DELTALOC), average cyclomatic complexity of changed methods (DELTACC), and (1-EXP), where EXP refers to the task familiarity of the developer, computed using the method outlined in Section 3.1. (The variable (1-EXP) measures the lack of expertise which should positively correlate with effort). To compute EXP values, file change history was extracted for a period of 6 months before the start of the case study. The metrics calculated for determining the efficacy of the regression analysis are the well-known R2 – coefficient of determination – and Magnitude of Relative Error (MRE). ef f ort−predicted ef f ort)2 R2 is defined as (1− (actual ), whereas (actual ef f ort−mean)2 MRE is defined as the absolute value of (predicted ef f ort−actual ef f ort)∗100 . While the latter provides (actual ef f ort) an indication of the typical fit error present in the model, the former measures the goodness of fit of the regression model, and may be interpreted as the proportion of response variation explained by the regressors in the model. To ensure trustworthy learning data during the case study, a specific process recommendation was implemented by the project managers that ensured accurate reporting of effort data. This was done by checking each work item in the training data set, and interviewing the developer responsible for that workitem, before the work item was closed in the Rational Team Concert tool.

4.2

Estimation Results

Using TA, we were able to extract a total of 100 work items as training data over the previously mentioned 1.5 month period. While fixing defects, the project team primarily developed code in “major” programming languages such as Java or Javascript, with insignificant number of “minor” files updated – hence we did not consider the latter for our regression analysis. In order to ensure accuracy of this historical data, we verified the reported effort for each work item with the respective developer and his/her project manager.

Figure 2: Timesheet Visualizer We ran the regression analysis on the input combinations (DELTALOC), (DELTALOC, 1-EXP) and (DELTALOC, DELTACC, 1-EXP). Table 2 lists results of our regression analysis for these three combinations. Table 2 also depicts regression analysis results with and without outliers, which needs explanation. When we analyzed the distribution of MRE values, we realized that a small share of work items contributed a very significant share of the estimation error. Discussions with the project team revealed that some outliers will always exist; for example, a task may require a developer to learn a new framework/library which takes significantly more time than the actual coding effort and this will be included in the overall time spent. For the input combination (DELTALOC, 1-EXP), a small subset of work items – 10 out of a total of 100 – contributed to about 80% of the overall fit error measured in terms of squared residuals, where the squared residual is calculated as (actual ef f ort − predicted ef f ort)2 . Project managers in the team told us that they tend to focus more on streamlining the development of the vast majority of work items hence we decided to rerun the regression analysis by removing such outliers (as has also been recommended elsewhere e.g., [18]) after which we got much more predictable results as shown in Table 2. It is clear from the regression results that the size of the development effort, DELTALOC, was a dominating variable in terms of its impact on effort, with around 54% of the effort variation attributable to it (after outlier removal), as shown by the R2 value. While the MRE value may appear high in % terms, it is mainly a result of the fact that many defect work items take only a few hours to fix. When we included (1-EXP) into the formulation, the impact was small in terms of its coefficient but significant in terms of improving the regression results. Since most of the developers were working on defect items associated with development work they had carried out earlier, expertise was high in a large

number of cases, and consequently not much of a differentiator in general. However, we discovered that for certain work items where expertise was rated low by our model, the high development effort required (relative to the size of work) was estimated much better, leading to the improved results. Finally, we included the complexity factor DELTACC in our experiments. Interestingly, this led to a slight deterioration in the regression results for (DELTALOC, 1-EXP), although it was better than those obtained for only DELTALOC. There could be a number of explanations for this, including the possibility that the metric cyclomatic complexity of updated methods that we used is too generic and does not accurately capture the actual dimensions of task complexity in a project. We will experiment with other complexity metrics going forward. However, our hypothesis, based on interviews with developers on the topic of file familiarity, is that in software development, code complexity and familiarity are closely related; as developers become familiar with a file, its perceived complexity comes down (even though absolute complexity measures remain the same). When a developer is fixing defects in a file he/she is familiar with, its actual complexity measure is no longer an effort determinant of significance; in fact, including it (along with expertise) to estimate effort may even add noise in some data sets, as seems to be the case in our experiments. On the other hand, for new feature development that leads to creation of significant new code that a developer has no prior familiarity with, metrics such as cyclomatic complexity, number of libraries used, etc., may become quite relevant and help in post-facto explanation of effort spent. We will test this hypothesis in our next round of experiments with TA where we will have access to an IBM project where significant new features will be incorporated in a product. To validate the model, we collected data for an additional 49 work items for the 1-month validation period. The average MRE we obtained was 63% with outliers. Since the

Table 2: Results of Linear Regression on Training Data Effort Equation

With Outliers 2

Without Outliers

R = 0.26 and Average MRE = 72.8%

R2 = 0.54 and Average MRE = 60.2%

Effort = 1.2DELT ALOC 0.42 (1 − EXP )0.06

R2 = 0.34 and Average MRE = 68.5%

R2 = 0.62 and Average MRE = 54.5%

Effort = 1.18DELT ACC 0.01 DELT ALOC 0.42 (1 − EXP )0.06

R2 = 0.34 and Average MRE = 69.6%

R2 = 0.57 and Average MRE = 55.3%

Effort = 1.09DELT ALOC

0.41

Figure 3: Frequency Distribution of Squared Residuals for Training Data using (DELTALOC, 1-EXP) Figure 5: Developer Feedback on Timesheet Assistant • It is interesting to observe that the developers interviewed seldom update original estimated effort for tasks, even though the actual effort differs frequently (Figure 5). One of the managers also confirmed this and pointed out how it stands in the way of improved estimation for future tasks. Figure 4: % Deviation of Actual vs. Predicted Effort for Validation data using (DELTALOC, 1-EXP) purpose of validation is to determine the estimation accuracy of the generated regression model, we used MRE values to remove the top 10% of outliers. After this removal, the average MRE value dropped to 37.1%, which is even better than average MRE obtained in the training set. The relevant box plots of MRE values – with and without outliers – are displayed in Figure 4. Overall, the regression analysis results are promising, particularly in the absence of testing related information which was a clear gap, since every work item went through a testing phase, with some work items requiring many more test cases than others. At the same time, there are several ways in which our estimation model can be further improved, some of which are discussed in Section 5.

4.3 Qualitative Feedback We conducted semi-structured interviews [19] with the ten developers and the four project managers from the project team which participated in this case study. Each interview consisted of two parts – specific objective-type questions with pre-specified answer categories, and requests for qualitative feedback on TA. Figure 5 summarizes the responses against pre-specified answer categories (details of the interviews are not presented here due to lack of space). A few key findings can be summarized from the interviews:

• Given this, both developers and managers felt that by linking development activities to timesheets, TA will help justfy actual effort spent and promote greater transparency. Managers also felt that a tool like TA can help improve their own productivity by cutting down on project tracking effort, while the effort value estimated will at least provide some benchmark against which they can validate developer reported effort. • Two developers who did not find TA useful said that they were not sure how accurate the task-level effort prediction model would be in practice, so whether their reporting burden would be actually reduced. Indeed, there are limitations in our current approach that we discuss in Section 5, along with how some of them may be addressed. • Some of the developers suggested that TA should be able to calculate & “pop up” the estimated effort as soon as a developer closed the work item, since this will act as a useful reference while recording effort. Currently, this calculation is done offline and sent directly to the timesheet report. • Interestingly, while we internally used the metrics of R2 and MRE for measuring the goodness of estimation, project managers suggested that we report these values through TA as they may be helpful in identifying projects that show low predictability, so that appropriate process improvements can be initiated to drive

down variation and make application services delivery more predictable. • Finally, it was also suggested that we integrate TA with requirements management and design tools so that the approach may be broadened to support timesheet assistance for these activities.

5. DISCUSSION In this section, we discuss some of the limitations of the TA approach and possible ways to address the same. One of the threats to the validity of our approach comes from the focus on development size (lines of code) as the key effort determinant, and its possible impact on developer behavior. Expected size is central to all estimation approaches, but since we are doing post-facto analysis there is the danger that some redundant changes are deliberately made to increase development size and claim or justify higher effort than should have been the case. Currently, we do not perform any semantic analysis of code as part of our overall approach, and rely instead on process level monitoring, e.g., code reviews, to discourage such practices. Since we link all changes made to the timesheet report, it becomes easier for reviewers to check the changes made and detect if something is amiss. Also, sometimes a trivial change made for valid reasons (e.g., renaming a widely used variable) can lead to many lines of code change, hence, at least some lightweight semantic analysis needs to be built into TA to ensure such changes do not lead to gross over-estimates of effort. A related issue is if such a tool will discourage developers from writing optimized code (which may also take more effort). In fact, developers may have different coding styles with more experienced developers producing higher quality (with less defects) and more efficient code than their junior colleagues. Clustering developers by their experience levels and building models for each cluster may be one way to address this. Also, cloning of code can significantly boost productivity. This could be a significant factor for effort prediction of enhancements, especially when there is user interface code involved where reuse is very common. There is a rich literature on code clone detection [17] and we are currently reviewing how such techniques can be incorporated into our overall solution design. There are at least two aspects of file familiarity which the expertise model discussed in Section 3.1 does not cover. First, we need to factor in decay in file familiarity of a developer, when no updates have been by him/her over a significant period of time; this decay will be more pronounced when other updates have been made in the meantime. Second, we only consider file familiarity obtained by making code updates. What we found out during our case study is that developers also become familiar by simply reading code submitted by team members if they are pertinent to their own work, or through formal code reviews. In a nonintrusive approach like ours, it is difficult to accurately track this aspect of familiarity; possible solutions may include a process recommendation in IDEs such as Rational Team Concert where we could associate reviewers to each work item or change set. This could be a factor in identifying familiarity of a developer with a given file. Apart from the actual act of writing code for an assigned task, developers can spend a significant amount of time on

discussions with team-members to get a better understanding of the problem, discuss solutions, etc. Modern IDEs provide collaborative features to hold such discussions in the context of work items, so persisted discussions can provide at least some indication of whether a task required significant brainstorming or not. However, while distributed teams are often heavy users of these collaborative features, collocated teams (like the one in our case study) often prefer face-to-face discussions, which a tool like TA will be unable to track. Reusability is another aspect not considered within the current TA approach. It is well-understood that there will be a cost (higher effort) associated with writing a generic and reusable framework or library. Code complexity metrics indicating reusability (such as fan-in) cannot be measured when the code is written as these metrics evolve over a period of time. Hence to identify a task that requires reusable components to be written, we may need to trace back to its associated design space and analyze class or collaboration diagrams to measure the reusability for the components and allocate additional time if needed.

6.

RELATED WORK

Time Tracking Solutions A number of commercial solutions (e.g., Actitime [1], Baralga [2], Tasktop [8]), help office workers manually record their efforts for various activities, and also link that information with other corporate tools such as ERP/CRM and other project management tools. However, unlike TA, none of these solutions provide a means to automatically extract the actual quantum of work performed by a software developer, and estimate the effort thereof via a statistical technique. In comparison, oDesk [5] actually measures the effort taken per task by monitoring keystrokes on a computer. A similar solution is also provided by the Eclipse environment, which can keep track of the times when files are opened, edited and closed. However, the intrusive nature of these approaches can limit their acceptance in practice. Mining Change Data: In general, our TA solution is inspired by the rapidly emerging area of Development Intelligence [16], which is the application of Business Intelligence ideas to software development projects. Prominent among these ideas is how to effectively mine developer activities and change information from software repositories. In [14], the authors present a taxonomy of approaches for mining source code repositories for extracting developer activities and change data, such as extracting via CVS annotations, data mining, heuristics and differencing. In [22], Zimmermann describes a CVS plugin called APFEL that collects and stores fine-grained changes from version archives in a database. By searching for specific tokens such as method declarations, variable usages, exceptions, method calls and throw/catch statements, APFEL determines changes to files in terms of these tokens. On similar lines, in [23], the authors present a technique called “annotation graph” that identifies line changes in a file across several versions. The citation [11] presents an algorithm that compares the sets of lines added and deleted in a change set, using a combination of information retrieval techniques. We will investigate such approaches for code differencing in TA. Software Effort Estimation: Several parametric software estimation models have been proposed over the years (e.g., PROBE [6], SLIM [7], COCOMO [9]) and empirically

calibrated using actual project data, with multiple regression approach being one of the commonly used techniques. These models are useful for “estimation in the large”, while for estimating effort for timesheet tasks, we had to design models at a much lower-level granularity. In addition, our approach has the benefit of analyzing actual development work to estimate factors such as complexity and developer expertise. Along similar lines, [13] presents a multivariate regression model for estimating effort for a modification request based on the following factors: nature of work (e.g., defect or new feature development), size of change, the developer making the change, and the date the change was initiated (to account for a code decay factor). While different developers were found to expend varying levels of effort for comparable work, the reasons behind this were not investigated, and in particular, the impact of file familiarity in fixing defects has not been considered. Moreover, our focus is on providing an end-to-end automation framework that applies repository mining, statistical analysis and data summarization techniques to address some practical challenges in timesheet reporting and validation; this also differentiates it from other research efforts where the primary motivation has been offline analysis of historical data to build explanatory models. On a different note, [21] presents an approach for calculating defect fixing effort, by extracting the effort for the “nearest neighbors” (based on a similarity threshold using a text similarity engine) of the defect in question. It would be interesting to apply text similarity (e.g., defect work item descriptions) to see if our TA estimation model may be improved.

would provide better results, and we would like to try them out.

8. [1] [2] [3] [4] [5] [6]

[7] [8] [9] [10]

[11]

[12] [13]

[14]

7. CONCLUSIONS AND FUTURE WORK When we set out to develop Timesheet Assistant, our aim was to determine the usefulness of our proposed “estimation in the small” approach, in support of the filling and reviewing of timesheets. We feel that the case study reported in our paper does point towards the general feasibility of this approach, especially for maintenance projects. The effort fit of the training set, as well as the validation results, are definitely encouraging, and demonstrate the usefulness of expertise, along with development size, as key effort indicators. The qualitative feedback received from our semi-structured interviews is also quite positive, with the idea of TA being welcomed not only by project managers, but also by developers. We believe that this is due to the non-intrusive and overall helpful nature of TA. At the same time, we discussed some limitations of our current approach and ways in which the effort model may be enriched with additional parameters from the development environment. Our future work will be along the following directions. First, we are initiating a case study of TA in a project that involves significant development of new features; it would be interesting to find out to what extent our current model would need to be tailored to account for effort in new development tasks. Second, continuous re-calibration of our prediction model has not yet been tested, for which we will be investigating techniques such as those reported in [20]. While outliers are currently eliminated in our model, using them to create and maintain separate regression models for outlier work items (cf. [18]) could also be useful. Finally, so far we have used linear regression for effort prediction since we found that the simplest to use. Perhaps other techniques such as Bayesian analysis [12] or decision tree analysis [10]

[15]

[16]

[17]

[18]

[19] [20]

[21]

[22] [23]

REFERENCES Actitime,. http://www.actitime.com. Baralga,. http://baralga.origo.ethz.ch. FreeTime,. http://www.zoo2.com.au. GNU Octave,. http://www.gnu.org/software/octave/. oDesk,. http://www.odesk.com. PROxy Based Estimation (PROBE) for Structured Query Language (SQL),. http://www.sei.cmu.edu/reports/06tn017.pdf. QSM,. http://www.qsm.com/tools/slim-estimate/index.html. Tasktop,. http://www.tasktop.com. B. W. Boehm. Software Cost Estimation with COCOMO II. Prentice-Hall, Inc., 2000. L. Breiman, J. Friedman, C. J. Stone, and R. Olshen. Classification and Regression Trees. Chapman and Hall/CRC, 1 edition, 1984. G. Canfora, L. Cerulo, and M. D. Penta. Tracking your changes: A language-independent approach. IEEE Software, 26(1):50–57, 2009. S. Chulani. Bayesian analysis of software cost and quality models. In ICSM, pages 565–, 2001. T. L. Graves and A. Mockus. Inferring change effort from configuration management databases. In IEEE METRICS, pages 267–, 1998. H. H. Kagdi, M. L. Collard, and J. I. Maletic. Towards a taxonomy of approaches for mining of source code repositories. In MSR, 2005. A. Mockus and J. D. Herbsleb. Expertise browser: a quantitative approach to identifying expertise. In ICSE, pages 503–512, 2002. B. Rotibi. Development Intelligence: Business Intelligence for Software-Development. http://www.borland.com/resources/en/pdf/solutions/lqmovum-developmental-intellligence.pdf. C. K. Roy, J. R. Cordy, and R. Koschke. Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Sci. Comput. Program., 74(7):470–495, 2009. Y.-S. Seo, K.-A. Yoon, and D.-H. Bae. An empirical analysis of software effort estimation with outlier elimination. In PROMISE ’08, pages 25–32, New York, NY, USA, 2008. T. R. L. B. C. Taylor. Qualitative Communication Research Methods. SAGE, Inc., 2002. A. Trendowicz, J. Heidrich, J. M¨ unch, Y. Ishigai, K. Yokoyama, and N. Kikuchi. Development of a hybrid cost estimation model in an iterative manner. In ICSE, pages 331–340, 2006. C. Weiß, R. Premraj, T. Zimmermann, and A. Zeller. How long will it take to fix this bug? In MSR, page 1, 2007. T. Zimmermann. Fine-grained processing of cvs archives with apfel. In ETX, pages 16–20, 2006. T. Zimmermann, S. Kim, A. Zeller, and E. J. W. Jr. Mining version archives for co-changed lines. In MSR, pages 72–75, 2006.