Initial Evaluation of Data Quality in a TSP Software ...

Initial Evaluation of Data Quality in a TSP Software Engineering Project Data Repository Yasutaka Shirai

William Nichols

Mark Kasunic

Toshiba Corporation, Software Engineering Institute 4500 Fifth Avenue Pittsburgh, PA 15213, USA [email protected]

Carnegie Mellon University, Software Engineering Institute 4500 Fifth Avenue Pittsburgh, PA 15213, USA [email protected]

Carnegie Mellon University, Software Engineering Institute 4500 Fifth Avenue Pittsburgh, PA 15213, USA [email protected]

ABSTRACT To meet critical business challenges, software development teams need data to effectively manage product quality, cost, and schedule. The Team Software ProcessSM (TSPSM) provides a framework that teams use to collect software process data in real time, using a defined disciplined process. This data holds promise for use in software engineering research. We combined data from 109 industrial projects into a database to support performance benchmarking and model development. But is the data of sufficient quality to draw conclusions? We applied various tests and techniques to identify data anomalies that affect the quality of the data in several dimensions. In this paper, we report some initial results of our analysis, describing the amount and the rates of identified anomalies and suspect data, including incorrectness, inconsistency, and credibility. To illustrate the types of data available for analysis, we provide three examples. The preliminary results of this empirical study suggest that some aspects of the data quality are good and the data are generally credible, but size data are often missing.

Categories and Subject Descriptors D.2.8 [Software Engineering]: Metrics – Process metrics, Process Performance

General Terms Management, Measurement, Performance, Reliability, Verification

Keywords TSP, Team Software Process, Database, Data Quality

help that engineer track work progress, make better plans, and improve performance. The Software Engineering Institute (SEI) at Carnegie Mellon University has collected data from organizations that have adopted TSP and has begun to analyze the data to contribute to empirical software engineering research. Other software-related data sets are available to researchers; well known examples include those from ISBSG (the International Software Benchmarking Standards Group) and the Promise Data Repository. The ISBSG provides datasets from more than 5,600 software development projects and publishes data analysis reports [11]. The Promise Data Repository mainly stores product metrics of open source software and NASA projects [17]. Compared to other data sets, the TSP database provides a granular level of development detail that enables not only analysis with great precision at different levels (e.g., project, team, component, and individual developer), but also additional opportunity to assess data quality. We hope to make the TSP database more broadly available in the future to supplement the other research resources. Data quality is important for data collection and analysis. Analysis with data of poor quality can lead to inaccurate results that mislead the community. Nevertheless, in empirical software engineering there is little published literature on the subject of data quality [14]. In our work, we focused on identifying measures, techniques, and tools to ensure the quality of our process data (e.g., time measures, defect measures, and size measures) before using the database for further analysis. In particular, we intended to address the following questions: Q 1) How much data are available in the database? Q 2) How much of the data are obviously incorrect?

1. INTRODUCTION

Q 3) Are the data internally consistent?

Software development projects require data to inform effective product, quality, cost, and schedule management. The Team Software Process (TSP) addresses these critical business needs by applying software development process data collected by software developers as they perform their work [9] [10]. The TSP couples the use of integrated product teams with the Personal Software ProcessSM (PSPSM) to give engineers a defined, planned, and measured process [8] that increases their ability to succeed. The TSP and PSP provide a data driven feedback framework for collecting and analyzing an individual engineer’s process data to

Q 4) How credible are the data?

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICSSP’14, May 26–28, 2014, Nanjing,China. Copyright 2014 ACM 978-1-4503-2754-1/14/05 …$15.00.

This paper describes the database and our efforts to study the quality of its data. Section 2 introduces background and related work. The schema of the TSP database is described in Section 3. Section 4 provides an overview of the procedure for verifying data quality and results. Based on the information in Section 4, Section 5 provides some examples of descriptive data that support analyses for three of the measurement dimensions (project, component, and individual). We summarize our conclusions in Section 6.

2. BACKGROUND In our work, we focused on the use of the Software Process Dashboard to verify data quality in the TSP database. This section discusses background on the TSP and related work on data quality.

2.1 Using Process Data Many software development organizations have applied the TSP to help engineers do quality work and acquire a powerful framework for data collection and analysis. Davis and Mullaney et al. [3] [4] described the performance of TSP projects by investigating the results of data analysis of effort, schedule, quality, and size. Sasao and Nichols et al. [19] reported the results of validating the quality of effort data. Nichols and Kasunic et al. [16] documented that the SEI has started to deploy the TSP Performance and Capability Evaluation (TSP-PACE) to assess the performance of software development organizations by using data collected during TSP projects. We provide the schema of the database, the basis of process data analysis for empirical software engineering, and initial results for the data quality from records of the four primary TSP measures.

2.2 Data Quality While empirical software engineering researchers have paid attention to data quality, systematic reviews of data quality by Liebchen et al. [14] [15] [20] and Bosu [1] [2] indicated that few researchers have explored data quality, and more research needs to be done in this area. If you are using data to make judgments, track projects, or draw conclusions, the data must be of known quality, and preferably be of high quality. De Vaux and Hand [5] indicate that 60-95% of the total effort in data analysis work is spent on data cleaning. ISO/IEC 25012 [12] defines a general data quality model. Both show clearly that data quality is important for research. In 1998, Johnson and Disney [13] [6] focused on the data recording process of the PSP for 89 exercises completed by ten participants. The data in the study were manually recorded and then entered into a database that automatically calculated various measures. They discovered 1,539 primary errors in this process and determined that 46% of the errors were caused by incorrect calculations. In their article they suggested that PSP users should employ better tool support, which might record process data automatically. Data quality is also important for the TSP database. All projects included in this database use the Software Process Dashboard to collect individual metrics in a standard fashion [21]. However, even while using the tool, there are still items that must be entered manually. The dashboard also allows developers to overwrite data manually, allowing them to enter or modify data; therefore, it is necessary to verify the data quality in the TSP database.

2.3 TSP Database Many organizations using the TSP collect data using an open source tool, the Software Process Dashboard. A companion tool, The Team Process Data Warehouse application, stores TSP project data in a relational database via SQL [21]. This TSP database uses the Team Process Data Warehouse and therefore, its schema. The database currently includes historic project data from 34 teams located in the U.S., France, Mexico, and South Africa, enabling studies of individual team metrics at project, organization, and world-wide scales. Integrated metrics hold promise to provide new insights and value for every stakeholder. The organizations running these projects provided the project data to the SEI for research purposes. Our efforts focused on data collected by projects that launched after 2009 and used an automatic data recording tool. The projects were mostly small to medium, with a median duration of 46 days and a maximum duration of 90 days. Median team size was 7

people, with a maximum of 40. Data included time logs, defect logs, added and modified size logs, and WBS logs.

3. APPROACH We show results from four dimensions of data quality [12]: amount of data, rates of anomalies including obvious incorrectness, consistency, and credibility for the four primary tables in the TSP database.

3.1 Database Schema Figure 1 depicts the high-level schema of the TSP database. It uses two types of tables: dimension and fact. Dimension tables include contextual information, such as process phase and identification keys for projects, teams, and individuals. To protect privacy, personal information is anonymized before it is stored in the database. Software development process data is recorded in fact tables. The TSP database includes a time log, defect log, size log, Work Breakdown Structure (WBS) log and “plan item” as fact tables. DimensionTables

FactTables

Person

Timelog DataBlock

Team

Defectlog

Project

Sizelog Planitem

Phase

WBSlog

Figure 1. High level schema of the TSP database Time log records include work start time, work end time, delta work time, and interruption time. Software engineers are often interrupted by meetings, requests for technical help, reporting, and so forth. These events are recorded, in minutes, as interruption time. Delta time is the difference between the start and end times of a work session, with any interruption time subtracted (the difference in times, minus the interruptions). The defect log table has five key data fields for quality analysis: defect fix count, defect fix time, defect type, defect injected phase, and defect removed phase. A defect entry may include multiple defects. The engineers record the number of related defects they fix in the defect fix count field. Defect fix time is the number of minutes required to fix defects in that record. A defect’s type is determined by the standard used by the projects or organizations. The final two metrics, defect injected phase and defect removed phase, indicate when in the process a defect was introduced and then fixed. Knowledge of the number of defects injected and removed in particular phases is essential to conducting effective root cause analysis. Size is used to normalize other metrics. The size log stores planned and actual product size information divided into total size, added and modified size, base size, deleted size, and reused size. The size log includes measures for lines of code and various other measures, including document page, the number of requirements, and the number of test cases. The Work Breakdown Structure (WBS) log stores the plan items. A typical project plan contains both the WBS elements and the tasks organized under those elements. The WBS element dimension only describes the part of the hierarchy above the task level. The hierarchical components information, which has the full path of a WBS element and a path to be split out, is stored in the WBS log. A plan item, which has a

phase key, a project key, and a WBS element key, stores task level information and is used to combine the two dimension tables, the phase table and the project table, with fact tables. All fact tables have a plan item key. The data block table connects the person and team tables to the fact tables by using the data block key stored in the time, defect, and size logs. Connecting dimension tables to fact tables makes it possible to analyze the data from many perspectives.

3.2 Tests Applied to the Data The amount of data, the number of facts stored in a database, is often used to normalize the error ratio in data quality. The amount of data also plays an important role in benchmarking. We began by counting the number of projects, project personel, and records of various types. Correctness requires that every set of stored data represents a real world situation. We examined the four types of logs to identify obviously incorrect data, including illegal formats, missing data, and wrong values. Consistency is defined as no contradiction between the multiple sources of data stored, so it is useful to examine related logs such as the time log and defect log. For example, the date that a defect is found in the code inspection phase should never be earlier than the start date of the code inspection phase. We investigated two kinds of consistency between the time log and the defect log. First we verified the consistency of sequence by checking the timestamps of the time and defect log entries. Sequence consistency requires a valid order of events for plan items and defect recording. We verified that defects found during work on a plan item had time stamps between the start time and the end time recorded for that plan item or task. Credibility is defined as a degree to which the stored data are believable in a specific context of use. Data recorded during or immediately after work are considered to be more credible than data recorded at a later time. Most of the time, data are recorded by the “push button” operations of the Software Process Dashboard, but it is possible to overwrite the data or enter the data at a later time. These later or overwritten data values are not necessarily incorrect, but are less worthy of trust. Verification of credibility must be defined and done precisely. One direct test for verifying credibility is to test for overlapping time records and another is to look for questionable work durations. Indirect tests involve distributional properties, such as leading digits, trailing digits, duration of individual time record values, and so forth. Overlapping time does not occur under the normal use of the Software Process Dashboard because only one time log recording window starts at a time. The task must be paused or stopped to begin a new record. However, time overlapping can result from overwriting the data or recording the data manually at a later time. For each developer, we count the number of overlapping records and the total time within those records. We found some overlap in 14,214 records, 13.8% of the total number. The total time in the time log is 3,693,940 minutes, or 61,565.7 hours. To identify questionable work time, we examined the recorded durations to see if they were consistent with reasonable work durations. First, for a single person, we searched for one day’s worth of summed delta times that exceeded 1,360 minutes (the total number of minutes in 24 hours).

4. RESULTS 4.1 Amount of Data As of January 2014, the TSP database contained data from 109 TSP projects. The projects started between July 2009 and September 2013; they included 34 teams and 309 people. Among the database fact tables, the time store contains 103,023 time logs, 18,408 defect logs, and 7,464 size logs. The amount of data in the primary fact tables is indicated in Table 1, column two. Table 1. Amount and incorrectness of data in the fact tables Fact Table

Amount of Data

Illegal Format

Lack of Data

Time Log

103,023

1

2

0

WBS Log

11,449

0

0

0

Defect Log

18,408

0

442

277

7,464

0

2,688

24

Size Log

Illegal Value

4.2 Incorrectness Table 1 shows the amount of data and counts of records in the primary tables of the TSP database. The time log table has only two null data values and one value that were entered in an illegal format. The low incorrectness rate of 0.003% results from automatic rather than manual operation during data recording. Two “push button” operations start and stop recording of a time log record. The WBS log has no incorrectness. The data value consists only of a manually entered string naming the WBS element. The defect log contains 719 incorrect records, producing a 3.9% incorrectness rate. Of the 719 incorrect records, 277 had zero defect fix time. Normally, the defect fix record is initiated by a button operation that increments the count, with the remaining fields to be entered manually. For the records with a defect fix time of zero, it is possible that the actual defect fix time was manually recorded with other defects, making those those defects incorrect as well. Of these 719 incorrect records, 442 have no data entered in the fix defect field. All 442 were injected during the compile or test phases of the project, and they represent 53.4% of defects injected during those phases. Most defects injected in either compile or test should have a value in the fix defect field corresponding to the defect that was being fixed when another defect was injected. When defects are injected in the compile or test phase, the developer must enter many data fields manually. We believe this manual recording operation led to the high rate of incorrectness (53.4%). In the case of the size log, 53% of actual size data are missing.

4.3 Consistency We extracted 33,877 records by connecting the time log and defect log by plan item and found that 776 of 33,877 (2.3%) had a timestamp inconsistency. Among those inconsistent data, 170 records had the defect recorded date before the start date of the task and 618 records had the defect found date after the end date of the task. Next we looked at the consistency of effort duration between the effort applied to a task and defect fix efforts. To verify time and defect effort consistency, we checked that the effort logged for the task was greater than the effort registered for finding and fixing the defects identified in that task. Defects found in the task are normally fixed during the same task, so the sum of defect fix times should be shorter than task effort time. In the 18,408 defect records, 121 records, 0.66% had inconsistency between the total phase effort and phase defect fix effort.

4.4 Credibility We found that nine people out of 309 (2.9%) had a delta time over 1,360 minutes in a day. We also found 389 time log records that exceeded an arbitrary but plausible cut of 360 continuous minutes in a day. This is equivalent to 0.38% of the total number of records.

5. DISCUSSION In this section we discuss our results and include some examples of data extracted from this database to consider how the quality aspects we have identified might be used.

might be helpful to address this issue of missing data due to manual entry of defect and size data. The frequency of defect density in test by project shown in Figure 3 illustrates some of the risks introduced by missing or incorrect data. A Weibull distribution, which is commonly used in reliability engineering [18] and found to fit the Alberg diagram of defects [22], is superimposed upon the data. The value for defects/KLOC is, of course, sensitive to zero values within the denominator (for software size as measured by actual added and modified code). There is a risk that missing size values could bias the distribution results.

Q 1) How much data are available in the database The counts of projects and people are straightforward but detailed demographics are not yet readily retrievable. This remains for future work. The largest volume of records is for time.

㻝㻜

㻲㼞㼑㼝㼡㼑㼚㼏㼥

Figure 2 shows a scatterplot of the planned time vs. the actual time recorded for items from the Work Breakdown Structures. As we segment the data by project or domain, the amount of data becomes a key consideration in the statistical analysis.

㻿㼔㼍㼜㼑㻜㻚㻥㻟㻥㻞㻿㼏㼍㼘㼑㻠㻚㻥㻡㻞㻺㻡㻟

㻝㻞

㻤

㻢

㻠

㻟㻜㻜㻜㻜

㻞

㻞㻡㻜㻜㻜

㻜

㼍㼏㼠㼡㼍㼘㻌㼠㼕㼙㼑㻌㻔㼙㼕㼚㼡㼠㼑㼟㻕

㻜

㻠

㻤㻝㻞㻝㻢㻞㻜㻞㻠㼐㼑㼒㼑㼏㼠㻌㼐㼑㼚㼟㼕㼠㼥㻌㼕㼚㻌㼀㼑㼟㼠㼕㼚㼓㻌㻔㼐㼑㼒㼑㼏㼠㼟㻛㻷㻸㻻㻯㻕

㻞㻤

㻞㻜㻜㻜㻜

Figure 3. Defect density values in the testing phase

㻝㻡㻜㻜㻜

Q 3) Are the data internally consistent? 㻝㻜㻜㻜㻜

㻡㻜㻜㻜

㻜㻜

㻡㻜㻜㻜

㻝㻜㻜㻜㻜㻝㻡㻜㻜㻜㼜㼘㼍㼚㼚㼑㼐㻌㼠㼕㼙㼑㻌㻔㼙㼕㼚㼡㼠㼑㼟㻕

㻞㻜㻜㻜㻜

㻞㻡㻜㻜㻜

Figure 2. Planned time vs. actual time for WBS elements Q 2) How much of the data are obviously incorrect? The time logs had the lowest incorrectness rate, followed by defect logs. The very low incorrectness rate on the time log suggests we should be more concerned with consistency and credibility than with incorrectness per se. The 3.9% incorrectness rate within the defect logs suggests that several potential issues must be considered before use. The most common problem was missing the defect fix values. This issue would affect studies of secondary defects, or “breakage.” The zero values for defect fix times may indicate missing values or cases where the time was mistakenly assigned to a different defect. In either case, the zero value is an instance known to be incorrect. Calculations of average defect fix times or fix time distributions must consider these known sources of data quality issues. By contrast, a large volume of size data was missing, 53% (dominated by values of actual size from projects). We believe that the lower rate of incorrectness in the time logs occurred because data entry is simpler. The timer is initiated with a single button click and so task time is automatically managed by the tool. However, the defect and size logs contain data that were manually entered and the individual entering the data must remember to enter size data when development of the software component has been completed. Therefore, we believe that the higher error rates are associated with these manual methods of data entry. Data collection tools that are incorporated within an integrated development environment (IDE) or a revision control system

The inconsistency rate between defect and time logs is modest 2.3%, with respect to time stamps while the duration inconsistency rate was small, 0.66%. This suggests that some of the data were not entered in real time leading to questions of correctness and credibility. The inconsistency of the time stamps does not necessarily imply incorrect data in all cases. However, these instances are somewhat less credible. Inconsistency between the time log and defect entry log imply that one entry or the other is incorrect. The inconsistencies must be addressed if the entries are to be included in any analysis results derived from including those values. Future work will continue to investigate data entry mechanisms that prevent data entry issues that lead to inconsistency. Q 4) How credible are the data? To estimate one aspect of data credibility, we examined the data for overlapping times and implausibly long work times. Work recorded as it is being performed is generally considered to be more trustworthy than data entered after the fact. In this data we estimate that approximately 86% of the time data appeared to be recorded in real time as the work was performed. 3.3% of the time log records contained implausible durations. Low rates of implausible durations suggest that the entered data may still be reasonable. Implausible times might represent aggregated effort that may be correct for a given purpose. This aspect requires more study. One potential approach is to use distributions of work session effort. For example, see Figure 4. We have reason to believe that work session duration follows an approximate log-normal distribution. The spikes occurring at 30minute intervals suggest data that was estimated and entered at a later time rather than as the work was performed. Future work will quantify this indirect measure of reliability. In future work we will also examine distributional properties such as leading and trailing digits of work session duration to estimate values that are guessed at or grossly rounded.

㻸㼛㼏㻞㻚㻤㻢㻜㻿㼏㼍㼘㼑㻝㻚㻞㻞㻠㻺㻝㻜㻝㻥㻣㻡

㻞㻜㻜㻜㻜

㻲㼞㼑㼝㼡㼑㼚㼏㼥

㻝㻡㻜㻜㻜

[3] Davis, N. and Mullaney, J. The Team Software Process (TSP) in Practice: A Summary of Recent Results. CMU/SEI2003-TR-014, Software Engineering Institute, Carnegie Mellon University, Pittsburgh, 2003. [4] Davis, N., Mullaney, J., and Carrington, D., Using Measurement Data in a TSP Project. in EuroSPI 2004, (Trondheim, Norway, 2004), Springer, 91-101.

㻝㻜㻜㻜㻜

[5] De Veaux, R.D. and Hand, D.J. How to lie with bad data. Statistical Science, 20 (3). 231-238. 㻡㻜㻜㻜

㻜

㻜

㻢㻡

㻝㻟㻜㻝㻥㻡㻞㻢㻜㻟㻞㻡㻟㻥㻜㼣㼛㼞㼗㻌㼠㼕㼙㼑㻌㼕㼚㻌㼠㼕㼙㼑㻌㼘㼛㼓㻌㻔㼙㼕㼚㼡㼠㼑㼟㻕

㻠㻡㻡

Figure 4. Frequency of work time data which have less than 500 minutes with lognormal distribution overlay

6. CONCLUSION In this paper, we described our empirical study of data quality in the TSP database. Our examination, based on four initial questions, suggests that we can quantify several aspects of the quality of these data and therefore make judgments about the value of using them in software engineering research. However, much work remains in our research. First we intend to apply additional tests to the data credibility by examining subsets of the data against expected distributions. More significantly, we have yet to include the contextual data needed to segment projects by industry, technology and so forth. Also, this database is growing and we will continue to maintain the database and quality analysis. Finally we plan to do comparative studies of data quality with other databases, such as ISBSG and Promise.

[6] Disney, A.M. and Johnson, P.M., Investigating Data Quality Problems in the PSP. in Proc. the ACM SIGSOFT Symposium on the Foundations of Software Engineering, (Lake Buena Vista, FL, 1998), ACM, 143-152. [7] Gokhale, S. and Mullen, R. A multiplicative model of software defect repair times. Empirical Software Engineering, 15 (3), 296-319. [8] Humphrey, W.S. Using a defined and measured Personal Software Process. IEEE Software 13 (3). 77-88. [9] Humphrey, W.S. Introduction to the Team Software Process. Addison-Wesley Professional, Reading, MA, 1999. [10] Humphrey, W.S. The Team Software Process. CMU/SEI2000-TR-023, Software Engineering Institute, Carnegie Mellon University, Pittsburgh, 2000. [11] International Software Benchmarking Standards Group (ISBSG). Retrieved 1997 from DOI= http://www.isbsg.org/ [12] ISO/IEC 25012 Software product Quality Requirements and Evaluation (SQuaRE) - Data quality model. ISO/IEC, 2008.

7. ACKNOWLEDGMENTS

[13] Johnson, P.M. and Disney, A.M. A Critical Analysis of PSP Data Quality: Results from a Case Study. Empirical Software Engineering 4 (4). 317-349.

We are grateful to the reviewers for their valuable and constructive comments and to the TSP partners for submitting project data to the SEI to support our research. This research is being conducted at the Software Engineering Institute.

[14] Liebchen, G.A. and Sheppert, M., Data Sets and Data Quality in Software Engineering. in 4th Workshop on Predictive Models in Software Engineering (PROMISE 2008), (Leipzig, Germany, 2008), ACM, 39-44.

This material is based upon work funded and supported by Cost recovery from TSP partner fees under Contract No. FA8721-05C-0003 with Carnegie Mellon University for the operation of the Software Engineering Institute, a federally funded research and development center sponsored by the United States Department of Defense.

[15] Liebchen, G. 2011. Data Cleaning Techniques for Software Engineering Data Sets. Ph.D. Dissertation. Brunel University, London.

This material has been approved for public release and unlimited distribution. Team Software ProcessSM and TSPSM are service marks of Carnegie Mellon University. DM-0001057

8. REFERENCES [1] Bosu, M.F. and MacDonell, S.G., Data Quality in Empirical Software Engineering: A Targeted Review. in 17th International Conference on Evaluation and Assessment in Software Engineering, (Porto de Galinhas, Brazil, 2013), ACM, 171-176. [2] Bosu, M. F. and MacDonell, S. G., Data Quality Challenges in Empirical Software Engineering: An Evidence-Based Solution. in 22nd Australasian Conference on Software Engineering,(Melbourne, Australia, 2013) IEEE.

[16] Nichols, W., Kasunic, M., and Chick, T. A. TSP Performance and Capability Evaluation (PACE): Customer Guide. CMU/SEI-2013-SR-031, Software Engineering Institute, Carnegie Mellon University, Pittsburgh, 2013. [17] Promisedata. Retrieved March 2003 https://code.google.com/p/promisedata/

from

DOI=

[18] Ramakumar, R. Engineering Reliability: Fundamentals and Applications. Prentice Hall, Englewood Cliffs, NJ, 1993. [19] Sasao, S., Nichols, W., and McCurley, J. Using TSP Data to Evaluate Your Project Performance. CMU/SEI-2010-TR038, Software Engineering Institute, Carnegie Mellon University, Pittsburgh, 2010. [20] Shepperd, M., Data Quality: Cinderella at the Software Metrics Ball? in 2nd International Workshop on Emerging Trends in Software Metrics, (Honolulu, HI, 2011), ACM, 1-4. [21] Software Process Dashboard Initiative. Retrieved February 2014 from DOI= http://www.processdash.com/ [22] Zhang, H. 2008. On the Distribution of Software Faults. IEEE Transactions on Software Engineering, 34, (2). 301-30