Discovering Dynamic Developer Relationships from ... - CiteSeerX

1 downloads 3360 Views 202KB Size Report
Discovering Dynamic Developer Relationships from Software Version Histories by Time Series Segmentation. Harvey Siy§, Parvathi Chundi∗§, Daniel J.
Discovering Dynamic Developer Relationships from Software Version Histories by Time Series Segmentation Harvey Siy§, Parvathi Chundi∗§, Daniel J. Rosenkrantz‡, Mahadevan Subramaniam†§ §Computer Science Department, University of Nebraska at Omaha, Omaha, NE 68182 {hsiy,pchundi,msubramaniam}@mail.unomaha.edu ‡Computer Science Department, University at Albany, SUNY, Albany, NY 12222 [email protected]

Abstract

tories based on time stamps associated with the version logs has been well-recognized in these works. Time series analysis is a well-established area of research that has been highly successful in discovering nuggets of temporal information from time stamped data. Time series analysis has been traditionally applied to numeric measurements/observations performed at regular intervals of natural phenomena like rainfall or man-made phenomena like stock prices. The overall objective of this paper is to enable the application of time series analysis techniques to discover information from version logs. This poses several interesting challenges. First, unlike traditional time series data, which is comprised of numeric measurements, version logs typically include both numeric and non-numeric data. Effective analysis of such logs requires that we extend the current time series analysis techniques to handle non-numeric data. Further, version logs of a large project can potentially span a large period of time with rich and varying temporal patterns of activities. These version logs can potentially result in non-numeric time series data containing hundreds to thousands of measurements. Methods are needed to compactly represent such time series data to highlight the underlying temporal patterns. One approach that has been successfully used to compactly represent time series data is segmentation. Segmentation of a time series data automatically partitions the time period associated with a time series into a sequence of time intervals (or segments), which can then be further analyzed to find interesting temporal patterns. The importance of segmentation time series is well-established and several studies have demonstrated the advantages of segmentation to compactly represent a time series [8, 14, 12]. In this paper, we propose a novel approach that models version logs as time series data and describe a dynamic programming based method to construct optimal segmenta-

Time series analysis is a promising approach to discover temporal patterns from time stamped, numeric data. A novel approach to apply time series analysis to discern temporal information from software version repositories is proposed. Version logs containing numeric as well as nonnumeric data are represented as an item-set time series. A dynamic programming based algorithm to optimally segment an item-set time series is presented. The algorithm automatically produces a compacted item-set time series that can be analyzed to discern temporal patterns. The effectiveness of the approach is illustrated by applying to the Mozilla data set to study the change frequency and developer activity profiles. The experimental results show that the segmentation algorithm produces segments that capture meaningful information and is superior to the information content obtaining by arbitrarily segmenting time period into regular time intervals.

1 Introduction Software version repositories contain an enormous wealth of information regarding development and maintenance activities in a project. Recently, there has been a lot of interest in analyzing these repositories to discern information pertaining to changes performed on software artifacts, study developer profiles, understand the impact of the development process on software quality by predicting future changes, and so on. The importance of explicitly considering the temporal dimension while analyzing version reposi∗ Supported in part by NSF Grant IIS-0534616 and by Grant Number P20 RR16469 from the National Center for Research Resources (NCRR), a component of the National Institutes of Health (NIH). † Supported in part by NSF Grant CCF-0541057.

1

tions of version logs. The effectiveness of the proposed approach is illustrated by applying it to one aspect of software version logs, namely, developer activity profiles. We automatically segment a given version history such that each segment identifies developers with significant activity in that time period. This information can then be used to understand core group of developers active at a given time, relate trends in developer activity to global events such as release dates, and future fixes. We believe that the proposed approach is general enough and can be applied to study trends in several other aspects of version logs such as change-couplings, developer file ownerships etc. The approach presented in this paper represents a version history as an item-set time series1 , a time series where each observation is a set of discrete items. We then show how an item-set time series can be segmented to obtain a compact representation. There are many ways to segment a time series. One can simply group all observations within a day, a week, a month etc., into a single segment. We call these segmentations as fixed segmentations. Alternatively, one can divide the time period into k segments of uniform length. Such a fixed manner of producing segments may not be well-suited for time series data such as version logs in which temporal patterns may occur in bursts. Therefore, automatic segmentation methods that discover variable length segments that closely represent the observations of an itemset time series are essential. We present an approach to automatically generate segmentation of item-set time series data. Our main contributions are as follows. • A software version history is modeled as an item-set time series where each observation in the time series is a set of user ids of developers who made changes to files at the same time. The notion of an item set of a segment is used to compactly represent the item set that results in combining consecutive observations of an item-set time series. A dynamic programming based algorithm is presented to construct an optimal segmentation of an item-set time series that minimizes the difference between the item set of a segment and the individual item sets that were combined to generate that segment. • The proposed approach has been applied to the Mozilla data set to study change frequency over time and developer activity profiles. The preliminary results are extremely promising and highlight the power of the proposed approach in applying time series analysis to version histories. The data set is represented as an item1 An item-set time series is much like the sequence data in marketbasket analysis. The only difference is that consecutive item sets in an item-set time series are measured / recorded at regular intervals.

set time series and several optimal as well as fixed segmentations were constructed and compared. • The results show that segments generated capture time durations involving significant events in the history of Mozilla. The segments identify groups of developers active over a certain period of time and also reflect the difference in composition of active developers before and after major releases. • The variable length segments produced by the algorithm outperform the fixed segmentations (produced by arbitrarily partitioning the project time period at regular intervals) in terms of active developers identified per segment, the distinctiveness of developers identified across adjoining segments, as well as quality of changes made by active developers within a segment. The rest of the paper is organized as follows. Section 2 introduces the terms and definitions for the optimal segmentation problem and describes a dynamic programming algorithm to construct an optimal segmentation. Section 3 describes the Mozilla data and the effectiveness of applying the optimal segmentation problem to the Mozilla data. Section 4 describes the related work and Section 5 concludes the paper.

2 Segmentation of Software Version History Let I be a finite set of items d1 , d2 , . . . , dm . An item set is a subset of I. The fractional difference between two item sets x and y is (|x − y| + |y − x|) / (|x ∪ y|) if x ∪ y is nonempty, and is 0 otherwise. An item-set time series T consists of a finite sequence of n samples x1 , . . ., xn where each xk is an item set, recorded at successive time points t1 , ...., tn . A time point ti may represent different units of time such as seconds, minutes, hours, days, etc. Below, for ease of exposition, we use time points that are at the granularity of a day. Example 2.1 Let I be a set of developer ids. A software version history can be modeled as an item set time series TD where each sample xi is a set of developers. Item set xi denotes the set of developers that checked in files on the day denoted by ti . A segment s(a, b) (1 ≤ a ≤ b ≤ n) of a time series T consists of the consecutive time points ta , . . ., tb . If segments s1 = s(a, b) and s2 = s(b + 1, c), the concatenation of s1 and s2 , denoted as s1 s2 , is the segment s(a, c). Example 2.2 Suppose the following time points – Jan 21, 2003, Jan 22, 2003, Jan 23, 2003, Jan 24, 2003, Jan 25,

2003 appear consecutively in an item set time series. Then, s(1, 3) is a segment containing the three time points Jan 21, 2003, Jan 22, 2003, Jan 23, 2003. Segment s(2, 2) contains the single time point Jan 22, 2003. A measure function (denoted by f ) is used to assign a numeric value to each item in a segment to capture the relevance of an item to that segment2 . There are many types of measure functions that one can formulate to capture the relevance of items in a segment. We define a measure function based on the occurrence frequency of an item in that segment. Definition 2.1 The density measure (fm ) takes an item dq and a segment s(a, b) as input, and returns the fraction of the item sets in s(a, b) that contain dq . The numeric values assigned to items by a measure function f in a given segment s(a, b) are used to identify items that are deemed to be significant for that segment, as follows. Let α be a user specified threshold. An item dq is called significant in segment s(a, b) if f (dq , s(a, b)) ≥ α. The item set of segment s(a, b), denoted by Iα (s(a, b), f ) is the set of all significant items in s(a, b). Example 2.3 Suppose we have the following item set time series TD containing developer sets constructed from a software version history where each developer set contains all developers that checked in files on the day of the corresponding time point. TD = { ha, b, ci, hb, c, di, ha, b, ei, ha, b, f i } recorded at time points Jan 21, 2003, Jan 22, 2003, Jan 23, 2003, and Jan 24, 2003. Consider segment s(1, 2). fm (a, s(1, 2)) = 0.5, fm (b, s(1, 2)) = 1.0, fm (c, s(1, 2)) = 1.0, and fm (d, s(1, 2)) = 0.5. Therefore, developers a, c and d checked in files on half the days during Jan 21 – 22, 2003. Let α = 0.5. Then, Iα (s(1, 2), fm ) = {a, b, c, d}. Iα (s(2, 4), fm ) = {a, b}. The item set associated with a segment represents developers that have checked in files on a majority of days in that segment and thus are deemed to have been active during the time period associated with the segment. A segmentation Π of a time series T is defined as a sequence s(b0 , b1 ), s(b1 + 1, b2 ), . . . , s(bl−1 + 1, bl ) of segments such that the concatenation s(b0 , b1 )s(b1 + 1, b2 ) · · · s(bl−1 + 1, bl ) = T . The size of Π, denoted by |Π|, is l, the number of segments in Π. A segmentation Π of a time series T is a fixed segmentation if all of the segments in Π contain the same number of time points. Example 2.4 A size 3 segmentation of TD is s(1, 1), s(2, 3), s(4, 4). A size 2 fixed size segmentation of TD is s(1, 2), s(3, 4). 2 The

term measure function first appeared in [4].

2.1. Non-Homogeneity of a Segmentation Let s(a, b) be a segment and th be a time point such that a ≤ h ≤ b. Let δh denote the fractional difference between Iα (s(a, b), f ) and xh . The segment P difference of segment s(a, b), denoted by δ(s(a, b)), is a≤h≤b δh . The segment difference of a segment represents how closely the item set of the segment captures the item sets of individual time points in that segment. The following example illustrates the segment differences for a couple of segments of the previous example. Example 2.5 Consider Iα (s(1, 2), fm ) from the previous example. Iα (s(1, 2), fm ) = {a, b, c, d}. δ1 = 0.25 and δ2 = 0.25. There δ(s(1, 2)) = 0.5. Iα (s(2, 4), fm ) = {a, b}. δ2 = 0.5, δ3 = 0.33, δ4 = 0.33. Therefore, δ(s(2, 4)) = 1.16. A desirable property of a segmentation is that the item set of each of the segments closely reflects the item sets of the time points contained in that segment. The segment difference of a given segment is a measure of how internally homogenous that segment is. There are a variety of ways to measure the non-homogeneity of a given segmentation of a time series, given the segment difference of each segment in the segmentation. We describe three such measures here. The summation difference measure, denoted by ∆sum , is the sum of the segment differences of the segments in the segmentation. The average difference measure, denoted by ∆avg , is the average segment difference (ratio of the summation difference to the size of the segmentation). The max difference measure, denoted by ∆max , is the maximum segment difference.

2.2. Optimal Segmentation Problem Segmentation of a time series reduces the number of samples to be examined, while hopefully preserving much of the information of the original time series. For a given measure function and difference measure, the optimal segmentation problem takes as input an item-set time series, segment difference values for each of O(n2 ) segments, and an upper bound p on the size of the desired segmentation of the input time series, and constructs a segmentation of at most p segments with minimal non-homogeneity. A dual formulation of the optimal segmentation problem is to take as input an item-set time series and an upper bound on the amount of non-homogeneity, and construct a minimal size segmentation whose non-homogeneity does not exceed the given limit. Dynamic programming is typically employed to solve the optimal segmentation problem [8, 12]. The dynamic programming approach uses as input data the segment difference values for each of the O(n2 ) segments of the input

time series, i.e. the δ(s(i, j)) value for each segment s(i, j) (1 ≤ i ≤ j ≤ n). Prior to carrying out the dynamic programming algorithm, these segment difference values for all segments are computed, using the measure function specified by the user. The dynamic programming algorithm operates as follows. We assume that the item-set time series to be segmented begins at index 1 and that p ≤ n. A twodimensional table R is maintained by the dynamic programming algorithm. Entry R[j, k] in the table records the minimum possible amount of non-homogeneity that can be incurred in combining time points 1 through j into k segments (j ≥ 1, k ≤ j, k ≤ p). If k is 1, the value of R[j, k] is set to δ(s(1, j)). Otherwise, a recursive equation is used to compute the value of R[j, k] from previously computed entries in the table. The specifics of the recursive equation depends on which non-homogeneity measure is being used. Entry R[j, 1] (that is k = 1) is set to δ(s(1, j)) in all cases. For the summation difference measure of non-homogeneity, each entry R[j, k], k > 1 is set to mink−1≤z