Free University of Bolzano/Bozen, Center for Applied Software Engineering. {ajanes, asillitti ... as much effort and attention determining whom to contact.
Non-invasive software process data collection for expert identification Andrea Janes, Alberto Sillitti, Giancarlo Succi Free University of Bolzano/Bozen, Center for Applied Software Engineering {ajanes, asillitti, gsucci}@unibz.it deleting, reading of code is an indicator of the knowledge that the programmer has about that part of the code [6]. Other approaches such as expertise recommenders, which suggest who has expertise in particular parts of the program also base their recommendations on this assumption. Tools such as Expertise Recommender [7], EEL [8], and Expertise Browser [9] make recommendations based on commits to source code repositories. The objective of the expert identification measurement framework is twofold: the primary goal is to track the time spent editing the artifacts that are created during the software development process, such as documents, source code, spreadsheets, etc. to infer the user’s knowledge on that artifact. Additionally, data that describes the artifact being accessed is also collected to allow the retrieval of an artifact using its properties as search criteria.
Abstract Software companies depend heavily on knowledgeable employees. Competence and skills management are essential instruments to understand how to employ the available skills in an optimal way. Unfortunately, implementing knowledge management strategies like competence and skills management is challenging because resources, time and effort are required before benefits become visible. This paper shows an approach to collect noninvasively (i.e., without requiring any effort by developers) data about “who” is working on “what” during software production. We present two examples to show how to answer three questions: “who is the expert of a specific part of the code?”, “who should do pair programming with whom?”, and “what knowledge gap arises if a specific developer leaves?”.
2. Related work Examples of existing implementations can be categorized within two groups depending if the knowledge about who knows what has to be provided by the users of the system or if it is extracted from other sources. Examples of tools of the first group are the StepStone Skills & Competency Management Module [10] or SAP ERP Human Capital Management [11] in which employees maintain their own or the skill profiles of their subordinates. A common problem of knowledge engineering is how to generate knowledge with as less effort and resources as possible [12]. Asking employees to maintain their knowledge profile takes time which means that it will generate costs since if experts spend time sharing knowledge, they will be less productive [13]. To overcome this barrier to adopt knowledge management, tools that try to extract knowledge automatically from existing artifacts who knows what were developed. Two examples are AnswerGarden [14] which creates a knowledge repository storing the questions and answers exchanged between help desks and their clients and ActiveNet [15] which extracts knowledge from the e-mail traffic, instant messaging, and digital workspaces that employees use in their every days’ work.
1. Introduction Competence management and expert identification are activities within the field of knowledge management which aim to find out who knows what [1]. This information can serve various purposes, e.g., to find the right employees to staff new projects [2], to match positions with skills [3], or to support software maintenance [4]. In distributed environments or in larger development teams knowledge about who wrote a particular piece of code, who knows about a particular set of classes, who is responsible for a particular requirements document is essential. According to [5], software developers apply just as much effort and attention determining whom to contact in an organization as they do getting the job done. To partly solve this problem, this paper proposes a measurement framework to extract knowledge about who knows what about software development artifacts (such as source code, documents, slides, spreadsheets, etc.) noninvasively, i.e., without the need by the developers to spend time on documenting their knowledge within a knowledge management system. The information that is extracted builds on the idea that the programmer’s activity, i.e. the adding, modifying,
1
// we want to collect data with the // granularity of one second so that // our application does not consume // too many resources on the machine // of the developer wait for 1 second if getCurrentArtifact() artifact then set now to the current system date and time set duration to now - start
Proposals like the “Knowledge Dust to Pearls Approach” [13] build on the AnswerGarden approach to refine the collected knowledge into “experience pearls” that are collected in form of an “experience base” for the purpose to be reused for the planning of future activities.
3. Measurement framework
append user, app, artifact, start, duration to the local log file
To extract the desired knowledge from the ongoing software development process, we developed a measurement framework that is able to identify on which artifact a user is currently working on, to read properties of the artifact currently accessed, and to store this information in a central database. It was our focus during development that all steps can be done non-invasively, i.e. without the need for the developer to interact with the knowledge management system. Currently our measurement framework can identify artifacts accessed using the Microsoft Office suite 1 (Word, Excel, PowerPoint, Visio, Frontpage, and Outlook), the OpenOffice.org office suite2 (Writer, Calc, and Impress), and software development environments such as Microsoft Visual Studio 3 , Eclipse 4 , NetBeans 5 , and IntelliJIDEA6. The mentioned applications allow to read the currently accessed artifact through a provided API. We developed a set of measurement probes (one per application) that constantly poll these applications about the current artifact accessed. As soon as the reported value changes, the amount of time passed since the last change is written to the local log file together with the current date, time, user name and name of the artifact accessed. In regular intervals the so collected log is transferred to a central server. We assume that the API of the application to observe provides a function getCurrentArtifact() that we use to access the current artifact modified by the user (in most of the cases this is the artifact that is currently on focus). The following pseudo code describes how one measurement probe (i.e. an application of our measurement framework connected to the API of an application to observe) collects data. initialize initialize initialize initialize
set artifact = getCurrentArtifact() set start = now end if end while
The pseudo code above shows how the probe constantly polls the observed application and generates an entry in the local log file if the current artifact changes. The so obtained activity log files are transferred to a central database where all data is stored. The three steps, artifact identification, local data caching, and data transfer are illustrated in figure 1. Step 1: identify the active artifact Mozilla FireFox
Microsoft Visual Studio
Web page
C# class
OpenOffice Calc Spreadsheet
Step 2: store data about identified artifact in local log file log
Step 3: transfer the collected data to a central server
user with the current user app with the application name artifact using getCurrentArtifact() start with the current date and time
Server
while application to observe is running
Data transfer daemon
Database
1
Microsoft Office. http://office.microsoft.com OpenOffice.org. http://www.openoffice.org Microsoft Visual Studio, http://msdn2.microsoft.com/vs2008/products 4 Eclipse.org, http://www.eclipse.org 5 Sun NetBeans, http://www.netbeans.org 6 JetBrains IntelliJIDEA, http://www.jetbrains.com/idea 2 3
Figure 1. Overview of the framework
2
specified as a Prolog array, i.e. ['a', 'b'] for the namespace “a.b”). f) access(Y, U, X, S): true, if the access with the id Y by the user U to the artifact X lasted S seconds. g) access_date(Y, D): true, if the access with the id Y occurred on the date D.
What we consider as the concrete artifact depends on the application: within Microsoft Office and OpenOffice it is a file. This means that we track the accesses and modifications of the properties of files. Within programming environments we consider the method as the artifact, i.e., the time spent editing, adding, deleting a method is the highest granularity of the data that is collected. This data can be aggregated later e.g., at the class, namespace, or file level. The result of the data collection step consists of artifacts, properties of artifacts, and the sequence of accesses on the artifacts (see figure 2). In this way, the described measurement framework collects when, who accesses which artifact, and – to allow the filtering of artifacts – logs properties that describe the artifacts accessed by the user.
timestamp properties value id property_type
name
id
4. Data analysis
artifacts name
To help users searching for experts within the system described above, we categorize the collected time spent editing artifacts according to different criteria. For example, if the source code of a specific company is organized in such a way that the namespace indicates the component of the software system, it is reasonable to group the time spent per artifact by namespace. In this case, knowing the namespace, i.e. the component, helps to find who dedicated the most time accessing artifacts within that namespace, i.e. the employee with the most knowledge of this component, the expert [6]. Other examples for classifying the access duration are the project name, prefixes of class names (e.g., “test” to find the testing experts), or types of documents. To be able to change the rules easily, we use Prolog rules to define how the data should be classified (using Interprolog [16] as a bridge between Java and Prolog). Within Prolog we define a set of predicates that correspond to the properties collected by the measurement framework which allow access to the table “properties” (see figure 2) within Prolog. Currently the following predicates are available for the use within Prolog rules:
id artifact_type
name
timestamp access
duration
id user name Figure 2. EER diagram of the measurement framework
To define a classification of the artifacts, we expect that the predicate classification_artifact(T, X, C) exists and that it returns true if an artifact with the id X is contained in the class with the name C according to the classification criteria with the name T. To classify the records stored within the table access (e.g. to classify certain accesses within a certain time range to a specific category) we expect that the predicate classification_access(T, A, C) exists and that it returns true if an access with the id A is contained in the class with the name C according to the classification criteria with the name T. The resulting classifications are cached within the database so that they can be easily queried using SQL. Figure 3 shows the EER diagram of how the classifications of artifacts and accesses to these artifacts are stored within the database: classifications can be made according to different criteria. For example, it is possible
a)
path(X, N): true, if the artifact with the id X is a file stored within the folder with the name N (N is specified as a Prolog array, i.e. ['c', 'a', ‘b’] for the namespace “c:\a\b”); b) file(X, N): true, if the artifact with the id X is a file with the name N (without path); c) class(X, N): true, if the artifact with the id X represents a class with the name N; d) method(X, N): true, if the artifact with the id X represents a method with the name N; e) namespace(X, N): true, if the artifact with the id X is contained within the namespace N (N is
3
to classify artifacts according to their importance and according to their size. All criteria T that the classifying Prolog predicates classification_artifact and classification_access return are stored within the table classification_criteria. All classes C used within the two classification predicates are stored in the table classification_classes.
first_two(X, H1, H2, R):-namespace(X, Y), [H1|T1] = Y, [H2|R] = T1.
Now, classification_artifact(T, X, C) can be defined as follows: classification_artifact(T, X, C) :- T=expert, first_two(X, H1, H2, R), concat(H1, '.', P1), concat(P1, H2, C).
id classification_criteria
This means that for the type of classification expert, we consider an artifact with the id X part of the class C, if the first two packages of the namespace of the artifact X correspond to the name of C. To visualize the data obtained using our measurement framework, every tool able to connect to a database using JDBC can be used, we used OpenOffice Calc7 to query the database and to generate a pivot table of the format as shown in figure 4. In this table, all classification items accessed within the analyzed period are shown as lines, the users accessing these classification items as columns. Within the pivot table the single values represent the sum of time spent by the specific user on a specific classification item.
name
id classification_classes name
artifacts
access
Figure 3. EER diagram of how the classifications of artifacts and accesses are stored in the database
Users Classification
5. Examples In the following two examples we will show how the described measurement framework can be used to address expert identification and skill management issues.
Sum of time spent per classification item
5.1. Example 1 Figure 4. Schema of the pivot table to visualize the effort distribution
In the following example we describe the analysis of experts within a case study carried out for a company in the domain of industrial automation, which, for confidentiality, in the following we call “Acme”. Acme’s IT department consists of about 50 employees, 23 of them are developing software for internal use. We installed our measurement framework and collected the time spent per method, class, namespace, and file. Within Acme, the developers agreed that the development effort should be grouped on the namespace level, considering only the first two packages since the first package for them represents always the name of the project and the second package the name of a main component. If, e.g. a class is contained within the namespace a.b.c.d, we attribute all the time spent editing in this class to the class a.b. Therefore, as rules, we defined the predicate first_two(X, H1, H2, R) so that it is true if H1 and H2 are the first two elements of the array X and R is the remainder of the array:
In this way, calculating the ith largest value of time spent on each classification item, we obtain the ith expert of that classification item. Within Acme, we used percentages (the time spent in relation to the time spent by the top expert) to ease the understanding of the resulting pivot table. In the example shown in table 1, user 2 has almost the same amount of experience as user 1 considering classes within namespaces starting with project1.a, but he is the only one that has experience with classes within project2.c. Within our case study we formatted the table above to ease the understanding of the data for the user: we color the cells in dark green if the value is above 90% and in light green if it is above 50%.
7
4
OpenOffice.org Calc, http://www.openoffice.org/product/calc.html
In such contexts it is important that all know everything about the produced code. A measurement framework as described here can be used to determine the knowledge of different developers about different subparts of the code and could be extended to recommend who should do pair programming with whom to optimally distribute the knowledge. In the case of Acme, we decided to consider only the time spent in the last half year as an indicator of experience. If e.g., a programmer for more than half year did not dedicate time in the development of a specific component he or she had to do pair programming with one of the developers currently working on that part to be up to date with the last modifications. For this we defined the predicate classification_access(T, A, C) as follows:
The column “Experts” shows the number of users within the current line with values above 50%. If in this column the number of users is equal to 1, this means that there is only one user that knows about that part of the code. Within the case study we colored these cells in red.
Classification
Users User1
User2
project1.a
100%
90%
project1.b
10%
project2.c project3.d
10%
User3
Experts 2
100%
1
100%
1
100%
1
classification_access(T, A, C) :- T=expert, access_date(A, B), parse_time(B, C), get_time(D), E is D-C, E < 259200, C=pp.
Table 1. Example of pivot table representing the knowledge of each user about the code within a classification item
This means that for the type of classification expert, we consider an access to an artifact with the id X part of the class pp (pair programming), if the access date (obtained as a string B and converted to the Prolog timestamp C) is not less than six months (259200 seconds) ago. Within our SQL query we sum up all the accesses belonging to the class pp. As in the example 1 we used OpenOffice Calc to visualize this data. The resulting table is formatted in the same way (see table 2 for an example), just that now the data shows who worked on which classification item (i.e. the first to elements of the namespace) during the last 6 months.
To summarize, using the mentioned measurement framework and a tool to query the obtained data such as OpenOffice Calc, it is possible to: determine the ith expert of a specific part of the code, assuming that the time spent in adding, modifying, and updating reflects the knowledge of the code, b) to show the experience of a specific user, c) to evaluate the knowledge gap that will occur if a specific user will leave. a)
Point c) addresses the problem that when a person with critical knowledge leaves an organization, this creates severe knowledge gaps. It is crucial to understand what knowledge is lost to prevent valuable knowledge from disappearing. Additionally, knowing which knowledge disappeared can help to decide which skills are needed for the new employee and on which areas he or she should work on.
Classification
Users
5.2. Example 2 Acme is using Extreme Programming [17] as software development methodology. Within agile methodologies, the focus lies on “working software over comprehensive documentation” [18]. The produced source code is considered the most valuable asset, representing the knowledge of the development team. This knowledge has to be shared among team members using practices like “pair programming” (meaning that all code has to be programmed in two) or “common code ownership” (meaning that the entire team is responsible for the source code and that everybody has the right to modify everything) [17].
User1
User2
User3
project1.a
100%
90%
20%
project1.b
80%
80%
100%
project2.c
90%
100%
80%
project3.d
70%
100%
80%
Table 2. Example of pivot table representing the knowledge of each user about the code within a classification item (considering only the last 6 months)
In the example in table 2 it is visible that user 3 should to pair programming either with user 1 or user 2 for the next requirement on the component project1.a.
6. Acknowledgements I thank Tadas Remencius for his support on this article.
5
Engineering.New York, NY, USA, 2002. pp. 503512. [10] Skills & Competency Management Software by StepStone Solutions. [Online] StepStone . [Cited: March 11, 2008.] http://www.stepstonesolutions.com/Solutions/Skills_ Competency_Management/Skills_Competency_Man agement.php. [11] SAP ERP Human Capital Management. [Online] SAP. [Cited: March 11, 2008.] http://www.sap.com/solutions/businesssuite/erp/hcm/index.epx. [12] Rus, Ioana, Lindvall, Mikael and Sinha, Sachin Suman. Knowledge Management in Software Engineering. DACS State-of-the-Art-Report, The Data & Analysis Center for Software (DACS) is a Department of Defense (DoD) Information Analysis Center (IAC). 2001. http://www.cebase.org:444/ umd/dacs_reports/kmse_-_nicholls_final_edit_11-1601.pdf. [13] Basili, V, et al. An Experience Management System for a Software Engineering Research Organization. SEW ’01: Proceedings of the 26th Annual NASA Goddard Software Engineering Workshop.Washington, DC, USA, 2001. p. 29. [14] Ackerman, M S and Malone, T W. Answer Garden: a tool for growing organizational memory. Proceedings of the ACM SIGOIS and IEEE CS TC-OA conference on Office information systems.New York, NY, USA, 1990. pp. 31-39. [15] ActiveNet - Facilitating Real-Time Collaboration. [Online] Tacit Software. http://www.tacit.com/ products/activenet/technology.html. [16] Calejo, Miguel. InterProlog: Towards a Declarative Embedding of Logic Programming in Java. [ed.] José Júlio Alferes and João Alexandre Leite. Logics in Artificial Intelligence, 9th European Conference, JELIA 2004, Lecture Notes in Computer Science 3229.Lisbon, Portugal, 2004. Vol. 3229, pp. 714-717. DOI 10.1007/b100483. ISBN 3-540-23242-7. [17] Beck, Kent and Cynthia, Andres. Extreme Programming Explained. Embrace Change. 2nd Edition. Amsterdam : Addison-Wesley Longman, 2004. ISBN 0321278658. [18] Beck, Kent, et al. Manifesto for Agile Software Development. [Online] 2001. http://agilemanifesto.org.
7. References [1] Rus, Ioana and Lindvall, Mikael. Guest Editors’ Introduction: Knowledge Management in Software Engineering. 3, 2002, IEEE Software, Vol. 19, pp. 26-38. [2] Becerra-Fernandez, Irma. Searching for experts on the Web: A review of contemporary expertise locator systems. 4, New York, NY, USA : ACM, 2006, ACM Trans. Inter. Tech., Vol. 6, pp. 333-355. [3] Becerra-Fernandez, Irma. Facilitating the Online Search of Experts at NASA using Expert Seeker People-Finder. [ed.] Ulrich Reimer. PAKM.2000. Vol. 34. [4] Sarkar, Santonu, Sindhgatta, Renuka and Pooloth, Krishnakumar. A collaborative platform for application knowledge management in software maintenance projects. Compute ’08: Proceedings of the 1st Bangalore annual Compute conference.New York, NY, USA, 2008. pp. 1-7. [5] Perry, Dewayne E, Staudenmayer, Nancy and Votta, Lawrence G. People, Organizations, and Process Improvement. 4, Los Alamitos, CA, USA : IEEE Computer Society Press, 1994, IEEE Softw., Vol. 11, pp. 36-45. [6] Fritz, Thomas, Murphy, Gail C and Hill, Emily. Does a programmer’s activity indicate knowledge of code? ESEC-FSE ’07: Proceedings of the the 6th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering.New York, NY, USA, 2007. pp. 341-350. [7] McDonald, David W and Ackerman, Mark S. Expertise recommender: a flexible recommendation system and architecture. CSCW ’00: Proceedings of the 2000 ACM conference on Computer supported cooperative work.New York, NY, USA, 2000. pp. 231-240. [8] Minto, Shawn and Murphy, Gail C. Recommending Emergent Teams. ICSEW ’07: Proceedings of the 29th International Conference on Software Engineering Workshops.Washington, DC, USA, 2007. p. 5. [9] Mockus, Audris and Herbsleb, James D. Expertise browser: a quantitative approach to identifying expertise. ICSE ’02: Proceedings of the 24th International Conference on Software
6