Information Sciences xxx (2013) xxx–xxx
Contents lists available at SciVerse ScienceDirect
Information Sciences journal homepage: www.elsevier.com/locate/ins
Creating Process-Agents incrementally by mining process asset library Hui Huang a,b, Junchao Xiao a, Qiusong Yang a,⇑, Qing Wang a, Hong Wu a,b a b
Laboratory for Internet Software Technologies, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China Graduate University of Chinese Academy of Sciences, Beijing 100049, China
a r t i c l e
i n f o
Article history: Received 1 January 2010 Received in revised form 12 November 2012 Accepted 1 December 2012 Available online xxxx Keywords: Software process Human resource Clustering Process-Agent Trustworthiness
a b s t r a c t Software process trustworthiness is the degree of confidence that a software process produces expected trustworthy work products that satisfy requirements. Software processes are dynamic and highly people-dependent. The performance of software processes relies not only on the process itself, but also on the personnel’s capabilities. Therefore, management of human resources and evaluation of a company’s work force capabilities are crucial and will affect software process trustworthiness. Our software process modeling method OEC-SPM (Organization-Entity Capability based Software Process Modeling) has been shown to take into account personnel’s capabilities and groups software developers with certain capabilities into a Process-Agent, which is a way of organizing human resources and process asset libraries in software organizations, and will help to improve trustworthiness of software processes. This paper proposes a novel method for incrementally mining Process-Agents from process asset libraries to support OEC-SPM. The method can automatically and incrementally create Process-Agents under three scenarios with high efficiency. Furthermore, we assess the method with the data from real industry setting. The results show that the utilization of human resources in an organization can be optimized when personnel’s capabilities are taken into account. Additionally, reasonable resource scheduling making use of Process-Agents will result in higher trustworthiness. Ó 2013 Elsevier Inc. All rights reserved.
1. Introduction Software process trustworthiness is defined as the degree of confidence that a software process produces expected trustworthy work products that satisfy their requirements [3,53]. Software process trustworthiness is important during software development [3], where the quality of software largely depends on the quality of the software development process [36]. According to [2,50], software processes are dynamic and highly people-dependent, thus the capabilities of personnel and how to allocate human resources among process activities determine whether a process can be performed as expected, which can affect the software process trustworthiness, and in turn software trustworthiness. Considering the influence of human resources’ capabilities on software process trustworthiness, we proposed an Organization-Entity Capability based Software Process Modeling method (OEC-SPM) [50,63]. In OEC-SPM, a Process-Agent (PA) is used to group human resources with similar capabilities (goals, skills, processes, experiences, etc.). Additionally, each PA includes knowledge accumulated in historical projects, such as descriptive knowledge, process knowledge and experience data. Through a group of well defined PAs, not only OEC-SPM method can be well supported, but also human ⇑ Corresponding author. E-mail addresses:
[email protected] (H. Huang),
[email protected] (J. Xiao),
[email protected] (Q. Yang),
[email protected] (Q. Wang),
[email protected] (H. Wu). 0020-0255/$ - see front matter Ó 2013 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.ins.2012.12.052
Please cite this article in press as: H. Huang et al., Creating Process-Agents incrementally by mining process asset library, Inform. Sci. (2013), http://dx.doi.org/10.1016/j.ins.2012.12.052
2
H. Huang et al. / Information Sciences xxx (2013) xxx–xxx
resources can be better utilized and their capabilities can be accurately evaluated. Based on the improved utilization of human resources, a software process with high trustworthiness can be realized through reasoning and cooperation among PAs. PAs can be created by managers according to their tacit knowledge about their work force. However, it is difficult to make judgments on the capabilities of developers objectively. Furthermore, a lot of knowledge of human resources, e.g. descriptive knowledge, processes knowledge and experience data, is needed to be extracted for PA creation, thus to create PAs manually is tedious and error prone. In mature software development companies, such as the ones that have achieved CMMI (Capability Maturity Model Integration) maturity level 3 or higher [7], . . . a massive amount of organizational process assets, which include standard processes, project processes and corresponding process performance data, are usually maintained for directing project developments. Through the analysis of the process asset library, descriptive knowledge, process knowledge and experience data can be extracted, which can reflect human resources’ capabilities. From this data, PAs are created through combination of a set of human resources who have similar capabilities which supports better utilization of those resources. Risks such as human resource demission can be brought down. Through improved human resource allocation and a more stable process execution, the software process trustworthiness is improved. To obtain a high level of software process trustworthiness, a mature method for creating Process-Agents should have the following characteristics: (1) Efficient. Data processing for PA creation is usually time consuming and space consuming because the amount of process execution data in the process asset library of an organization is extremely huge. A mature PA creation method should handle such issues in an efficient way. (2) Flexible. Software processes are dynamic and capabilities of human resources are changing over time, therefore a mature PA creation method should be able to dynamically update PAs when the capabilities of human resources change. (3) Comprehensive. To evaluate human resources accurately, all data that reflect human resources’ capabilities, including descriptive knowledge, process knowledge and experience data, should be obtained through the method. In our previous work [50], the PA structure was outlined and the description of a tool for the creation of PAs has been provided in [58]. The tool, however, does not meet efficiency requirements and cannot create PAs incrementally and dynamically, i.e. it has to create PAs from the scratch as long as the capabilities of human resources change. The experience data, which is an important part for determining human resources’ capabilities, is also not considered in [58]. In addition, it cannot assure the quality of PAs and thus lowers the trustworthiness of software processes. To overcome these shortages, the current paper presents a method for creating PAs incrementally by mining the process asset library. The method adopts a representative tree based heuristic tree clustering algorithm that makes it highly efficient. According to the variability of human resources and their experience data, three scenarios of PA creation were considered in order to make the creation more flexible: (1) creating PAs from scratch; (2) creating PAs incrementally for new human resources; (3) creating PAs incrementally for human resources whose capabilities have changed. Moreover, both process knowledge and experience data were taken into account when determining human resource capabilities. At last, an empirical study of mining the process asset library in a software enterprise has been performed. As demonstrated in the empirical study, not only the utilization of human resources were optimized as a result of the introduction of PAs, but also some problems during the process execution and in the human resource management were discovered, such as incorrect usage of standard processes or too many roles for one employee, which give further decision support for software process improvements. The goal of this paper is to create PAs to better utilize human resources, evaluate their capabilities and support the OECSPM method. The contributions of the paper are listed as follows: (1) The method shows improved efficiency by adopting a representative tree based heuristic tree clustering algorithm and creating PAs incrementally. (2) The method can create PAs dynamically given the three scenarios, which is more suitable for the dynamics of software processes. (3) The method organizes human resources more accurately and determines human resource capabilities by extracting both process information and experience data. (4) Using PAs, human resources are organized more efficiently and their capabilities are clearly evaluated, which helps to improve trustworthiness of software processes and give further decision support. The remainder of this paper is organized as follows. Section 2 discusses the related work. Section 3 gives the background of PAs based software process modeling method and the general structure of process asset library. The method for creating PAs through mining process asset library is presented in Section 4. Based on the data of a software enterprise, Section 5 gives the application of our method and further discusses the decision support for software process improvement. Section 6 concludes the paper and gives discussions on future work.
Please cite this article in press as: H. Huang et al., Creating Process-Agents incrementally by mining process asset library, Inform. Sci. (2013), http://dx.doi.org/10.1016/j.ins.2012.12.052
H. Huang et al. / Information Sciences xxx (2013) xxx–xxx
3
2. Related work Amoroso pointed out that there are two existing possible approaches to access trustworthiness [3]. The first approach emphasizes techniques for examining software products directly. A lot of work has been done in the first approach. For example, [26,43,59] provided methods to evaluate the performance of software projects and the quality of software. Software defect/fault/bug detection and analysis methods were extensively studied in [4,6,27,28,32,35,37,39,41,45], and many support tools were developed to help examine the quality of software products [9,12,15,42]. Because of the combinatorial explosion problem, exhaustive testing is practically impossible [56]. Therefore, plenty of works about the second approach, which emphasizes examination of the processes used to create these products, have also been studied. For example, Peng et al. studied risk assessment and management during software development in [5,38]. The methods for software process representation were provided in [21,34], and the methods for software process description were provided in [25,54]. In addition, many software process models, such as CMMI and OEC-SPM [7,50,63], were also provided to help build trustworthy software processes. Process extraction is an important research topic in the software process literature, on which the second approach focuses in recent years. A lot of important information can be found through mining development processes from existing repositories, which gives decision support for process improvement. For example, Jensen et al. proposed using text analysis techniques for extracting instances of process meta-model entities from community repositories [24]. VanHilst et al. presented a method applying artifact mining in a global development environment to support measurement based process management and improvement [46]. Huo et al. proposed a method for detecting consistent patterns from process enactment data, which is to uncover the actual development process, and thereby to provide evidence for improving the quality of a planned software process that will be followed by an organization in the future [20]. Rember et al. provided a preliminary approach to mining multiple perspectives of a business process, whose different characteristics can be measured [40]. Wen et al. proposed a method to mine process models with ‘‘non-free-choice’’ process structures [51]. A few challengeable problems were identified by Aalst in [1], including mining processes with hidden tasks, duplicate tasks, non-free-choice constructs, loops, and using time, mining different perspectives and dealing with noise. However, software processes are dynamic and highly people-dependent [2,50], although mining multiple perspectives of processes was mentioned in [1,40], they did not consider processes from the standpoint of individual human resource. As the later approach, which emphasizes the examination of processes, focuses on the manner in which activities are performed during an actual development lifecycle, the burdens of demonstrating trustworthiness are placed on the developers [3]. Therefore, developers’ capabilities will finally affect the trustworthiness of software. As one of the nine areas (integration, scope, time, cost, quality, human resources, communications, risk, procurement) in project management [30], human resources are regarded as the largest source of variation of project performance [2,50]. Many researchers evaluate the behaviors and capabilities of developers through mining data from different kinds of repositories. History data not only helps us to understand software [16], but also can tell us a lot about developers. For example, Liu et al. used the historical information stored in their CVS repository to understand how students interact and find out the correlation between students’ grades and the nature of their collaboration [29]. Mierle et al. defined various quantitative metrics for student behavior and code quality and correlated these metrics with grades [31]. Gousios et al. introduced their work on mining developer distribution from software repository data [17]. Yu et al. demonstrated how to understand open-source developers’ roles by mining CVS repository [55]. Zhang et al. mined individual performance in collaborative development environments from code base, version control systems and bug tracking systems [60–62]. However, the analysis of project management data in process execution, such as effort and productivity in each activity, and the processes used by human resources, is not taken into account in these methods and most of them do not consider the capability similarity between human resources. The performance of processes is not only related to the processes themselves, but also impacted by capabilities of developers. As different developers might be good at different development activities, the relationships between human resources and software processes, reflected in the process execution data, are very important. As a result, mining process capabilities of human resources from process asset library is useful for process improvement and process trustworthiness. Although Zhang et al. presented a method to assess the capability of individual software development processes [61], they did not either extract individual processes or use the project management data. To solve the issue, our method extracts individual process for each individual developer, and evaluates and organizes human resources using their experiential project data mined from organization process asset libraries, which consists of a large amount of enactment data generated during process executions. From these data, the roles of human resources, the activities that human resources can assume, effort and productivities as well as the quality of process execution, can be mined. It can then help organizations to make project plan, manage human resources and improve software processes.
3. Process-Agent and process asset library 3.1. Process-Agent A PA comprises two parts: Infrastructure and Engine [50]. Infrastructure is the base of reasoning and cooperation. The Infrastructure’s structure is shown in Fig. 1. It comprises three kinds of knowledge: Descriptive Knowledge, Process Knowledge and Experience Library [50]. Please cite this article in press as: H. Huang et al., Creating Process-Agents incrementally by mining process asset library, Inform. Sci. (2013), http://dx.doi.org/10.1016/j.ins.2012.12.052
4
H. Huang et al. / Information Sciences xxx (2013) xxx–xxx
Descriptive Knowledge Process-Agent Goal goal 1
goal n
Realize
Process-Agent Skill
Basic Information of Human Resources in Process-Agent
Realize
Execute step
skill 1
skill m
Determine
Determine
Experience Library
Process Knowledge Step Step Step
Step Step
Experience Library
Generate data Step Estimate resources needed by the step execution
Step performance data
Step performance data
Performance data Performance data Estimate resources needed by the step execution Fig. 1. Knowledge in Process-Agent.
(1) Descriptive Knowledge: describes what a PA can do, including what goals it can realize, its skills and the description of human resources inside the PA. (2) Process Knowledge: describes what a PA does to realize its goals. It consists of a group of steps needed to be executed by PAs to realize the goals described in descriptive knowledge [52]. Each step in process knowledge has properties such as Implementation and Kind, which are used to describe PA behaviors. (3) Experience Library: organizes the experience data of human resources involved in PAs. The experience data is accumulated gradually during the execution of historical projects. And each data item may have the attributes such as planned and actual workload, start time, end time, work product size, etc. The data in an experience library can be used to predict time, workload, cost and quality and so on when a step described in process knowledge is performed in a new project. The three parts of Infrastructure are closely interrelated and indispensable. As shown in Fig. 1, process knowledge inside PAs can be organized as a tree, called the process step tree. Each node in the tree is a process step with experience data in an experience library, such as workload and actual execution time, reflecting the performance of human resources on associated development activities. Meanwhile, descriptive knowledge depends on both process knowledge and the experience library. Based on the above knowledge, PAs can effectively evaluate human resource capabilities and organize them accordingly. As another integral part of PAs, Engine provides an acting mechanism for PAs, which is to reason about PA behavior based on their Infrastructure and driven by the external environment. When a PA starts to work, it continually perceives the environment. If development goals are put into the environment, a PA can respond actively and autonomously. Through reasoning and negotiation inside the Engine, a project development plan can be constructed and human resources can be allocated to realize set goals. During process execution, enactment data will be collected by PAs in order to optimize and improve their capabilities. As a PA comprises large amounts of information, such as descriptive knowledge, information on the process and the experience library, creating PAs manually is tedious and error prone. A method is needed to automatically create PAs from the organizational process asset libraries. 3.2. Process asset library Organizations with higher maturity levels usually establish infrastructures for collecting various kinds of data produced during project development. The data includes description of projects, processes adopted by projects, tasks, human resources and their performance data such as the productivity when executing a task. These data are stored in process asset libraries. In addition, a process asset library may contain standard processes for task allocation in project development processes [7]. Generally a process asset library (PAL) is structured as shown in Fig. 2. Entities in a process asset library are described as follows: (1) Activity of standard process: a standard process consists of a group of standard activities. This entity describes the name, description, input, output, pre- and post-conditions of activities. (2) Project: includes the name, the description and the type of certain projects. Part of descriptive knowledge of PAs may come from this entity.
Please cite this article in press as: H. Huang et al., Creating Process-Agents incrementally by mining process asset library, Inform. Sci. (2013), http://dx.doi.org/10.1016/j.ins.2012.12.052
H. Huang et al. / Information Sciences xxx (2013) xxx–xxx
5
(3) Task: a project process usually consists of a group of tasks in different levels. The parent-tasks are split into sub-tasks. All the tasks in a project form a task tree. A task entity specifies various attributes of a task, such as name, corresponding standard process activities (if the task is not generated from the standard process, this item is noted as ‘‘null’’), parent-task ID, plan and actual start/end time and workload, and also the input/output work products. Information on the development process will be extracted from this entity. (4) Task member: includes the plan information about the tasks that each human resource executes, such as the planned start/end time and effort. The partial experience library of PAs will be created from this entity. (5) Member report: is the actual information about task execution in each report period. It includes information such as the actual start/end time and the workload, which are also part of the experience library. (6) Human resource: describes information about the human resource, such as name, address, and the timetable for resource usability. Descriptive knowledge of PAs will be mainly extracted from these data. With these data and their relationship information in process asset libraries, all the projects and tasks that each human resource has experienced in a software organization can be determined. In addition, human resources’ performance data in the execution of tasks can also be obtained. Consequently, capabilities of human resources in a process execution can be decided. 4. Creating PAs incrementally by mining PAL Human resources in organizations and their capabilities change over time. Therefore, creating PAs from scratch every time is time consuming and will cause instability during process execution. To adapt to the dynamics of software processes, PAs can be incrementally and dynamically created in three scenarios (see Fig. 3). Scenario 1 – Creating PAs from scratch. In this scenario, there are no existing PAs, thus all PAs should be created from scratch. Firstly, a set of human resources with information on development process and experience data are extracted directly from the process asset library. The PAs are then created employing a data clustering algorithm. After PA creation, the process asset library is in turn reorganized by these PAs. Scenario 2 – Creating PAs incrementally for new human resources. In this scenario, some PAs have already been created, but some new human resources have been added, thus these new human resources should be redistributed to proper PAs. They can be processed using the same incremental creation process as applied in Scenario 1, the data clustering step and the reorganization of the process asset library thereby comprises all human resources, i.e. existing and new human resources. Scenario 3 – Creating PAs incrementally for human resources whose capabilities have changed. In this scenario, some PAs have already been created before, some human resource knowledge and capabilities however have changed (e.g. after Personal Software Process [19] training, organization process improvement or taking part in more projects). They should be redistributed to the proper PAs if necessary. Through incremental creation, actual capabilities of human resources can be reflected. As shown in Algorithm 1, an algorithm is provided to deal with the above three Scenarios. It adopts a representative tree based heuristic tree clustering algorithm presented in [33] and can create PAs incrementally by merging human resources who have similar capabilities through clustering analysis. Typical clustering activity usually involves several steps: pattern representation, data abstraction, proximity measure definition, clustering and assessment of output [22,23]. In this work the pattern is represented as a tree, and each step in the tree has corresponding experience data. The subsequent sections will outline how to extract data, cluster human resources, define similarity score and assess output will be shown.
Project Name Description Type ...... *
0..1
1 *
* Human Resource Name Address TimeTable
1 *
Task Name StandardActivityID ParentTaskID InputWorkproduct OutputWorkproduct PlanStartTime PlanEndTime ActualStartTime ActualEndTime PlanWorkload ActualWorkload ...... * 1 Task Member PlanStartTime PlanEndTime PlanWorkload ......
Activity of Standard Process Name * Input * 1Output Pre-Condition Post-Condition ......
Member Report ActualStartTime ActualEndTime 1 * ActualWorkload Workproduct ......
Fig. 2. Entities of process asset library.
Please cite this article in press as: H. Huang et al., Creating Process-Agents incrementally by mining process asset library, Inform. Sci. (2013), http://dx.doi.org/10.1016/j.ins.2012.12.052
6
H. Huang et al. / Information Sciences xxx (2013) xxx–xxx
PAL Create PAs from scratch
PAL organized by PA Create PAs incrementally for new HRs
Create PAs incrementally for HRs whose capabilities have changed
ProcessAgents
Fig. 3. Process-Agent creation.
The algorithm takes as input a process asset library as defined in Section 3.2 and a similarity threshold LIMIT (see Definition LIMIT below). The process asset library is the place where the necessary data are extracted, and only when the similarity score between two step trees is larger than LIMIT, those trees are identified as being similar and can be composed. Outputs of the algorithm are a set of created PAs, which are human resources grouped by their capabilities with descriptive knowledge, process knowledge and experience library. Definition LIMIT: LIMIT is the similarity threshold used to assure capability similarity among human resources in one Process-Agent. Algorithm 1. AgentCreator algorithm Input: Process asset library (PAL) and similarity threshold LIMIT Output: Process-Agents AgentCreator () 1. Initiate the ClusterSet 2. Get human resources HRs from PAL 3. For each human resource hr 2 HRs 4. Begin 5. Extract data from PAL and construct process step tree T for hr 6. For each cluster C 2 ClusterSet 7. Begin 8. Calculate similarity score Score (T, C) between T and C 9. If (Score (T, C) > MaxScore) 10. Begin 11. MaxScore = Score(T,C) 12. MaxCluster = C 13. End 14. End 15. If (MaxScore > LIMIT) 16. Begin 17. Add hr to MaxCluster 18. Compose T and MaxCluster to produce a new representative tree T’ for MaxCluster 19. End 20. Else 21. Begin 22. Add hr to ClusterSet as a new cluster and T is the representative tree 23. End 24. End 25. For each cluster C 2 ClusterSet 26. Begin 27. Construct a Process-Agent according to C 28. End//
Please cite this article in press as: H. Huang et al., Creating Process-Agents incrementally by mining process asset library, Inform. Sci. (2013), http://dx.doi.org/10.1016/j.ins.2012.12.052
H. Huang et al. / Information Sciences xxx (2013) xxx–xxx
7
(1) Cluster set initialization. According to the three scenarios of PA creation, cluster set should be initialized differently in this step. For scenario 1, the cluster set will be initialized as empty. For scenario 2, the cluster set will be initialized as the existed PAs, which are the base to recreate PAs incrementally. For scenario 3, like scenario 2, the cluster set is initialized as the PAs created before. (2) Data extraction. In this step a process step tree is constructed and experience data is extracted from the process asset library for each human resource. The method to extract personal data is outlined in detail in Section 4.1. Three scenarios there are a few differences: For Scenario 1, because there exists no PA and it is the first time to deal with process asset library, it should extract data from the whole process asset library for every human resource. For Scenario 2, because the original human resources’ data has already been extracted, only data for new coming human resources are needed. Thus in this scenario data for extracting is only the data about new human resources in process asset library. This will involve much less process asset library data and improve efficiency of the algorithm. For Scenario 3, only newly accumulated data from the process asset library are extracted for each human resource. Human resources who do not have new data will be ignored and are kept in their original status. This also will improve efficiency of the algorithm. A new process step tree of human resource will be constructed by merging newly extracted data with the representative process step tree of the cluster to which the human resource belonged. The rules for merging used in this paper are: (a) For each human resource, extract her/his new personal data using algorithm given in Section 4.1. Human resources who do not have new extracted data are ignored. Thus each remaining human resource has a new process step tree. (b) For each extracted human resource, find a representative tree of the cluster which she/he originally belonged to and put all knowledge from the representative tree into the new process step tree. (c) Remove the human resource from the original cluster. If the cluster does not contain any human resource, remove the cluster from cluster set. (3) Similarity score calculation. Similarity score is a measure of capability similarity between human resources. It makes sure that human resources in one PA have similar capabilities. Corresponding to the two important kinds of knowledge in PA (Process Knowledge and Experience Library), similarity score is composed of two parts: process knowledge similarity score and experience library similarity score. How to calculate similarity score will be shown in Section 4.3. (4) Clustering. For every human resource, the similarity score between her/his process step tree and the representative tree of each cluster is calculated to identify the highest score. If the highest score is greater than LIMIT, the human resource will be assigned to that cluster, and a new representative tree is generated by composing his/her process step tree with the representative tree of the cluster. Else the human resource will be added to the cluster set as a new cluster, and his/her process step tree is the representative tree of the new cluster. How to compose step trees will be shown in Section 4.2. (5) Process-Agents construction. After the previous four steps, human resources with their process knowledge and experience data are grouped into clusters. For each cluster, the name of the PA is generated, and descriptive knowledge is constructed PA according to the basic information about human resources. PA construction is completed by applying information on the development process and experiment library. 4.1. Data extraction Extracting process knowledge and experience data is the premise for PA creation. In this section data of human resources are extracted from the process asset library to measure their capabilities. The process for extracting personal data is as follows: (1) Extract all task information: for each human resource, all data of the historical projects that the human resource has participated are extracted from the process asset library, which include project information, task information and the performance data accumulated in the projects. (2) Eliminate tasks that cannot determine capabilities of the human resource: the tasks in historical projects usually constitute a task tree, and the data that determine capabilities of a human resource usually involves two parts. One part is the tasks that the human resource has participated. It decides what the human resource can do and what performances the human resource has. The other part is sub-tasks of the tasks that the human resource has participated. It shows that the human resource can realize the task by assigning sub-tasks to others through cooperation. Accordingly, Implementation of a task can be set as follows: If the human resource is a performer of the task: 1. Yes: the Implementation of the corresponding step is set as DIRECT. 2. No: the Implementation of the corresponding step is set as SUBPROCESS, which means the step requires cooperation for executing.
Please cite this article in press as: H. Huang et al., Creating Process-Agents incrementally by mining process asset library, Inform. Sci. (2013), http://dx.doi.org/10.1016/j.ins.2012.12.052
8
H. Huang et al. / Information Sciences xxx (2013) xxx–xxx
Therefore, only tasks that the human resource has participated and their sub-tasks are retained in this step. Fig. 4 shows such example (the letter in the bracket stands for the task executor). (3) Eliminate tasks not imported from standard process: a mature organization usually has standard processes, which are used to direct software process development. Only tasks imported from standard processes are useful and comparable. Thus just tasks imported from stand processes are retained, while others will be eliminated. (4) Extract experience data of the human resource for each task: the execution data of each task are extracted for the human resource, including the plan and actual start time, end time, workload, work product size. It obeys the rule that if the human resource realizes the task directly, her/his execution data on that task will be obtained as experience data of the human resource, else if she/he realizes the task through cooperation, execution data of all task members on that task will be obtained as experience data of the human resource. (5) Merge similar tasks: if there are tasks imported from the same standard process step and having the same Implementation, they would be merged into one task. (6) Construct process steps: based on the tasks having been extracted previously, one step is constructed corresponding to one task, and the property Kind can be set for each process step. According to the human resources that execute the task and the relationship between the task with others in project software processes, Kind is set using the following rules: If a process step has children which are imported from the same standard process activity, set the Kind property of this step as ‘‘Choice’’. If the task has sub-tasks and executing time of the sub-tasks overlaps, set the Kind property of this step as ‘‘Parallel’’. If the task has sub-tasks and executing time of the sub-tasks does not overlap, set the Kind property of this step as ‘‘Sequence’’. Else set the Kind property of this step as ‘‘Leaf’’. The other properties of this step will be set according to the attributes of the task and the standard process. Relationships between steps are constructed according to the parent–child and brother relationship in the standard process. (7) Construct the process step tree: if the steps are still not built into one process step tree, a virtual root should be added to combine the forest as one tree. Kind property of the virtual root will be set as ‘‘Choice’’. By now the process step tree of the human resource is constructed. 4.2. Process step trees composition In [33] Mostafa et al. provided an algorithm to compose two trees into one representative tree, which is basis of the process step tree composition algorithm used in this work. Before introducing the algorithm, two important definitions used in the algorithm are given firstly. Definition distinct path: Distinct paths of tree T are constructed by traversing T in preorder. During this operation: (a) whenever a backtracking (moving from a node to its ancestor) occurs, a new distinct path is constructed and added to the end of the list containing distinct paths and (b) whenever a node is reached, it is added to the end of the last constructed distinct path. Each distinct path has two properties: depth and parent, which are the depth and the parent of the first node in path. Definition same parent: Two distinct paths have the same parent if their parents have the same label, the same Implementation and the same Kind. Given a human resource, he/she should be assigned to the proper cluster by clustering, and representative tree of the new cluster is composition of the human resource’s process step tree and the representative tree of the original cluster. In [33] the algorithm for composing two trees is to find possible common part of distinct paths of the two trees, and use the common part to compose a new tree that can represent character of the two trees. Based on the composition algorithm provided by [33], a new composition algorithm which is more suitable for our work is given. First, distinct paths for both process step tree of the human resource and representative tree of the original cluster are constructed. Then when comparing the two distinct path lists, compose trees by applying the following rules:
Task1(HR)
Task2(HR1)
Task7(HR1)
Task2(HR1) Task3(HR2)
Task4(HR3)
Task7(HR1) Task3(HR2)
Task4(HR3)
Task8(HR5) Task8(HR5)
Task5(HR1)
Task6(HR4) Task5(HR1)
Fig. 4. Task elimination for human resource HR1.
Please cite this article in press as: H. Huang et al., Creating Process-Agents incrementally by mining process asset library, Inform. Sci. (2013), http://dx.doi.org/10.1016/j.ins.2012.12.052
H. Huang et al. / Information Sciences xxx (2013) xxx–xxx
9
(1) If two distinct paths have the Same Parent, find the longest common subpath of the two distinct paths, if it exists, add it to the composed tree and the next two distinct paths are selected from the two lists. Else if the longest common subpath does not exist, the distinct path whose list has more leavings is ignored and the next distinct path is selected from its list. (2) If two distinct paths have different parents as well as different depth, the deeper one is ignored, and the next distinct path is selected from its list. (3) If two distinct paths have different parents but equal depth, the distinct path whose list has more leavings is ignored and the next distinct path is selected from its list. However, when finding the longest common subpath, the algorithm provided in [33] is not always right. For example, consider two paths ‘1-4-5-6-2-3’ and ‘1-3-5-6-4-2’, the algorithm will find ‘1-4-2’, but the expected longest common subpath is ‘1-5-6-2’. It is because the author adopted a greedy algorithm and it will find the local optimal solution instead of the global optional solution. Actually the longest common subsequence is a classic problem. There is a dynamic programming algorithm which can be used to solve the problem. Assuming there are two sequences Xi = {x1, x2, . . . , xi} and Yj = {y1, y2, . . . , yj}, and c[i][j] is the length of the longest common subsequence of Xi and Yj, the following recursive relationship can be found, based on which the dynamic programming algorithm for solving the problem can be implemented (How to implement the dynamic programming algorithm can be found in [44]).
8 i ¼ 0; j ¼ 0 > 0; xi ¼ yj c½i½j ¼ > : maxfc½i½j 1; c½i 1½jg i; j > 0; xi – yj
ð1Þ
4.3. Similarity score calculation 4.3.1. Process knowledge similarity score Process knowledge is organized as a process step tree which indicates how PA realizes its work. Process knowledge similarity score (PS) is actually the approximate measure of tree structure between trees. To keep it simple and comprehensive, PS is defined as follows, and PS 2 [0, 1].
PSðtree T; cluster CÞ ¼
N ¼ M
P
8i N i M
ð2Þ
where N is the size of the composed tree of T and C, M is the bigger one between the size of T and C, and i is a compose relation between two distinct path, Ni is the size of the composed path of the two distinct paths. 4.3.2. Experience library similarity score Experience library reflects how well a PA can do the work. As emphasized by CMMI, etc., performance and stability are very important for a process, thus experience library similarity score (ES) can be measured from the two aspects. In [49] Wang et al. suggested two statistics average l and standard deviation r, which can be taken as indicators of PA’s performance and stability. When l is close to expectation, the performance of a PA is high. And the smaller r is, the more stable a PA is. Therefore l and r are used to calculate ES. Assuming n_dim dimensions of human resources’ experience data (such as schedule, effort) need to be measured. Deviation of dimension i is defined as devi = (actuali plani)/plani. Then ES is defined as follows, and ES 2 [0, 1].
ESðtree T; cluster CÞ ¼ 1
nX dim
xi
i¼1
jlti lci j jrti rci j jlti j þ jlci j maxfrti ; rci g
ð3Þ
where nX dim
xi ¼ 1
i¼1
and n_dim is the number of dimensions which need to measure, xi is weight of ith dimension, lti is average of devi in tree T, rti is standard deviation of devi in tree T, lci and rci are similar to lti and rti. 4.3.3. Similarity score A PA involves a group of human resources who have similar capabilities and contains process knowledge which indicates how it does the work and experience library which reflects how well a PA can do the work. Therefore human resources in one PA should be similar in both process knowledge and experience library. Because descriptive knowledge depends on the other two parts of knowledge, it can be represented by process knowledge and experience library, thus descriptive knowledge is not considered here. Similarity score of two process step trees is defined as follows, and Score 2 [0, 1].
Please cite this article in press as: H. Huang et al., Creating Process-Agents incrementally by mining process asset library, Inform. Sci. (2013), http://dx.doi.org/10.1016/j.ins.2012.12.052
10
H. Huang et al. / Information Sciences xxx (2013) xxx–xxx
Socreðtree T; cluster CÞ ¼ xps PSðT; CÞ þ xes ESðT; CÞ
ð4Þ
where xps and xps are weight of process knowledge similarity score and experience library similarity score respectively. 4.4. Assessment of output It is important to measure the quality of PA because it directly influences the understanding and management of human resources, the performance of software processes and OEC-SPM method. High quality PAs should have higher intra-similarity and lower inter-similarity. Here two commonly used metrics are introduced to measure the quality of Process-Agent: DUNN measure [11] and Davies Bouldin (DB) measure [8]. The DUNN measure is the smallest extreme value of inter-cluster-distance dividing intra-cluster-distance, which reflects the difference among PAs. Thus the larger the DUNN measure is, the lower inter-similarity PAs have [18]. The DB measure indicates the average similarity of human resources within a PA [33]. So the lower the DB measure is, the higher intra-similarity a PA has [18]. In the two measures, difference between data points, which are process step trees in this paper, is defined as Tree Edit Distance, and similarity between data points is defined as the reciprocal of Tree Edit Distance. The method presented by Zhang and Shasha [57] is used to compute the tree edit distance. DUNN measure is defined as follows [11]:
DUNN ¼
min fdðci ; cj Þg
16i; j6n; i–j
max fdiamðck Þg
16k6n
ð5Þ
where
dðci ; cj Þ ¼
min ftreeEditDistanceðt1 ; t 2 Þg
t12ci; t22cj
diamðC k Þ ¼ max ftreeEditDistanceðt 1 ; t2 Þg t1;t22ck
and ci is the ith PA, n is the number of PAs and t represents one process step tree in the PA. Davies Bouldin measure is defined as follows [8]:
DB ¼
n 1X max fðsi þ sj Þ=dij g n i¼1 16j6n; i–j
ð6Þ
where
si ¼
1 X treeEditDistanceðt; r i Þ jci j t2ci
dij ¼ treeEditDistanceðr i ; rj Þ and ri is the representative tree of the ith PA, jcij is the number of members of the ith PA, t and n have the same meaning as in DUNN. Besides the two measures DUNN and DB, whether a human resource can find similar colleagues and how many human resources can find similar colleagues are also very important. Here two metrics are computed to measure cluster ratio. Cluster ratio of Process-Agents is ratio of the number of PAs involving more than one human resource to the total number of PAs, and cluster ratio of human resources is ratio of the number of human resources involved in PAs having more than one human resource to the number of clustered human resources. To create high quality PAs, all of the four measures, i.e. DUNN, DB, cluster ratio of Process-Agents and cluster ratio of human resources should be taken into account. 4.5. Time complexity and space complexity 4.5.1. Time complexity In the algorithm presented in Section 4, data extraction step and clustering step are the most time consuming. Assuming there are n human resources needed to be processed. For each human resource, the process step tree can be constructed by scanning tasks several times, thus the time complexity of data extraction is O(m), where m is the number of tasks related to human. And each constructed step tree should be compared to all the clusters and composed with proper cluster in the clustering step. Composing two trees would cost O(s h) time [33]. If there are c clusters, the time complexity of clustering would be O(s h c), where s is the size of process step tree, h is the size of distinct path. Thus processing one human resource will have a time complexity of O(max{m, s h c}). The time complexity of the algorithm will be O(n max{m, s h c}). It is much more efficient than the method given in [58]. 4.5.2. Space complexity When dealing with human resources, the algorithm processes one person every time. The space cost of one human resource is the size of history project data related to the human resource. If there are m tasks of one human resource, the space complexity would be O(m). Besides, clusters also should be kept in memory. If there are c clusters and each cluster Please cite this article in press as: H. Huang et al., Creating Process-Agents incrementally by mining process asset library, Inform. Sci. (2013), http://dx.doi.org/10.1016/j.ins.2012.12.052
H. Huang et al. / Information Sciences xxx (2013) xxx–xxx
11
has s nodes, the space cost of clusters would be O(c s). Thus the space complexity of the algorithm is O(m + c s), while the algorithm in [58] needs O(n m) space. It means that the algorithm in this paper can deal with much larger size of data set. 5. Experiment and analysis A PA creation tool has been developed according to the algorithm proposed in this paper. The tool is applied to a software company passed CMMI ML4. This company has about 400 employees. They are responsible for the work such as web-project development, compiler-project development, software process improvement and software quality assurance. The company has a group of standard processes to direct the establishment of the project software process. The company adopts SoftPM [47], developed by Institute of Software Chinese Academy of Science, to manage the process asset library and the execution of the project process. SoftPM is an integrated system to support project managers, higher level managers, engineers, tester, quality assurance people and other supporting people work together, share the collected data and respective vision, understand the schedule, effort and quality of project and communicate effectively. The process asset library in SoftPM has the similar structure defined in Section 3.2 and can be transformed easily. The company has collected a lot of data from the software development process. This section introduces the application of the algorithm in this company. The dataset used in this paper was obtained from SoftPM’s database on July 27th 2009 which contains data from December 1st 2005 to July 27th 2009, including 91 projects, 27,889 tasks, 45,518 task-member relationships, 45,342 member-reports and 738 activities of standard processes. Total number of human resources in the data set is 321. The tool is implemented with JAVA and runs on Windows XP operating system. The java virtual machine arguments are set as – Xms256 m –Xmx512 m. The computer has two cores 2.4 GHz CPU, 2 GB RAM and 5400 rpm hard disk. Before applying the tool, there are several parameters needed to be set: similarity threshold LIMIT, weight of each dimension when computing ES and weight of PS and ES when computing similarity score. By default LIMIT is set to 0.5. Weight of each dimension is set to 1/jdimsj, where jdimsj is number of dimensions involved when calculating ES. In our experiments dimensions used are schedule time and workload, thus the weight of each dimension is actually 0.5. The weight of PS is set to 0.7 and weight of ES is set to 0.3. Weight of PS is bigger than weight of ES because experience library depends on process knowledge. Only if process knowledge is similar, experience library is comparable. These parameters are set based on experiments having executed. If there are no special instructions, these parameters should be the default value. Based on the whole process asset library, PAs can be created by the tool. The results are shown in Table 1. It is used as a baseline to compare with the results under the three scenarios. The number of clustered human resources is 49, which is much smaller than the total number of human resources. That is because the ones having less than two process steps in their step trees have been ignored. Because when the process step tree is too small, it cannot represent the knowledge of that human resource. We do not have enough knowledge about that person, and she/he should be ignored by the algorithm this time. In this part totally 30 Process-Agents are created and 10 of them contain more than one human resources. And 29 human resources in 321 can find colleagues having similar capabilities. It takes the tool nearly 12 min to produce the experience result. 5.1. Experiment result In this section the efficiency and flexibility of the algorithm was examined. This experiment has three parts corresponding to the three scenarios of PA creation presented in section 4. It uses the data in process asset library to simulate the dynamic changes of human resources in the organization. First the data before October 2008 was used to create PAs from scratch, and then the left data was used to simulate incremental creation of PAs. The first part is to create Process-Agent from scratch. Scenario 1 in Table 2 shows the experiment results when there are 200 original human resources and 68 history projects. The number of clustered human resources is 34. Totally 25 PAs are created and 5 of them have more than one human resource. Moreover, 14 human resources in 200 can find colleagues having similar capabilities. Comparing to the results in Table 1, it produces less PAs, but the distribution of human resources is similar. It takes the tool about 7 min to produce this part of experiment result. The second part is to create Process-Agents for new coming human resources. Based on the 25 existed PAs in scenario 1, the tool creates PAs incrementally when the new 121 human resources come. The results are shown in scenario 2 in Table 2. The number of clustered human resources is 49. Totally 28 PAs are created and 10 of them have more than one human resource. Moreover, 31 human resources in 321 can find colleagues having similar capabilities. The tool is sensitive to the order of the human resources. Comparing to the results in Table 1, it produces less PAs but the same number of clustered human resources. It takes the tool almost 4 min to produce this part of experiment result. The third part is to create Process-Agents for human resources whose capabilities have changed. Based on the 28 existing PAs in scenario 2 in Table 2, the tool can create PAs incrementally when capabilities of human resources have changed and they have taken part in more projects. Scenario 3 in Table 2 shows the experiment results. The number of clustered human resources is 49. Totally 31 PAs are created and 10 of them have more than one human resource. Moreover, 29 human resources in 321 can find colleagues having similar capabilities. The results which are almost the same as Table 1 except that Agent8 in Table 1 is broken into two agents (Agent1 and Agent10) in scenario 3 and Agent9 is combined with other two agents (Agent4 and Agent9) is acceptable. Producing this part of experiment result takes less than 1 min. Please cite this article in press as: H. Huang et al., Creating Process-Agents incrementally by mining process asset library, Inform. Sci. (2013), http://dx.doi.org/10.1016/j.ins.2012.12.052
12
H. Huang et al. / Information Sciences xxx (2013) xxx–xxx
Table 1 Process-Agents created from the whole PAL. Input
HRs:
321
Output
Num_PA: Process-Agents involving more than one HR:
30 Num_CHR Agent1: LN, DSHZH, XJSH, LDP, LB Agent2: CHMR, WYX Agent3: WXL, YWW, ZHSH, LQ, HP Agent4: WXCH, ML Agent5: DZHM, YB, LXN Agent6: WWP, GL Agent7: ZHXY, TGQ Agent8: CQW, ZHT, WXP, WHY Agent9: LX, WP Agent10: JL, DWW
Projects:
91 49
Num_PA: number of PA; Num_CHR: number of clustered HRs.
Through analysis of the experiment result, it proves that the algorithm provided in this paper can deal with the three scenarios of PA creation efficiently. By the created PAs, human resources in organization are organized efficiently and they can find out colleagues having similar capabilities with them. Both project manager and human resource herself/himself can know her/his advantages and disadvantages clearly, which will help human resource department, project managers and human resources to improve their performance and human resources will be scheduled more reasonably as well. Moreover, the software process will be more stable that risks such as human resource demission can be brought down. Therefore, the software process performed by these human resources will have higher trustworthiness and software produced by the software process will have higher quality. 5.2. Assessment The two measures DUNN and DB are computed for different LIMIT to measure the cluster results by which the quality of all cluster results can be seen. As shown in Fig. 5, the DUNN measure increases as LIMIT increasing, while the DB measure decreases as LIMIT increasing. It means that bigger LIMIT will generate PAs with higher quality. However, as shown in Fig. 6, the cluster ratio of Process-Agents and the cluster ratio of human resources come down when increasing the value of LIMIT. It means that the higher similarity is required, the harder the similar colleagues of human resources can be found. To obtain the most suitable PAs, suitable LIMIT value should be found by balancing the requirement of quality and cluster ratio. Based on the found LIMIT, not only human resources can be organized with high quality, but also similar colleagues can be found, which will help to improve the trustworthiness of software process. 5.3. Validation and comparison Based on the experience results are shown in Table 2, expert consultation is introduced in this section. 2 product managers and 7 project managers from the company are interviewed to verify the experiment results. One of the product managers is in charge of the whole product line, including five versions. The other product manager has two years process
Table 2 Create Process-Agents in the three scenarios. Scenarios Input Output
HRs: Projects: Num_PA: Num_CHR Process-Agents involving more than one HR:
Scenario 1
Scenario 2
Scenario 3
200 68 25 34 Agent1: LDP, LB Agent2: Agent3: Agent4:
121 23 28 49 Agent1: CHY, CQW, ZHT
321 23 31 49 Agent1: ZHT, CQW
Agent2: LN, DSHZH, XJSH, LDP, LB Agent3: CHMR, WYX Agent4: WXL, YWW, ZHSH, LQ, HP, WP Agent5: WXCH, ML Agent6: DZHM, YB, LXN Agent7: WWP, GL Agent8: ZHXY, TGQ Agent9: LX, JL, DWW Agent10: WXP, WHY
Agent2: LB, DSHZH, LDP, XJSH, LN Agent3: CHMR, WYX Agent4: WXL, YWW, ZHSH, LQ, HP, WP Agent5: ML, WXCH Agent6: YB, LXN, DZHM Agent7: GL, WWP Agent8: TGQ, ZHXY Agent9: DWW, JL, LX Agent10: WXP, WHY
LN, DSHZH, XJSH, CHMR, WYX WXL, YWW, ZHSH WXCH, ML
Agent5: WWP, GL
Please cite this article in press as: H. Huang et al., Creating Process-Agents incrementally by mining process asset library, Inform. Sci. (2013), http://dx.doi.org/10.1016/j.ins.2012.12.052
H. Huang et al. / Information Sciences xxx (2013) xxx–xxx
13
improvement consulting experience, four years project management experience and three years development experience in the company. The 7 project managers are in charge different parts of the product. Their teams have about 3 to 10 members. For example, project manager ‘‘JCHX’’ is responsible for the main branch of the product from August 2008 to November 2010. The similarity of PAs was measured in five degrees, which are very similar, similar, cannot tell, dissimilar and very dissimilar. The experts will determine one degree for each Process-Agent separately. And the PA is created correctly if it belongs to the similar or very similar degree. Otherwise, the PA is created incorrectly. Based on the experts’ answers, vote was used to determine the final degree of the PAs. Moreover, Fleiss Kappa [13,14] for multiple raters was calculated to check the reliability. The overall Kappa is 0.5749 with standard error as 0.0286 which means that the experts’ decisions are in moderate agreement. The number of PAs in each degree and the correctness of the experience are shown in Fig. 7. X-axis in the figure represents the four experiment results shown in Tables 1 and 2. Y-axis in the left graph represents the number of PAs in each degree and Y-axis in the right graph represents the correctness of the four experiments. As seen in Fig. 7, there are no very similar and very dissimilar PAs. This is a main difference between software engineering and traditional manufacturing in which you can find the same machines easily. When creating PAs in scenario 2, the correctness is only 60%, which is lower than that of other scenarios. Through analysis, it is found that it is because new human resources are considered in scenario 2. These new coming human resources have little experimental data, which will bring down correctness of the experience. But as the human resources take part in more projects and more history project data are accumulated, the correctness will rise (see the ‘‘V’’ part of the curve). So enough experimental data is crucial. It can be seen in Fig. 7, most of PAs are in the similar degree, which strongly support the conclusion that human resources of the company are well organized by the created PAs. The comparison results between this work and the previous work are shown in Table 3, which include the DUNN and DB measures, the cluster ratio of Process-Agents, the cluster ratio of human resources, the accuracy, the recall ratio and the extracted knowledge. In the previous work, Zhang et al. did not consider the experiment data and only used the process knowledge to measure human resources’ capability. In addition, the human resources are considered having similar capability only when their process step trees have the same structure and no other parameters are taken in account. It is too rigid for measuring human resources’ capability. The parameters of this work are given in section 5. As shown in Table 3, although the previous work has high Process-Agent quality and accuracy, its cluster ratio of Process-Agents, cluster ratio of human resources and recall ratio are very low. It is helpless for most of the human resources who want to find colleagues having similar capabilities. This work solves the problem and has better performance on cluster ratio and recall ratio. Besides, the LIMIT can be reset to different value flexibly to meet special needs, and much more knowledge is extracted to evaluate the human resources’ capabilities. Amoroso et al. have pointed out that software process trustworthiness is the necessary condition for software trustworthiness [3] and should be taken as a capability indicator for measuring and improving software trustworthiness [53]. And human resource, which may introduce certain major threats to the software trustworthiness during the development life cycle [48,53], is a critical success factor in Trustworthiness Assurance Process Area (TAPA) [10,53]. Comparing with traditional software development methods, this method aims to solve the problem that human resource seriously affects the trustworthiness of software and software process. The differences between this method and the tradition software development methods are that our method is human resource capability based and will help to decrease the human resource related risks, such as demission and insufficient capacity. It breaks the whole organization processes into individual processes and evaluates the performance of each human resource on the individual process. Because the human resources’ process capability is extracted based on the actual execution of the organization processes, the created PA can tell what a human resource can do, how it does the work and how well it does the work. Therefore, when a new software process is built, the process can have higher trustworthiness because proper process performer can be allocated. 5.4. Discussion From the algorithm introduced in this paper and the Process-Agents created under various conditions, several benefits can be obtained.
X-axis is LIMIT with different value between 0 and 1; Y-axis is measure score of DUNN or DB. Fig. 5. DUNN and DB measure score for different LIMIT.
Please cite this article in press as: H. Huang et al., Creating Process-Agents incrementally by mining process asset library, Inform. Sci. (2013), http://dx.doi.org/10.1016/j.ins.2012.12.052
14
H. Huang et al. / Information Sciences xxx (2013) xxx–xxx
Fig. 6. Cluster ratio of Process-Agents and human resources.
Fig. 7. Similarity degree of Process-Agents.
Table 3 Comparison between this work and the previous work. Measures
Previous work
This work LIMIT = 0.5
DUNN DB Cluster Ratio of PA (%) Cluster Ratio of HR (%) Accuracy (%) Recall (%) Knowledge
1 0 0.00 0.00 100 0.00 Process knowledge
LIMIT = 0.7
0.14286 0.25 3.15053 1.42857 27.27 18.42 59.18 38.78 80.00 83.33 77.78 55.56 Process knowledge and experience library
LIMIT = 0.9 1 0 4.55 14.29 75.00 33.33
(1) PAs are created automatically and incrementally with high efficiency and human resources in organization are organized by their capabilities. PA can be created by hand and descriptive knowledge, process knowledge and experience library can be input manually too. However, the creation by hand depends on the experience of the creator and its efficiency is low. Based on our algorithm and tool, a group of PAs were created automatically from SoftPM. It proved that the algorithm proposed in this paper has high efficiency and flexibility through experiments in Section 5. With those created PAs, capability distribution in organization is clear and the performances of process activities by human resources are determined. Using those well organized human resources to perform the software processes will highly improve the trustworthiness of these processes. (2) Support OEC-SPM method well. The three kinds of infrastructural knowledge (Descriptive Knowledge, Process Knowledge and Experience Library) which are needed by OEC-SPM method can be constructed automatically and flexibly by mining process asset library. Through DUNN measure and Davies Bouldin (DB) measure calculated in experience, the capability similarity of human resources in the same PA can be promised at some extent. (3) Provide the strength and weakness of human resource process execution capabilities. Two statistics are calculated for each human resource: average value l and standard deviation r. They reflect performance and stability of the human resource when executing the process. The average difference of planned and actual workload and productivity of each activity execution can also be determined for each human resource. These data describe the strength and weakness of human resource when executing different kinds of activities. (4) Provide decision support for project managers. PAs can provide project managers with a lot of useful information for decision making. The three main kinds of benefits that can be obtained by PA creation are:
Please cite this article in press as: H. Huang et al., Creating Process-Agents incrementally by mining process asset library, Inform. Sci. (2013), http://dx.doi.org/10.1016/j.ins.2012.12.052
H. Huang et al. / Information Sciences xxx (2013) xxx–xxx
15
Project team construction: If the organization is a project-oriented company, a project team needs to be constructed when a project is coming. By using the method we proposed in this paper, the project manager can clearly know the current status of the kinds of human resources in the organization based on PA creation, and thus constructing a suitable project team would be possible. Project plan making: A PA can tell what kinds of activities a human resource can do, how she/he implements an activity and her/his performance on that activity. Therefore, with the help of PAs, managers can choose proper human resources to do the right job. If the project needs to be finish quickly, the human resource with high performance is selected. If the project needs to be finished stably, the human resource with high stability is selected. That will improve the usage efficiency of human and the performance of standard process. Especially when demission occurs during the project, the manager can find successor from the same PA easily and quickly. Human resource management: By analyzing the experience results we find that only a small part of PAs contain more than one human resource. For those remaining PAs that have only one human resource, it is analyzed that they play more than one role when participating in projects. They have in fact participated in a lot of different activities. This shows the problems of human resources scheduling and management. They are not specialized in work. Therefore, the company should improve their resource management capability. (5) Provide information and guidance for process improvement. When extracting personal data with the application of standard process in data preprocessing, the number of obtained PAs is small. For example, there are total 321 human resources in database, but in Table 1 only 49 human resources and 30 PAs are obtained. After the analysis of PAs and certain projects, it was found that among the 91 projects, 58 are history projects and 62 projects have applied standard processes. Thus only 41 history projects have used standard processes, and only 88 of 321 human resources have used standard processes. Because these projects generate project software process without using standard process, these projects are not developed normatively. It means that the standard process is not commonly used in this mature organization. Process improvement should be implemented either by prescribing the use of standard processes or by setting up a new group of standard processes that reflect the actual situation. Therefore, the creation results and the analysis give the information and guidance for process improvement. (6) Reflect the distribution of human resources in organization. PA contains human resources having similar capability, and human resources from different PAs have different capabilities. Thus every PA can be seen as a kind of human resource such as designer, test engineer or quality assurance people. Therefore, the distribution of human resources in a organization can be reflected by the states of PAs. It can help the Human Resource Department make hiring program.
6. Conclusion and future work Software process is highly people-dependent. Software process trustworthiness relies on the capabilities of human resources. Thus human resource organization and evaluation is crucial. This paper organized human resources as ProcessAgent and proposed an efficient, flexible and comprehensive method for creating PAs by mining process asset library to support the OEC-SPM method. The method used process asset library as input to extract the process data and experience data that can determine the capabilities of human resources. By clustering analysis, the human resources with similar capabilities can be classified into one group, with which the PA can be created. Through the created PAs human resources are properly organized and their capabilities are clearly evaluated, which can help to improve trustworthiness of processes and give further decision support. The work in this paper has resolved the following faults of the previous work. (1) Creating PAs by hand is unpractical. This method can create PAs automatically and incrementally by mining process asset library. (2) The previous method for PA creation is inflexible. This method can be created flexibly under different scenarios when the process asset library has changed. It is more suitable for the dynamics of software processes. (3) The previous method is inefficient and incomprehensive. This method used a heuristic clustering algorithm which has high efficiency. And both process information and experience data were extracted for each human resource to evaluate their capabilities. It can more accurately organize human resources and determine their capabilities. (4) The previous method cannot assure the quality of PA. This method can create more suitable PA by computing four measures. An empirical study has been conducted in a mature software organization. In the experiment we assume that the human resources are added once when creating PAs incrementally to keep concise. In the future work the human resources should be added gradually and more experiments will be done. The analysis shows that the creation is accurate and efficient. The results of the creation indicate that the organization should conduct process improvement and resource management improvement.
Please cite this article in press as: H. Huang et al., Creating Process-Agents incrementally by mining process asset library, Inform. Sci. (2013), http://dx.doi.org/10.1016/j.ins.2012.12.052
16
H. Huang et al. / Information Sciences xxx (2013) xxx–xxx
There are some directions in which this work will be extended: (1) Mining the individuals’ capabilities in some specific activities: The paper looks at human resource from a viewpoint of treating a person as a whole. Because some people are good at certain aspects and should be deployed to relevant activities according to their capabilities, it is also very interesting to describe the type of human resource more detailed and look at the specific activities/capabilities of individuals. Such information will help to perform human resource management in the organizations. (2) Using time information. Because the history data is accumulated gradually, the latest data reflects the current status of human resources and may be much important. Capabilities of human resources are also changing with time. The variation trend may be found by using time information.
Acknowledgments This research was supported by the Natural Science Foundation of China under Grant Nos. 60903051, 61003028, 61073044 and 91218302 and the National Science and Technology Major Project under Grant No. 2012ZX01039-004. References [1] W.M.P. van der Aalst, A.J.M.M. Weijters, Process mining: a research agenda, Computers in Industry (2004) 231–244. [2] S.T. Acuña, N. Juristo, A.M. Moreno, Emphasizing human capabilities in software development, IEEE Software 23 (2006) 94–101. [3] E. Amoroso, C. Taylor, J. Watson, J. Weiss, A process-oriented methodology for assessing and improving software trustworthiness, in: Proceedings of the 2nd ACM Conference on Computer and Communications Security, Virginia, USA, 1994, pp. 39–50. [4] C. Boogerd, L. Moonen, Evaluating the relation between coding standard violations and faults within and across software versions, in: Proc. 6th Int’l Working Conf. on Mining Software Repositories, IEEE, 2009, pp. 41–50. [5] G. Buyukozkan, D. Ruan, Choquet integral based aggregation approach to software development risk assessment, Information Sciences 180 (2010) 441–451. [6] C. Catal, B. Diri, Investigating the effect of dataset size metrics sets and feature selection techniques on software fault prediction problem, Information Sciences 179 (8) (2009) 1040–1058. [7] CMU, CMMIÒ for development, version 1.2, CMU/SEI-2006-TR-008, ESC-TR-2006-008, 2006. [8] D.L. Davies, D.W. Bouldin, Cluster separation measure, IEEE Transactions on Pattern Analysis and Machine Intelligence 1 (2) (1979) 95–104. [9] Delugach S. Harry, An evaluation of the pragmatics of web-based bug tracking tools, in: ACM International Conference Proceeding Series, vol. 280, 2007, pp. 49–55. [10] J. Du, T. Tan, M. He, et al., Technical Report: A Process-Centric Approach to Assure Software Trustworthiness, ISCAS/iTechs Technical Report #106, September 2008. [11] J.C. Dunn, Well separated clusters and optimal fuzzy partitions, Journal of Cybernetica 4 (1974) 95–104. [12] H. Falk, New tools help exterminate software bugs, Computer Design 26 (18) (1987) 52–61. [13] J.L. Fleiss, J. Cohen, The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability, Educational and Psychological Measurement 33 (1973) 613–619. [14] J.L. Fleiss, Measuring nominal scale agreement among many raters, Psychological Bulletin 76 (5) (1971) 378–382. [15] Frédéric Michaud, Frédéric Painchaud, Source code verification tools for software security bugs, in: New Trends in Software Methodologies, Tools and Techniques, IOS Press, 2006, pp. 231–241. [16] M. Godfrey, X. Dong, C. Kapser, L. Zou, Four interesting ways in which history can teach us about software, in: MSR’ 04 25th May 2004, Edinburgh, Scotland, UK, 2004. [17] G. Gousios, E. Kalliamvakou, D. Spinellis, Measuring developer contribution from software repository data, in: MSR’08, Leipzig, Germany, 2008, pp. 129–132. [18] M. Halkidi, Y. Batistakis, M. Vazirgiannis, Cluster validity checking methods: Part II, SIGMOD Record 31 (3) (2002) 19–27. [19] W.S. Humphrey, Introduction to the Personal Software Process, Addison-Wesley, 1997. [20] M. Huo, H. Zhang, R. Jeffery, Detection of consistent patterns from process enactment data, in: ICSP 2008, LNCS 5007, 2008, pp. 173–185. [21] Jacobs David, Marlin Chris, Software process representation to support multiple views, International Journal of Software Engineering and Knowledge Engineering 5 (4) (1995) 585–597. [22] A.K. Jain, R.C. Dubes, Algorithms for Clustering Data, Prentice-Hall Advanced Reference Series, Prentice-Hall Inc., Upper Saddle River, NJ, 1988. [23] A.K. Jain, M.N. Murty, P.J. Flynn, Data clustering: a review, ACM Computing Surveys 31 (3) (1999). [24] C. Jensen, W. Scacchi, Data mining for software process discovery in open source software development communities, in: Proceedings of International Workshop on Mining Software Repositories (MSR2004), 2004, pp. 96–99. [25] T. Katayama, A hierarchical and functional software process description and its enaction, in: Proceedings of the 11th International Conference on Software Engineering, 1989. [26] T.M. Khoshgoftaar et al, Software quality assessment using a multi-strategy classifier, Information Sciences (2010), http://dx.doi.org/10.1016/ j.ins.2010.11.028. [27] Larson Eric, An undergraduate course on software bug detection tools and techniques, in: SIGCSE’06, March 1–5, 2006, Houston, Texas, USA, 2006. [28] J.L. Lawall, J. Brunel, R.R. Hansen, H. Stuart, G. Muller, WYSIWIB, A declarative approach to finding protocols and bugs in Linux code, in: IEEE/IFIP International Conference on Dependable Systems & Networks, 2009 (DSN ’09), 2009, pp. 43–52. [29] Y. Liu, E. Stroulia, K. Wong, D. German, Using CVS historical information to understand how students develop software, in: Proceedings of International Workshop on Mining Software Repositories (MSR2004), 2004, pp. 32–36. [30] T. Marchewka Jack, Information Technology Project Management, third ed., Wiley, 2009. [31] K. Mierle, K. Laven, S. Roweis, G. Wilson, Mining student CVS repositories for performance indicators, in: Proceedings of International Workshop on Mining Software Repositories (MSR2005), 2005, pp. 41–45. [32] S. Morisaki, A. Monden, T. Matsumura, H. Tamada, K. Matsumoto, Defect data analysis based on extended association rule mining, in: Proceedings of the 4th International Workshop on Mining Software Repositories (MSR 07), May 2007, p. 3. doi:http://dx.doi.org/10.1109/MSR.2007.5. [33] H.C. Mostafa, M. Rahgozar, C. Lucas, H.C. Morteza, A heuristic algorithm for clustering rooted ordered trees, Intelligent Data Analysis 11 (2007) 355– 376. [34] T.C. Oliveira, P.S. Alencar, I.M. Filho, C.J. de Lucena, D.D. Cowan, Software process representation and analysis for framework instantiation, IEEE Transactions on Software Engineering 30 (3) (2004) 145–159.
Please cite this article in press as: H. Huang et al., Creating Process-Agents incrementally by mining process asset library, Inform. Sci. (2013), http://dx.doi.org/10.1016/j.ins.2012.12.052
H. Huang et al. / Information Sciences xxx (2013) xxx–xxx
17
[35] Osamu Mizuno, Shiro Ikami, Shuya Nakaichi, Tohru Kikuno, Spam filter based approach for finding fault-prone software modules, in: Proceedings of the Fourth International Workshop on Mining Software Repositories, May 20–26, 2007, p. 4. doi:http://dx.doi.org/10.1109/MSR.2007.29. [36] L. Osterweil, Software processes are software too, in: ACM 0270-5257/87/0300/0002, 1987. [37] B.J. Park et al, The design of polynomial function-based neural network predictors for detection of software defects, Information Sciences (2011), http://dx.doi.org/10.1016/j.ins.2011.01.026. [38] Y. Peng, G. Kou, G. Wang, H. Wang, F. Ko, Empirical evaluation of classifiers for software risk management, International Journal of Information Technology and Decision Making 8 (4) (2009) 749–768. [39] Y.PengG. Wang, H. Wang, User Preferences based software defect detection algorithms selection using MCDM, Information Sciences (2010), http:// dx.doi.org/10.1016/j.ins.2010.04.019. [40] A.J. Rember, C. Ellis, An initial approach to mining multiple perspectives of a business process, in: Tapia ’09 April 1–4, 2009, Portland, 2009. [41] D. Rodriguez et al, Searching for rules to detect defective modules: a subgroup discovery approach, Information Sciences (2011), http://dx.doi.org/ 10.1016/j.ins.2011.01.039. [42] N. Rutar, C. Almazan, J.S. Foster, A comparison of bug finding tools for Java, in: Proceedings of the 15th IEEE International Symposium on Software Reliability Engineering, Saint-Malo, France, November 2004. [43] N. Seliya, T.M. Khoshgoftaar, Software quality estimation with limited fault data: a semi-supervised learning perspective, Software Quality Journal 15 (3) (2007) 327–344. [44] H. Cormen Thomas, E. Leiserson Charles, L. Rivest Ronald, Stein Clifford, Introduction to Algorithms, second ed., The MIT Press, 2001. September 1. [45] L.I. Vanek, M.N. Culp, Static analysis of program source code using EDSA, in: Proceedings of the International Conference on Software Maintenance, Miami, 1989, pp. 192–199. [46] M. VanHilst, P.K. Garg, C. Lo, Repository mining and six sigma for process improvement, in: Proceedings of International Workshop on Mining Software Repositories (MSR2005), 2005, pp. 80–83. [47] Q. Wang, M. Li, Software process management: practices in China, in: Software Process Workshop 2005, Beijing, China, 2005, pp. 317–331. [48] Q. Wang, Y. Yang, Technical Report: A Process-Centric Methodology to Software Trustworthiness Assurance, ISCAS/iTechs Technical Report #105, June 2008. [49] Q. Wang, N. Jiang, L. Gou, X. Liu, M. Li, Y.J. Wang, BSR: a statistic-based approach for establishing and refine software process performance baseline, in: ICSE’06, May 20–28, 2006, Shanghai, China, 2006. ACM 1-59593-085-X/06/0005. [50] Q. Wang, J. Xiao, M. Li, M.W. Nisar, R. Yuan, L. Zhang, A Process-Agent construction method for software process modeling in SoftPM, in: SPW/ProSim 2006, Shanghai China, 2006, pp. 204–213. [51] L. Wen, Wil M.P. van der Aalst, Mining process models with non-free-choice constructs, Data Mining and Knowledge Discovery 15 (2007) 145–180. [52] J. Xiao, L.J. Osterweil, L. Zhang, A. Wise, Q. Wang, Applying Little-JIL to describe Process-Agent knowledge and support project planning in SoftPM, Journal of Software Process: Improvement and Practice 12 (2007) 437–448. [53] Y. Yang, Q. Wang, M. Li, Process trustworthiness as a capability indicator for measuring and improving software trustworthiness, in: ICSP 2009, 2009, pp. 389–401. [54] K. Yasumoto, T. Higashino, K. Taniguchi, Software process description using LOTOS and its enaction, in: Proc. of 16th IEEE Int’l Conf. on Software Engineering (ICSE16), 1994, pp. 169–179. [55] L. Yu, S. Ramaswamy, Mining CVS repositories to understand open-source Project developer roles, in: Proceedings of 29th International Conference on Software Engineering Workshops (ICSEW’07), 2007. [56] K.Z. Zamli et al, Design and implementation of a t-way test data generation strategy with automated execution tool support, Information Sciences (2011), http://dx.doi.org/10.1016/j.ins.2011.01.002. [57] K. Zhang, D. Shasha, Simple fast algorithms for the editing distance between trees and related problems, SIAM Journal of Computing 18 (6) (1989) 1245–1262. [58] L. Zhang, Q. Wang, J. Xiao, L. Ruan, L. Xie, M. Li, A tool to create process-agents for OEC-SPM from historical project data, in: Proceedings of the 1st International Conference on Software Process (ICSP’ 07), 2007, pp. 84–95. [59] S. Zhang, J. Tong, Y. Wang, J. Zhou, L. Ruan, Evaluation of project quality: a DEA-based approach, in: Q. Wang, D. Pfahl, D.M. Raffo, P. Wernick (Eds.), SPW/ProSim 2006, LNCS, vol. 3966, Springer, Heidelberg, 2006, pp. 88–96. [60] S. Zhang, Y. Wang, J. Xiao, Mining individual performance indicators in collaborative development using software repositories, in: Proceedings of 15th Asia–Pacific Software Engineering Conference (APSEC2008), 2008. [61] S. Zhang, Y. Wang, Y. Yang, J. Xiao, Capability assessment of individual software development processes using software repositories and DEA, in: Q. Wang, D. Pfahl, D.M. Raffo (Eds.), Making Globally Distributed Software Development a Success Story, LNCS, vol. 5007, Springer, 2008, pp. 147–159. [62] S. Zhang, Y. Wang, F. Yuan, L. Ruan, Mining software repositories to understand the performance of individual developers, in: Proceedings of COMPSAC, 2007. [63] X. Zhao, K. Chan, M. Li, Applying agent technology to software process modeling and process-centered software engineering environment, in: The 20th Annual ACM Symposium on Applied Computing(SAC’05), Santa Fe, New Mexico, USA, 2005, pp. 1529–1533.
Please cite this article in press as: H. Huang et al., Creating Process-Agents incrementally by mining process asset library, Inform. Sci. (2013), http://dx.doi.org/10.1016/j.ins.2012.12.052