Keywords: Global software development; task allocation; work assignment; neural networks. I. INTRODUCTION. Over the last decade and half, many firms in the ...
Science and Information Conference 2014 August 27-29, 2014 | London, UK
A Model for Work Distribution in Global Software Development Based on Machine Learning Techniques Abdulrhman Alsri, Sultan Almuhammadi, Sajjad Mahmood {g201102450, muhamadi, smahmood}@kfupm.edu.sa. Information and Computer Science Department, King Fahd University of Petroleum and Minerals, Dhahran 31261, Saudi Arabia.
Abstract—Global Software Development (GSD) initiative aims to facilitate software development process by providing access to skilled workers at a relatively low cost and 24/7 software development model. Previous work suggests that half of the companies that have tried GSD have failed to realize the anticipated outcomes which have resulted in poor outsourcing relationships, high costs and overall poor software products. One critical factor for successful GSD projects is the allocation of tasks as project managers not only need to consider their workforce but also need to take into the account the characteristics of the geographically distributed sites and their relationships. In this paper, we present a task allocation model based on neural network that identifies a fit site for a given task and then finds related sites to the fit site. The related sites can be used as alternatives to the fit site or as additional sites to run more tasks in parallel. The proposed model provides project managers with a list of potential sites for the given tasks to select the appropriate GSD sites. We also discuss and evaluate the proposed task allocation model compared with other approaches.
organizations. In this paper, we present a task allocation model that find a fit site for a given task based on neural network algorithm and suggests a number of alternative sites as desired by the project manager to select appropriate GSD sites. The alternative sites are chosen according to the Euclidean distance of their attributes victors in the n-dimensional space using a technique similar to the one used in the k-nearest neighbour (k-NN) algorithm. The proposed model considers several task allocation factors such as time difference and quality to suggest a suite of suitable sites for a given GSD project. We also present an evaluation of the proposed task allocation model.
Keywords: Global software development; task allocation; work assignment; neural networks.
Lamersdorf et al. [10] proposed a Bayesian Network model that takes three sites and considers cost, time and qulaity factors. The model suggests the appropriate site for each task using probabilistic values. Doma et al. [11] argues that global development introduces time and cost overhead due to significant coordination and communication requirements between geographically distributed sites and work force from different cultural backgrounds. The model is based on the critical path calculation for project scheduling and only considers time as a task allocation factor. Grinter et al. [12] presented an approach to deal with coordination issues in a distributed work environment.
I.
I NTRODUCTION
Over the last decade and half, many firms in the world have started global software development (GSD) projects. A large number of these organizations have adopted GSD to reduce development cost and increase overall software quality [1, 2]. In GSD, a company (client) contracts out all or part of its software development activities to another company (vendor), who provides services for remuneration [3]. Client organizations benefit from GSD because vendors in developing countries cost less than in-house development [4]. Furthermore, client organizations also get access to skilful human workforce and take advantage of the 24/7 development model. Despite steady growth of GSD projects, half of the companies that have tried GSD have failed to realize the anticipated outcomes which has resulted in poor global relationships, misunderstanding of the projects’ requirements and higher development costs [5]. There are many reasons for these failures [6, 7, 8]. These failures are usually traced back to two main causes: insufficient abilities (e.g. absence of domain knowledge, high turnover rate etc.) at different sites; and problems at the interfaces between distributed sites due to communication, coordination and cultural barriers [2]. A number of GSD challenges are directly impacted by the decisions taken at the task allocation phase of GSD projects [9]. Despite the importance of this problem, little research has been carried out to improve task allocation processes at GSD
The rest of this paper is organized as follows: Section II reviews the related literature. In Section III, we present the task allocation model and discusses an experimental scenario. We conclude the paper and discuss future work in Section IV. II.
R ELATED W ORK
Furthermore, Lamersdorf et al. [9] presented an empirical study to identify key factors that impact work distribution of a GSD project. The study concludes that cost is the single most important factor during task allocation process in a GSD project. Beecham et al. [13] presented a decision support system for global software development that gives a set of recommendations to project managers with an aim to facilitate appropriate site selection that best meets business needs of the project. Lately, Wickramaarachchi and Lai [14] presented a task allocation model based on a high abstraction level of the development process models. The model selects the suitable site based on work and site dependency. It identify workspecific characteristics, relation between the work phases, dependency between sites and the site-specific characteristics. The site is selected if its output is maximum value of the four parameters multiplication. The drawback of this method is that each site is independently treated.
www.conference.thesai.org
399 | P a g e
Science and Information Conference 2014 August 27-29, 2014 | London, UK III.
TASK A LLOCATION M ODEL
In this section, we present task allocation decision model. First, we briefly introduce the neural networks and the Euclidean distance which are the key components of our task allocation decision model. Next, we discuss the task allocation decision model in detail and present an experimental scenario to evaluate the usefulness of the approach. A. Neural Network Overview Articial Neural Networks (ANN) approach is a machine learning technique that has learning abilities, and it is helpful in solving problems with uncertain conditions. ANN is constructed from a neuron processing elements. The elements are connected by a network of connections where each connection is assigned a weight. Furthermore, ANN is used as a classifier. A classifier propagates input vectors to output vectors. The network learns the input along with their output values. Once the network has learned, it can be used as a predictor for future data. [15]. Software engineers have taken the advantage of ANN to solve some problems such as defect prediction [16] and development effort estimation [17]. The proposed model uses the neural network to identify a fit site for a given task. B. Euclidean Distance The Euclidean distance of two points in the n-dimensional space is a useful parameter to determine the relationship between these points taking in consideration the differences in all dimensions equally. It is defined by the square-root of the summation of the squares of the differences in each dimension. Formally, Let a = (a1 , a2 , ..., an ) and b = (b1 , b2 , ..., bn ) be two points in the n-dimensional space. Then the Euclidean distance between a and b is given by v u n uX δ(a, b) = t (ai − bi )2
Next, the manager has the choice if he needs to look for more sites that are related to that site in order to assign more tasks to. The proposed task allocation decision model uses neural network technique to determine an appropriate site called the fit site and then, based on some parameter k provided by the project manager, it searches for k most related sites that have low Euclidean distance to the fit site (i.e. it finds the k-nearest neighbours to the fit site) and lists them as alternatives. Initially, the neural network is trained on a sample data set. Each sample on the training data set represents a site, which contains its own feature values. The site’s feature values represent the task allocation criteria such as time differences, quality, cost, collaboration maturity between sites, trust, etc. Each sample on the training data set has its own output value as the number 0 or 1 that represent Yes or No. After the neural network is trained on the training data sets, a project manager can provide it with the feature values of one site to predict its output (as shown in Fig 1). If it is accepted, it will be considered as a fit site. The manager can also express his GSD objectives by specifying a value for one feature that is the main criteria (for example, quality, time difference, cost etc.) for assigning the task allocation. The output of the neural network is either 0 or 1. If the output is 1, this means that the given site is recommended or fit to assign it a task with the specific given objective. On the other hand, no task is assigned to a site where the neural network output is 0.
i=1
The Euclidean distance is used to find the nearest points of a given target point in many algorithms such as the k-nearest neighbour (k-NN) [18]. In this paper, we use the Euclidean distance to determine the alternative sites that are ’close’ to the fit site obtained by the neural network. C. The Proposed Task Allocation Model In a traditional software development environment, a project manager typically distributes work tasks among its team members who are present at a single development site. However, in GSD projects, a project manager needs to assign tasks to its team(s) who are usually present at different geographical locations. This introduces an extra complexity at the task allocation phase of a project as one GSD vendor can be more cheaper than other while another vendor might have more skilled workers. We present the task allocation decision model, as shown in Fig. 1, that acts as a tool that helps managers in such a situations. The proposed model decides what the appropriate site is, based on a given objective by the manager such as looking for better development quality or cheaper development cost etc. For instance, if the goal is to find a site which provides good quality, the model responses with GSD provider on a specific site that has more skilled people.
Fig. 1: Task Allocation Decision Model If the manager has more than one task to assign, then he can select more related sites along with the fit site using the Euclidean distance. The same data set is used to measure the Euclidean distances for the related sites that is used for neural network training. It provides more sites as needed by the project manager that are nearest to the fit site (as shown Fig. 1). Furthermore, it is important to note that the k related sites also satisfy the same task allocation objective (e.g. cost, quality etc.) of the fit site. Finally, as an output of the task allocation decision model, the project manager will obtain the fit site and k most related sites that are recommended for the given task and other similar tasks to be assigned in parallel if needed. Model Output = (fit site, k-most related sites) where k ≥ 0 represents the number of the related sites specified by the project manager.
www.conference.thesai.org
400 | P a g e
Science and Information Conference 2014 August 27-29, 2014 | London, UK D. Discussion We discuss here some of the strengths of the proposed task allocation decision model. First, unlike to Lamersdorf et al. model [10] which requires to have several sites at hand so that a competition between them is being done for a specific objective. On the other hand, our model can be given the criteria values subjective by the project manager that represent the task allocation criteria for one site, the produced decision for that site is either recommended or not. If it is recommended, the model should tell about that site name and other information from the dataset. Therefore, our model does not require that you have several sites to be familiar with their criteria values, or information about those sites available at your hand. You only might have at least one real example about one site, or you might have created some values subjective to your needs. At any of those two cases, the example can be tested on the model for task allocation recommendation. This might be a good feature such that you do not need to know about common sites such as India for lower cost, or Australia for better quality. Second, in the discussed method of Lamersdorf, Ansgar et al., they treat with more than one site in terms of one is the winner of the objective focus such as quality, others have equal weight for cost and productivity time. Our model is different where it supports managers that seek for more than one site for the same objective, or it might be running the process three times for obtaining focus on different objectives. In this way, the proposed model has the flexibility to run in either one way of the two mentioned ways. Third, since the number of tasks can be increased or decreased because of requirements change, the model provides flexible number of selected sites by setting the parameter k as needed for a total number of k + 1 sites. Also, we can predict more than one site at a time by feeding the neural network a set of sites to decide whether they are fit or not. Furthermore, the criteria for task allocation are weighted in discrete numeric values, whereas the values which are used by the Bayesian Networks by Lamersdorf, Ansgar et al. [10] are fuzzy. For example, medium is subjective value for two persons where it might also be much low or little high. We argue that using the neural network technique alone is insufficient to solve our problem, but extending it by finding the nearest neighbours of the fit site produces a reasonable solution. That is, if we see the difference between the neural network and the nearest neighbor, we find that the neural network one is a supervised machine learning whereas the nearest neighbor is unsupervised, clustering or grouping machine learning [19]. Generally, the neural network is given data samples within their output values (target values) so that once it learned those examples within their output values 0 or 1, it can predict the output of new site examples either 0 or 1 based on the weights that are previously adjusted while learning process. On the other hand, providing the nearest neighbors can group the data examples with other examples on the feature space based on similarity of their features and not the output values of those examples. The similarity that is between the criteria or features values. For more illustration, the neural network can be used to predict fitness of site based on previous learning,
and by extending it with the nearest neighbors we can group other k sites based on similarity to the fit site. E. Input Factors In this section, the input data for the algorithm is described. Since task allocation depends on several factors, which represent the features for the input data set for the neural network. Table 1 presents a set of important task allocation factors during a GSD project [9]. Factor
Description
CD LD LC CM TD R Q PtC
Culture Differences Language Differences Labor Cost Collaboration Maturity Time Difference Reliability Quality Proximity to client
Table 1: GSD Task Allocation Factors
In this paper, we have only considered the common task allocation factors, namely, time, quality and cost [9], as shown in Table 2. Each of the common task allocation factor is assigned a value. For example, the time difference parameter, Time-Diff, represents the time zone differences between the client and vendor sites. The Cost and Quality values range between 1 and 3 and they represent the level of cost and quality for a specific site. The Focus is an index of one of the features: Time-Diff, Quality or Cost. This factor represents the project manager’s main objective/criteria during the GSD project. The Class is a binary value {0,1} that indicates the desired output for a particular site that is either it is recommended or not recommended by the task allocation decision model. Site Delhi Canberra Cairo Hadhramout Canberra Washington
Time-Diff
Quality
Cost
Focus
Class
[0.25] [0.8] 0.1 0 0.8 0.8
0.2 0.3 [0.1] 0.1 0.3 [0.3]
0.1 0.3 0.2 [0.2] [0.3] 0.3
0.1 0.1 0.2 0.3 0.3 0.2
1 0 0 1 0 1
Table 2: GSD Task Allocation Common Factors
A feature’s range is standardized to be between 0 and 1 such that a feature with a higher range of value will govern the distance with reference to other task allocation factors. The features are scaled using the formula [20] as follows: x(j, i) = 2 ∗ (x(j, i) − min)/(max − min) − 1 where max and min are the maximum and minimum values of each features respectively. Furthermore, we have applied the White Gaussian Noise to the data features to generate more examples so that we get enough data for the neural network to learn. The White
www.conference.thesai.org
401 | P a g e
Science and Information Conference 2014 August 27-29, 2014 | London, UK Gaussian Noise is based on the feature data mean and standard deviation. F.
Experimental Scenario
We have used MATLAB R2012b to simulate our proposed task allocation decision model. In this experiment, the neural network is first conducted, trained and tested. A backpropagation neural network is structured as (4-10-1). There are 4 input factors, 10 hidden neurons and 1 output layer. The nearest neighbor uses the Euclidean distance as a distance metric. Once a ’Fit site’ is selected by the neural network, the Euclidean distance finds the related sites to the selected ’fit site’. The output of Euclidean distance is a set of sites k. As a result, the output of the task allocation decision model is k sites plus the ’fit site’, as shown in Fig 5.
(0.8). That value is preferred such that while one team is offwork, the other team is working because of the time difference between the sites is long enough. This is an example of the first record mutation, the same idea is applied to the other mutated examples in Fig 2 and 3 tables. Actually, the above mutation is introduced only to demonstrate the idea that when one site is being under test, the neural network will decide based on the task allocation factor values, as example of them above in Table 1. We actually do not have such mutation. We are assumed to have real data, and the real data is different from the data that is used for training. This difference is demonstrated above using what we called mutation.
Fig. 2: Sample Training Data
Fig. 4: Sample Training Data Fig. 3: Sample Training Data In order to simulate the model, the dataset that contains the data for the task allocation factors is given as input for the model. Now we take one example of the model process. See Fig 2 that shows a sample of the training dataset. In order to test, some mutations were done in the training data with a focus on the objective features (the mutated values are labeled gray-color background in Fig 2 and Fig 3). In the experimental scenario, the mutation were done on the time difference, quality and cost features. The change on data was done for cells with gray background in the table, which are the objective that is indicated by the focus feature. Assume we have some mutation that is expressed by the two tables, Fig 2 and 3 as follows: (0) has became (0.8) (0 → 0.8) for the time-diff, (0.3 → 0.1) for the quality and (0.3 → 0.1) for the cost. The results of the task allocation decision model are as follows: First, we consider each record as an example of a site. Then, we notice that for the first row on Fig 2 and 3, the desired output was changed from (0) on Fig 2 to (1) in the corresponding Fig 3, which means Yes. That means that the site became recommended because the time difference on the record became longer (0.8) instead of (0). It means that, when the manager has asked to focus on time, that site has been recommended for the objective time difference, which is
Fig. 5: Model Simulation. K-Nearest Neigbour. In the next step, we find the alternative sites that are related to that site in term of task allocation criteria. As shown in Fig 4, the fit site is Washington. The related site is Canberra (in this case, k=1). In the second case, whwn Canberra is the fit site on Fig 4, the task allocation decision model suggested 2 sites, London and Canada. These two sites are satisfying good quality, and are most related to Canberra. However, Fig 5 shows simulation results for the nearest neighbor using the Euclidean distance where a fit site is represented as ”X” and the related sites represented as circled ”O”.
www.conference.thesai.org
402 | P a g e
Science and Information Conference 2014 August 27-29, 2014 | London, UK IV.
C ONCLUSION
Global software development approach is adopted by organizations with an aim to reduce development cost, improve overall software quality and increase productivity by having work carried out along the day using follow-the-sun concept. Task allocation is a key phase of GSD projects that directly impacts the benefits of adopting GSD. In this paper, we have presented a task allocation decision model based on the neural networks algorithm to find a fit site, and apply the Euclidean distance to find the nearest neighbors to fit site in order to select appropriate set of sites. We also evaluate the proposed model on an experimental scenario. As a threat to validity, the proposed model needs to be evaluated using real data set. This opens doors for further extensions of this work. Future work may include: (1) evaluating the proposed model on real world case studies, (2) proposing another solution based entirely on the k-nearest neighbor technique, (3) comparing the performance of the kNN solution in (3) with our proposed model in this paper, and (4) improving the model by considering more than one task allocation objectives to select a fit site. V.
ACKNOWLEDGMENT
[11] Supraja Doma, Larry Gottschalk, Tetsutaro Uehara, and Jigang Liu. Resource allocation optimization for gsd projects. In Computational Science and Its Applications–ICCSA 2009, pages 13–28. Springer, 2009. [12] Rebecca E Grinter, James D Herbsleb, and Dewayne E Perry. The geography of coordination: dealing with distance in r&d work. In Proceedings of the international ACM SIGGROUP conference on Supporting group work, pages 306–315. ACM, 1999. [13] Sarah Beecham, John Noll, Ita Richardson, and Deepak Dhungana. A decision support system for global software development. In Global Software Engineering Workshop (ICGSEW), 2011 Sixth IEEE International Conference on, pages 48–53. IEEE, 2011. [14] D. Wickramaarachchi and R. Lai. A method for work distribution in global software development. In Advance Computing Conference (IACC), 2013 IEEE 3rd International, pages 1443–1448, 2013. [15] James A Anderson and Joel Davis. An introduction to neural networks, volume 1. MIT Press, 1995. [16] Venkata Udaya B Challagulla, Farokh B Bastani, I-Ling Yen, and Raymond A Paul. Empirical assessment of machine learning based software defect prediction techniques. International Journal on Artificial Intelligence Tools, 17(02):389–400, 2008. [17] Heejun Park and Seung Baek. An empirical validation of a neural network model for software effort estimation. Expert Systems with Applications, 35(3):929–937, 2008. [18] Wikipedia. K-nearest neighbors algorithm — wikipedia, the free encyclopedia, 2013. [Online; accessed 17-December-2013]. [19] Simon S Haykin, Simon S Haykin, Simon S Haykin, and Simon S Haykin. Neural networks and learning machines, volume 3. Prentice Hall New York, 2009. [20] Wikipedia. Feature scaling — wikipedia, the free encyclopedia, 2013. [Online; accessed 5-January-2014].
The authors would like to thank King Fahd University of Petroleum and Minerals (KFUPM), Dhahran, Saudi Arabia for its continuous support of research. This research is supported by the Deanship of Scientific Research at KFUPM under Research Grant IN131013. R EFERENCES [1] A.A. Bush, A. Tiwana, and H. Tsuji. An empirical investigation of the drivers of software outsourcing decisions in japanese organizations. Information and Software Technology Journal, 50(6):499–510, 2008. [2] M. Niazi, S. Mahmood, M. Alshayeb, M. Rehan Riaz, K. Faisal, and N. Cerpa. Challenges of project management in global software development: Initial results. In Science and Information Conference (SAI), 2013, pages 202–206, Oct 2013. [3] T. Kern and L. Willcocks. Exploring information technology outsourcing relationships: theory and practice. Journal of Strategic Information Systems, 9:321 – 350, 2000. [4] L McLaughlin. An eye on india: Outsourcing debate continues. IEEE Software, 20(3):114–117, 2003. [5] S. Islam, M.M.A. Joarder, and S.H. Houmb. Goal and risk factors in offshore outsourced software development from vendor’s viewpoint. In Global Software Engineering, 2009. ICGSE 2009. Fourth IEEE International Conference on, pages 347–352, July 2009. [6] M. Cataldo, M. Bass, J.D. Herbsleb, and L. Bass. On coordination mechanisms in global software development. In Global Software Engineering, 2007. ICGSE 2007. Second IEEE International Conference on, pages 71–80, Aug 2007. [7] Mary C. Lacity and Joseph W. Rottman. Effects of offshore outsourcing of information technology work on client project management. Strategic Outsourcing: An International Journal, 2(1):4–26, 2008. [8] J. Stark, M. Arlt, and D.H.T. Walker. Outsourcing decisions and models some practical considerations for large organizations. In Global Software Engineering, 2006. ICGSE ’06. International Conference on, pages 12– 17, Oct 2006. [9] Ansgar Lamersdorf, J¨urgen Munch, and Dieter Rombach. A survey on the state of the practice in distributed software development: Criteria for task allocation. In Global Software Engineering, 2009. ICGSE 2009. Fourth IEEE International Conference on, pages 41–50. IEEE, 2009. [10] Ansgar Lamersdorf, J¨urgen M¨unch, and Dieter Rombach. A decision model for supporting task allocation processes in global software development. In Product-Focused Software Process Improvement, pages 332–346. Springer, 2009.
www.conference.thesai.org
403 | P a g e