Data Science for Software Engineering

3 downloads 58866 Views 109KB Size Report
data science for software engineering (SE). Content: In the age of big data, data science (the knowledge of deriving meaningful outcomes from data) is an ...
Data Science for Software Engineering Tim Menzies, Ekrem Kocaguneli, Fayola Peters

Burak Turhan

Leandro L. Minku

Lane Department of CS&EE, West Virginia University Morgantown, WV, USA [email protected], [email protected], [email protected]

University of Oulu, Oulu, Finland [email protected]

CERCIA School of Computer Science, The University of Birmingham, Birmingham, UK [email protected]

Abstract—Target audience: Software practitioners and researchers wanting to understand the state of the art in using data science for software engineering (SE). Content: In the age of big data, data science (the knowledge of deriving meaningful outcomes from data) is an essential skill that should be equipped by software engineers. It can be used to predict useful information on new projects based on completed projects. This tutorial offers core insights about the state-of-the-art in this important field. What participants will learn: Before data science: this tutorial discusses the tasks needed to deploy machine-learning algorithms to organizations (Part1: Organization Issues). During data science: from discretization to clustering to dichotomization and statistical analysis. And the rest: When local data is scarce, we show how to adapt data from other organizations to local problems. When privacy concerns block access, we show how to privatize data while still being able to mine it. When working with data of dubious quality, we show how to prune spurious information. When data or models seem too complex, we show how to simplify data mining results. When data is too scarce to support intricate models, we show methods for generating predictions. When the world changes, and old models need to be updated, we show how to handle those updates. When the effect is too complex for one model, we show how to reason across ensembles of models. Pre-requisites: This tutorial makes minimal use of maths of advanced algorithms and would be understandable by developers and technical managers.

I. I NTRODUCTION Topic: Data and model issues in SE, and machine learning algorithms to solve these issues. Why the topic would be of interest to a broad section of the software engineering community: In the age of big data, data science for software engineering is a very active area. A search through Amazon.com reveals dozens of new data science texts, just in the last 2 years. Yet most of those texts are overly concerned with the specific details of particular data miners. Industrial practitioners should be able to use data mining approaches as decision support tools, which allow prediction of useful information on new software projects based on completed projects. For that, what industrial practitioners and researchers need is a view of data mining that is higher level than (e.g.) how to build a Naive Bayes classifier, and that at the same time takes into account the particularities of software prediction tasks. The presenters of this tutorial have been active for many years in this area. Their work has found general principles that cover multiple data mining methods. This tutorial will present those principles. Note that

c 2013 IEEE 978-1-4673-3076-3/13/$31.00

there is much evidence that such a “higher view” is urgently required. Recent work by Martin Shepperd shows just how much user expertise can alter the conclusions reached by a data miner [15]. His work clearly shows that even supposedly skilled researchers can use these tools very poorly. Therefore it is time to take a second better look, at a higher level, at these tools. The overall goal of the tutorial: Better data mining by better skilled software engineers. Concrete objectives to be achieved: The following list repeats paragraph 3 from the abstract: (1) Before running data mining algorithms, this tutorial discusses the systems tasks needed to deploy learners into an organization. (2) Where a lack or scarcity of local data is a problem, this tutorial shows how to adapt data from other organizational sources to local problems. (3) When privacy concerns block access to data, this tutorial shows how to privatize data while preserving our ability to mine that data. (4) When working with data of dubious quality, this tutorial will show how to prune spurious attributes and examples. (5) When data or models seem too complex, this tutorial shows how to simplify the results of data mining. (6) When data is too scarce to support intricate models, this tutorial discusses casebased reasoning methods for generating predictions. (7) When the world is a changing environment, and models need to be updated, this tutorial shows how to handle those updates. (8) When the effect being studied is too complex for one model, this tutorial presents methods for reasoning across ensembles of models. II. O UTLINE OF THE P ROPOSAL A. Part 1: Organization Issues • • • • •

Know your domain Let the experts talk Suspect your data Data collection is cyclic Other notes from [2], [3], [9]

B. Part 2: Data Issues

1484





How to solve lack or scarcity of local data: Methods to use cross data, i.e. data from other organizations. – Naive use of cross data is bad [17], [18]. – Must use relevancy filtering [6], [7], [17] How to prune data, simpler and smarter?

ICSE 2013, San Francisco, CA, USA Tutorial Summaries





– Feature selection [1] – Simpler and smarter feature selection: QUICK How to advance simple CBR methods? – Easy-path principle: TEAK [7] and, D-ABE How to keep your data private? – Data privacy [14] – Test case privacy [5]

C. Part 3: Model Issues •



• • •

Problems of the SE models: What problems has the accumulation of decades of model introduction revealed? – Instability in effort [10] – Instability in defects [10] – Instability in process [16] Solutions: Methods to handle instability of models – Envy-based learning [11] – Ensembles Static [8], [13] Temporal [4] GAC-based III. A BOUT THE P RESENTERS

Tim Menzies is a full Professor at WVU. His experience in data analysis is extensive. He is the author of 200+ refereed publications and one of the co-founders of the PROMISE repository for repeatable SE experiments. Since, 2001 he has been one of the leading proponents of applying data mining to software engineering data. His paper on data mining SE data in TSE07 is the highest cited paper in that journal for 2007 to 2012 [12]. He is the inventor of two new data mining algorithms (TAR3 and KEYS2) and is one of the co-organizers of the PROMISE conference on data mining SE data. He has organized workshops for ICSE 1999; ICSE 2005; ICSE 2007, ICSE 2012 and co-located conferences for ICSE 2008; ICSE 2009. He was the PC co-chair for ASE 12 and is a member of the editorial boards of IEEE TSE, ESE, JLVC, ASE journal. He organized all the PROMISE conference meetings from 2005 to 2011. He has organized special issues for journals for the Empirical SE journal (five times), the Requirements Engineering journal, IEEE Intelligent Systems, and twice for the Journal of Human-Computer Studies. As a result of all the above, Dr. Menzies has an extensive collection of contacts in the international scientific community. Burak Turhan is a Postdoctoral Research Fellow at the Department of Information Processing Science, University of Oulu (Finland). His research interests include empirical studies of software engineering on software quality, defect prediction, cost estimation, as well as data mining for software engineering. He has published over 50 peer-reviewed articles on these topics, including one of the top five most cited papers in the Empirical Software Engineering journal since 2009, where he investigated the feasibility of cross company defect predictors with a novel filtering technique [Turhan09]. He has recently been awarded a 3-year research grant by the Academy of Finland on applying data science concepts to learn

defect predictors across projects [Turhan12a, Turhan12b]. His research activities are in close collaboration with industrial partners, e.g.: Nokia, Ericsson, Elektrobit, Turkcell, IBM Canada. He is a steering committee member and the PC Chair for PROMISE13, and on the editorial board of e-Informatica Software Engineering Journal (EISEJ). Leandro L. Minku is a Research Fellow at the Centre of Excellence for Research in Computational Intelligence and Applications (CERCIA), School of Computer Science, the University of Birmingham (UK). He received his PhD degree in Computer Science from the University of Birmingham (UK) in 2011, and was an intern at Google Zurich for six months in 2009/2010. He was the recipient of the Overseas Research Students Award (ORSAS) from the British government and several scholarships from the Brazilian Council for Scientific and Technological Development (CNPq). His research focuses on software prediction models, online/incremental machine learning for changing environments, and ensembles of learning machines. Along with Xin Yao, he is the author of the first approach able to improve the performance of software predictors based on cross-company data over single-company data by taking into account the changeability of software prediction tasks’ environments. Ekrem Kocaguneli is a postdoctoral research fellow at West Virginia University. He also received his Ph.D. from Lane Department of Computer Science and Electrical Engineering, West Virginia University. His research focuses on empirical software engineering, data/model problems associated with software estimation and tackling them with smarter machine learning algorithms. His research provided solutions to industry partners like Turkcell, IBTech (subsidiary of Greece National Bank), also he recently completed an internship at Microsoft Research Redmond. His work was published at IEEE TSE, ESE and ASE journals. Fayola Peters is a Ph.D. candidate at the Lane Department of Computer Science and Electrical Engineering, West Virginia University. Along with Grechanik, she is the author of one of the two known algorithms (presented at ICSE12) that can privatize algorithms while still preserving the data mining properties of that data. R EFERENCES [1] Z. Chen, T. Menzies, D. Port, and B. Boehm. Finding the right data for software cost modeling. IEEE Software, 22(6):38–46, 2005. [2] P. Domingos. A few useful things to know about machine learning. Communications of ACM, 55(10):78–87, Oct. 2012. [3] B. Dominic and C. D. Making advanced analytics work for you. Harvard Business Review, 90(10):78–83, 2012. [4] L. GMINKU and X. YAO. Can cross-company data improve performance in software effort estimation? In PROMISE’12: Proceedings of the 8th International Conference on Predictive Models in Software Engineering, pages 69–78, 2012. [5] M. Grechanik, C. Csallner, C. Fu, and Q. Xie. Is data privacy always good for software testing? In ISSRE’10: IEEE 21st International Symposium on Software Reliability Engineering, pages 368–377, 2010. [6] E. Kocaguneli and T. Menzies. How to find relevant data for effort estimation? In Empirical Software Engineering and Measurement (ESEM), 2011 International Symposium on, pages 255–264. IEEE, 2011. [7] E. Kocaguneli, T. Menzies, A. Bener, and J. W. Keung. Exploiting the essential assumptions of analogy-based effort estimation. IEEE Transactions on Software Engineering, 38(2):425–438, 2012.

1485

[8] E. Kocaguneli, T. Menzies, and J. Keung. On the value of ensemble effort estimation. IEEE Transactions on Software Engineering, 38(6):1403–1416, 2012. [9] T. Menzies, C. Bird, T. Zimmermann, W. Schulte, and E. Kocaguneli. The inductive software engineering manifesto: principles for industrial data mining. In Proceedings of the International Workshop on Machine Learning Technologies in Software Engineering, MALETS ’11, pages 19–26, New York, NY, USA, 2011. ACM. [10] T. Menzies, A. Butcher, D. Cok, A. Marcus, L. Layman, F. Shull, B. Turhan, and T. Zimmermann. Local vs. global lessons for defect prediction and effort estimation. IEEE Transactions on Software Engineering, pages 1–1, 2012. [11] T. Menzies, A. Butcher, A. Marcus, T. Zimmermann, and D. Cok. Local vs. global models for effort estimation and defect prediction. In ASE’11: 26th IEEE/ACM International Conference on Automated Software Engineering, pages 343–351, 2011. [12] T. Menzies, J. Greenwald, and A. Frank. Data mining static code attributes to learn defect predictors. IEEE Transactions on Software

Engineering, 33(1):2–13, 2007. [13] L. L. Minku and X. Yao. Ensembles and locality: Insight on improving software effort estimation. Information and Software Technology, 2012. [14] F. Peters and T. Menzies. Privacy and utility for defect prediction: Experiments with morph. In ICSE’12: 34th International Conference on Software Engineering, pages 189–199, 2012. [15] M. Shepperd. It doesn‘t matter what you do, but it does matter who does it! In CREST Open Workshop, 2011. [16] B. Turhan. On the dataset shift problem in software engineering prediction models. Empirical Software Engineering, 17:62–74, 2012. [17] B. Turhan, T. Menzies, A. B. Bener, and J. Di Stefano. On the relative value of cross-company and within-company data for defect prediction. Empirical Software Engineering, 14(5):540–578, 2009. [18] B. Turhan, A. T. Misirli, and A. Bener. Empirical evaluation of the effects of mixed project data on learning defect predictors. Information and Software Technology, 2012.

1486