Application of Data Mining in Software Engineering ...

4 downloads 1988 Views 78KB Size Report
Application of Data Mining in Software Engineering for PASED. AliReza ..... publicly available data sources, mainly from free software repositories, but that the ...
International Journal of Applied Engineering Research ISSN 0973-4562 Volume 10, Number 13 (2015) pp 33853-33857 © Research India Publications. http://www.ripublication.com

Application of Data Mining in Software Engineering for PASED AliReza Honarvar1, Fatemeh Sarikhani 2,MohamadSadegh Kargar-Khorami3 and Kazem Faridi4 1

Department of Electrical and Computer Engineering, Safashahr Branch, Islamic Azad University, Safashahr, Iran, [email protected] 2 Department of Electrical and Computer Engineering, Safashahr Branch, Islamic Azad University, Safashahr, Iran, [email protected] 3 Department of Electrical and Computer Engineering, Safashahr Branch, Islamic Azad University, Safashahr, Iran, [email protected] 4 Department of Electrical and Computer Engineering, Safashahr Branch, Islamic Azad University, Safashahr, Iran, [email protected]

Abstract Software organizations have often collected volumes of data in hope of better understanding their processes and products. Useful information has been extracted from those large volumes of data, but it is commonly believed that large amounts of useful information remains hidden in software engineering databases. Data mining has appeared as one of the tools of choice to better explore software engineering data. Data mining can be defined as the process of extracting new, non-trivial, and useful information from databases. In this paper the application of mining softer engineering dataset is considered from various aspects, and the benefits of analysing the Empirical Software Engineering in PASED School is also discussed. Keywords: Empirical Software Engineering, PASED, Data mining, Software engineering 1.

Introduction

Data mining represents a shift from verification-driven data analysis approaches to discovery-driven data analysis approaches. In the former approach, a decision maker must hypothesize the existence of information of interest, collect this information, and test the posed hypothesis against the information collected. Due to the size and complexity of data repositories nowadays, this approach is not sufficient to efficiently explore the data available in an organization. Discovery-driven approaches discover important information hidden in the data. The main reasons for the recent boom of data mining are [1]: 1. software and hardware infrastructures have matured to handle the intensive computation required by discoverydriven data analysis; 2. advances in and ability to use techniques such as pattern recognition, neural networks, association measures, and decision trees have advanced dramatically in recent years; 3. cheap data storage and repositories integration have made large amounts of data readily available in modern organizations. In this paper we discuss data mining and its applications to big software engineering data analysis.

2.

Empirical Software Engineering

organizations. Empirical software engineering has established itself as a research area within software engineering during the last two decades. 20 years ago Basili et al. [12] published a methodological paper on experimentation in software engineering. Some empirical studies were published at this time, but the number of studies was limited and very few discussed how to perform empirical studies within software engineering. Since then the use of empirical work in software engineering has grown considerably, although much work still remains. A journal devoted to empirical software engineering was launched in 1996 and conferences focusing on the topic have also been started. Today, empirical evaluations and validations are often expected in research papers. However, the progress in research must reflect on education. Computer scientists and software engineers ought to be able to run empirical studies and understand how empirical studies can be used as a tool for evaluation, validation, prediction and so forth within the area. This section presents some views on education in empirical software engineering. In particular, the chapter contributes by identifying three different educational levels for empirical software engineering. Furthermore, the chapter stresses the possibility to run both experiments and case studies as part of courses and how this facilitates research. However, to perform research as part of courses, with the educational goals of a course in mind, requires a delicate balance between different objectives. The research goals must be carefully balanced with the educational goals. Different aspects to have in mind when balancing the different objectives are presented. Finally, the need to develop guidelines for running empirical studies in a student setting is stressed. Guidelines are needed both to ensure a proper balance between different objectives and to, within the given constraints, get the maximum value from empirical studies run in a student setting. It is too simplistic to disregard empirical studies with students just because they are students instead there is a need to increase our understanding of how empirical studies with students should be conducted and to what extent results from them could be generalized.

33853

International Journal of Applied Engineering Research ISSN 0973-4562 Volume 10, Number 13 (2015) pp 33853-33857 © Research India Publications. http://www.ripublication.com Teaching Empirical Methods This section addresses issues to consider when teaching empirical software engineering on different educational levels, including bachelor, master and PhD level [5]. Furthermore, the section discusses the use of different empirical methods in particular experiments and case studies. 2.1 Educational Levels Empirical software engineering may be taught in at least three distinct ways: Integration in software engineering courses One way of making empirical studies a natural part of evaluating and assessing different methods, techniques and tools is to integrate empirical work as assignments or studies within other courses. This means that in, for example, a course on software design different ways of performing a design can be compared and evaluated empirically. Another example is to include an experiment in a verification and validation course, where, for example, unit testing and inspection could be compared. In particular, it may be interesting to evaluate which type of faults that are found with the different techniques. Integration into courses early in the curricula means that students get exposed to and used to apply empirical methods. 2.2 Separate course It is also possible to run a separate course on empirical software engineering. The major advantage is that the course becomes very focused on empiricism and hence it becomes easier to provide assignments that include designing studies, reviewing empirical work and also possible write an empirical paper. The latter may be done without actually running a study, i.e. the focus could be in identifying, designing and discussing an empirical study. A potential drawback may be that it is hard to get a separate course on empirical software engineering into the curricula. It is often many different topics that should be covered in a software engineering program. Inclusion in research methodology course The need for a research methodology course normally becomes evident when students are approaching their thesis work, independent of whether it is a bachelor, master or PhD thesis. Thus, it may be natural to include an empirical software engineering component in a more general research methodology course. Other components may be the use of an analytical approach or the use of simulation. The above three alternative ways of teaching empirical software engineering may be mapped to educational levels. The mapping is based on experiences at Blekinge Institute of Technology to integrate empirical software engineering into a general software engineering program. At Blekinge, students may get five different degrees in software engineering. The first three are regarded as being on undergraduate and graduate level and they are (counting years from the start at the university): 1) university diploma in software engineering after two years, 2) a bachelor degree after three years and 3) a master degree after 4.5 years. The two latter includes a thesis, where the thesis on the master

level normally is more research-oriented than the thesis on bachelor level. On a research level, two different degrees are awarded: 1) licentiate degree after two years of research studies and 2) a doctoral degree after four years of research studies. The licentiate degree includes a year of course work and a thesis equivalent to one year of full time research. The doctoral degree includes 1.5 years of course work and a thesis equivalent to 2.5 years of research. The work to licentiate level is expected to be included in the doctoral degree. To simplify a little and make the Swedish system comparable to an international context, the main focus is set on the bachelor, master and PhD levels. These three levels are also the ones included in a joint agreement within the European community to ensure compatibility between countries within the community. In our experience, it is suitable to try to integrate assignments with some empirical parts into courses on the bachelor level. It is also possible to successfully perform experimental studies with students as subjects on the bachelor level. However, it is crucial to ensure a very clear educational goal with such studies. This is particularly important on the bachelor level since the students are still quite far from research. On the master level, it is still important to integrate empirical methods and studies into other courses. The students are now approaching the research level and in particular they are expected to write theses with a research component, and hence it is possible to run experimental studies that are more research focused. However, it is still very important to discuss the outcome of the studies and relate to whatever topic they are studying. Before the master thesis, it is suitable to run a course on either empirical software engineering or a more general research methodology course with an empirical component. The actual choice is dependent on whether the thesis work is expected to be only empirical or if other research approaches are also possible. It also matters whether the objective is to run a joint course between several programs, for example, a joint course between software engineering and computer science. The latter is the situation at Blekinge and hence methods for empirical studies are included in a more general research methodology course. Finally, on the PhD level, we have experience from teaching specific courses on controlled experiment in software engineering, empirical software engineering and also a course on statistical methods for software engineers. The latter course is based on having data from a number of empirical studies in software engineering. The course is given using a problembased approach. This means that the students are given a set of assignments connected to the data sets provided. No lectures are given on how to solve the problems and no literature on statistics are provided. It is up to the students to find suitable literature and hence statistical methods to use based on the assignments and the data available to them. The students practice statistical methods useful both for case study data and data from controlled experiments. Students provide individual solutions that are peer reviewed before a discussion session. In the discussion session, differences in solutions are discussed and the teacher shares her or his experience with the students. The sessions include discussions on both the statistical methods used and the interpretation of the analysis

33854

International Journal of Applied Engineering Research ISSN 0973-4562 Volume 10, Number 13 (2015) pp 33853-33857 © Research India Publications. http://www.ripublication.com in a software engineering context. This course is run for the second time in 2006 with participants from five different Swedish universities. 3. Application of Data Mining engineering data

in

Software

In [6] uses Quinlan’s ID3 system to build classification trees to identify software modules with high development effort or high number of faults. The data - collected from a NASA production environment - included 4,700 software modules with 74 measured attributes. This is a seminal article on the application of machine learning-based techniques to build software engineering classification models. It has an excellent background section and good references. Using classification methods to predict software defect proneness with static code attributes has attracted a great deal of attention. The class-imbalance characteristic of software defect data makes the prediction much difficult; thus, a number of methods have been employed to address this problem. However, these conventional methods, such as sampling, cost-sensitive learning, Bagging, and Boosting, could suffer from the loss of important information, unexpected mistakes, and overfitting because they alter the original data distribution. In [r2] presents a novel method that first converts the imbalanced binary-class data into balanced multiclass data and then builds a defect predictor on the multiclass data with a specific coding scheme. A thorough experiment with four different types of classification algorithms, three data coding schemes, and six conventional imbalance data-handling methods was conducted over the 14 NASA datasets. The experimental results show that the proposed method with a one-against-one coding scheme is averagely superior to the conventional methods. Mining Software Repositories (MSR) is one of the fastest growing field and community within Software Engineering and consists of analysing the rich data available in software repositories to uncover interesting and actionable information about software systems and projects. MSR is data-driven and falls under Empirical Software Engineering (ESE). Hence, understanding and mitigating threats to validity of the findings and results is a critical element of MSR research. In [7], they present thier point of view and opinion on the issue of lack University-Industry collaboration and usage of Closed or Proprietary Source Software (CSS/PSS) in MSR research. They defend the validity of their argument by conducting a bibliometric analysis of scientific publications in MSR series of conferences. They believe that the issue of lack of University-Industry collaboration and usage of Closed or Proprietary Source Software (CSS/PSS) in MSR research is a real and important issue and their research presented in this paper provides a fresh perspective on the given subject matter. A study by Robles et al. show that MSR authors use in general publicly available data sources, mainly from free software repositories, but that the amount of publicly available processed datasets is very low [8]. This study consisting of analysis of 187 scientific publications in MSR series of conferences over a period of 5 years from 2010 -

2014 reveals lack of University-Industry collaboration and paucity of studies involving CSS/PSS dataset posing threats and concerns to practical applicability, external validity and generalizability. they believe that data sharing by Industries is a debatable and arguable point. In this position statement, we take a stand on the debate that specific and strict guidelines should be defined on the expectation of data sharing by researchers. Also, we believe that the MSR community can define a data archiving and preservation policy as a condition for publication. Software companies are focusing on delivering quality to their customers. One of the main factors that determine the perception of the end-user of quality is the degree to which the software is free of bugs. Software fault prediction aims to improve software quality by identifying fault prone modules in a timely manner by means of metric-based classification .Early detection is also crucial as the cost of correcting or reworking software is much lower if faults are discovered in the early phases of the software development cycle. In the literature, software fault prediction has been studied from different viewpoints; predicting the number of faults in each segment or using the characteristics of different segments in the software code to identify which segments are most fault prone . In the first one, software fault prediction is considered to be a regression problem while the second case regards it as a classification task. authors In [3] consider software fault prediction from a classification point of view in which modules that are fault prone are identified based on their specific metrics. Various techniques have been investigated to this end, including artificial neural networks (ANN),treebased methods , swarm intelligence, SVMs, Random Forests (RFs) and logistic regression. How-ever, it is often stated that results regarding the superiority of one technique over another are difficult to assess across different studies due to different experimental setups. 4. Empirical Software Engineering in PASED School Most of the material in this section is extracted from [4]. Empirical research has become crucial for today’s software engineering researchers. Virtually any software engineering research publication is expected to have some form of empirical validation, e.g., a case study or a controlled experiment. Moreover, software engineering is gaining maturity, moving from craftsmanship towards a consolidated scientific discipline, where conclusions are drawn based on empirical evidences rather than on common wisdom. the large majority of software engineering graduate students still do not have access to such courses and must rely on generic statistics/data mining courses from computer science/software engineering curricula. Such courses do not completely satisfy the needs of empirical software engineering researchers, because: developers and users are at the center of the development activities (either as producers or consumers), making software engineering research more similar to social sciences than to natural sciences research;

33855

International Journal of Applied Engineering Research ISSN 0973-4562 Volume 10, Number 13 (2015) pp 33853-33857 © Research India Publications. http://www.ripublication.com • many software engineering studies deal with data sets for which parametric statistics (taught in typical statistics courses) are not suitable. Thus, empirical software engineering research courses should provide solid foundations on nonparametric statistical procedures; • empirical software engineering researchers require specific skills to extract information from software repositories containing a mixture of structured and unstructured data, often incomplete and imprecise [10,11]. The educational objectives of the school were to: 1) learn to plan and conduct software engineering experiments with human subjects and collect related data; 2) learn to plan and conduct software engineering experiments involving the mining of data from (un)structured software repositories; 3) learn to build prediction and classification models from the collected data, and to use these models. At the end of the school, they asked participants to fill out a feedback form. Rather than using scales to allow empirical analysis of the forms, but constrain participants into categories, we preferred free-text forms to first understand the categories with which such a school can be evaluated and to improve the next editions. Students were excited about the “learning by doing” teaching model, as this model helped them to concretely apply the various methods taught in the courses and to reinforce learning. They also liked the variety of topics chosen and the technical depth of the lectures. Having laboratories with each lecture and with step-bystep instructions was probably the part of the school that students liked the most, because they could (1) apply what they learned during the lectures, (2) learn tools that they never used before, e.g., R, and (3) interact with speakers and teaching assistants. they also obtained suggestions to improve future editions of the school. Students liked laboratories and, thus, we would like to make them longer, possibly complemented with some general tutorials on the tools being used. Also, students asked for keynotes on writing empirical research papers, possibly explicitly related to the specific topics of the lectures. Another interesting remark was the request to see “what not to do” when conducting empirical research, possibly by referring to wrong (and published) empirical research. Finally, participants would like to see industry talks that are directly connected with the topics of the lecture, to understand their relevance in the industry.

5. Conclusion This paper presents some ways to integrate empirical software engineering into the curricula. Some issues related to this introduction are highlighted. In particular, the need to balance educational objectives with research objectives is stressed. It is argued that both education and research would benefit from as realistic empirical studies as possible when performing studies involving students. Thus, it can be concluded that the

challenge is to integrate empirical software engineering and empirical studies into the curriculum and maintain research relevance and quality. Instead of being negative towards studies with students, we should increase our understanding of how empirical studies with students can be used as part of our research process. We must address questions such as: How is empirical software engineering best introduced into the curricula to ensure that students are both able to run empirical studies and capable of understanding their value? Is it possible to effectively combine educational and research objectives when performing empirical studies in a student setting? Can empirical studies with students be a natural stepping stone in technology transfer? Empirical software engineering has established itself as an important area within software engineering. However, it still remains to effectively introduce it into computer science and software engineering curricula, and to address several challenges in relation to the introduction.

5. References [1] Mendonca, Manoel, and Nancy L. Sunderhaft. "Mining software engineering data: A survey." A DACS state-ofthe-art report, Data & Analysis Center for Software, Rome, NY (1999). [2] Tripathi, Ambika, Savita Dabral, and Ashish Surekay. "University-industry collaboration and open source software (OSS) dataset in mining software repositories (MSR) research." Software Analytics (SWAN), 2015 IEEE 1st International Workshop on. IEEE, 2015. [3] Moeyersoms, Julie, et al. "Comprehensible software fault and effort prediction: A data mining approach." Journal of Systems and Software 100 (2015): 80-90. [4] Di Penta, Massimiliano, et al. "Five days of empirical software engineering: the PASED experience." Proceedings of the 34th International Conference on Software Engineering. IEEE Press, 2012. [5] Wohlin, Claes. "Empirical software engineering: teaching methods and conducting studies." Empirical Software Engineering Issues. Critical Assessment and Future Directions. Springer Berlin Heidelberg, 2007. 135-142. [6] Selby, Richard W., and Adam A. Porter. "Learning from examples: generation and evaluation of decision trees for software resource analysis." Software Engineering, IEEE Transactions on 14.12 (1988): 1743-1757. [7] Sun, Zhongbin, Qinbao Song, and Xiaoyan Zhu. "Using coding-based ensemble learning to improve software defect prediction." Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on 42.6 (2012): 1806-1817. [8] Basili, V.; Rombach, D.; Schneider, K.; Kitchenham, B.; Pfahl, D.; Selby, R. (Eds.),Empirical Software Engineering Issues. Critical Assessment and Future Directions, Springer-Verlag, 2007, ISBN 978-3-54071300-5.

33856

International Journal of Applied Engineering Research ISSN 0973-4562 Volume 10, Number 13 (2015) pp 33853-33857 © Research India Publications. http://www.ripublication.com [9] Dick, Scott, et al. "Data mining in software metrics databases." Fuzzy Sets and Systems 145.1 (2004): 81110. [10] Antoniol, Giuliano, et al. "Is it a bug or an enhancement?: a text-based approach to classify change requests." Proceedings of the 2008 conference of the center for advanced studies on collaborative research: meeting of minds. ACM, 2008. [11] Bird, Christian, et al. "Fair and balanced?: bias in bug-fix datasets." Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering. ACM, 2009. [12] Wohlin, Claes, et al. Experimentation in software engineering. Springer Science & Business Media, 2012.

33857