Int. J. Information and Decision Sciences, Vol. 2, No. 3, 2010
The application of intelligent and soft-computing techniques to software engineering problems: a review Ramakanta Mohanty Computer Science Department, Berhampur University, Berhampur 760 007, Orissa, India E-mail:
[email protected]
Vadlamani Ravi* Institute for Development and Research in Banking Technology, Castle Hills Road #1, Masab Tank, Hyderabad 500 057, AP, India E-mail:
[email protected] *Corresponding author
Manas Ranjan Patra Computer Science Department, Berhampur University, Berhampur 760 007, Orissa, India E-mail:
[email protected] Abstract: This paper presents a comprehensive review of the work done during 1990–2008 in the application of intelligent techniques to solve software engineering (SE) problems. The review is categorised according to the type of intelligent technique applied viz. (1) neural networks (NNs), (2) fuzzy logic, (3) genetic algorithm, (4) decision tree, (5) case base reasoning and (6) other techniques subsuming soft computing. Further, the source of the data set and the results whenever available are also provided. We find that NNs is the most often used non-parametric method in SE and there exists immense scope to apply other equally famous methods such as fuzzy logic, decision trees and rough sets. The review is going to be useful to researchers as a starting point as it provides important future research directions. For practitioners also, the review would be useful. This would eventually lead to better decision making in SE thereby ensuring better, more reliable and cost effective software products. Keywords: SE; software engineering; software reliability prediction; software development cost estimation; NNs; neural networks; GA; genetic algorithm; decision tree; fuzzy logic; CBR; case-based reasoning; soft computing. Reference to this paper should be made as follows: Mohanty, R., Ravi, V. and Patra, M.R. (2010) ‘The application of intelligent and soft-computing techniques to software engineering problems: a review’, Int. J. Information and Decision Sciences, Vol. 2, No. 3, pp.233–272. Copyright © 2010 Inderscience Enterprises Ltd.
233
234
R. Mohanty, V. Ravi and M.R. Patra Biographical notes: Ramakanta Mohanty holds an MTech in Software Engineering. His research interests include software engineering, neural networks, web services, soft computing, data mining and grid computing. He is a Research Scholar from University of Berhampur, Berhampur, Orissa, India. Vadlamani Ravi is an Assistant Professor in IDRBT, Hyderabad since April 2005. He holds a PhD in Soft Computing from Osmania University, Hyderabad and RWTH Aachen Germany. Earlier, he was a Faculty at NUS, Singapore for three years. He published 70 papers in refereed journals/conferences and invited book chapters. He edited ‘Advances in Banking Technology and Management: Impacts of ICT and CRM’ published by IGI Global, USA. He is a Referee for several international journals and on the Editorial board of IJIDS, IJDATS, IJISSS, IJITPM and IJSDS. He is listed in Marquis Who’s Who in the World 2009, 2010. Manas Ranjan Patra is an Associate Professor in Department of Computer Science in Berhampur University, Orissa, India. He obtained his PhD in Computer Science from University of Hyderabad, Hyderabad, India in 2002. He was a United Nations Fellow in International Institute of Software Technology, Macau in 2000. He has 45 research publications to his credit, which include 3 book chapters. He chaired many National and International Conferences. Further, he is on the Editorial Committee for many international conference proceedings. His research interests include service oriented computing, software engineering, intelligent agents and application of data mining to intrusion detection.
1
Introduction
The software engineering (SE) community is increasingly recognising the value of empirical evidence to support research and practice. Empirical evidence provides a means to evaluate the utility of promising research areas and to help practitioners to make informed technology adoption decisions. It aims to provide SE researchers and practitioners with appropriate knowledge and training in different methods and techniques to design, execute, analyse and report empirical research (Pedrycz, 2002). The goal of this review is to introduce a wide range of qualitative and quantitative methods, which can be used to gather and use empirical evidence to evaluate various SE technologies (i.e. methods, techniques and tools). It will also demonstrate different ways of applying empirical approaches to guide SE research (Basili, 1996). The review will provide a general introduction to intelligent techniques followed by a detailed discussion of controlled experiments, surveys, case studies, action research and systematic reviews. The review will also evaluate the strength and limitations of different methods and techniques. Empirical software engineering (ESE) provides a forum for applied SE research with a strong empirical component, and a venue for publishing empirical results relevant to both researchers and practitioners. Empirical studies presented usually involve the collection and analysis of data and experience that can be used to characterise, evaluate and reveal relationships between software development deliverables, practices and technologies. Over time, it is expected that such empirical results will form a body of knowledge leading to widely accept and well-formed theories. ESE can also consist of
The application of intelligent and soft-computing techniques
235
case studies, field studies or industrial data from development and maintenance, and anecdotal evidence gathered as a by-product of an SE task. Data gathered from ESE can describe individual or group behaviour; it can be qualitative or quantitative and can be related to human, technical, economical or educational aspects of SE (Basili, 1996). Software has been a major enabling technology for advancing modern society, and is now an indispensable part of daily life. Because of the increased complexity of these software systems, and their societal role, more effective software development and analysis technologies were needed (He and Yu, 2007). An important advantage of machine learning (ML) over statistical analysis as a modelling technique lies in the fact that the interpretation of production rules was more straightforward and intelligible to human beings than principal components and patterns with numbers that represented their meanings. Further, all the intelligent techniques are non-parametric unlike their statistical counterparts. A reliable and accurate estimate of software reliability and software development effort has always been a challenge for both the software industrial and academic communities. It is well-known that software project management teams can greatly benefit from knowing the estimated cost of their software projects. The benefits can have greater impact if accurate cost estimations are deduced during the early stages of the project life cycle. Estimating the process and performance behaviour of a software project, early in the software life cycle, is very challenging and often very difficult. Top down and bottom up are traditionally the most commonly used techniques to estimates the effort of software projects (AIAA, 1992; Lyu and Nikora (1992); Moody and Darken, 1989). In SE, entity is categorised into three categories, that is, processes, products and resources. Processes are software related activities, such as constructing specification, detailed design or testing. Products refer to output of a particular process, that is, deliverables, documents that result from a process activity, such as a specification document, a design document or a segment of code. Resources are entities, which are required by a process to do a particular task, such as personnel, software tools or hardware. The aforementioned entities have internal and external attributes. Internal attributes describe an entity itself, whereas external attributes characterise the behaviour of an entity. ML methods have been utilised to develop better software products, to be part of software products, and to make software development process more efficient and effective (He and Yu, 2007; Zhang and Tsai, 2007). We present a partial list of SE areas where machine-learning techniques have found applications (Zhang and Tsai, 2007): x
Predicting or estimating measurements of processes of internal or external attributes of processes, products or resource. These include: software quality, software size, software development cost, project or software effort, maintenance task effort, software resource, correction cost, software reliability, software defect, reusability, software release timing, productivity, execution times and testability of program modules.
x
Discovering either internal or external properties of processes, products or resources. These include: loop invariants, objects in programs, boundary of normal operations, equivalent mutants, process models and aspects in aspect-oriented programming.
x
Transforming products to accomplish some desirable or improved external attributes. These include; transforming serial programs to parallel ones, improving software modularity and mapping OO applications to heterogeneous distributed environments.
236
R. Mohanty, V. Ravi and M.R. Patra
x
Synthesising or generating various products which includes the test data, test resource, project management rules, software agents, design repair knowledge, design schemes, data structures, programs/scripts, project management schedule and information graphics.
x
Reusing products or processes by including the similarity computing, active browsing cost of rework, knowledge representation, locating and adopting software to specifications, generalising program abstractions and clustering of components.
x
To enhance the processes: these include: deriving specifications of system goals and requirements, extracting specifications from software, acquiring knowledge for specification refinement and augmentation, and acquiring and maintaining specification consistent with scenarios (Senyard et al., 2003).
x
Managing products: these include: collected and managing software development knowledge, and maintaining software process knowledge.
The rest of this paper is organised in the following manner. In Section 2, an overview of review methodology is presented. Section 3, briefly describes the various stand-alone intelligent methods that are applied in this paper. Section 4 presents some of the important insights that emerged out of the review. Section 5 suggests some future directions in the field and concludes the review.
2
Review methodology
The review is conducted in two broad areas: 1
stand-alone intelligent techniques
2
hybrid techniques.
The intelligent techniques covered in the study belong to 1
different neural network (NN) architecture including multilayer perception (MLP) and cascade correlation NN
2
fuzzy logic
3
genetic algorithm (GA)
4
decision tree
5
case-based reasoning (CBR)
6
soft computing (hybrid intelligent systems).
The other techniques are 1
analogy based
2
support vector machine
3
SOM etc.
Table 1 presents the distribution of research papers in terms of techniques applied and the SE problems solved. Further, from Table 2, where journal-wise distribution of papers is presented, we found that the maximum number of papers appeared in IEEE Transactions on Software Engineering followed by Journal of Systems and Software. We did not include this papers presented in international conferences in this table.
The application of intelligent and soft-computing techniques
237
Table 1 Distribution of various intelligent techniques applied to software engineering problems Software engineering area Software reliability
Software cost/effort estimation
Type of intelligent technique used # papers References Neural networks 48 Adnan et al. (2000), Aggarwal et al. (2005a), AIAA, (1992), Cai et al. (1991, 2001), Dohi et al. (1999), Fahlman and Lebiere (1990), Guo and Lyu (2004), Gupta and Singh (2005), Ho et al. (2003), Houchman and Hudepohl (1997), Huang et al. (2003a,b, 2008), Kanamani et al. (2007), Karunanithi et al. (1991, 1992a,b), Khoshgoftaar and Szabo (1994), Khoshgoftaar et al. (1992, 1993, 1997, 2000, 2003), Lyu and Nikora (1992), Lyu (1996), Matsumoto et al.(1988), Musa et al. (1987), Musa (1998), Moody and Darken (1989), Ohba et al. (1982), Ohba (1984a,b), Pai and Hong (2006), Patra (2003), Park et al. (2006), Perlovsky (2000), Quah and Thwin (2004), Rajkiran and Ravi (2007b), Ravi et al. (in press), Sitte (1999), Singpurwalla and Soyer (1985), Specht (1990), Su et al. (2007), Thwin and Quah (2005), Tian and Noore (2005a,b) and Yamada and Osaki (1983a,b, 1985) Fuzzy logic 8 Chen (2008), Cimpan and Oquendo (2003), Lee et al. (2002), Rine et al. (1999), So et al. (2002), Zadeh (1994), Zhao et al. (2003) and Zimmermann (1996) Genetic 9 Briand et al. (2002), Costa et al. (2005), Dai et al. algorithm (2003), Goldberg(1989, 1993), Holland (1975), Jiang (2006), Minohara and Tohma (1995) and Purnapraja et al. (2006) Decision tree 6 Gehrke et al. (1999), Khoshgoftaar et al. (1997), Kokol et al. (2001), Quinlan (1987), Rud (2000) and Selby and Porter (1988) Case-based 4 Aamodt and Plaza(1994), Kolodner(1992, 1993) reasoning and Watson and Marir (1994) Soft computing 6 Baisch and Liedtke (1997), Chiu and Huang (2007), Mertoguno et al. (1996), Oh et al. (2004), Rajkiran and Ravi (2007b) and Zhang and Tsai (2007) Neural networks 13 Dillibabu and Krishnainah (2005), Gray (1999), Heiat (2002), Hsu and Tenorio (1991), Idri et al. (2004), Jiang (2006), Leung (2000), Park and Beak (2008), Tronto et al. (2007), Venkatachalam (1993), Vinay Kumar et al. (2008), Wittig and Finnie (1997) and Zadeh (1965) Fuzzy logic 6 Ahmed et al. (2005), Bajaj et al. (2006), Fahlman and Lebiere (1990), Musflek et al. (2000), Xu and Khosgoftaar (2004) and Yuan et al. (2000) Genetic 3 Burgess and Lefley (2001), Huang et al. (2007a,b) algorithm and Yang et al. (2006) CBR 2 Delany and Cunningham (2000) and Sakran (2006) Soft computing 7 Aggarwal et al. (2005b), Huang et al. (2003a and 2007a), Pedrycz et al. (2007), Shukla (2000), Tadayon (2004) and Vinay Kumar et al. (2009)
238
R. Mohanty, V. Ravi and M.R. Patra
Table 1 Distribution of various intelligent techniques applied to software engineering problems (continued) Software engineering area
Type of intelligent technique used
# papers References
Software project Neural networks management
1
Idri et al. (2002)
Software testing Neural networks
1
Khoshgoftaar and Szabo (1996)
Fuzzy logic Other areas of software engineering
Other techniques (SVM, SOM, analogy-based technique, etc.)
– 16
– Basili (1996), Bellettins et al. (2007), Chen and Rine (2003), Cristiamini and Taylor (2000), Gersho and Gary (1992), He and Yu (2007), Kohonen (1997), Pawlak (1982), Pedrycz (2002), Reformat et al. (2007), Russel and Norvig (2003), Rumelhart et al. (1986), Senyard et al. (2003), Sheppered and Schofield (1997), Twala et al. (2007) and Vapnik and Haykin (1998)
Table 2
Distribution of papers in various refereed journals
S. No.
Name of the journal
Total number of papers reviewed
1
IEEE Transactions on Software Engineering
18
2
IEEE Transactions on Software Reliability
04
3
IEEE Transactions on fuzzy system
01
4
IEEE Transactions on Neural Network
01
5
Journal of Systems and Software
17
6
European Journal of Operational Research
02
7
Expert Systems with Applications
06
8
Information and Software Technology
13
Fuzzy Sets and Systems
08
10
9
Annals of Software Engineering
01
11
IEEE Application of Software Engineering
01
12
ACM SIGSOFT Software Engineering
10
13
Int. J. Project Management
02
In this context, one may note that all the intelligent techniques barring fuzzy logic are data-driven in nature, all of them do posses some merits and demerits, which are presented in Table 3. A cursory glance at the Table 3 indicates when each technique is applicable. Under each category of intelligent technique, these papers are reviewed in the chronological order. Thus, the important dimension of the present review is the type of techniques applied. The review is conducted across other dimensions also such as 1
source of data
2
results obtained in each study.
The application of intelligent and soft-computing techniques Table 3
239
Merits and demerits of various intelligent techniques
Technology
Basic Idea
Advantages
Disadvantages
1
FL
Models imprecision and ambiguity in the data using fuzzy sets and incorporates the human experiential knowledge into the model
Good at deriving human comprehensible fuzzy ‘if-then’ rules; It has low computational requirements
Arbitrary choice of membership function skews the results, although triangular shape is the most often used one. Secondly, the plethora of choices for membership function shapes, connectives for fuzzy sets and defusification operators are the disadvantages
2
NN
Learn from examples using several constructs and algorithms just like a human being learns new things
Good at function approximation, forecasting, classification, clustering and optimisation tasks depending on the neural network architecture
The determination of various parameters associated with training algorithms is not straightforward. Many neural network architectures need a lot of training data and training cycles (iterations)
3
GA
Mimics Darwinian principles of evolution to solve highly non-linear, non-convex global optimisation problems
Good at finding global optimum of a highly non-linear, non-convex function without getting trapped in local minima
Does take long time to converge; May not yield global optimal solution always unless it is augmented by a suitable direct search method
4
CBR
Learns from examples using the Euclidean distance and k-nearest neighbour method
Good for small datasets Cannot be applied to large and when the data datasets; poor in generalisation appears as cases; similar to the human like decision-making
5
SVM
It uses statistical learning theory to perform classification and regression tasks
It yields global optimal solution as the problem gets converted to a quadratic programming problem; It can work well with few samples
Selection of kernel and its parameters is a tricky issue. It is abysmally slow in test phase. It has high algorithmic complexity and requires extensive memory
6
Decision trees
They use recursive partitioning technique and measures like entropy to induce decision trees on a dataset
Many of them can solve only classification problems while CART solves both classification and regression problems. They yield human comprehensible binary ‘if-then’ rules
Over fitting can be a problem. Like neural networks, they too require a lot of data samples in order to get reliable predictions
7
SC
Hybridises fuzzy logic, neural networks and genetic algorithms, etc. in several forms to derive the advantages of all of them
It amplifies the advantages of the intelligent techniques while simultaneously nullifying their disadvantages
Apparently, it has no disadvantages. However, it does require good amount of data, which is not exactly a disadvantage nowadays
240
R. Mohanty, V. Ravi and M.R. Patra
Table 3 shows that datasets and techniques applied in different areas of SE application. Further, the review concentrated on peer reviewed journals/international conferences/edited volumes in the areas of NNs, fuzzy logic, GAs, CBR and decision trees. Moreover, in this paper, when multiple techniques are compared in their standalone mode, the technique proposed in this paper is taken as the main criterion and accordingly this paper is categorised in that family. For example, when in a paper, NN, FL and logit (logistic regression) are compared in their stand-alone mode and NN is proposed in that paper, then this paper is reviewed under the NN section. Incidentally, our work is different from many reviews appeared recently and not related to SE. To mention one such work, Chen (2008) reviewed various artificial intelligence techniques viz. CBR, rule-based system, NN, GA, cellular automata, fuzzy models, multiagent systems, swarm intelligence, reinforcement learning and hybrid systems. They also provided briefly the mathematical background of each of these techniques and mentioned one application for each of these of techniques. However, the present review explicitly focuses on the applications of intelligent and soft-computing techniques in SE problems, whereas their paper is a very general one without focus on any particular domain.
2.1 Neural networks NNs (Kolodner, 1993; Lyu, 1996) are named after the cells in the human brain that perform intelligent operations. The brain is made up of billions of neuron cells. Each of these cells is like a tiny computer with extremely limited capabilities. NNs are formed from hundreds or thousands of simulated neurons connected together in much the same way as the brain’s neurons. A NN is a system of massively parallel, interconnected computing units called neurons, arranged in layers. The NNs found extensive applications in all areas of science and engineering in general and ESE problems in particular. The multilayer perceptron (MLP) (Rumelhart, et al.1986), radial basis function network (RBFN) (Moody and Darken,1989), probabilistic neural network (PNN) (Specht, 1990), cascade correlation (Cascor) NN (Fahlman and Lebiere, 1990), learning vector quantisation (LVQ) (Gersho and Gary, 1992) and self-organising feature map (SOM) (Kohonen, 1997) are some of the popular NN architectures. They differ in aspects including the type of learning, node connection mechanism, the training algorithm, etc. A NN by Park et al. (2006) is a powerful data analysis tool, that is, able to capture and represent complex input/output relationships. The motivation for the development of NN technology is to develop an artificial system that could perform ‘intelligent’ tasks similar to those performed by the human brain. NNs resemble the human brain in the following two ways: 1
NNs acquire knowledge through learning
2
a NN’s knowledge is stored within inter-neuron connection strengths known as synaptic weights.
The true power and advantage of NNs lies in their ability to represent both linear and non-linear relationships and learn these relationships directly from the data being modelled (Perlovsky, 2000). Traditional linear models are simply inadequate when it comes to modelling data that contains non-linear characteristics. The most common NN model is the MLP. This type of NN is known as a supervised network because it requires
The application of intelligent and soft-computing techniques
241
a desired output in order to learn. The goal of this type of network is to create a model that correctly maps the input to the output using historical data, so that the model can then be used to produce the output when the desired output is unknown.
2.2 Fuzzy set theory Fuzzy logic is a powerful problem-solving methodology with a myriad of applications in embedded control and information processing. Fuzzy provides a remarkably simple way to draw definite conclusions from vague, ambiguous or imprecise information. In a sense, fuzzy logic resembles human decision making with its ability to work from approximate data and find precise solutions. Unlike classical logic, which requires a deep understanding of a system, exact equations, and precise numeric values, fuzzy logic incorporates an alternative way of thinking, which allows modelling complex systems using a higher level of abstraction originating from our knowledge and experience. Fuzzy logic allows expressing this knowledge with subjective concepts such as very hot, bright red and a long time, which are mapped into exact numeric ranges. Fuzzy logic has been gaining increasing acceptance during the past few years. There are over two thousand commercially available products using fuzzy logic, ranging from washing machines to high-speed trains. Nearly, every application can potentially realise some of the benefits of fuzzy logic, such as performance, simplicity, lower cost and productivity. Fuzzy set theory, proposed by Zadeh (1965) has found a number of applications. It is a theory of graded concepts. It provides a mathematical framework where vague, conceptual phenomena can be rigorously studied (Zimmermann, 1996). Fuzzy logic models human experiential knowledge in any domain. When applied to solve process control or prediction problems fuzzy logic takes the help of the knowledge of the domain expert and employs fuzzy mathematics to come out with fuzzy inference systems. Fuzzy logic can also be used to derive fuzzy ‘if-then’ rules from data to solve classification problems.
2.3 Genetic algorithm GA was developed as an alternative technique for tackling general optimisation problems with large search spaces. They have the advantages that they do not need prior knowledge, expertise or logic related to the particular problem solution. But for most problems with a large search space, a good approximation to the optimum is a likely outcome. The basic ideas are based on the Darwinian theory of evolution, which in essence says that genetic operations between chromosomes eventually leads to fit individuals which are more likely to survive. Thus, over a long period of time, the population of spices as a whole improves. GA proposed by Holland (1975) is a most popular evolutionary algorithm. Solutions to a given problem are encoded into chromosome like data structure named genotype. The objective function of the problem to be solved is encoded into a special function called fitness function. This function embraces all requirements imposed on solutions to the problem. The genotypes after being decoded into phenotypes that are solutions in the problem domain constitute inputs to the fitness function. They are evaluated based on their ability to solve the problem. The result of evaluation is used in a selection process based on favouring individuals with higher fitness value. The selection process can be formed in numerous ways. One of them is called stochastic sampling with replacement. In this method, the entire population is
242
R. Mohanty, V. Ravi and M.R. Patra
mapped onto a roulette wheel where the area corresponding to its fitness represents each individual. Repetitive spinning of the roulette wheel chooses individuals of the next population. The basic process of the GA is as follows 1
generate at random a population of solutions, that is, a family of chromosomes
2
create a new population from previous one by applying genetic operators to the fittest chromosomes, or pairs of fittest chromosomes of the previous chromosomes
3
repeat step (2), until either the fitness of the best solution has converged or a specified number of generations have been completed.
The best solution in the final generation is taken as the best approximation to the optimum for that problem that can be attained in that run. The whole process is normally run a number of times, using different seeds to the pseudo-random number generator. The key parameters that have to be determined for any given problem are 1
the best way of representing a solution as a fixed length binary string
2
the best combination of genetic operators. For GA, reproduction, crossover and mutation are the most common
3
choosing the best fitness production to measure the fitness of a solution
4
trying to keep enough diversity in the solutions in a population to allow the process to converge to the global optimum but not converge prematurely to a local optimum by Purnapraja et al. (2006).
2.4 Decision trees Decision trees, proposed by Gehrke et al (1999) and Quinlan (1987), form a part of ‘ML’, which is an important area of artificial intelligence. Majority of the decision tree algorithms are used for solving classification problems. However, algorithms like classification and regression trees (CART) can be used for solving regression problems also. All these algorithms induce a binary tree on the given training data, which in turn results in a set of ‘if-then’ rules. These rules can be used to solve the classification or regression problem. A number of algorithms are used for building decision tree including Chi squared automatic interaction detection (CHAID), CART, Quest and C5.0 (Rud, 2000).
2.5 Case-based reasoning CBR, a branch of artificial intelligence, is intuitively similar to the cognitive process humans follow in problem solving (Kolodner, 1992; Watson and Marir, 1994). CBR is a problem-solving paradigm that in many respects is fundamentally different from other major AI approaches. Instead of relying solely on general knowledge of a problem domain, or making associations along generalised relationships between problem descriptors and conclusions, CBR is able to utiliase the specific knowledge of previously experienced, concrete problem situations (cases). Finding a similar past case and reusing it in the new problem situation solves a new problem. A second important difference is that CBR also is an approach to incremental, sustained learning, since a new experience
The application of intelligent and soft-computing techniques
243
is retained each time a problem has been solved, making it immediately available for future problems. The CBR field has grown rapidly over the last few years, leading to commercial tools, and successful applications in daily use. When people confront a new problem, they often depend on past similar experiences and reuse or modify solutions of these experiences to generate a possible answer for the problem at hand. The hallmark of CBR is its capability to give an explanation for its decision based on previous cases. Citing relevant previous experiences or cases is a way to justify a position in human decision-making (Kolodner, 1993). Comprehensibility of the decision is often crucial in solving software reliability/software cost estimation problems. When a company is identified as failing, CBR can give examples of similar companies that failed in the past as a justification for its prediction. The heart of the CBR is the nearest neighbour algorithm proposed by Russell and Norvig (2003).
2.6 Support vector machine (SVM) SVM introduced by Vapnik and Haykin (1998) uses a linear model to implement nonlinear class boundaries by mapping input vectors non-linearly into a high-dimensional feature space. In the new space, an optimal separating hyper plane (OSH) is constructed. The training examples that are closest to the maximum margin hyper plane are called support vectors. All other training examples are irrelevant for defining the binary class boundaries. SVM is simple enough to be analysed mathematically. In this sense, SVM may serve as a promising alternative combining the strengths of conventional statistical methods that are more theory-driven and easy to analyse and ML methods that are more data-driven, distribution free and robust. Recently, The SVM has been used to financial applications such as credit rating that SVM is comparable to and even outperform other classifiers including BPNN, CBR, MDA and logit in terms of generalisation (Cristiamini and Taylor, 2000).
2.7 Soft computing The paradigm of soft computing or computational intelligence refers to the seamless integration of different, seemingly unrelated, intelligent technologies such as fuzzy logic, NNs, GAs, ML (CBR and decision trees subsumed), rough set theory and probabilistic reasoning in various permutations and combinations to exploit their strengths. This term was coined by Zadeh (1994) in the early 1990s to distinguish these technologies from the conventional ‘hard computing’, that is, inspired by the mathematical methodologies of the physical sciences and focused upon precision, certainty and rigor, leaving little room for modelling error, judgement, ambiguity or compromise. In contrast, soft computing is driven by the idea that the gains achieved by precision and certainty are frequently not justified by their costs, whereas the inexact computation, heuristic reasoning and subjective decision making performed by human minds are adequate and sometimes superior for practical purposes in many contexts. Soft-computing views the human mind as a role model and builds upon a mathematical formalisation of the cognitive processes those humans take for granted (Venkatachalam, 1993). Within the soft-computing paradigm, the predominant reason for the hybridisation of intelligent technologies is that they are found to be complementary rather than competitive in several aspects such as efficiency, fault and imprecision tolerance and learning from examples (Venkatachalam, 1993). Further, the resulting hybrid architectures tend to minimise the disadvantages of
244
R. Mohanty, V. Ravi and M.R. Patra
the individual technologies while maximising their advantages. Some of the softcomputing architectures employed are neuro-fuzzy, fuzzy-neural, neuro-genetic, geneticfuzzy, neuro-fuzzy-genetic, rough-neuro, etc. Multiclassification systems or ensemble classifiers are also treated as soft-computing system.
3
Review of techniques applied to SE problems
3.1 NNs applied to predict software reliability NNs have drawn people’s attention in recent years and a good amount of efforts have been devoted to them Khosgoftaar and Szabo (1996), Khosgoftaar et al. (1993, 1997), Khosgoftaar and Rebours (2003) to handle numerous factors in SE and approximate any non-linear continuous function in theory. In these works, it was argued that NN method could be applied to estimate the number of software defects, classify program modules, and forecast the number of observed software failures, and they often provided better results than existing methods. In this section, we will review briefly the works carried in the areas of software reliability. Cai et al. (1991) presented a review on software reliability modelling. The review discussed about the different type of probabilistic software reliability model and their shortcomings. They developed a simple yet powerful fuzzy software reliability model, which was powerful alternative to other methods. Then, Karunanithi et al. (1991) employed the use of feed forward NNs to predict the software reliability. They used five different types of models such as exponential model, logarithmic model, inverse polynomial mode, delay S-shape and power model. They used three different types of dataset Data 1 (Matsumoto et al., 1988), data 2 (Musa et al., 1987), Data 3 (Ohba, 1984a). These datasets represents the failure history of the systems. The dataset consists of two observations: one representing the cumulative execution time and other represent the accumulated number of fault observed. They trained the network with the execution time as the input and observed fault count as the associated output response. They found that the NN models were consistent in prediction and their performance is comparable to that of other parametric models. Later, Karunanithi et al. (1992b) illustrated the usefulness of connectionist model for software reliability predictions. The connectionist model is compared with five reliability growth models viz. exponential model, logarithmic model, the delayed S-shaped f-model (Ohba, 1984a,b; Ohba et al., 1982), inverse polynomial model and power model. This connectionist approach offers easy construction of complex models and estimation of parameters as well as good adaptability for different dataset. They used dataset collected from different software systems to compare the models. Based on the experiment, they showed that the scaled representation of input–output variables provided better accuracy than the binary coded (or grey coded) representation. The experimental result obtained by them showed that the connectionist networks had less end point prediction errors than parametric models. Thereafter, Karunanithi et al. (1992a) presented a solution to the scaling problems in which they used a clipped linear unit in the output layer. The NN could predict positive values in any unbounded range with a clipped linear unit. They demonstrated the applicability of the NNs structure with 03 dataset. The predictive capability of NN models with clipped linear output units and network with sigmoid output units are compared using three dataset, data 1 (Ohba, 1984a), data 2 (Ohba, 1984b), Data 3 (Ohba et al., 1982). They used different types of analytical models such as
The application of intelligent and soft-computing techniques
245
logarithmic model, inverse-polynomial, power model, delay S-shaped and exponential model. Among the models, the Jordan network model exhibits a better accuracy than the feed forward model. Subsequently, Khoshgoftaar et al. (1992) explored the use of NN for predicting the number of faults in a program. They used static reliability modelling and compare its performances in terms of prediction quality of fit with that of regression models. They used dataset obtained from Ada development environment for the command control of a military data line communication system. They found that the absolute relative error of NN is less compared to regression model. Then, Khoshgoftaar and Szabo (1994) used the principal component analysis (PCA) on NN for improving predictive quality. They used regression modelling and NN modelling to predict the reliability and the number of fault. The datasets used for the prediction were collected from large commercial software system. They developed three different regression models as quality metric domain (QM) regression model, design change request (DCR) metric regression mode and problem tracking request (PTR) model used. They found that regression model did not correspond well in linking complexity and fault. They developed two NN models, that is, model NNraw trained using ten raw software complexity metric with 40 hidden layers and output neuron and NNmod was trained using domain metric with six domain metric as input with 24–40 hidden and neuron output. They also used QM, DCR and PTR NN for experiment. They found that NN model using domain metric as input outperformed both regression model NN using raw input metric. Further, Khoshgoftaar and Szabo (1996) used NN to investigate the application of PCA to predict the number of faults. The datasets used for prediction of faults collected from commercial software systems. They extracted PCA from these measures. They trained two NNs, one with observed data (raw) and other with principal components. They compare the predictive quality of two models and found that PCA yielded better results. Furthermore, Khoshgoftaar et al. (1997) reported a case study of NN modelling techniques developed for early risk assessment of latent defect (EMERALD) for improving the reliability of telecommunication software products. They applied NN techniques to find out fault prone modules and also to detect the risk of operational problems. They used a variety of classification techniques to identify fault prone software including discriminant analysis, classification tress and pattern recognition. The dataset used for prediction was taken from a large telecommunication system. They compared the NN model with non-discriminant model and found that NN model provided better accuracy. Later, Khoshgoftaar et al. (2000) reported the implementation of CART to evaluate software quality models over several releases. The case study developed in two classifications tree models based on four consecutive releases of very large telecommunication systems. The model 1 used the measurement of first release of training datasets of 11 important predictions. The model 2 used measurement of the second release as the training dataset but having 15 predictions. Analysis of the models yielded insight into various software development practices. They found that both modules had accuracy that would be useful to predict the different new models releases about the accuracy. Meanwhile, Houchman and Hudepohl (1997) compared the performances of discriminant analysis back propagation neural network (BPNN) and evolutionary NN approach on datasets collected from telecommunication systems with 12 million lines of code written in Pascal like proprietary language. The PCA was carried out to split the dataset into training, testing and validation test without significant loss of information.
246
R. Mohanty, V. Ravi and M.R. Patra
The dataset contained 6,972 modules, which was classified into fault prone based on number of fault by using data splitting algorithm. Based on experimental results, they found that ENN’s showed superior performance on the range of data configuration used. On the other hand, Dohi et al. (1999) used MLP and recurrent NN to estimate the software release time for cost minimisation problem. They interpreted cost minimisation problem as a graphical one and showed that cost minimisation problem could be reduced to time series forecasting problem. They used datasets cited by Lyu (1996) and found that the predictive performance of the optimal software release time by MLP was better than the existing parametric SRGMs. Sitte (1999) compared the performance of two different methods of software reliability prediction, that is, NN and recalibration for parametric models. Both methods were claimed to predict as good as conventional parametric models. Each method applied its own, predictability measure impeding a direct comparison. He used a common predictability measure and common datasets for comparing the prediction results. NN were not only much simpler to use than the recalibration method, but that they were equal or better trend predictors. The NN prediction was further improved by preparing the data with a running average, instead of the traditionally used averages of grouped data points. Adnan et al. (2000) reported the use of MLP for software reliability. BPN algorithm was used for NN implementation, which composed of eight sets of data. They used the SE failure data from NASA, which exhibited features of a reliable and sizable software. Cai et al. (2001) aimed to quantify software reliability status and behaviour by handling dynamic software reliability data. They used MLP to handle time between successive software failures observed in successive time intervals. They further observed that 1
the NN approach is more appropriate for handling dataset with ‘smooth’ trends than with large fluctuations
2
The training results are much better than the test results in general
3
The empirical probability distribution of test data resembles that of the training data.
A NN can qualitatively predict what it has learnt. The result showed that the NN approach does not generate satisfactory quantitative results of the software reliability modelling. Patra (2003) used a MLP to predict for long-term software failure. He estimated a long-term mean time to failure (MTTF) for the dataset with the cumulative execution of time as an input and cumulative number of failures as output. The results obtained through proposed method were compared with computer-aided software reliability estimation (CASRE) tool. He observed that the information provided by the MTTF was more accurate as compared to existing methods. Ho et al. (2003) used modified Elman Recurrent NNs in modelling and predicting software failure. They compared the recurrent architecture with MLP and the Jordan recurrent model. The predictive performances were evaluated using the mean absolute deviation and directional change accuracy. The dataset used for analysing the failure data was taken from a complex military computer system. They compared the analytical model and NN models and found that NN models outperform traditional software. There is a variety of statistical techniques used in software quality modelling and models are based on statistical relationship between measures of quality and measurement of software metrics. But the relationship between static software metrics and quality factors are complex and non-linear. Quah and Thwin (2004) used general regression neural network (GRNN) and Ward NN to evaluate the capability of SQL
The application of intelligent and soft-computing techniques
247
metrics for finding out software development fault. Based on actual project defect data, the SQL metrics are empirically validated by analysing their relationship with the probability of fault detection across PL/SQL. SQL metrics were extracted from oracle PL/SQL code of a warehouse management database application system. The faults were collected from journal files that contain the documentation of all changes in source file. The experimental data were collected from a set of warehouse management application, that is, developed using C, JAM and PL/SQL languages. They used PCA to identify the underlying orthogonal dimensions that explain relations between variables in the dataset. From the experimental results they found that SQL metrics should be useful in predicting faults in PL/SQL files. One of the important issues in training NN is generalisation. When training set is small and deteriorates by random noise, the network becomes over trained and fits to the noise, while over fitting the noisy data degrades the predictive accuracy of the network. Guo and Lyu (2004) proposed the use of pseudo-inverse learning algorithm (PIL) for MLP with stacked generalisation applied to SRGM to overcome the degradation predictive accuracy of the NNs. Their architecture has the same number of hidden neurons as the number of examples to be learned. The PIL algorithm eliminates learning errors by adding hidden layers. The PIL algorithm was feed-forward, fully automated, which included no critical user dependent parameters such as learning rate or momentum constants. Learning the previously trained weights in the network was not changed. The learning errors were minimised separately on each layer instead of globally for the network as a whole. The algorithm was more effective than the standard BPN and other gradient descent algorithm. The algorithm was tested on case studies with stacked generalisation application to software reliability growth modelling data. They used Sys 1 and Sys 2 (Dohi et al., 1999) software failure datasets for the above experiment. Most of the published literature used single-input single-output NN architecture to build software reliability growth models (Karunanithi et al., 1991, 1992b). Recent studies focus on modelling software reliability based on multiple-delayed-input single-output NN architecture. Tian and Noore (2005a) proposed an on-line adaptive software reliability prediction model using evolutionary connectionist approach based on multiple-delayedinput single output architecture. They modelled the interrelationship among software failure time data instead of the relationship between failure sequence number and failure time data. Desiring an online failure prediction process, GA was used to globally optimise the NN. The optimisation process determined the number of delayed input neurons corresponding to the previous failure time data sequence and the number of neurons in the hidden layer as the training algorithm. They used four real time control and flight dynamic application dataset. The experimental result outperformed the existing NN model for failure time prediction. Thereafter, Tian and Noore (2005b) proposed NN modelling approach for software cumulative failure time prediction based on multiple-delayed-input single-output architecture. GA was used to optimise the number of delayed input neurons and the number of neurons in the hidden layers of the NN architecture. They used the modification of Levenberg-Marquardt algorithm with Bayesian regularisation to improve the ability to predict software cumulative failure time. They compared the performance of prediction by using real-time control and flight dynamic application datasets with existing approaches. They found that the proposed approach yielded better accuracy in predicting software failure time.
248
R. Mohanty, V. Ravi and M.R. Patra
Aggarwal et al (2005a) proposed various training algorithms in a NN model and showed which algorithm was best suited for SE applications. They had compared a number of algorithms such as bias learning rules, BFGS Quasi-Newton method, Levenberg-Marquardt optimisation and BPN with Powell-Beale BPN algorithm, gradient descent with adaptive learning and Bayesian algorithm. The ISBSG dataset was used to train and test the NN. They found that Bayesian regularisation algorithm yielded best results. Many object-oriented metrics for software quality prediction have been proposed in the last decade. Thwin and Quah (2005) presented the application of NN in software quality estimation using object-oriented metrics. They investigated two kind of operation. The first one was on predicting the number of defects in a class and the second one was predicting the number of lines changed per class. They used two types of NN models such as Ward NN and GRNNs. Object oriented design metrics concerning inheritance related measures, complexity measures, cohesion measures, coupling measures and memory allocation measures were used as the independent variable. GRNN network model was found to predict more accurately than Ward NN model. The NN model aimed to predict object oriented software quality by estimating the number of faults and the number of lines changed per class. They used software metrics including both object oriented metrics and traditional complexity metrics. They introduced Ward NN and GRNN to improve prediction results for estimating software quality. Ward NN was a BPN network with different activation function. They applied hidden layers slabs to detect different features in the mid range of data and a Gaussian component complement to other hidden slabs to detect the upper and lower extremes of data. Thus, the output layer would get different views of data. They combined the two feature sets in the output layer to a better prediction. They used quality evaluation system (QUES) dataset for the experiment. From the experiment it was found that the GRNN model predicted more accurately than Ward NN model. There has been extensive work in measuring reliability using mean time between failure and MTTF. Gupta and Singh (2005) determined the estimation of software reliability by taking basic execution time model. In this model, they estimated the reliability after predicting the number of faults by using pattern mapping technique of artificial NNs. The pattern mapping showed the interpolative and extrapolative behaviour of NNs. Su et al. (2007) proposed a new NN for software reliability estimation. They predicted software reliability by designing a dynamic weighted combinational model (DWCM), which was demonstrated through real software failure datasets taken from Musa et al. (1987). They compared the DWCM with Goel–Okumoto model and Yamada delay (Yamada and Osaki, 1983a,b, 1985) model and found that DWCM outperformed them. Kanamani et al. (2007) used MLP and PNN as software fault prediction models using object-oriented metrics. The results obtained by these models were compared with that of the statistical methods (discriminant analysis and logistic regression) by computing five models quality parameters. Between the two, PNN was highly robust in nature, which was studied through the quality parameters viz. misclassification rate, correctness, effectiveness and efficiency. Recently, Rajkiran and Ravi (2007b) proposed the use of wavelets neural networks (WNN) to predict software reliability. In WNN, they employed two kinds of wavelets – More wavelet and Gaussian wavelet as transfer function resulting in two variants of WNN. They compared the performance of WNN to that of multiple linear regression (MLR), multivariate adaptive regression splines (MARS), MLP, threshold accepting
The application of intelligent and soft-computing techniques
249
trained neural network (TANN), PI-sigma network (PSN), GRNN, dynamic evolving neuro-fuzzy inference system (DENFIS) and TreeNet in terms of normalised root mean square error (NRMSE) obtained on test data. They found that WNN outperformed all other techniques.
3.2 NNs on software effort estimation/cost estimation Due to rapid increase in software development costs, software development imposes a heavy burden on companies seeking to implement enterprise information system. Before performing the costly software development processes, it is necessary to examine and evaluate the anticipated cost and profits of such projects. The estimated accuracy of software development cost plays an important role, both directly and indirectly, in a company’s decision on whether to make an investment or not. Such accurate estimation has a prominent impact on the success of projects. Various methods of software development effects have been proposed, that is, line of code (LOC) based, constructive cost model (COCOMO), function point-based regression model (FP), NNs model and case base reasoning. Hsu and Tenerio (1991) used MLP and SOM to construct SE models from observation data. Out of two experiments conducted, the first experiment described the BPN and self-organising neural network (SONN) algorithm, which were used to develop effort estimation model for projects in the COCOMO database. The second experiment described the ability of these models to predict the effort required for new projects to be examined. The experimental result shows that the NN techniques were superior in performance, learning time and modelling power. Venkatachalam (1993) explored the software cost estimation by use of the BPNN. The BPNN is constructed with 22 input nodes and 2 output nodes. The input nodes represented the distinguishing features of software projects and output nodes represented the effort required in terms of person month and the development time required to complete the project. The data used for estimation was taken from COCOMO database. He found that NN were accurately estimating software cost and development time. Wittig and Finnie (1997) evaluated the assessment of MLP for effort estimation on simulated data as well as actual data of commercial projects. This project data had large productivity variations, noise and missing data values, which enabled model evaluation under typical software development conditions. The results were encouraging, with network showing an ability to estimate development effort within 25% of actual effort more than 75% of the time for one large commercial dataset. Later, Jun and Lee (2001) proposed a search method to the relevant level of effort estimation by using MLP. They adopted the beam search technique and case-set selection algorithm to the effort estimation. For the selected case set, eliminating the quantitative input factors with the same value could reduce the scale of MLP. They adopted beam search technique and devised the case set algorithm, to classify the case sets and corresponding NN model. Their proposal using the qualitative input factors from the model, resulted in the reduced NN model According to the paired t-test, they could prove that the quasi-optimal case-selective NN model could significantly reduce the errors more than the full NN model. Further, Leung (2000) provides a general overview of software cost estimation methods. Since a number of these models rely on a software size estimate as input, he first provided an overview of common size metrics. He classified the models
250
R. Mohanty, V. Ravi and M.R. Patra
into two categories such as algorithmic and non-algorithmic. Each has its own strengths and weaknesses. Consequently, Heiat (2002) compared the prediction performances of MLP and RBF to that of regression analysis on the estimation of software development effort. In the experiments, he used the datasets from Kemerer, IBM and Hallmark datasets, which includes both third generation and fourth generation programming languages. They found that the combination of third generation language dataset produced improved performance over conventional regression analysis in terms of mean absolute percentage error. However, Idri et al. (2002) interpreted the cost estimation model based on MLP. Their proposed idea comprises mainly of the use of a method that mapped the MLP to a fuzzy rule-based system. They used COCOMO 81 dataset for their case study. Further, Idri et al. (2004) compared the performances of the NN and RBFN model based on software cost estimation. They found that the variety of NN architecture did not estimate well due to their short coming of being ‘black boxes’ model. They used Benitez method to extract the ‘if-then’ fuzzy rules. These fuzzy rules expressed the information encoded in the architecture of the network, and the interpretation of each fuzzy rule determined by analysing its premise and its output. They also suggested another mapping method viz. the Jang and Sun method, to extract ‘if-then’ fuzzy rules from NN. This case study was based on the COCOMO’81 historical dataset. Recent research in the area of software cost estimation has been growing rapidly due to practicality and demand for it. Tadayon (2004) explored the use of expert judgement and NN referencing to COCOMO II approach to predict the cost of software. They found that expert judgement system was most common and successful technique used in software cost estimation. Further, Huang et al. (2003b) explored the effects of accuracies of the software effort estimation models established from clustered data by using the International Benchmarking Standard Groups (ISBSG) repository. The ordinary least square (OLS) regression method adopted to establish a respective effort estimation method in each clustered of datasets. For building effort estimation model, they used data with clustering different test drivers viz. function points (FP), max. time size (MTS), development type (DT), development platform (DP), language type (LT), used methodology (UM), methodology acquired (MA) and application type (AT). In the feature selection stage, they selected the statistical tests of Pearson correlation and one-way ANOVA were utilised to identify the significant effort drivers for building the software effort estimation model. They used three evaluation criteria viz. MMRE, MdMRE and Pred (0.25). The empirical experimental result showed that software effort estimation model established from the homogenous datasets did not produce more accurate results than models established from the homogenous datasets. Recently, Dillibabu and Krishnaiah (2005) employed COCOMO II 2000 model to estimate cost in terms of effort spent on an embedded system project. They implemented COCOMO II 2000 on 10 projects. Out of which 8 projects are developed projects and 2 were of porting projects. They collected the actual effort from metric database of the company. The lines of code enumerated using ‘code count’ tool to achieve the logical source lines of code for each project. The calibration of COCOMO II 2000 carried out based on log approach and curve fitting approach. The curve fitting approach yielded better estimates of the model parameter. The most important factor of an IT industry is to accurately estimate the efforts and necessary costs of a project in initial stage of development. Park and Baek (2008)
The application of intelligent and soft-computing techniques
251
proposed NN-based effort estimation model where they identified six input variables through expert interview and regression analysis. They attempted to discover the best combination of input variables, which generates accurate effort estimation. To develop a NN-based effort estimation model, they collected 148 IT projects completed between 1999 and 2003 from Korean IT Project service vendors. They compared the performance of NN-based model with human expert judgement and two regression models. They found that NN-based model showed superior result compared to existing regression analysis. Tronto et al. (2007) used MLP and stepwise regression-based predictive model to estimate the size, effort, cost and time spent in the development process of software project management. They compared MLP and stepwise regression (SR)-based predictive model with that of AFF, SLIM and COCOMO model. They found that MLP and SR were competitive with other models. Most recently, Vinay Kumar et al. (2008) proposed the use of WNN to forecast the software development effort. They used two types of WNN with Morlet function and Gaussian function as transfer function and used threshold acceptance training algorithm for wavelet neural network (TAWNN). They compared WNN with other techniques viz. MLP, RBFN, MLR, DENFIS and SVM in terms of mean magnitude of relative error (MMRE). They obtained datasets from Canadian Financial (CF) dataset and IBM data processing services (IBMDPS) dataset. From the experiment, they found that WNNMorlet for CF dataset and WNN-Gaussian for IBMDPS outperformed all other techniques. Thereafter, Ravi et al. (2009) proposed WNN to predict operational risk in banks and firms by predicting software reliability. They employed two types of WNN as transfer function viz. Morlet wavelet and Gaussian wavelet. They compared the MLR, MARS, BPNN, TANN, threshold accepting trained wavelet neural network (TAWNN), PSN, GRNN, DENFIS and TreeNet in terms of normalised root mean square. They observed that NRMSE values of different techniques gradually decreased with increase of lag number and WNN-based models outperformed all the individual techniques over all the lags for both Gaussian-based and Morlet-based wavelets NNs.
3.3 Fuzzy logic approach Uncertainty is present in our everyday lives. It is also present in ESE. Real world concepts transition smoothly into one another rather than abruptly. Therefore, fuzziness as a means of modelling linguistic uncertainty can be very well used to model real world problems and also SE problems. The overall life cycle cost associated with product failure exceeds 10% of yearly corporations turn over. A major factor contributing to this loss is ineffective performance of software and system verification, validation and testing (VVT). In this review paper, we report the application of fuzzy logic approach as the second intelligent technique to the areas of software reliability, software cost estimation and software effort estimation, etc. and the datasets used by different authors. Rine et al. (1999) used reusable software controller architecture for designing a specific domain of controllers (zero order, first order, …, nth order). The controller architecture composed of two-sub models 1
model identifier, which was on-line fuzzy rule-based system, which took a sample from the input and output signal, tried to approximate the plant model behaviour
252 2
R. Mohanty, V. Ravi and M.R. Patra neuro-fuzzy controller, which has a fuzzy controller that has fuzzy rule-based system by using feed forward NN.
They used software modules by incorporating automatic training and adaptive control technologies instead of using the traditional manual. The fuzzy logic allowed developing SE principles of modularity and testability in order to stable the specification level and make them adaptable at the representation level. Yuan and Ganeshan (2000) integrated fuzzy subtractive clustering with module order for software quality prediction. They experimented the number of faults by using fuzzy subtractive clustering and also predicted whether modules were fault prone or not by using module-order modelling. The dataset used for fault prone prediction is legacy telecommunication systems. Different types of modelling techniques were used for software quality prediction as discriminant power and optimal set reduction. The membership function associated with a given fuzzy set was of linear, sigmoid or quadratic type. They found that fuzzy subtractive clustering could discover modules, which have faults discovered by customer with useful accuracy prior to release. So et al. (2002) proposed an automated and fuzzy logic-based approach to identify potential error prone software components using inspection data. The effectiveness of their approach was demonstrated on inspection data. The empirical evaluation on launch interceptor program (LIP) and conflict inspection data as well as X2 analysis confirmed the validity. The advantageous of this model used to provide a natural mechanism to model fuzzy data. Fuzzy rule used in determining error prone modules were fuzzy and prototype system could be developed without having to have extensive empirical data and that the system performances could be continuously tuned as more inspection data. To understand the economy of software development, which would reduce software production cost overrun or even project cancellation, it is required to develop different models for software cost estimation. Musflek et al. (2000) estimated the effort/cost required for development of software products with fuzzy model. They used fuzzy setbased generalisation of the COCOMO model (f-COCOMO). The size of software was taken as a fuzzy set which yields the cost estimation. They estimated the software cost by analysing the different membership functions such as triangular and parabolic fuzzy set. The f-COCOMO model emphasises the way of propagation of uncertainty and visualisation of the resulting effect. Then, Zhao et al (2003) used COCOMO-II for estimating software cost. They used lines of code and FP to assess software size. The entity relationship (ER) model was used in conceptual modelling (requirements analysis) for data-intensive systems. They used ER model for the estimation of software cost. However, Lee et al. (2002) focused on approaches especially devoted to extending fuzzy logic to SE along with the following three directions 1
applying fuzzy logic to requirements engineering
2
software quality prediction
3
fuzzy object oriented modelling.
They also reviewed fuzzy current approaches in SE that could provide powerful tool for requirement engineering, formal specification, software quality prediction and fuzzy object orientation modelling. The quality of the software development process depends on the quality software products. Cimpan and Oquendo (2003) used fuzzy logic approach for monitoring
The application of intelligent and soft-computing techniques
253
software processes deviation. They implemented the online monitoring environment: general and adaptable (OMEGA) concept to monitor the models. The OMEGA\ monitoring definition language (MDL) used to define monitoring models as well as mechanism for the execution of such models. The executing monitoring model observed the processes and detected deviation between it and an expected behaviour. They applied fuzzy logic-based monitoring which allowed establishing the level of conformance between performed process and process model for different aspects of the process like progress, cost, structure (order between activities), etc. The fuzzy logic enabled to handle imprecise and uncertain as well as precise and certain information. The fuzzy set theory used in the information representation, while the possibility theory used in the reasoning input imprecise information. They also used conformance factor to measure the similarity and expressed to which extent the observed process behaved similarly to what expected. They used process for process monitoring (P2M) for monitoring the software process. P2M consists of three phases viz. definition, instantiation and enactment. The definition phase constructed to monitor the model, the instantiated phase is related to context of use. The enactment phase entailed the actual monitoring of a given software process. The fuzzy logic used to model the imprecise and uncertain aspects of data and rules. They used fuzzy set theory to represent uncertain and imprecise information. Moreover, Xu and Khoshgoftaar (2004) presented an innovative fuzzy identification cost estimation modelling technique to deal with the linguistic data, and automatically generated fuzzy membership function and rules. A case study based on the COCOMO 81 database compared the proposed model with all three COCOMO models, that is, basic, intermediate and detailed. It was observed that the fuzzy identification model provided significantly better-cost estimation than the three COCOMO models. In a recent study, Ahmed et al. (2005) presented an adaptive fuzzy logic framework for software effort prediction. The training and adaptation algorithm implemented in the framework tolerated imprecision, explained prediction ordinals through rules, incorporated expert’s knowledge and offered transparency in the prediction system. The experiment was carried out on artificial datasets as well as the COCOMO database. From the experiment, they recorded improvement in the performance of their training procedure in the presence of counter-intuitive rule base and reduced RMSRE values during validation. Software estimation is a challenging activity as no model can predict exactly, thereby leading to cost and schedule overrun. Bajaj et al. (2006) used fuzzy set theory to extend the bottom up approach to achieve better precision in the estimates. A bottom up approach was advised where a project is completely novel or there is no historical data available. After the high level design, for each component the estimates, the effort necessary to create the detailed design, code and unit testing were determined. Each team member had the responsibility of estimating the factors for assigned components, subcomponents that estimates the complexity and effort involved. They suggested a bottom up approach, based on estimating factors and factor counts. Each team member was responsible to estimate the effort, complexity for its own components or module. Thereafter, Engel and Last (2007) proposed a set of quantitative probabilistic model for estimating costs and risks stemming from carrying out any given VVT strategy. They demonstrated that quality costs in software-intensive projects were likely to consume as much as 60% of the development project. They suggested a new approach for fuzzy modelling of VVT by introducing the concept of a canonical VVT model to describe system process and risks cost.
254
R. Mohanty, V. Ravi and M.R. Patra
3.4 Genetic algorithm GA proposed by Goldberg (1989, 1993) is a stochastic optimisation algorithm loosely based on concept of biological evolutionary theory. The application of GA’s to solve combinatorial problems may require the development of new coding/representation schemes and fitness evaluation techniques to reflect the problem or it may involve the use of conventional coding schemes and currently available operators. Burgess and Lefley (2001) evaluated critically the potential of genetic programming (GP) to improve the accuracy of software estimation. They compared GP with various statistical and ML techniques in terms of accuracy to reduce inaccurate cost estimation, misleading tendering bids and disabling the monitoring progress. They used Desharnai datasets of 81 software projects derived from Canadian software. The input variables were restricted to those available from the specification stage and significant effort was put into GP and comparison was made to offer a realistic and fair solution. GA-based software integrations investigate an approach of integrating software with a minimum risk using GA. This problem was initially proposed by the need of sharing common software components among various departments within the same organisation. Further, Huang et al. (2008) investigated the effect of estimation accuracy of the appropriate weighted similarity measures of effort drivers in analogy-based software effort estimation method. Three weighted analogy methods, namely, the unequally weighted, the linearly weighted and the non-linearly weighted methods were investigated. They used dataset obtained from the international software bench marking standards group (ISBSG) repository and the IBM DP services database. The experimental results showed that GA used to determine a suitable weighted similarity measure of software effort estimation model was a feasible approach to improve the accuracy of software effort estimates. Huang et al. (2007b) examined the potentials of software effort estimation model by integrating a GA to the gray rational analysis (GRA). The GA method was adopted to find the best fit of weights for each software effort driver in the similarity measures. The experiment indicated that GA obtained more precise estimates than CBR, CART and MLP. Minohara and Tohma (1995) applied GA to the parameter estimation of the hypergeometric distribution software reliability growth model (HGDM). They used pseudo datasets for the estimation. The experimental result compared with that of SRGM, calculus-based method imposed restriction on evaluation function making the estimation but HGDM method was more effective in parameter estimation and removes restrictions from SRGM. Costa et al. (2005) explored GP as an alternative approach to derive SRGM. GP based on the idea of GAs and was acknowledged as a very suitable technique for regression problems. The main motivation to choose GP for this task was its capability of learning from historical data. Consequently, Briand et al. (2002) implemented the GA to devise integration test orders in object oriented systems to reduce integration testing. They performed a case study on a system that contained 21 classes and a large number of dependency cycle. They investigated that the use of coupling measurement in the cost function (OCplx) and compare the OCplx values and the number of broken dependencies used as a cost function, attribute and method coupling were used in turn as cost function. They concluded from the case studies that using coupling measurement might lead to significantly different test orders, involving less coupling to be broken and leading to
The application of intelligent and soft-computing techniques
255
lower integration test expenditures. The software application integration can be categorised into database integration, operating environment integration, user interface integration and so on. Yang et al. (2006) formulated the software integration problem as a search problem and solved by using GA. They selected Derbyshire Fire and Rescue Services (DFRS) as case study. Jiang (2006) approached two major concerns of the problem in SE domain using GA. Firstly, he was trying to distinguish the problems that were difficult for GA to solve from the ones that are genetic solvable. Secondly, he identified and classified the problems in the SE domain and provides the definitions of the near optimal solutions. Further, Dai et al. (2003) presented a GA for testing resource allocation problem that can be used when the software system structure was complex, and there were multiple objects. They consider the system reliability and testing cost in the testing resources allocation problems. They developed GA to solve the complex problem of general parallel-series modular software system with multiple objectives.
3.5 Case-based reasoning CBR (Kolodner, 1992, 1993) is relatively a simple concept. It involves matching the current problem against ones that have already been encountered in the past in the current context. It can be represented as a cyclical process, that is, divided into four-sub process is presented as by Aamodt and Plaza (1994) 1
retrieve the most similar case or cases from the case base
2
reuse the case to solve the problem
3
revise the proposed solution if necessary
4
retain the solution for future solving.
Therefore, CBR is a problem solving technique based on the reuse of past experiences. For this reason, there is considerable optimism about its use in difficult problem solving areas where human expertise is evidently experience based. Delany and Cunningham (2000) assessed the applicability of CBR to the difficult problem of early software project cost estimation. They suggested that the objective should be risk assessment rather than cost estimation because case representation was not available early in the project. They identified and presented a case representation capturing the availability prediction features for early estimation. They proposed a measure called the productivity coefficient, which gives a measure of the potential risk, revealed by the characteristics of a project compared with previous experience. They implemented CBR for software cost estimation early in the project life cycle. They showed the possibility of case representation that could be drawn from wider literature on cost estimation. It is impossible to identify features early in the life cycle that defined the size of project. Instead they proposed the alternative to pursue a solution that produced an estimate of potential risk for a specific project based on previous experiences. Accurate software cost estimation is a vital task that affects the firm’s software investment decision before committing required resources to that project or bidding for projects. Sakran (2006) proposed an improved (CBR) approach integrated with multiagent technology to retrieve similar projects from multiorganisational distributed
256
R. Mohanty, V. Ravi and M.R. Patra
datasets. He explored the possibility of building a software cost estimation model by collecting software cost data from distributed predefined project cost database.
3.6 Decision tree Solutions to the problem of learning from examples will have far-reaching benefits, and therefore, the problem is one of the most widely studied in the field of ML (Selby and Porter, 1988). The approach based on decision trees is to provide insights through indepth empirical characterisation and evaluation of decision trees for one problem domain, software resource data analysis. The purpose of the decision trees is to identify classes of objects (software modules) that had ‘high’ development effort or faults where ‘high’ was defined to be in the uppermost quartile relative to past data. The analysis focuses on the characterisation and evaluation of decision tree accuracy complexity and composition. The decision tree correctly identified 79.3% of the software modules that had high development effort or faults, on the average across all 960 trees. The decision trees generated from best parameter conditions correctly identified 88.4% of the modules on the average. Selby and Porter (1988) presented the analysis of automatic decision tree generation. The purpose of decision tree was to identify certain classes of objects, such as software modules that are likely to be fault-prone or costly to develop. Kokol et al. (2001) presented the use of evolutionary decision trees as a fault predictive approach. They used different types of software complexity metrics together with fractal software metric as the attributes for learning evolutionary decision trees. They reported that the Ȑ metric as an attribute with other software complexity measures could be successfully used to induce decision trees for predicting dangerous modules. Redesigning such modules or devoting more testing or maintenance effort to them can largely enhance the quality and reliability. Khoshgoftaar et al. (2000) employed CART algorithm to evaluate the software quality models over several releases. They used very large legacy telecommunication system to develop two classification trees for software quality. Both modules yielded better accuracy that would be useful to developers into various software development practices.
3.7 Soft computing The paradigm of soft computing or computational intelligence refers to the seamless integration of different, seemingly unrelated, intelligent technologies such as fuzzy logic, NNs, GAs, ML (CBR and decision trees subsumed), rough set theory and probabilistic reasoning in various permutations and combinations to exploit their strengths. Baisch and Liedtke (1997) compared the multilinear discriminant analysis and fuzzy expert system generated by GA for software quality prediction. They used complexity metrics to predict the quality of software development project. They applied prediction techniques on Alcatel 1000 S12 telecommunication systems. They classify the dataset into training data and test data. They used multilinear discriminant analysis in the training data to produce the change class and they applied genetic optimiser to extract fuzzy expert system for change prediction. They performed fuzzy expert system techniques for constructing quality-based productivity prediction model that identified the software
The application of intelligent and soft-computing techniques
257
components, which posed to software potential quality problem. From the experiment they found that MDLA approach was better as compared to fuzzy ruled base. Gray (1999) presented a simulation-based study on empirical modelling techniques using a size and effort software metric dataset. He accessed the prediction performances on hold out samples using sampling with replacement and without replacement. He applied different techniques such as MLP, regression tree, linear least-median square robust regression (LMS), linear least quantile square robust regression (LQS), linear least-trimmed squared robust regression (LTS) and M-estimator robust regression on datasets. He observed from the experiment that by comparing the different errors viz. RMSE, MMRE and pred (25), M-estimation model outperformed all other techniques. Further, Chen and Rine (2003) employed GA, NN algorithm, Monte Carlo algorithm and their combination to improve the effectiveness and efficiency of automated maintenance of reusable components. To build a relationship between component abstract and concrete levels, each control software component is represented at the abstract level by means of a set of adaptive fuzzy membership functions. The testing phase used for identifying faulty fuzzy elements of a component while adopting phase used for modifying the membership function. This approach demonstrated by maintaining an automotive cruise controller software component. The next step was to apply train attitude control software components of a spacecraft with lab simulation. The experimental results showed that the off line training approval supports controller software components adaptation effectively and efficiently in terms of control process operation accuracy and effort spent. Bellettins et al. (2007) addressed a set of related but distinct computing paradigms that included fuzzy logic, granular commuting, NNs and evolutionary computation. A common target of these paradigms was making human activities more tolerant of imprecision in uncertainty and partial truth. They illustrated the impact of soft-computing technique on SE research and practices. Specifically, they showed how SE task such as reuse-oriented classification (e.g. bug detection and correction effort prediction) project cost and time estimation planning and others can be handled by soft computing. Recently, Rajkiran and Ravi (2007a) developed an ensemble model, which accurately forecast software reliability. They used MLR, MARS, DENFIS and TreeNet to develop the ensemble. They designed and tested three linear and one non-linear ensemble. They concluded that the non-linear ensemble outperformed all other ensembles and also the constituent’s statistical and intelligent techniques. Mertoguno et al. (1996) used a method of designing and modelling of a neuro expert systems for prediction of software metrics. The neuro-expert system consists of two NN and expert systems with fuzzy reasoning in order to achieve a better evaluation of software metrics. They also described an approach using a MLP and fuzzy ruled-based systems for assisting in the analysis and application of the software metrics information. They described an approach to process the USA army software testing and evaluation program (STEP) software metrics. The STEP neural network interface (SNNI) used past project STEP metric datasets to train an MLR to predict final software projects status of an uncompleted project. The expert system (rulebased), which was an extension of SNNI, gives a second prediction and an interpretation of the metrics. Various conventional model-based methods have limited success, whereas intelligent prediction using neuro-computing has proven its worth in many diverse applications. Shukla (2000) presented a novel genetically trained NN trained on historical data. He demonstrated substantial improvement in accuracy by the neuro-genetic approach as
258
R. Mohanty, V. Ravi and M.R. Patra
compared to regression tree-based conventional approach as well as MLR. He used nfold cross validation on various portions of merged COCOMO and Kemerer datasets incorporated from 78 real time projects. They found that genetically trained NNs (GANN) predicted significantly better predictor of software development effort than recursive partition regression (CARTx) and quick propagation trained neural networks (QPNN). More interestingly, Huang et al. (2003a) proposed a novel neuro-fuzzy COCOMO to estimate software development effort by combining neuro-fuzzy technique with the popular COCOMO model. They used dataset obtained from COCOMO’81 database. They found that the neuro-fuzzy model yields better cost estimation accuracy than COCOMO’81 model. It could be a powerful tool to tackle important problems in SE. Further, Huang et al. (2007a) proposed neuro-fuzzy constructive model (COCOMO) to carry out the software estimation. The model dealt effectively with imprecise and uncertain input and enhanced the reliability of software cost estimation. In addition, it allowed input to have continuous rating values and linguistic values, avoiding the problem of similar projects having large different estimated costs. The validation using industry project data showed that the model greatly improves estimation accuracy in comparison with the well-known COCOMO model. Oh et al. (2004) used a concept of self-organising neuro-fuzzy networks (SONFN). It was a hybrid architecture combining neuro-fuzzy networks (NFN) and polynomial neural networks (PNN). They used NASA datasets and medical image system (MIS) dataset for cost estimation. From the experiment, they found that SONFN predicted best accuracy compared to others. Then, Pedrycz et al. (2007) reported the analysis of software quality data using a neuro-fuzzy model. They discussed specificity of software data related to the character of neuro-fuzzy processing. They showcased how self-organising maps help reveal and visualise a structure of software data. The experimental part of study was concerned with MIS datasets available in the literature on software quality and dealing with dependencies between software complexities measured characterising software modules and the ensuing number of changes (modifications) made to them. Aggarwal et al. (2005b) proposed NN and fuzzy logic method to carry out sensitivity analysis. They developed a model consisted of four input viz. comment ratio, average cyclomatic complexity, average number of live variables and average life span of variables. Based on case study, they built a fuzzy logic and NN model to measure software maintainability and then sensitivity analysis was carried out. From sensitivity analysis, it was found that NN Bayesian regularisation training algorithm produced reduced condition numbers as compared to the fuzzy model having membership functions whose derivative showed discontinuities at some points. They concluded that NN using Bayesian regularisation training algorithm was more stable model than fuzzy model having membership function whose derivative showed discontinuities at some points. Most recently, Vinay Kumar et al. (2009) employed soft-computing approaches to predict software development effort. They developed linear and non-linear ensemble models to forecast software development effort. They used COCOMO 81 dataset for the experiment. They found that non-linear ensemble with radial basis function NN as the arbitrator outperformed all other techniques.
The application of intelligent and soft-computing techniques
259
3.8 Other type of intelligent techniques The accurate prediction of software development effort has a critical effect on all stages of the software development cycle. Underestimates of resource requirements for a software project leads to underestimation of costs, unrealistic of time schedule, considerable work pressure on the engineers. Sheppered and Schofield (1997) described an alternative approach by using analogies. The principle was to characterise projects in terms of features (e.g. the number of interfaces, the development method or the size of the functional requirements document). Completed projects were stored and then the problem becomes one of finding the most similar projects to the one for which a prediction was required. Similarity was defined as Euclidean distance in n-dimensional space where n was the number of project features. Each dimension was standardised, so that all dimensions have equal weight. The known effort values of the nearest neighbours to the new project were then used as the basis for the prediction. The method was validated on nine different industrial datasets (a total of 275 projects) and in all cases analogy outperformed algorithmic models based on stepwise regression. From this work, they argued that estimation by analogy was a viable technique that, which can complement current estimation techniques. Chiu and Huang (2007) investigated the effect of analogy-based estimation on the improvement of estimation accuracy in. They reported that by applying suitable linear model to adjust the analogy-based estimation was a feasible approach to improve the accuracy of software effort estimation. The data was used for the test was from IBMDPS organisation and from major CF organisations. Reformat et al. (2007) focused on defect related activities that are core of corrective maintenance. Two aspects of these activities were considered namely 1
a number of software components that had to be examined during defect removing process
2
time needed to remove a single defect.
Analysis of the available datasets leads to development of data models, extraction of an IF_THEN rules from these models, and construction of ensemble-based prediction systems that were built based on these data models. The data was analysed using C 5.0 and new multilevel evolutionary algorithm. Most datasets are proprietary in natures, making it impossible to replicate in these situations. As demonstrated by the NASA experiments not all data distributions result in similar results. This work extended previous research in datasets prediction by Khoshgoftaar and Rebours (2003) conducting nearest-neighbour analysis for gaining deeper understanding of how datasets relate to each other, and thus the need for developing more realistic empirical/ML-based models. Though, evolutionary testing has performed well in many testing areas, there are still several open problems that need further research. The use of ML algorithm had proven to be of great practical value in solving a variety of SE problems including software predictions, cost and defect processes. Twala et al. (2007) experimented on the rule inductions (RI) providing some background and key issues on RI and further examined how RI is utilised to handle uncertainties in data. Application of RI in prediction and other software tasks was considered. Supports Vector Machines (SVMs) are employed to solve non-linear regression and time series problem. Pai and Hong (2006) investigated
260
R. Mohanty, V. Ravi and M.R. Patra
the feasibility of use of SVMs to forecast software reliability. Simulated annealing (SA) algorithm was used to select the parameters of an SVM model. They used two numerical examples to demonstrate the forecasting performance of SVMs model, the first example taken as a rolling-based forecasting procedure and a one-step-ahead policy is applied to forecasting software reliability. In the second example, the software failure data obtained from Musa et al. (1987), Musa (1998) and Singpurwalla and Soyer (1985) are adopted to investigate the performance of the proposed model in forecasting software reliability.
4
Discussion
From Table 1, we can draw the following inferences. MLP being at the forefront of datadriven techniques has found a huge number of applications (approximately 60% of this papers reviewed here involved MLP) in SE. Fuzzy logic-based techniques, which are knowledge-driven techniques, constitute the second highest number. Then, GAs (subsuming GP) and soft-computing hybrids (neuro-fuzzy, neuro-genetic, etc.) have 12 and 13 papers, respectively. It is followed by CBR with four papers and decision tree with six papers. It is rather surprising that decision tree found so few papers, even though their potential is quite high and they are applied in many diverse disciplines. It is an encouraging trend that soft-computing hybrids slowly but steadily found acceptance among the SE researchers. After going through all the research work, that is, reviewed in this paper, we make the following observations. x
Since many areas of SE namely software cost/development effort estimation, software reliability forecasting, etc. are highly empirical in nature, data-driven techniques are extremely useful in solving these problems. The most important areas of SE problems that are of practical significance are software development/cost estimation, software reliability prediction, etc. These problems can be solved by utilising the data from past software projects. It is clear that all the data-driven intelligent techniques viz. NN, decision trees, etc. can be used effectively to solve these problems. This can be corroborated from this review.
x
Further, we note that the data-driven nature of SE problems is common to other engineering disciplines as well. However, the most striking feature of the SE problems is its heavy dependence on human experts, knowledge and experience of past software projects. As a result, the conspicuous presence of subjectivity and empiricism in SE discipline could not be neglected. Therefore, this call for the application of knowledge-driven technology like fuzzy logic to the SE problems. Even though, we found that a few papers reviewed here dealt with the application of fuzzy logic to SE problems, there is still a lot of scope to apply FL-based method in several ways.
x
In view of foregoing discussion, we make the following remarks. The peculiar nature of SE problems is that they are both data-driven and knowledge-driven in different proportions. This curious mix indeed calls for the application of more and more softcomputing hybrids to the SE problems.
x
One of the significant observations is that rough set theory by Pawlak (1982) is conspicuous by its absence in the present review, even though it found interesting
The application of intelligent and soft-computing techniques
261
applications in many diverse areas of science and engineering. The advantage of rough set-based techniques is that feature selection, classification/forecasting and generation of ‘if-then’ rules – all can be done in one go. x
SE is always a complicated domain, which is characterised by a large number of competing and inter-related constraints (Zhang and Tsai, 2007). With the rapid development of software scale and complexity, more and more SE problems, such as initial planning, requirements analysis, cost estimation, system integration, software testing, system maintenance and so on, become vague and hard to solve, which makes it more difficult to balance these problems in the limited budget and schedule. Since traditional methods may not meet the continuous increasing complexity, promising techniques based on artificial intelligence have been proposed and researched. Nowadays, many SE problems have been considered from a new perspective as search problems, where metaheuristics can be applied. Metaheuristic search techniques refer to GAs, SA, tabu search, differential evolution, ant colony optimisation and particle swarm optimisation, which search intelligently for optimal or near optimal solutions to a problem within the search space. ML methods have been utilised to develop better software products, to be part of software products, and to make software development process more efficient and effective (He and Yu, 2007; Zhang and Tsai, 2007).
x
Software testing (Zhang and Tsai, 2007) is one of the significant components of SE with many complex and inter-related constraints. Efficient testing requires systematic and automatic test data generation to satisfy pre-defined standards. However, because of the increasing complexity of software systems, usual testing techniques have demonstrated their limitation in certain areas. There are several types of testing viz. functional testing, structural testing, safety testing and random testing and so on. Test data generation can be transformed into search problem in which metaheuristic techniques can be applied. Since the search space of software testing is usually large, non-linear, and discontinuous, local search-based methods, such as hill climbing, are inefficient to find good solutions However, GAs are a kind of global search-base strategy, which has been proved suitable for software testing. Evolutionary testing is a promising testing technique, which utilises GAs to generate test data for various testing objectives. This new testing technique is regarded as complementary and supplementary to existing approaches (Zhang and Tsai, 2007).
x
Further, this review provides the merits and demerits of the intelligent techniques covered here (see Table 3) and the source of data sets (see Table 4), wherever available. This kind of information would be immensely useful to beginners to the SE area, who pursue research in this direction as to which technique should be preferred in a given circumstance.
262
R. Mohanty, V. Ravi and M.R. Patra
Table 4 Software engineering area Software reliability
Software reliability
Description of different methods and datasets Types of intelligent technique applied Methods applied Neural BPNN networks FP, Bayesian regularisation Matlab, BPNN
Neural network
Datasets NASA
References Adnan et al. (2000) Aggarwal ISBSG Release 9 (2005a) Cai et al. Musa (1979) (2001) SYS1, SYS2, SYS 3 (Lyu, Dohi et al. MLPN, RN 96) (1999) IBM DP services, Kemmerer Guo and Lyu FP, MLP, LOC/RBF Hall mark datasets (2004) BPNN, SOM, SONN Intermediate level COCOMO Hsu and Tenorio (1991) Software failure data from Elman-recurrent NN Ho et al. (2003) complex military computer systems Discriminant analysis, Large telecommunication Houchman and ENN systems Hudepohl (1997) OLS regression Huang et al. ISBSG database method (2007a,b) Developed by Pondichery Kanamani et al. BPNN, PNN Engineering College, India (2007) Logistic regression, Musa (1979, 1987), Ohba Karunanithi discriminant analysis, (1984a,b), Matsumoto et al. et al. (1991, FFN generalisation, (1988) 1992a,b) FFN prediction, Jordan net generalisation, Jordan net prediction, logarithm, inverse polynomial, exponential, power, delayed S-shape Khoshgoftaar PCA Large commercial system and Szabo (1994) Multivariate datasets (Dhilon Khoshgoftaar PCA and Goldstein, 1984) and Szabo (1996) Ada development Khosgoftaar NN environment for the et al. (1992) command, control of a military data link communication system (CCCS) Very large telecommunication Khosgoftaar BPNN, PCA software et al. (1997)
The application of intelligent and soft-computing techniques Table 4
Description of different methods and datasets (continued)
Types of Software intelligent engineering technique area applied Methods applied SVM, Bayesian statistical model BPNN Ward Net, GRNN
Software reliability
263
Fuzzy logic
Decision tree
BPNN, TANN, PSN, MARS, GRNN, MLR, TreeNets, DENFIS, Morelet-based WNN, Gaussian-based WNN BPNN, TANN, PSN, MARS, GRNN, MLR, TreeNets, DENFIS, Morlet-based TAWNN, Gaussianbased TAWNN, Morlet-based WNN, Gaussian-based WNN BPNN, Goel– Okumoto model, Yamada delay Sshaped, inflection Sshaped model Ward neural network, GRNN GP, LevenbergMarquard algorithm, Bayesian regularisation FL based GA, BPNN, Monte Carlo algorithm and their combination FL based CART Quilan (ID3)
GA, MLP Genetic algorithm GP GA
Datasets Musa (1979) , Singpurwalla and Soyer (1985) Lyu (1996) Warehouse management application Musa (1979)
References Pai and Hong (2006) Patra (2003) Quah and Thwin (2004) Rajkiran and Ravi (2007b)
Musa (1979)
Ravi et al. (2009)
Musa et al. (1987)
Su et al. (2007)
QUES (Wei and Henery, 1993) Real time and flight application Musa et al. (1987)
Thwin and Quah (2005) Tian and Noore (2005a,b)
COCOMO
Ahmed et al. (2005) Chen and Rine (2003)
Automotive controller software Inspection data (Ebenau) Large legacy telecommunication system NASA production environment Desharnais Musa (1980) Parallel-series modular software system COCOMO, ALBRECHT
GRA,CBR,CART, ANN GA using hyperPseudo dataset geometric distribution
So et al. (2002) Khoshgoftaar et al. (2000) Selby and Porter (1988) Burgess and Lefley (2001) Costa et al. (2005) Dai et al. (2003) Huang et al. (2007a,b) Minohara and Tohma (1995)
264
R. Mohanty, V. Ravi and M.R. Patra
Table 4
Description of different methods and datasets (continued)
Types of Software intelligent engineering technique area applied Methods applied AX-based approach Multilinear Soft computing discriminant analysis, fuzzy rule based GA, Euclidean distance (AE), Manhattan distance (AMH), Minikowski distance (AMK) NEURO-FUZZY
Software cost estimation
Neural networks
SONN, neuro-fuzzy network, polynomial neural network Logical log approach, curve fitting approach
Datasets Derbyshire fire and rescue services (DFRS) Telecommunication projects
IBM data processing services Chiu and (DPS), Canadian financial Huang (2007) (CF) US army software testing and Mertoguno evaluation program (STEP) et al. (1996) metric data NASA Oh et al. (2004) COCOMO II, 2000
MLP, regression trees, COCOMO-II M-estimation MLP, RBFN IBM DP services, Kemerer dataset, Hall mark dataset Heiat (2002) BPNN, SONN Intermediate level of COCOMO RBFN, Jang and Sun COCOMO-81 RBFN, BPNN, Jang and Sun BPNN PCA FP FP, regression neural network, CBR Adaptive neural network NN, multiple regression BPNN Regression analysis, ANN
References Yang et al. (2006) Baisch and Liedtke (1997)
COCOMO-81 Korea electric power corporation, Samsung electronics datasets Kernel Korean IT services data (1999–2003) COCOMO-II, AKEMCO Technology Pvt. Ltd., Singapore COCOMO-II
Dillibabu and krishnaiah (2005) Gray (1999) Heiat (2002) Hsu and Tenorio (1991) Idri et al. (2002) Idri et al. (2004) Jun and Lee (2001) Khoshgoftaar and Szabo (1996) Park and beak (2008) Zhao et al. (2003) Tadayon (2004)
COCOMO database (Boehm, Tronto et al. 1981) (2007) COCOMO Venkatachalam (1993) COCOMO 81, SLIM Vinay Kumar et al. (2008)
The application of intelligent and soft-computing techniques Table 4
265
Description of different methods and datasets (continued)
Types of Software intelligent engineering technique area applied Methods applied BPNN FP, regression neural network, CBR
Datasets Desharnais COCOMO-II, AKEMCO Technology Pvt. Ltd., Singapore Legacy telecommunication system MUSA (79)
Fuzzy subtractive clustering BPNN, TANN,PSN, Soft computing MARS, GRNN, MLR, TreeNet, DENFIS Neuro-fuzzy ISBSG Neuro-fuzzy
COCOMO-81
BPNN, GA BPNN MLP, RBF, MLR, DENFIS, SVM
COCOMO COCOMO-II COCOMO-81
References Wittig and Fennie (1997) Zhao et al (2003) Yuan and Ganesan (2000) Rajkiran and Ravi (2007b) Aggarwal et al. (2005a) Huang et al. (2003a 2007a) Shukla (2000) Tadayon (2004) Vinay Kumar et al. (2009)
As far as the future directions are concerned, we observe that 1
Decision trees can be applied more effectively and more often to solve SE problems such as software defects identification, software development cost estimation and software reliability prediction.
2
Fuzzy ruled-based systems could be applied more often and exploited more effectively. As a consequence, we end up with either crisp or fuzzy expert systems involving human comprehensible ‘if-then’ rules.
3
Further, there are other NN techniques such as group method of data handling (GMDH), counter propagation neural networks (CPNN), Bayesian neural networks, probabilistic NNs that can be applied to solve many SE problems involving classification and forecasting tasks such as software defects identification, software development cost estimation and software reliability prediction, etc.
4
Rough set theory-based methods could also be applied with great success to solve classification problems in SE.
5
New soft-computing hybrids could be developed involving differential evolution, ant colony optimisation, NNs to solve problems such as software defects identification, software development cost estimation and software reliability prediction, etc. Further, the effectiveness of principal component NN, DEWNN could also be investigated.
6
Given the data dependent nature and performance of the metaheuristics, other metaheuristics such as differential evolution, ant colony optimisation, particle swarm optimisation could be attempted as an alternative to GA, in order to come out with better test data generation, which would be eventually useful in software testing.
7
Descriptive data analysis techniques such as data envelopment analysis, analytic hierarchy process, analytic network process, fuzzy multiattribute decision making
266
R. Mohanty, V. Ravi and M.R. Patra could be employed to rank software projects developed in the same domain in terms of quality attributes.
5
Conclusions
A comprehensive review encompassing the application of intelligent techniques such as NN, fuzzy logic, GA, decision trees, CBR and soft-computing hybrid to ESE is presented. This review is written by taking the technique employed as the important dimension. The source of the data set and the results whenever available are also provided in the review. This review indicates that out of all the machine intelligent techniques NNs was the most preferred method by the SE community to solve problems like software development cost estimation, software reliability prediction, etc. This review is expected to act as a starting point for any researcher who wishes to conduct research in ESE employing sophisticated techniques. For practitioners also, the review would provide useful insights into the popularity, merits and demerits of various intelligent techniques. Of particular significance is that the present review provides immensely useful future directions, which by themselves open up new areas of research in the entire gamut of SE at the interface of intelligent techniques and soft computing. This would eventually lead to better decision making in SE thereby ensuring better, more reliable and cost effective software products.
Acknowledgement The authors thank the anonymous reviewers for their useful suggestions, which greatly improved the presentation of this paper.
References Aamodt, A. and Plaza, E. (1994) ‘Case based reasoning: foundational issues, methodological variations and system approaches’, AI Communications, Vol. 7, No. 1, pp.39–59. Adnan, W.A., Yaakoob, M., Anas, R. and Tamjis, M.R. (2000) ‘Artificial neural network for software reliability assessment’, IEEE Proceedings, TENCON 2000, Vol. 3, pp.446–451. Aggarwal, K.K., Singh, Y., Chandra, P. and Puri, M. (2005a) ‘Evaluation of various training algorithm in a neural network model for software engineering application’, ACM SIGSOFT Software Engineering Notes, Vol. 30, p.4. Aggarwal, K.K., Singh, Y., Chandra, P. and Puri, M. (2005b) ‘Sensitivity analysis of fuzzy and neural network models’, ACM SIGSOFT Software Engineering Notes, Vol. 4, p.1. Ahmed, M.A., Saliu, M.O. and AlGhamdi, J. (2005) ‘Adaptive fuzzy logic-based framework for software development effort prediction’, Journal of Information and Software Technology, Vol. 47, pp.31–48. American Institute of Aeronautics and Astronautics (AIAA) (1992) ‘Software reliability engineering recommended practice’, R-103-1992. Baisch, E. and Liedtke, T. (1997) ‘Comparison of conventional approaches and soft computing approaches for software quality prediction’, IEEE International Conferences on Systems, Man and Cybernetics, Vol. 2, pp.1045–1049. Bajaj, N., Tyagi, A. and Agarwal, R. (2006) ‘Software estimation – A fuzzy approach’, ACM SIGSOFT Software Engineering Notes, Vol. 31, pp.1–5.
The application of intelligent and soft-computing techniques
267
Basili, V.R. (1996) ‘The role of experimentation in software engineering: Past, Present and future’, Keynote Address, 18th International Conferences on Software Engineering (ICSE’18), pp.442–449. Bellettins, C., Fugine, G. and Pierluigi, P. (2007) ‘Fuzzy selection of software components and web services from domain knowledge’, In E. Damiani, L.C. Jain and M. Madravio (Eds.), Soft Computing in Software Engineering, Springer Berlin, pp.1–32. Boehm, B.W. (1981) Software Engineering Economics. Englewood Cliffs, NJ: Prentice – Hall. Briand, C., Feng, J. and Labiche, Y. (2002) ‘Using genetic algorithm and coupling measures to devise optimal integration test orders’, Proceedings of the 14th International Conferences on Software Engineering and Knowledge Engineering, Vol. 27, pp.43–50. Burgess, C.J. and Lefley, M. (2001) ‘Can genetic programming improve software effort estimation? A comparative evaluation’, Information and Software Technology, Vol. 43, pp.863–873. Cai, K., Yuan, C. and Zhang, M.L. (1991) ‘A critical review on software reliability modeling’, Reliability Engineering and Systems Safety, Vol. 32, pp.357–371. Cai, K.Y., Cai, L., Wang, W.D., Yu, Z.Y. and Zhang, D. (2001) ‘On the neural network approach in software reliability modeling’, The Journal of Systems and Software, Vol. 58, pp.47–62. Chen, J. and Rine, D.C. (2003) ‘Training fuzzy logic controller software components by combining adaption algorithm’, Advances in Engineering Software, Vol. 34, pp.125–137. Chen, S.H. (2008) ‘Artificial intelligence techniques: an introduction to their use for modeling environmental systems’, Mathematics and Computers in Simulation, Vol. 78, pp.379–400. Chiu, N.H. and Huang, S.J. (2007) ‘The adjusted analogy based software estimation based on similarity distances’, The Journal of Systems and Software, Vol. 80, pp.628–640. Cimpan, S. and Oquendo, F. (2003) ‘Dealing with software process deviation using fuzzy logic based monitoring’, LLP-CESALP, ESIA Engineering School, University of Savoie, France. Costa, E.O.,Vergili, O.S.R. and Souz,.G. (2005) ‘Modeling software reliability growth with genetic algorithm’, Proceedings of 16th IEEE International Symposium on Software Reliability Engineering, pp.170–180. Cristiamini, N. and Taylor, J.S. (2000) An Introduction to Support Vector Machines, Cambridge, England: Cambridge University Press. Dai, Y.S., Xie, M., Poh, K.L. and Yang, B. (2003) ‘Optimal testing resource allocation with genetic algorithm for modular software systems’, The Journal of Systems and Software, Vol. 66, pp.47–55. Delany, S.J. and Cunningham, P. (2000) ‘The application of case based reasoning to early software project cost estimation and risk assessment’, ftp://ftp.cs.tcd.ie/pub/tech-reports/reports.00/ TCD-CS-2000-10.pdf. Dhilon, W.R. and Goldstein. M. (1984) Multivariate Analysis. New York: John Wiley. Dillibabu, R. and Krishnaiah, K. (2005) ‘Cost estimation of a software product using COCOMO II. 2000 model – a case study’, Int. Journal of Project Management, Vol. 23, pp.297–307. Dohi, T., Nishio, Y. and Osaki, S. (1999) ‘Optional software release scheduling based on artificial neural networks’, Annals of Software Engineering, Vol. 8, pp.167–185. Engel, A. and Last, M. (2007) ‘Modeling software testing costs and risks using fuzzy logic paradigm’, The Journal of Systems and Software, Vol. 80, No. 007, pp.817–835. Fahlman, S.E. and Lebiere, C. (1990) ‘The cascade-correlation learning architecture’, In Advances in Neural Information Processing Systems. San Mated, CA: Morgan Kaufmann, pp.524–532. Gehrke, J., Ramakrishnan, R. and Loh, W.R. (1999) ‘BOAT-optimistic decision tree construction’, In Proceedings ACM SIGMOD International Conference Management of Data, Philadelphia, PA, pp.169–180. Gersho, A. and Gary, R.M. (1992) Vector Quantization and Signal Compression. Norwell, MA: Kluwer.
268
R. Mohanty, V. Ravi and M.R. Patra
Goldberg, G.E. (1989) Genetic Algorithmic Search, Optimization and Machine Learning. Reading, MA: Addition-Wisely. Goldberg, G.E. (1993) ‘Genetic and evolutionary algorithms come of age’, ACM Communications, Vol. 37, No. 3, pp.113–119. Gray, A.R. (1999) ‘A simulation based comparison of empirical modeling technique for software metric models of development effort’, Proceedings ICONIP ‘99’ 6th International Conference on Neural Information Processing, Vol. 2, pp.526–531. Guo, P. and Lyu, M.R. (2004) ‘A pseudo-inverse learning algorithm for feed forward neural networks with generalization application to software reliability growth data’, Neuro Computing, Vol. 56, pp.101–121. Gupta, N. and Singh, M.P. (2005) ‘Estimation of software reliability with execution times model using the pattern mapping techniques of artificial neural networks’, Computers and Operation Research, Vol. 32, pp.187–199. He, X. and Yu, H. (2007) ‘Formal methods for specifying and analyzing complex software systems from domain knowledge’, In D. Zhang, J.J.P. Tsai (Eds.), Advances in Machine Learning Applications in Software Engineering. London: Idea Group Publishing, pp.73–102. Heiat, A. (2002) ‘Comparison of artificial neural networks and regression models for estimating software development effort’, Information and Software Technology, Vol. 44, pp.911–922. Ho, S.L., Xie, M. and GOH, T.N. (2003) ‘A study of the connectionist models for software reliability predictions’, Computers and Mathematics with Applications, Vol. 46, pp.1037–1045. Holland, J.H. (1975) Adaptation in Natural and Artificial Systems. Ann Arbor, USA: University of Michigan Press. Houchman, R. and Hudepohl, J.P. (1997) ‘A robust approach to software reliability problems’, Proceedings of the Eighth International Symposium on Software Reliability Engineering, Vol. 2, No. 5, pp.13–26. Hsu, W. and Tenorio, M.F. (1991) ‘Software engineering effort models using neural networks’, IEEE Transactions on Software Engineering, Vol. 2, pp.1190–1195. Huang, X., Luiz, F., Ren, C.J. and Danny, H. (2003a) ‘A neuro-fuzzy model for software costs estimation’, Proceeding of the Third International Conference on Quality Software’s. Washington, DC, USA: IEEE Computer Society of India, p.126. Huang, C.Y., Lyle, M.R. and Kuo, S.Y. (2003b) ‘A unified scheme of some non-homogenous poisons process model for software reliability estimation’, IEEE Transactions on Software Engineering, Vol. 29, No. 3, pp.261–269. Huang, X., Danny, H.D., Ren, J. and Capretz, L.F. (2007a) ‘Improving the COCOMO model using a neuro-fuzzy approach’, Applied Soft Computing, Vol. 7, pp.29–40. Huang, H.J., Chiu, N.H. and Chen, L.H. (2007b) ‘Integration of the grey relational analysis with genetic algorithm for software effort estimation’, European Journal of Operational Research, Vol. 188, No. 3, pp.898–909. Huang S.J., Chiu, N.J. and Liu, Y.J. (2008) ‘A comparative evaluation on the accuracies of software effort estimates from clustered data’, Information and Software Technology, Vol. 50, pp.879–888. Idri, A., Khoshgoftaar, T.M. and Abran, A. (2002) ‘Can neural network be easily interpreted in software cost estimation?’, Proceedings of the IEEE International Conferences on Fuzzy Systems. FUZZ IEEE ‘02’. Available at: http://www.lrgl.uqam.ca/publications/pdf/739.pdf. Idri, A., Mbarki, S. and Abran, A. (2004) ‘Validating and understanding software cost estimation models based on neural networks’, Proceedings on International Conferences on Information and Communication Technologies, pp.433–434. Jiang, H. (2006) ‘Can the genetic algorithm be a tool for software engineering searching problems?’ Proceeding of 30th Annual International Conferences on Computer Software and Application, Vol. 2, pp.362–366.
The application of intelligent and soft-computing techniques
269
Jun, E.S. and Lee, J.S. (2001) ‘Quasi optimal case selective neural networks models for software effort estimation’, Expert Systems with Applications, Vol. 21, pp.1–14. Kanamani, U., Sankaranarayana, V.R. and Tambidurai, V. (2007) ‘Object oriented software fault prediction using neural networks’, Information and Software Technology, Vol. 49, pp.483–492. Karunanithi, N., Whitley, D. and Malaiya, Y.K. (1991) ‘Prediction of software reliability using neural networks’, International Symposium on Software Reliability, pp.124–130. Karunanithi, N., Malaiya, Y.K. and Whitley, D. (1992a) ‘The scaling problem in neural networks for software reliability prediction’, Proceedings of the Third International IEEE Symposium of Software Reliability Engineering, Los Alamitos, CA, pp.76–82. Karunanithi, N., Whitley, D. and Malaiya, Y.K. (1992b) ‘Prediction of software reliability using connectionist models’, IEEE Transactions on Software Engineering, Vol. 18, pp.563–574. Khoshgoftaar, T.M. and Szabo, R.M. (1994) ‘Predicting software quality, during testing using neural network models: a comparative study’, Int. J. Reliability, Quality and Safety Engineering, Vol. 1, pp.303–319. Khoshgoftaar, T.M. and Szabo, R.M. (1996) ‘Using neural networks to predict software faults during testing’, IEEE Transactions on Reliability, Vol. 45, No. 3, pp.456–462. Khoshgoftaar, T.M. and Rebours, P. (2003) ‘Noise elimination with partitioning filter for software quality estimation’, Int. J. Computer Application in Technology,Vol. 27, pp.246–258. Khoshgoftaar, T.M., Pandya, A.S. and More, H.B. (1992) ‘A neural network approach for predicting software development faults’, Proceedings of the third IEEE International Symposium on Software Reliability Engineering, Los Aiamitos, CA, pp.83–89. Khoshgoftaar, T.M., Lanning, D.L. and Pandya, A.S. (1993) ‘A neural network modeling for detection of high-risk program’, Proceedings of the Fourth IEEE International Symposium on Software reliability Engineering, Denver, Colorado, pp.302–309. Khoshgoftaar, T.M., Allen, E.B., Hudepohl, J.P. and Aud, S.J. (1997) ‘Application of neural networks to software quality modeling of a very large telecommunications system’, IEEE Transactions on Neural Networks, Vol. 8, No. 4, pp.902–909. Khoshgoftaar, T.M., Allen, E.B., Jones, W.D., Hudepohl, J.P. (2000) ‘Classification –tree models of software quality over multiple releases’, IEEE Transactions on Reliability, Vol. 49, No. 1, pp.4–11. Kokol, P., Podgorelec, V. and Pighim, M. (2001) ‘Using software metrics and evolutionary decision trees for software quality control’, Available at: http://www.escom.co.uk/ conference2001/papers/kokol.pdf. Kohonen, T. (1997) Self-Organizing Maps (2nd ed.). Berlin: Springer-Verlag. Kolodner, J.L. (1992) ‘An introduction to Case Based Reasoning’, Artificial Intelligence Review, Vol. 6, No. 1, pp.3–34. Kolodner, J.L. (1993) Case Based Reasoning. CA: Morgan Kaufman. Lee, J., Kuo, J.Y. and Fanjang, Y.Y. (2002) ‘A note on current approaches to extending software engineering with fuzzy logic’, Proceedings of the 2002 IEEE International Conferences on Fuzzy Systems, Vol. 2, pp.1168–1173. Leung, H. (2000) ‘Software cost estimation’, Department of Computing, The Hong Kong Polytechnic University. Lyu, M.R. (1996) Handbook of Software Reliability Engineering. New York: McGraw-Hill. Lyu, M.R. and Nikora, A. (1992) ‘Using software reliability models more effectively’, IEEE Software, Vol. 9, No. 4, pp.43–53. Matsumoto, K., Inoue, T., Kikuno, T. and Torii, K. (1988) ‘Experimental evaluation of software reliability growth models’, IEEE Proceedings of FTCS-18, pp.148–153. Mertoguno, J.S., Paul, R. and Ramamoorthy, C.V. (1996) ‘A neuro-expert systems for the prediction of software metrics’, Copyright @1996 Elsevier science Ltd.
270
R. Mohanty, V. Ravi and M.R. Patra
Minohara, T. and Tohma, Y. (1995) ‘Parameter estimation of hyper-geometric distribution software reliability growth model by genetic algorithms’, Proceedings of Sixth International Symposium on Software Reliability Engineering, pp.324–329. Moody, J. and Darken, C.J. (1989) ‘Fast learning in networks of locally tuned processing units’, Neural Computing, Vol. 1, pp.81–294. Musflek, P., Pedcrycz W., Succi, G. and Reformaf, M. (2000) ‘Software cost estimation with fuzzy model’, ACM SIGAPP Applied Computing Review, Vol. 8, No. 2, pp.4–29. Musa, J.D. (1998) Software Reliability Engineering: More Reliable Software, Faster Development and Testing. McGraw-Hill. Musa, J.D., Iannino, A. and Okumoto, K. (1987) Software Reliability, Measurement, Prediction and Application. New York: McGraw-Hill. Musa, J.D. (1998) Software Reliability Engineering: More Reliable Software, Faster Development and Testing. Osborne/Mcgraw-Hill, ISBN 0-07-913271-5. Musa, J.D. (1980) ‘Software reliability data’, Journal Systems and Software, Vol. 1, No. 3, pp.223–241. Musa, J.D. (1979) Software Reliability Data. IEEE Computer Society-Repository. Ohba, M. (1984a) ‘Software reliability analysis models’, IBM Journal of Research Development, Vol. 28, pp.428–443. Ohba, M. (1984b) ‘Inflection S-shaped software reliability growth models’, In S. Osaki and Y. Hatoyama (Eds.), Stochastic Models in Reliability Theory. Berlin: Springer, pp.44–162. Ohba, M., Yamada, S., Takeda, K. and Osaki, S. (1982) ‘S-shaped software reliability growth curve: How good is it?’, In Proceedings of the Sixth IEEE International Computer Software Applications Conference, Chicago, Illinois, pp.38–44. Oh, S.K., Pedrycz, W. and Park, B.J. (2004) ‘Self-organizing neuro-fuzzy networks in modeling software data’, Fuzzy Sets and Systems, Vol. 145, pp.65–181. Pai, P.F. and Hong, W.C. (2006) ‘Software reliability forecasting by support vector machine with simulated annealing algorithms’, Journal of System and Software, Vol. 79, No. 6, pp.747–755. Patra, S. (2003) A Neural Network Approach for Long Term Software MTTF Prediction. Denver, USA: Chillarge Press. Fasttabstract ISSRE 2003. Available at: www.chillarege.com/ISSRE 2003/129 - FA - 2003.pdf. Park, S., Hoyoung, N. and Sugumaran, V. (2006) ‘A Semi automated filtering technique for software process tailoring using neural networks’, Expert System and Applications, Vol. 30, pp.179–189. Park, H. and Beak, S. (2008) ‘An empirical validation of a neural networks model for software effort estimation’, Expert Systems with Application, Vol. 35, pp.929–937. Pawlak, Z. (1982) ‘Rough sets’, Int. J. Computer and Information Science, Vol. 11, pp.341–356. Pedrycz, W. (2002) ‘Computational intelligence as an emerging paradigm of software engineering’, Proceedings of the 14th International Conferences on Software Engineering and Knowledge Engineering, Vol. 27, pp.7–14. Pedrycz, W., Reformat, M. and Pizzi, M. (2007) ‘Neuro-fuzzy analysis of software quality data from domain knowledge, In E. Damiani, L.C. Jain, M. Madravio (Eds.) Soft Computing in Software Engineering. Berlin: Springer, pp.254–273. Perlovsky, L.I. (2000) Neural Networks and Intellect: Using Model Based Concepts. New York: Oxford University Press. Purnapraja, M., Reformat, M. and Pedrycz, W. (2006) ‘Genetic algorithm for hardware–software partitioning and optimal resource allocation, The Journals of System Architecture, Vol. 53, pp.39–354. Quinlan, J.R. (1987) ‘Decision trees as probabilistic classifiers’, Proceedings of 4th International Workshop on Machine Learning, Irvine, CA, pp.31–37. Quah, T.S. and Thwin, M.M. (2004) ‘Prediction of software development faults in PL/ SQL files using neural network models’, Information and Software Technology, Vol. 46, pp.19–523.
The application of intelligent and soft-computing techniques
271
Rajkiran, N. and Ravi, V. (2007a) ‘Software reliability prediction by soft computing technique’, The Journal of Systems and Software, Vol. 81, No. 4, pp.576–583. Rajkiran, N. and Ravi, V. (2007b) ‘Soft reliability prediction using wavelet Neural Networks’, International Conference on Computational Intelligence and Multimedia Applications (ICCIMA), Vol. 1, pp.195–197. Ravi, V., Chauhan, N.J. and RajKiran, N. (2009) ‘Software reliability prediction using intelligent techniques: application to operational risk prediction in firms’, Int. J. Computational Intelligence and Applications, Vol. 8, No. 2, pp.181–194. Reformat, M., Musilek, P. and Lgbide, E. (2007) ‘Intelligent of software maintenance data from domain knowledge’ In D. Zhang and J.J.P. Tsai (Eds.), Advances in Machine Learning Applications in Software Engineering. London: Idea Group Publishing, pp.14–51. Rine, D., Ahmed, M. and Chen, J. (1999) ‘A reusable software adaptive fuzzy controller architecture’, Proceedings of the ACM Symposium on Applied Computing, pp.633–637. Rud, O.P. (2000) Data Mining Cook Book. New York: John Wiley and Sons. Russell, S.P. and Norvig, P. (2003) Artificial Intelligence. A Modern Approach (2nd ed.). New Jersey, USA: Prentice-Hall. Rumelhart, D.E, Hinto, G.E. and Williams, R.J. (1986) ‘Learning internal representations by error propagation’, In D.E. Rumelhart and J.L. McClelland (Eds.), Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1. Cambridge, MA: The MIT Press, pp.318–362. Sakran, H.A. (2006) ‘Software cost estimation model based on integration of multi-agent and case based reasoning’, Journals of Computer Science, Vol. 2, No. 3, pp.276–282. Selby, R.W. and Porter, A.A. (1988) ‘Learning from examples: generation and evaluation of decision trees for software resource analyzed’, IEEE Transactions of Software Engineering, Vol. 14, No. 12, pp.1743–1757. Senyard, A., Kazmiercza, E. and Sterling L. (2003) ‘Software engineering methods for neural networks’, Proceedings of the Tenth Asia-Pacific software Engineering Conferences (APSEC’03), pp.468–477. Sheppered, M. and Schofield, C. (1997) ‘Estimating software projects effort-using analogies’, IEEE Transactions on Software Engineering, Vol. 23, No. 12, pp.736–743. Shukla, K.K. (2000) ‘Neuro-genetic prediction of software development effort’, Information and Software Technology, Vol. 42, pp.701–713. Sitte, R. (1999) ‘Comparison of software-reliability-growth predictions: neural networks vs. parametric-recalibration’, IEEE Transactions on Reliability, Vol. 48, No. 3, pp.285–291. Singpurwalla, N.D. and Soyer, R. (1985) ‘Assessing (software) reliability growth using a random coefficient autoregressive process and its ramifications’, IEEE Transactions on Software Engineering, Vol. 11, pp.1456–1464. So, S.S., Sung, C.D. and Kwon, Y.R. (2002) ‘Empirical evaluation of a fuzzy logic-based software quality prediction model’, Fuzzy Sets and Systems, Vol. 127, pp.199–208. Specht, D.F. (1990) ‘Probabilistic neural networks’, Journal of Neural Networks, Vol. 3, pp.110–118. Su, A., Chin, Y. and Huang, U. (2007) ‘Neural-network based approaches for software reliability estimation using dynamic weighted combinational models’, The Journal of System and Software, Vol. 80, pp.606–615. Tadayon, N. (2004) ‘Adaptive dynamic COCOMO II in cost estimation’, Proceedings of Software Engineering Research and Practice, pp.559–556. Thwin, M.M.T. and Quah, T.S. (2005) ‘Application of neural networks for software quality prediction using object-oriented metrics’, Journal of Systems and Software, Vol. 76, pp.147–156. Tian, L. and Noore, A. (2005a) ‘On-line prediction of software reliability using an evolutionary connectionist model’, The Journal of Systems and Software, Vol. 77, pp.173–180.
272
R. Mohanty, V. Ravi and M.R. Patra
Tian, L. and Noore, A. (2005b) ‘Evolutionary neural network modeling for software cumulative failure time prediction’, Reliability Engineering and System Safety, Vol. 87, pp.45–51. Tronto, I.F.D.B., Da Silva, J.D.S. and Sant’Anna, N. (2007) ‘An investigation of artificial neural networks based prediction systems in software project management’, The Journal of System and Software, Vol. 81, No. 3, pp.356–362. Twala, B., Cartwright, M. and Sheppered, M. (2007) ‘Applying rule induction in software predictions’, In D. Zhang and J.J.P. Tsai (Eds.), Advances in Machine Learning Applications in Software Engineering. London: Idea Group Publishing, pp.73–102. Vapnik, V. and Haykin, S. (Eds.) (1998) ‘Statistical learning theory’, Adaptive and Learning Systems, Vol. 736, John Wiley and Sons. Venkatachalam, A.R. (1993) ‘Software cost estimation using artificial neural networks’, IJCNN, 93-Nayoya, Proceeding of 1993 International Joint Conference, Vol. 1, pp.987–990. Vinay Kumar, K., Ravi, V. and Carr, M. (2008) ‘Software cost estimation using wavelet neural networks’, The Journal of Systems and Software, Vol. 81, No. 11, pp.1853–1867. Vinay Kumar, K., Ravi, V. and Carr, M. (2009) ‘Software cost estimation using soft computing approaches’, In E. Soria, J.D. Martín, R. Magdalena, M. Martínez and A.J. Serrano (Eds.), Handbook on Machine Learning Applications and Trends: Algorithms, Methods and Techniques. USA: IGI Global. Watson, I. and Marir, F. (1994) ‘Case-based reasoning: a review’, The Knowledge Engineering Review, Vol. 9, No. 4, pp.327–354. Wei, L. and Henry, S. (1993) ‘Object oriented metrics that predict maintainability’, Journal of Systems and Software, Vol. 23, No. 2, pp.111–122. Wittig, G. and Finnie, G. (1997) ‘Estimating software development effort with connectionist models’, Information and Software Technology, Vol. 39, pp.469–476. Xu, Z. and Khoshgoftaar, T.M. (2004) ‘Identification of fuzzy models of software cost estimation’, Fuzzy Sets and Systems, Vol. 145, pp.141–163. Yuan, X. and Ganesan, K. (2000) ‘An application of fuzzy clustering to software quality prediction’, Proceedings of 3rd IEEE Symposium on Systems and Software Engineering Technology, pp.85–90. Yang, I., Jones, B.F. and Yang, S.H. (2006) ‘Genetic algorithm based software integration with minimum software risk’, Information and Software Technology, Vol. 48, pp.133–141. Yamada, S. and Osaki, S. (1983a) ‘Reliability growth models for hardware and software systems based on non-homogeneous Poisson processes: a survey’, Microelectronics and Reliability, Vol. 23, pp.91–112. Yamada, S. and Osaki, S. (1983b) ‘S-shaped software reliability growth model with four types of software error data’, Int. J. Systems Science, Vol. 14, pp.683–692. Yamada, S. and Osaki, S. (1985) ‘Software reliability growth modeling: models and applications’, IEEE Transactions on Software Engineering, Vol. 11, pp.1431–1437. Zadeh, L.A. (1965) ‘Fuzzy sets’, Information and Control, Vol. 8, pp.338–353. Zadeh, L.A. (1994) ‘Soft computing fuzzy logic’, IEEE Software, Vol. 11, No. 6, pp.48–56. Zhao, Y., Tan, H.B.K. and Zhang, W. (2003) ‘Software cost estimation through conceptual requirement’, Proceedings of the Third International Conference on Quality Software (QSIC’03).0-7695-4/03 IEEE. Zhang, D. and Tsai, J.J.P. (Eds.) (2007) Advances in Machine Learning Applications in Software Engineering. London: Idea Group Publishing. Zimmermann, H.J. (1996) Fuzzy Set Theory and its Applications. London: Kluwer Academic Publishers.